Error Analysis for AI: Turning Messy Review Notes Into a Fix List

Written by Madalina Turlea
15 Jan 2026
Before you can improve an AI feature, you have to know how it is failing. Not how you think it fails, how it actually fails. The way to find that out is error analysis, and it starts with reading the answers by hand.
Read every answer and annotate
The first step is manual. You go through every single answer the feature produced and annotate it as specifically as possible, focusing on where the AI missed the mark. You are looking for the spots where the result is not in agreement with what you would have delivered yourself.
This is slow, and it is the point. For one feature, this meant manually reviewing over 100 answers, around 150 cases in total, and leaving notes on exactly how each one failed.
Group the notes by root cause
A pile of individual notes is hard to act on. The next step is to classify them into categories that are mutually exclusive, so there is no overlap between them. Each category groups together issues that share the same root cause, the same type of error. Those issues might show up in different ways in the actual responses, but because the root cause is the same, the potential fix is the same for all of them.
For one feature, this produced eleven different types of errors. Looking at them as categories, you can already see which kind of fix each one needs. Some are structural and can be caught with rule-based, deterministic checks: the most common error was that the schema was not as expected, and you do not need an AI to check whether a response is valid JSON with the right structure, you can do that in code. Others, like recommending narrowing down the models without enough evidence, have no simple rule that would catch every case, so they need an AI judge instead.
The failures are not the ones you expected
When you start building a feature, you have an idea of how it will fail. That idea is usually wrong. On one feature, the expectation was that the AI would read the data wrong, make up data points, or misread the metrics. None of the models actually did that. There were other types of failure that only became visible by reading the actual responses. That specificity, the real failure modes, comes only from looking at the output, not from guessing in advance.
What the analysis gives you
Once your notes are grouped into clear categories, you can work through them systematically instead of reacting to scattered observations. The same annotated notes do double duty: they become the test data you use to validate an AI judge, since you already know which answers had a given error and which did not, and they can be turned into an expected output with metrics so you can automatically check whether your next change actually improved things.
Error analysis is the unglamorous front end of every reliable AI feature. You read the answers, you name the failures precisely, you group them by cause, and only then do you have something concrete to fix.
You might also like

LLM-as-a-Judge: How to Evaluate AI Features Without Checking Every Answer by Hand
Most teams build the wrong judge — vague rubrics, one-to-five scoring, no validation. The three-stage evaluation ladder, why a judge should be binary and single-error, and how one prompt went from 50% to 93% agreement with humans.

Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand
Small pieces of code that run on every answer and check for something specific — length, format, language, JSON schema, expected items. How metrics measure (not enforce) and connect back to error analysis.

How to Run an AI Experiment, Step by Step
From idea to evaluation: how to write the first prompt, build realistic test cases, pick the right mix of models, review the answers side by side, and turn your notes into something you can scale.