LLM-as-a-Judge: How to Evaluate AI Features Without Checking Every Answer by Hand

Written by Madalina Turlea
15 Jan 2026
Once an AI feature is running, you cannot read every answer it produces. So teams reach for an LLM as a judge: another model that checks the output. The idea is sound, but the way most teams do it produces a judge you cannot trust. Here is the approach that works, built on three stages of evaluation.
Start with the three stages, in order
There are three levels of evaluation, and it matters that you do not skip straight to the exciting one. Each level builds on the knowledge from the one before it.
The first level is manual, human evaluation. You go through every single answer and annotate it as specifically as possible, focusing on where the AI missed the mark. This is how you find patterns in the failure modes and refine the instructions in your prompt.
The second level is deterministic metrics. These are small pieces of code you run on each answer to check specific attributes you care about. For one feature, the response had to be valid JSON, otherwise it could not be displayed in the interface and everything broke. Another check confirmed the answer returned three key insights. These metrics are cheap, automated, and efficient, and they can run every single time you get an answer. Check as much as you can this way.
The third level is the LLM as a judge, and you only need it for the things you cannot check with code, like how well the model understood the problem or how good the generated text is.
The mistake most teams make
The common pattern is to run the feature on some test data, get results, and immediately build a judge that says "check that this is a good answer," sometimes scoring it from one to five. This is suboptimal in several ways.
The more specific you are about what you want the judge to check, the better the results. It is the same strategy you use for the feature itself. When you tell an AI to check that everything is correct, it checks some things you expect and some you do not.
Scoring on a scale makes it worse. If the AI says four and a human would say three, are they in agreement or not? You cannot really tell. So the judge should return a binary answer, zero or one: it either agrees with the human evaluation or it does not.
The principle is one judge, one error, one binary score, one question: is this specific error present in the response?
You have to evaluate the evaluator
The judge is another AI, so you need to evaluate it too. You need data proving that this judge actually delivers quality answers before you trust its judgement. If it agrees with you 50% of the time, you might as well flip a coin, and you are paying AI credits for the privilege.
This is where the manual annotation pays off. Because you already went through the answers and marked which ones had a specific error and which did not, you have a test set with an expected output. You run the judge against it and measure how often it agrees with the human.
How the numbers actually moved
In a real example, the judge looked for one specific error: premature model narrowing, where the AI recommends dropping or restricting models too early, before there is enough evidence. The prompt was long and specific. It defined the role clearly, scored pass or fail, listed exactly what the error looks like (too few total cases, too few experiments, not enough iterations on the prompt), spelled out what does not count as the error, and asked for a justification alongside the score.
Even with all that specificity, the first iteration agreed with the human only 50% of the time across five models. The GPT models did a bit better than the Claude Opus models, which landed around 30%. Looking at the cases where the judge disagreed, and using the justifications to see why, led to refinements and a few good and bad examples added as few-shot prompting. A second iteration brought the best judge to 93% agreement.
One detail worth noting: when all the models disagreed with the human on a case, it sometimes turned out the test data itself had an error, where a second failure type was present that had not been marked originally. The judge surfaced mistakes in the ground truth.
Choosing the model for the judge
The judge runs in production, so its cost and latency matter. The same task and prompt and test data gave different results across models, and the Claude models actually went lower in the second iteration. DeepSeek had slightly lower accuracy than GPT-4.1 but was more than eleven times cheaper, which can make it worth tweaking the instructions to make it viable at scale.
Because the task was tightly defined and looked for one error at a time, token usage was similar across models, around four thousand tokens. If you asked one judge to look for all eleven error types at once, the cognitive load would be much higher, and so would the cost and latency.
What the judge unlocks
Once the judge agrees with you at a high enough rate on the test data, say 80% or higher, you can let it evaluate real responses and scale your testing from the 150 cases you could review by hand to hundreds or thousands. You still go in and check, but now you can move. And having these automatic checks in place before you start changing the prompt means that when you fix one error, you can tell whether you broke something else, like the JSON, instead of finding out a month later.
You might also like

Error Analysis for AI: Turning Messy Review Notes Into a Fix List
You cannot improve an AI feature until you know how it actually fails — not how you think it fails. The unglamorous front end of every reliable AI feature: read, annotate, group by root cause.

Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand
Small pieces of code that run on every answer and check for something specific — length, format, language, JSON schema, expected items. How metrics measure (not enforce) and connect back to error analysis.

We Tested the Viral Prompt Tricks. Most of Them Do Nothing.
Threatening the model, all caps, high-stakes framing — we ran the viral prompt techniques across nine models on real extraction tasks. Structure and clarity beat hacks every time.