Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand

By Madalina Turlea·15 Jan 2026

Written by Madalina Turlea

15 Jan 2026

After you annotate an AI feature's answers by hand, you cannot keep doing that for thousands of cases. This is where metrics come in. Evaluations, also called evals or metrics, can be deterministic or another AI. The deterministic ones are small pieces of code that run on every single answer and check for something specific.

What a deterministic metric can check

A deterministic metric is one where there is a rule you can check. It can confirm that an answer is no longer than 300 or 500 characters, that it is formatted a certain way, that it is in a certain language, or that it is valid JSON with exactly the structure you expect, which matters when you need to display it in a UI. It can check that a date or time is in the right format, for example that the time is 9am or 12, or that the date follows an exact pattern.

It can also check content in a rule-based way. For a shopping list, a metric can take your expected output and confirm that the name of each item appears somewhere in the answer, comparing case-insensitively, since it does not matter whether tuna comes before olive oil. A separate metric can check that the quantity is correct.

Where there is no rule, a deterministic metric does not apply. Checking whether an answer is in the right tone of voice or on brand requires judgement, and that needs an LLM as a judge, which is a different and more advanced tool.

A metric measures, it does not enforce

A metric does not force the model to comply. You cannot enforce something in the answer, you can only check for it automatically. So the workflow is to add the metric first to get a baseline, run it, and see the distribution, for example that 30% of one model's answers complied with the rule and 20% of another's did. Then you go to the prompt, add the instruction, run the whole experiment again, and re-apply the metric to see whether it actually improved.

This matters because AI is non-deterministic. Writing an instruction into the prompt does not guarantee the model follows it, so you measure rather than assume. Automating the check also means you do not have to read through every answer to find the failures, and you can measure precisely whether a given failure got better in the next iteration.

Split checks into separate metrics

It is tempting to write one big metric that looks at everything. Splitting the task into multiple metrics is more useful, especially at scale. When you measure each part separately, you might find that one part of the task is understood better by one model and another part better by a different model.

That is a useful hint. It can point you toward splitting the work into two prompts. If one model handles shopping items best and another handles to-do items best, you can route each input to the model that is strongest for it, and get more accuracy than a single model would give you across the whole task.

How metrics fit the bigger picture

Metrics connect directly to your error analysis. The error analysis groups your review notes into categories, and for each category you look for both a metric you can check it with and an improvement you can make to the prompt. A category like malformed JSON or incorrect date resolution maps to a deterministic metric. A category like wrong tone of voice maps to an LLM judge. Working through the categories this way turns a pile of notes into an evaluation framework you can run automatically, every time, at scale.

Article15 JAN 2026

LLM-as-a-Judge: How to Evaluate AI Features Without Checking Every Answer by Hand

Most teams build the wrong judge — vague rubrics, one-to-five scoring, no validation. The three-stage evaluation ladder, why a judge should be binary and single-error, and how one prompt went from 50% to 93% agreement with humans.

Madalina Turlea

Article15 JAN 2026

Error Analysis for AI: Turning Messy Review Notes Into a Fix List

You cannot improve an AI feature until you know how it actually fails — not how you think it fails. The unglamorous front end of every reliable AI feature: read, annotate, group by root cause.

Madalina Turlea

Article14 APR 2026

Should you still write PRDs when building AI features?

The programming language is plain English. The prompt IS the spec, so why is the PM three handoffs away? Why "evals are the new PRDs" makes things worse, and what PM-owned AI development actually looks like.

Madalina Turlea

Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand

What a deterministic metric can check

A metric measures, it does not enforce

Split checks into separate metrics

How metrics fit the bigger picture

You might also like

LLM-as-a-Judge: How to Evaluate AI Features Without Checking Every Answer by Hand

Error Analysis for AI: Turning Messy Review Notes Into a Fix List

Should you still write PRDs when building AI features?

Your AI is live.Do you know it's working?

Your AI is live.
Do you know it's working?