Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand

By Madalina Turlea·
Deterministic Metrics: Automating the Checks You Would Otherwise Do by Hand

Written by Madalina Turlea

15 Jan 2026

After you annotate an AI feature's answers by hand, you cannot keep doing that for thousands of cases. This is where metrics come in. Evaluations, also called evals or metrics, can be deterministic or another AI. The deterministic ones are small pieces of code that run on every single answer and check for something specific.

What a deterministic metric can check

A deterministic metric is one where there is a rule you can check. It can confirm that an answer is no longer than 300 or 500 characters, that it is formatted a certain way, that it is in a certain language, or that it is valid JSON with exactly the structure you expect, which matters when you need to display it in a UI. It can check that a date or time is in the right format, for example that the time is 9am or 12, or that the date follows an exact pattern.

It can also check content in a rule-based way. For a shopping list, a metric can take your expected output and confirm that the name of each item appears somewhere in the answer, comparing case-insensitively, since it does not matter whether tuna comes before olive oil. A separate metric can check that the quantity is correct.

Where there is no rule, a deterministic metric does not apply. Checking whether an answer is in the right tone of voice or on brand requires judgement, and that needs an LLM as a judge, which is a different and more advanced tool.

A metric measures, it does not enforce

A metric does not force the model to comply. You cannot enforce something in the answer, you can only check for it automatically. So the workflow is to add the metric first to get a baseline, run it, and see the distribution, for example that 30% of one model's answers complied with the rule and 20% of another's did. Then you go to the prompt, add the instruction, run the whole experiment again, and re-apply the metric to see whether it actually improved.

This matters because AI is non-deterministic. Writing an instruction into the prompt does not guarantee the model follows it, so you measure rather than assume. Automating the check also means you do not have to read through every answer to find the failures, and you can measure precisely whether a given failure got better in the next iteration.

Split checks into separate metrics

It is tempting to write one big metric that looks at everything. Splitting the task into multiple metrics is more useful, especially at scale. When you measure each part separately, you might find that one part of the task is understood better by one model and another part better by a different model.

That is a useful hint. It can point you toward splitting the work into two prompts. If one model handles shopping items best and another handles to-do items best, you can route each input to the model that is strongest for it, and get more accuracy than a single model would give you across the whole task.

How metrics fit the bigger picture

Metrics connect directly to your error analysis. The error analysis groups your review notes into categories, and for each category you look for both a metric you can check it with and an improvement you can make to the prompt. A category like malformed JSON or incorrect date resolution maps to a deterministic metric. A category like wrong tone of voice maps to an LLM judge. Working through the categories this way turns a pile of notes into an evaluation framework you can run automatically, every time, at scale.