AI Evals for Product Managers: The Complete Guide for 2026

By Madalina Turlea·
AI Evals for Product Managers: The Complete Guide for 2026

Written by Madalina Turlea

16 Jun 2026

In 2025, product teams were adding an AI chatbot to their product without much validation. Quality checks were superficial and vibe-based only. In 2026, users' tolerance for generic, mediocre AI outputs has dropped significantly, and the quality of your AI is making or breaking your feature.

The shift already happened. Teams stopped asking "which model should we use?" and started asking the harder question: "how do we actually know our AI feature is good?" That question, how you measure AI quality, who owns the measurement, and what you do with the answer, is what evals are. And it is more a product responsibility than an engineering one.

This guide is the complete reference on AI evals for product managers. The practical version, built from running 1,500+ experiments with product teams across fintech, HR tech, procurement, logistics, health tech, and sustainability. Throughout, we'll follow one running example, an AI shopping assistant for an e-commerce platform, so every concept lands on something concrete.

If you own an AI feature and you can't answer "how good is it, and how do you know?" with numbers and data-backed evidence behind it, this guide is for you.

What this guide covers

  • - What an AI eval is and what it isn't
  • - Why evals are a product manager's job, not engineering's
  • - Why AI breaks the testing instincts you already have
  • - The three levels of evals: manual, code-based, and LLM-as-judge
  • - The PM's eval workflow, step by step, the two ways to start and the shared loop
  • - How to turn failures into metrics, with real examples
  • - Offline vs online evals, and the eval suite that evolves with your product
  • - The metrics that lie, the common mistakes, and the tools

Key takeaways

  • - An AI eval is a repeatable test suite of AI output quality against your own success criteria not a leaderboard, not a vibe check, not production monitoring.
  • - Evals are a product decision, not an engineering one. Quality lives in domain and product expertise. The people who can judge a good output are the ones closest to the user and most knowledgeable in the domain, usually the PM.
  • - AI fails silently. It always returns something. Without evals, your AI quality validation is being outsourced to your users and they don't file bug reports, they just leave.
  • - Start with manual annotation, earn automation later. The teams that build an LLM-judge before they've hand-labeled a single output build the wrong judge.
  • - You can start from either end. Already shipped? Analyze your real traces. Only have an idea? Run an experiment on your best-guess inputs. Both lead to the same place: looking at the outputs and annotating them.
  • - The payoff is measurable. Teams running structured evals before launch go from idea to validated feature in 3–7 days. Teams iterating through production complaints take 8–14 weeks at 10x the cost.

What is an AI eval?

An AI eval (evaluation) is a structured, repeatable test suite that measures the quality of an AI feature's output against criteria you define. You give the AI a set of representative inputs, you score the output on multiple dimensions through a quantitative metric (usually returning a 0-1 score) and track its changes through iterations to measure improvements or regressions.

That's it. An eval answers one question: "How good is the AI output for my users, given my custom definition of good?", in a way you have defined and can repeat, share, and trust.

What an eval is not:

  • - It is not a public benchmark. "GPT-5-x tops the leaderboard" tells you nothing about whether it works on your data, your task, your edge cases.
  • - It is not production monitoring. Monitoring tells you the feature didn't crash. Evals tell you the feature was wrong, which monitoring will never catch, because a wrong answer never returns an error.
  • - It is not a one-time demo. "It worked pretty good when I tried it" is an anecdote. An eval is a measurement across the full range of inputs your feature will actually meet.
Demo / vibe checkPublic benchmarkProduction monitoringAI eval
Uses your real dataSometimesNoYes (too late)Yes
RepeatableNoN/AN/AYes
Catches silent wrong answersNoNoNoYes
Runs before users see itYesN/ANoYes
Gives you a number you can trackNoNot yoursIndirectYes

Why evals are a product manager's job and not engineering's

Here's the uncomfortable truth most teams discover the hard way: AI quality is a product problem wearing an engineering costume.

An engineer can build the pipeline that runs the model, captures the output, and computes a score. What an engineer usually cannot do is look at a contract summary and know it missed the liability clause. Or read a user question and determine the right compliance rule. Or spot that a transaction was filed under the wrong category in a way that will quietly break a customer's books.

That judgment, what good looks like for your users, is product knowledge. It belongs to the people who've done the work: the PM, the subject-matter expert, the person who handled these cases manually for years.

When PMs hand prompt and eval ownership entirely to engineering, three things happen:

  1. - The criteria get encoded by the wrong people. Engineers optimize for what's easily measurable (does it return an answer, and is it valid JSON?) and rarely for what matters (is the answer high value, complete, on point, aligned with the product goals?).
  2. - The feedback loop slows to a crawl. If a PM can't directly shape the prompt or adjust an eval without filing a ticket, the process is the bottleneck, not the actual implementation.
  3. - Quality plateaus early. The biggest accuracy gains we see come from domain experts rewriting prompts and sharpening eval criteria, not from infrastructure.

This is the core reframe of 2026: the PM doesn't approve the eval, the PM owns it.

Why AI breaks the testing you already know

If you've shipped traditional software, you have testing instincts. Most of them are wrong for AI.

Traditional software is deterministic: same input, same output. Test a login flow 100 times and it works, and you can be confident about case 101. AI is probabilistic: the same input can produce different outputs on different runs. Test it 100 times successfully and you still have no guarantee about case 101.

That single difference cascades into everything:

  • - You can't test exhaustively. There is no finite set of code paths. There's an infinite space of inputs your users will throw at the model, empty fields, typos, conflicting instructions and so on.
  • - "It worked in ChatGPT" is a trap. ChatGPT is a finished product with its own system prompt and guardrails. When you integrate AI into your product, you talk directly to the LLM, you get the raw model. You supply the instructions, the edge-case handling, the format. Different foundation entirely.
  • - Most impactful failures are silent. This is the one that costs real money. Bad AI output doesn't throw an error. It returns a confident, plausible, well-formatted answer that happens to be wrong. Your monitoring sees success. Your user sees a mistake. And by the time it shows up as churn, three months later, the context for why is gone.

AI rarely says "I don't know how to handle this." It gives an answer that is confident and structured and well articulated, but too generic or mediocre or even completly wrong for your users expectations. Evals are how you find those before your users do.

One team ran structured evaluation across their full dataset before deployment and caught five distinct failure categories across 36 model runs, in 14 minutes. Every one of those failures would have reached production under their old "test three examples and ship" approach. The failures weren't hidden. They were invisible to the way the team had been testing. (More on this in Why Ship and Learn Just Doesn't Work for AI Features.)

The 3 levels of AI evals every PM should know

Once you've moved to a systematic way of measuring AI quality, every eval you'll ever run falls into one of three levels. You'll use all three, in combination.

Our running example — the shopping assistant. Throughout this guide we'll follow one feature: an AI agent for an e-commerce platform whose job is to help shoppers discover the right products faster and get them to checkout. A shopper describes what they want in plain language — "educational toys for my child under 2," "a laptop and accessories under 3,000 EUR" and the assistant searches the catalog, compares options, and recommends a pick, carrying the conversation across many turns.

1. Human (manual) evaluation

A person reviews outputs and annotates them. This is the first level and the most important for quality and the foundation for everything automated, because the only way to know if you're automating the right checks is knowing that the failures you're looking for are actual failures you've seen your AI make, not the failures you think it will make.

Three rules make manual evaluation actually work:

  • - Capture failures as specifically as possible. When you mark something failed, be very specific about the why — "missed key clause," "it addressed the user in a formal tone," "it missed the root cause of the metric drop." The more specific the failure description, the easier it will be to understand the failure taxonomy and how to measure for each type of failure automatically. Just writing "this is a wrong answer" leaves very little visibility into how to fix or measure for the failure automatically.
  • - Annotations should be aggregated in a single place Usually there are multiple team members who evaluate AI answers: the product managers, subject matter experts like financial analyst, complaince expert, legal expert etc. It is highly important that all notes and feedback are centralized into a single place. In most teams, the manual annotations process is very scattered around different sources, making it impossible to spot failure patterns across iterations.
  • - Compare across multiple models. Run the same system prompt and test inputs across multiple AI models and providers. Comparison of answers makes it easier to articulate which answer is better and why. It's an amazing tool to extract additional context on the problem, which you might have never thought of including in the model instructions in the first place. Hide which model produced each output. The moment a reviewer knows "this one's from Claude-opus-4-8," bias creeps in. Blind review consistently overturns teams' assumptions about which model is best.

[include here example & screenshot of human annotations in lovelaice] - here is one conversation of the AI shopping assistant Here is how you'd annotate this answer: [list here in specific details what is wrong with the answer]

  • - forgets the user age constraint when after 2 turns
  • - run an unnecessary tool call, instead of getting the information from the database
  • - misrepresents the catalog, telling the user the platform doesn't have any product for their need, when the products do exist, and tells them to search elsewhere

2. Code-based (deterministic) checks

Code-based metrics are small functions that take in the AI model response, the test input and (optionally) the expected output and scores the answer on a particular dimension. These are programmatic rules that pass or fail with no ambiguity: output format validation, JSON schema validation, exact-match against an expected value, format and length checks, presence of required fields, absence of forbidden content and so on. Use these for everything that can be checked objectively. They're free, instant, and easy to set up.

These should be the first type of automatic evals you write and put into place, especially when still exploring system architecture setups and implementation solutions. Setting up your first code-based metrics is also the fastest way to scale your manual testing. Automate the checks for the failures you've uncovered in the first level, manual annotations, and review only answers that pass these to check for new failure patterns. Any new failure that you uncover, you can automate in the same way.

[add screenshot here from lovelaice with AI generated determinist metrics and with the definition of a single metric]

Worked example — design the metrics. Most of the shopping-assistant checks turn out to be deterministic, because the catalog has structured fields to compare against:

  • - Did it identify the right products for the query? Compare the recommended product IDs against the catalog products that actually match the shopper's filters (category + age + price). Anything outside that set → fail.
  • - Did it search the right category? Check each recommended product's category against what the shopper asked for — Toys vs. Lifestyle vs. Computers.
  • - Did it report stock correctly? Check each product's in_stock field; if the assistant claims something is unavailable when it isn't (or "only one matches" when several do), fail.
  • - Did it make an unnecessary tool cal Compare the toll call the AI did with the expected output. If the answer didn't require a tool call, then the metric return a fail (score 0)

3. LLM-as-judge

The most complex and intensive type of eval to set up is an LLM-as-judge. Another AI model evaluates the output to check whether a particular type of error is present in the output. It outputs a 0–1 score. The LLM-as-judge evals should be the last layer of evals you set up and prioritzed for the high impact, judgement requiring failures you have observed in your manual annotation process. Setting up AI based evals requires more discipline and maintenence in time and also carry additional costs, unlike the determinist code based ones. The discipline that separates a useful judge from a useless one:

  • - One failure per judge. "Does this answer have this specific failure?" — yes. "Is this answer good?" — no. Narrow definitions of what is the particular error that the judge should look for are reliable; broad ones cannot be trusted, validated or acted on.
  • - Validate against your human labels. A judge should agree with your manual ratings before you trust it to replace them. If it doesn't agree with you on the cases you've already labeled, it won't agree on the ones you haven't. Here your manual annotations are paying off: you can use your already manually annotated outputs as a test dataset for your LLM-as-judge. The LLM-judge needs to be evaluate the same way as you do your main AI agent. Several iterations on the prompt and identifying the right model for the AI judge requires the same type of evals process. The judge is evaluated on a dataset of AI outputs which were annotated by a domain expert on whether the error is present or not. A judge scoring less than 90% on your validation dataset will not produce an eval that you can trust in making decisions.

Worked example — design an LLM-judge

One failure has no field to check, reading the shopper's intent, so it gets a narrow LLM-as-judge:

# Role
You are an evaluation judge for an AI shopping assostant, on an ecommerce marketplace. Your task is to assess whether the AI shopping assisstant commits the error of "dropping the user constarint" — making product recommendations that violate constraints that have been expressed by the user

# Scoring
- Score 1 (PASS): The AI output does NOT drop the user constarints. 
- Score 0 (FAIL): The AI output drops the user constarints and recommends products that don't match the user stated preferences and needs

# Rules for determining donstraints dropping

1. **Types of constarints**: Constarints can be about budget, shipping time, product specs, seller preferences, personal preferences. These can be expressed by the buyer as preferences, numbers, lists.

2. **Multiturn conversation**: The buyer may express the constraint upfront or in a subsequent turn. whenevr expressed 
....

The dropping constraints is JUSTIFIED (score 1) when ALL of these are true:
- The buyer specifically instructed to drop the constraint 
- the buyer has changed or modified the constraint
- the buyer has pivoted into another category of products, unrelated with the initial ask and constraint

# Output Format
Respond with valid JSON only:
{
  "score": <0 or 1>,
  "justification": "<1-2 sentences explaining your assessment>"
}

Then validate that judge on the turns you already annotated by hand. Where it agrees with you, trust it; where it disagrees, tighten the definition.

Eval typeBest forCostWatch out for
Human / manualError analysis, building the foundationTime intensiveDoesn't scale; needs domain expertise
Code-basedFormat, structure, comparison with expected output, objective rules~FreeCan't score subjective attributes of the response
LLM-as-judgeSubjective quality attribute at scalePer-callNeeds validation against human judgement

The order matters: start with manual annotation, earn automation later. The mistake nearly every team makes is building an LLM-judge before they've read a single output, so the judge encodes assumptions nobody ever checked. Manual review first. Automate what you've validated.

The PM's eval workflow: two ways to start

Most guides assume you already have an AI feature live in production. In reality, PMs arrive at evals from two different places, and both are valid starting points.

  • - Path A — you already have a live feature and you want to know how it's actually performing.
  • - Path B — you only have an idea and you want to start evaluating and validating it before you ship anything.

The good news: both paths land in exactly the same place. You get a set of AI outputs in front of you, you look at what the AI actually answered, and you annotate. Everything after that, the analysis loop, is shared.

Path A: You have a live feature — analyze your traces

You're already in production, so you already have the most valuable test data there is: real user inputs and the real answers your AI gave them. A trace is the full log of one interaction: the user input, the model's response, the tools it called, latency, token usage.

[add screenshot from Lovelaice on how you can improt traces]

  • - If you can access your traces: pull a representative sample into an evaluation tool. Don't cherry-pick the good ones — you want the messy reality, including the inputs nobody designed for.
  • - If you can't access your traces yet: collect a handful of real inputs your users actually sent, set up an experiment by pasting in your system prompt, uploading those sample inputs, selecting the model you run in production (plus a couple of others for comparison), and run it again. These won't be the exact answers your users saw, but the failures your users hit will mostly show up here too.

[add screenshot with the test inputs from Lovelaice]

The advantage of starting here is that your test set is reality, no guessing what users will send, because you already know.

Path B: You only have an idea — run an experiment on your best guess

No live feature yet? You can still write your evaluation suite while experimenting on building it. You just create the data instead of importing it.

  1. - Make your best guess at the user input. Sketch the inputs you expect real users to send: based on the data you know you can collect from them or how you expect them to interact with your AI. It doesn't have to be the final format. A reasonable guess is enough to learn whether AI can solve the task at all, and at what accuracy and cost. Aim for a small diverse sample of test cases, 10-15 should give you plenty to start with.
  2. - Write your best version of the system prompt. Write your instructions for the AI, describing as clearly as possible what its task is, what tools it has access to and how it should output the answer. There are several tools you can use to improve your prompt, but don't overcomplicate it at this stage, clear improvements to the prompt will become obvious after you look at the answers and the failure patterns.
  3. - Run the experiment on multiple LLMs. Send those test cases and system prompt to several models, and capture the outputs. Even if your company or product is limited on using only a particular provider, running across multiple models will still be beneficial for you at this stage so that the "best version" of the answer becomes more prominent.
  4. - Analyze and annotate the results — the same way. Look at what the AI answered, mark pass/fail, write down why.

Starting here means you can validate a feature before committing engineering effort and walk into the build with evidence instead of optimism.

Where both paths converge: the shared analysis loop

Whether your outputs came from real traces or a best-guess experiment, what you do next is identical.

1. Human review first. Both cases should start with a domain-expert review of actual AI responses, real answers the LLM generated based on your instructions, architecture choices, and inputs. Everything builds on this, and it is the highest-leverage activity for building confidence in your AI evals.

2. Find the patterns in your notes - group failures by doing error analysis Don't fix outputs one at a time or one failure at a time. Cluster failures into categories ( you described the failures freely first, now you group the ones that share a root cause; it's the single highest-leverage eval skill). A category should be defined as a failure with a common root cause and should be mutually exclusive from other failures. You want to be generic enough that you don't create too granular categories, but specific enough that the failure can be identified and measured (deterministically or with an LLM judge). For each category, capture severity × frequency — a rare cosmetic glitch and a frequent reasoning error are not the same priority. (You can try this on your own outputs with our free Failure Patterns tool — paste them in and it runs the axial coding for you and suggests metrics you could set up as your automated evals.)

Worked example — cluster into failure patterns. The messy annotations collapse into four clear patterns, none of which any generic metric would catch:

[add screenshot with Lovelaice error analysis for the Shopping assistant]

Failure patternRoot causeShows up as
Wrong recommendationsDrops the shopper's hard constraint (age, budget) on a pivot, or searches the wrong departmentRecommends a 3+ magnetic set to a parent shopping for under-2
Stock & availability errorsClaims a product is unavailable, or "we only have one," when the catalog says otherwise"We only have one matching set" — false
Talk instead of actDescribes a search it should execute; burns turns on confirmation loops"I can search for that…" with no search actually run
Intent misreadMisunderstands what the shopper wants nextReads "ok, go ahead" as the conversation ending

3. Turn the failures into metrics. Now you make each failure measurable, so you're not re-reading every output by hand forever. For each failure category, define one or multiple metrics and pick the cheapest/easiest type that can catch it:

  • - Deterministic checks for anything objective, rule based: JSON schema, value matching, format, required fields, comparison against an expected value. Free, instant, unambiguous. Reach for these first.
  • - LLM-as-judge only for what no code can check subjective attributes like tone, respecting user constraints, missing high . Keep each judge narrow (one failure, never "overall quality"), and validate it against your manual labels before you trust it.

Be granular. You can roll your metrics into one aggregate number to track over time, but the per-attribute metrics are what let you see what actually moved the needle. And beware shortcuts: out-of-the-box judge templates (relevancy, coherence, hallucination, toxicity) feel tempting, but no generic template looks for your high-impact failures or is validated against your human labels — so it returns a number that won't help you improve.

4. Prioritize and pick a fix across five axes.

  1. - System Prompt — sharper instructions, domain context, examples, explicit edge-case handling. Almost always try this first.
  2. - Metrics — the error you found may reveal a quality dimension you weren't even measuring. Add the eval.
  3. - Test cases — expand coverage for the failure mode you just discovered.
  4. - Models — always explore multiple alternatives to compare on your own metrics.
  5. - Architecture — split the agent, reconsider the data the AI has access to, the tools it uses, or even how the user interacts with it. The biggest hammer, it might trigger a revamp of your existing evals as well.

Stay conservative early, exhaust the cheap prompt fixes before you re-architect, and get bolder only as iterations accumulate without progress.

5. Re-run and decide if it's ready. Re-run the new setup: improved prompt, new models, new test cases, and look at whether each failure pattern moved, pattern by pattern, not just one aggregate number. Now your testing scales easily, because you've automated the most impactful checks, so you can test a wide variety of combinations in a few hours.

[add screenshot with experiment results table and metrics]

Offline vs online evals: building and evolving your golden dataset

There's a second axis to evals that decides when and where you measure quality: offline (before users ever see the output) and online (while the feature is live). You need both, and they do different jobs.

Offline evals: your golden dataset

An offline eval runs against a golden dataset — a curated set of test cases where, for each input, you've defined the ideal output yourself. You're not guessing what good looks like; you've written it down. This can be discovered through experimentation as well, you don't have to have this upfront.

For the AI shopping assistant the golden dataset is a collection of representative buyer questions, asks and multiturn conversations, each with an expected output of what should be the assistant ideal response. Then you define the collection of eval metrics that compare the AI's actual output against that ideal: did it recommend the right products? Did it miss any?

Offline is your controlled environment, fully curated, built on human annotations and validated expected outputs. Because you have a ground truth, you can stack up a lot of metrics and get a real, numeric read on quality: how well the AI does, and whether a change made it better or worse. Run your offline suite every time you change something that could move quality:

  • - every time you update the system prompt
  • - every time you want to test out a new model
  • - every time you change the architecture of the AI system
  • - every time you change or expand coverage to a new use case or a new customer segment

This is your regression net, a regression being a change that quietly makes quality worse. Before any change ships, you know numerically whether it improved quality or broke something.

Online evals: measuring quality in production

Once the AI is live, you no longer have the expected output. Real users send inputs that aren't in your golden dataset, and nobody has pre-written the ideal answer. So your comparison-to-ideal metrics can't run here.

What can run are the metrics that validate quality from the input and the actual model output alone — no expected output required. You build these as part of your offline suite too, but their superpower is that they work online as well. That gives you a live signal of how your AI is doing on the messy reality of real user inputs, far outside your golden dataset — so you can be notified when quality drifts, and discover the inputs you never thought to test (and add them to your golden dataset).

Offline evalOnline eval
WhenBefore shipping a changeLive, on real traffic
Needs an expected output?Yes — the golden datasetNo — input + actual output only
Best forRegression testing; comparing prompts, models, architecturesDrift detection; finding unseen failures
Based onCurated human annotationsReal user inputs

Your golden dataset is a living thing

Here's the part most teams miss: the golden dataset is not static. It moves with your product.

When you start, you don't have realistic test data, so you make your best guess at the inputs (exactly the Path B move from earlier). That's fine to begin with. But the moment your feature is live, you have something far better: real production cases, including the ones your AI struggled with.

So the loop becomes: your online eval or just a manual review of a fresh sample of traces, surface where the AI is failing in production. You take those real cases, manually curate the ideal output for each, and fold them into your golden dataset. Your next offline run now tests against reality, not your original guesses. Do this continuously and your evals evolve with your product: new use cases, new customer segments, and new edge cases all flow back in.

The golden dataset has to match production

One warning, because it's the trap that makes teams feel safe right before they get hurt: your golden dataset has to match the distribution of your real usage.

If you fill it with only happy cases, one user profile, one type of question, your offline evals will hit 100%, you'll celebrate that the AI is great, and you'll ship something that falls apart on the real range of inputs. A golden dataset that only contains easy cases isn't measuring quality. It's measuring your optimism. Cover the spread: the profiles, the question types, the edge cases, the messy inputs, in roughly the proportions they actually show up in production.

The metrics that matter (and the ones that lie)

Lying metric: a single accuracy number with no breakdown. "72% accurate" tells you nothing actionable. 72% on what? Failing how? With which errors acceptable? Always pair the number with the failure breakdown.

Lying metric: "the AI returned an answer." Returning something is not success. AI always returns something. That's the entire problem.

Confidence score from the same model. Asking the AI producing the output to rate its own answer is a highly unreliable way to measure and track AI quality.

Another AI judging the overall quality/sentiment/confidence. An AI scoring your AI output on multiple dimensions — most of the time defined too broadly as "good quality," never validated against human labels — is a highly unreliable AI quality metric.

Honest metrics, defined on your terms:

  • - Task accuracy — against your expected outputs and human ratings.
  • - Failure rate by category — so you know how it fails, not just that it fails.
  • - Format compliance — did it follow the required structure? (Code-based, cheap.)
  • - Cost per request at scale — see below; this one sinks unit economics quietly.
  • - Latency — sometimes the deciding factor (4.9s vs 47.6s is a different product).

A note on cost, because it's the eval PMs forget: AI inverts software economics. Every request costs money, so your power users cost you the most. In one extraction task, GPT-5 scored 60% while GPT-4o hit 100% — at a fraction of the cost, with 4.9-second latency versus 47.6 seconds. If you didn't measure cost during evals, your projections are fiction. Make cost part of your eval metrics, not an after-launch surprise.

Common AI eval mistakes PMs make in 2026

  • - Validating on three happy-path examples. The demo shines, production breaks, and you find out from a churned customer or a lost deal.
  • - Building the LLM-judge before reading a single real response. You can't validate a judge you have nothing to validate it against.
  • - Outsourcing the eval to engineering or QA based on the acceptance criteria. Most of the time the engineering or QA do not have the full domain expertise to spot the high impact failure beyond what is easily measurable like format.
  • - Treating the prompt like engineering work "You are a helpful assistant" is the AI equivalent of a spec that says "make it good." In one test, every model except GPT-5 scored 0% with a basic prompt; a structured prompt took multiple models to 90%. The prompt isn't an engineering task, as PM, it's the highest leverage you have to shape AI behavior.
  • - Skipping cost modeling. Most teams have no idea what the average AI call is costing them, missing critical information for making informed decisions about their AI features.
  • - Generic metrics from observability platform templates. Measuring relevance, consistency, and other template metrics from observability vendors leaves teams with a number but no idea what actually went wrong or how to improve it.
  • - Relying on thumbs up/down from users only. A lot of teams outsource AI quality checks to users. Relying on users clicking thumbs down as your only evals strategy backfires.

Evals and regulation: the EU AI Act raises the floor

If your AI feature touches a regulated domain, evals stop being optional hygiene and become evidence. The EU AI Act expects documentation, traceability, and risk management — which is exactly what a disciplined eval process produces as a byproduct: a record of what you tested, what failed, how you fixed it, and why you judged it ready.

Teams shipping on vibes have nothing to show a regulator or an enterprise procurement team. Teams running structured evals have an audit trail.

The bottom line

Evals are the product skill of 2026 because they're how you answer the only question that matters about an AI feature: is it good enough for my users, and how do I know?

Not "our confidence score is high" or "QA is validating it" or "we are measuring with thumbs up/down from users." A suite of quantitative metrics that you have defined, based on what you know makes a good output and how you've actually seen the AI fail for your use case.

You don't need permission, technical skills or a perfect setup to start. Your first step takes an afternoon: put 20 real outputs in front of you and annotate them. If you've shipped, pull your traces and look at them. If you haven't, run an experiment on your best guess. Either way, the move is the same: put the outputs in front of you and start annotating.

I believe the teams winning at AI bring evidence to AI decisions the same way good product teams brought data to product decisions when usage analytics changed everything.

Your AI feature is live. Three happy test cases are not proof that it works. A structured eval is.

— Madalina