How to Move Past Vibe Checks: Scaling Manual AI Testing into Systematic Evaluation

By Madalina Turlea·
How to Move Past Vibe Checks: Scaling Manual AI Testing into Systematic Evaluation

Written by Madalina Turlea

16 Jun 2026

Every team building an AI feature starts the same way: type in a few test cases, skim through answers, call them "pretty good" and ship. That's a vibe check, and it's the right place to start. Human manual review is the first step of every serious AI evaluation process. The teams that get into trouble aren't the ones doing vibe checks. They're the ones still doing only vibe checks six months later.

To move past vibe checks, you scale the manual review you're already doing: build a real test set, run it across multiple models, centralize your expert annotations in one place, group them into named failure modes, prioritize failure and then define automatic metrics to check your high impact failure categories. Keep reviewing new responses until they stop revealing new failure types, then prioritize fixes by impact and frequency. That's the whole path. The rest of this article walks through each step.

This isn't theory. It's the process we've run across more than a dozen AI features, in different industries and product stages, across 1,500+ experiments at Lovelaice, and the numbers say it matters: with this process you can improve your AI quality by over 40% in a matter of weeks.

What is a vibe check, and why does it break down?

A vibe check is informal, manual AI testing: you try a handful of inputs, read the outputs, and judge them by feel. No defined criteria, no record of what was wrong, no way to repeat the test after the next prompt change.

In practice, teams get stuck in one of two states.

State one: the five happy cases. You test two or five inputs, all happy paths. The review is a skim. The verdict is "yeah, looks good." The feature ships on the strength of a feeling.

State two: the scattered twenty. More disciplined teams keep a collection of 10–20 test cases they re-run at each iteration. Better. But the feedback on what's wrong with each answer lives everywhere: a Jira ticket here, a Notion page there, three Slack threads, a Google Doc someone made in a meeting. Nothing is consolidated. Six weeks later, nobody can tell you whether the new prompt fixed the problems from the last round, because nobody can find the last round.

Both states share the same root problem: AI fails silently. A bad AI answer doesn't throw an error. It returns a confident, well-formatted response that happens to be wrong, and users don't file bug reports for "just okay" answers. They leave. Your monitoring sees success while your retention sees the truth.

Vibe checks can't catch silent failure because they have no memory and no definition of "good." The six steps below build both.

Step 1: Build a real test set, not five happy cases

You don't need production data or thousands of examples to start. You need 5–10 test cases built from real scenarios you know well, written by the person who understands the domain. From there, grow toward 20–30 with a deliberate mix:

  • - 40–50% standard cases. Clear inputs, known-good answers. Your happy paths.
  • - 30–40% edge cases. Incomplete data, ambiguous requests, unusual but valid inputs.
  • - 10–20% adversarial cases. Users trying to abuse the system, missing or corrupt data, typos, conflicting instructions, anything that can go wrong.

A test set that's 100% happy path only exercises the part of the input space that was never going to fail. The edge and adversarial cases are where your AI's real behavior shows up.

Step 2: Run every test across multiple models, side by side

This step looks like a model-selection exercise. It's actually an evaluation exercise.

When you read one AI response in isolation, it's hard to say what's wrong with it. It sounds plausible. It's well structured. Something feels off, but you can't articulate what. Now put five or ten responses to the same input next to each other. Suddenly the quality attributes become obvious: this one missed the key recommendation, that one buried it under generic filler, this one invented a number, that one nailed the structure but got the priority backwards. Comparison turns vague feelings into words, and words are what you'll automate later.

The side effect is that you learn which model actually fits your problem. In one data-extraction experiment we ran, the accuracy gap between models was 83% versus 33% on identical test cases, and the worse model cost 50x more. You will not find that on a pricing page or a leaderboard. The model that wins on Twitter is not the model that wins on your problem.

Step 3: Annotate everything in one centralized place

This is the step that separates teams that improve from teams that loop.

For every test case, every model, every variation, leave notes, and put them in one system rather than scattered across Slack threads and Jira tickets. Each annotation attaches to the exact response it describes, so the next iteration can be measured against the last one instead of against memory.

Two rules make the notes valuable.

First, be specific. "Bad response" teaches you nothing. A real annotation from our own product reads: "Recommends restricting to two models after only five test cases. Not enough data to make that decision." That note names the failure precisely enough that, later, it can become an automated check. The quality of your notes determines the quality of everything downstream: your prompt fixes, your checks, your metrics.

Second, the right people write them. Evaluating AI output is not an engineering task (unless you're building a coding agent). It's a product and domain-expertise task. The question being answered on every response is "would an expert on my team have written it this way?", and only people who know what an expert answer looks like can answer it. One sustainability team came to us with a first iteration scoring below 40% accuracy. The turning point was putting their domain experts, the people who had done these assessments manually for years, in charge of reviewing outputs and shaping the prompt. Three weeks later: over 90%, on a cost-efficient model. The technology wasn't the differentiator. The domain expertise was.

Step 4: Turn your notes into failure modes

After a few rounds of annotation, patterns appear. Group your notes into named error categories, your AI's failure modes. These will be specific to your domain, your use case, and your product, and that specificity is exactly what makes them useful.

Say your feature generates a client activity report inside your product. You know precisely what a great report contains: which insights matter, which action points the client should take, what deserves the top spot. Every gap between the AI's output and that ideal is a candidate failure mode. For example: if a client is missing a critical setup step that's blocking them from getting value, that must be the first section of the report. An output that buries it is a failure, and now it's a named failure: "missed critical setup priority."

Expect surprises here. When we ran error analysis on our own insights feature, we expected hallucination to dominate. It barely registered. The actual top failures were superficial insights, premature recommendations made on too little data, and format violations. The failure distribution will surprise you, which is precisely why you can't skip the reading and go straight to automation: you'd build checks for failures you imagined instead of failures you have.

Step 5: Automate one check per failure mode

This is the reframe that makes scaling possible: the quality of your AI is the absence of your known failure modes. Once each failure has a name and a crisp definition, each one can get one or more automatic checks. Two kinds, in this order:

Deterministic checks first. Small pieces of code that verify rules. Is the JSON valid? Are all required sections present? Is the response within length limits? And the domain rules from your failure modes: if the input is a client with missing critical setup, is the setup section first in the report? These checks cost nothing to run, are perfectly reliable, and in our experience catch 60–70% of failures before you need anything smarter.

LLM-as-a-judge for the rest. Some failures require interpretation: is this insight shallow, is this recommendation premature, does this summary miss what matters? For these, another AI checks for one specific failure and returns a binary pass/fail. One judge, one error, one score. And validate the judge against your own human annotations before trusting it: our first judge agreed with our human labels 50% of the time, which is a coin flip, not a judge. After iterating on its prompt with examples from our error analysis, it reached 93%. Aim for at least 80% agreement before you rely on one.

Your scores will never hit a permanent 100%. You're dealing with AI; even with well-written evals, some responses will fail. That's exactly the point. You now have a system that automatically checks every known failure mode on every response, so you can run thousands of test cases without reading thousands of outputs. You read the failures.

Step 6: Expand your data

You started with the small sample you could read by hand. Now widen it, depending on where your product is.

If your feature is live, you have the best test data there is: real production interactions. Pull them in and run the same annotation and checks on actual user inputs and AI responses.

If you're pre-launch, use AI to generate synthetic variations of your manually validated cases. Point it deliberately at the gaps: edge cases you haven't covered, adversarial inputs, missing and corrupt data, users pushing the system somewhere it shouldn't go. Synthetic data built on a foundation of real, hand-checked cases is how teams test thoroughly before they have a single user.

When do you stop? The saturation point

You stop expanding manual review when it stops teaching you anything new.

The signal is simple: as long as reading new responses keeps exposing new failure patterns, keep going. Somewhere around 100–200 carefully reviewed responses, most teams hit saturation, where the next batch of responses contains only failures you've already named. That's the finish line for discovery, not for evaluation. Your checks keep running forever; it's the manual pattern-hunting that winds down.

Then comes the payoff. Rank your failure modes by impact times frequency: how much damage each one does and how often it shows up. Fix the top ones first, in the prompt, the architecture, the model choice, wherever the fix lives. And because every failure mode now has a metric attached, you'll know whether each fix actually moved the number or just moved the problem. That's the difference between iterating and guessing.

The loop, not the finish line

Moving past vibe checks isn't a single leap to automation. It's a sequence where each step earns the next: a real test set makes multi-model comparison meaningful, comparison makes quality articulable, centralized annotations make patterns visible, patterns become failure modes, failure modes become automated checks, and checks turn every future iteration into a measured experiment instead of a feeling.

The teams that do this go from "yeah, looks good" to "92% pass rate on our 14 failure modes, and the new prompt fixed the top two." Same feature. Completely different conversation with leadership, with engineering, and with yourself at 11pm before a release.

Lovelaice is built for exactly this workflow: test cases, multi-model runs, centralized annotations, failure-mode metrics, and side-by-side iteration history in one place. If you're somewhere between five happy cases and your first automated eval, that's the gap we close.

FAQ

How many test cases do I need to evaluate an AI feature? Start with 5–10 manual test cases built from real scenarios, and grow to 20–30 with a mix of roughly 40–50% standard cases, 30–40% edge cases, and 10–20% adversarial cases. For discovering failure patterns, plan to review 100–200 responses before reaching saturation.

What's the difference between a vibe check and an eval? A vibe check is informal manual testing judged by feel, with no defined criteria and no record. An eval is a repeatable test suite that measures AI output quality against your own definition of good, one automated check per known failure mode, producing scores you can track across iterations.

Who should evaluate AI quality, engineers or product teams? It all depends on the domain in which you're building. The domain experts should be the ones to evaluate quality. If you build a coding agent, software engineers are the experts. In other domains, it's product managers and field experts. Engineers build the pipeline that runs the checks, but judging whether an answer matches what an expert would write requires domain knowledge. Teams that moved evaluation ownership to domain experts have gone from 40% to 90% accuracy in weeks.

When should I use LLM-as-a-judge? Only after manual error analysis, and only for failures that require interpretation, like shallow insights or premature recommendations. Use deterministic code checks first; they catch failures at zero cost. Validate every judge against human labels and require at least 80% agreement before trusting it.

How do I test an AI feature without production data? Build 5–10 test cases manually from real scenarios you know, validate them by hand, then use AI to generate synthetic variations that cover edge cases and adversarial inputs. Real production data can replace and extend the synthetic set once the feature is live.