How to Run an AI Experiment, Step by Step

Written by Madalina Turlea
15 Jan 2026
When you build a feature with AI, guessing whether it works is not enough. You run an experiment. An experiment is a configuration that has a prompt, a set of test cases or use cases, and different LLM models attached to it. They all run in parallel, and then you look at the results. Here is the full loop, from idea to evaluation.
Step one: from idea to prompt
You start by writing the prompt, the set of instructions the model will follow. The usual sections are a role, a bit of context about how it will be used, the task and objectives described as clearly as you can, some examples depending on the task, a section for handling edge cases or inputs you do not expect, and the output format.
You do not have to get this perfect on the first try. The process is about experimenting. You write one version, the best you can describe the problem at that moment, run it, and refine it once you see how it behaves. Naming the role well matters, because pointing the model at the right domain, like a professional trainer or an experienced teacher, can activate the part of its training that brings its own understanding of that domain.
Step two: build the test cases
The test data should replicate how a user would actually use the feature. Start with three to five cases, each one a realistic input. Save each one as a test case.
This is also where you try to make the feature break. Alongside the obvious happy-path cases, add edge cases: an input in another language, a note that mixes two things, a recording that captured something unrelated, or text with typos. You cannot control what users send, so the test set should cover as many of those situations as possible.
Step three: select the models
Do not pick one model up front. Select across three families, and within each, a higher-level and a lower-level model. A higher-level model is a newer iteration with much more reasoning, which is also much more expensive. A lower-level model is cheaper and has a lower context window. Running a higher and a lower model from each family lets you see which one actually understands your problem.
Step four: run and review
The prompt, the test cases, and the models run together in parallel. With one prompt version, five test cases, and five models, that is twenty-five calls to the AI at once.
Then you review. In the results view, the same question, with the same instructions, shows the different models' answers side by side. Going through these answers is one of the most impactful moments of building with AI. For a feature that generates text rather than a right-or-wrong answer, you are not just checking correctness, you are deciding what you like: the formatting, the tone of voice, the language, whether you want emojis, lists, shorter or longer paragraphs. These are things you often have not thought about until you see the answers next to each other.
You annotate each one with pass or fail, using the quick tags or your own free-text notes. If an answer stops mid-sentence, that is a sign it ran out of tokens, which happens with the more expensive models because they spend a lot of tokens reasoning.
Step five: error analysis and evaluation
Reviewing by hand is interesting, but it is time-consuming. You can do it for five or ten test cases, but to release something with production quality you need to test on hundreds, or in critical cases like legal or finance, thousands. So the next step is to turn your notes into an error analysis and an evaluation framework, so you can scale the testing without reviewing every answer manually.
It is still important to do the manual review first. Comparing the answers, deciding which you prefer and what is more correct, is what gives you a good basis to build evaluations that test for something meaningful instead of generic checks. Do not expect this to be a thirty-minute exercise. It usually takes a few hours to digest the answers and leave good notes, and that work is what the rest of the experiment is built on.
You might also like

What is AI experimentation, and why do you need it?
One idea. One prompt. Five real cases. Several models. Read every response. That's where AI-native products start. An 8-step playbook for product thinkers running their first experiment.

Error Analysis for AI: Turning Messy Review Notes Into a Fix List
You cannot improve an AI feature until you know how it actually fails — not how you think it fails. The unglamorous front end of every reliable AI feature: read, annotate, group by root cause.

Should you still write PRDs when building AI features?
The programming language is plain English. The prompt IS the spec, so why is the PM three handoffs away? Why "evals are the new PRDs" makes things worse, and what PM-owned AI development actually looks like.