How to Run an AI Experiment, Step by Step

By Madalina Turlea·
How to Run an AI Experiment, Step by Step

Written by Madalina Turlea

15 Jan 2026

When you build a feature with AI, guessing whether it works is not enough. You run an experiment. An experiment is a configuration that has a prompt, a set of test cases or use cases, and different LLM models attached to it. They all run in parallel, and then you look at the results. Here is the full loop, from idea to evaluation.

Step one: from idea to prompt

You start by writing the prompt, the set of instructions the model will follow. The usual sections are a role, a bit of context about how it will be used, the task and objectives described as clearly as you can, some examples depending on the task, a section for handling edge cases or inputs you do not expect, and the output format.

You do not have to get this perfect on the first try. The process is about experimenting. You write one version, the best you can describe the problem at that moment, run it, and refine it once you see how it behaves. Naming the role well matters, because pointing the model at the right domain, like a professional trainer or an experienced teacher, can activate the part of its training that brings its own understanding of that domain.

Step two: build the test cases

The test data should replicate how a user would actually use the feature. Start with three to five cases, each one a realistic input. Save each one as a test case.

This is also where you try to make the feature break. Alongside the obvious happy-path cases, add edge cases: an input in another language, a note that mixes two things, a recording that captured something unrelated, or text with typos. You cannot control what users send, so the test set should cover as many of those situations as possible.

Step three: select the models

Do not pick one model up front. Select across three families, and within each, a higher-level and a lower-level model. A higher-level model is a newer iteration with much more reasoning, which is also much more expensive. A lower-level model is cheaper and has a lower context window. Running a higher and a lower model from each family lets you see which one actually understands your problem.

Step four: run and review

The prompt, the test cases, and the models run together in parallel. With one prompt version, five test cases, and five models, that is twenty-five calls to the AI at once.

Then you review. In the results view, the same question, with the same instructions, shows the different models' answers side by side. Going through these answers is one of the most impactful moments of building with AI. For a feature that generates text rather than a right-or-wrong answer, you are not just checking correctness, you are deciding what you like: the formatting, the tone of voice, the language, whether you want emojis, lists, shorter or longer paragraphs. These are things you often have not thought about until you see the answers next to each other.

You annotate each one with pass or fail, using the quick tags or your own free-text notes. If an answer stops mid-sentence, that is a sign it ran out of tokens, which happens with the more expensive models because they spend a lot of tokens reasoning.

Step five: error analysis and evaluation

Reviewing by hand is interesting, but it is time-consuming. You can do it for five or ten test cases, but to release something with production quality you need to test on hundreds, or in critical cases like legal or finance, thousands. So the next step is to turn your notes into an error analysis and an evaluation framework, so you can scale the testing without reviewing every answer manually.

It is still important to do the manual review first. Comparing the answers, deciding which you prefer and what is more correct, is what gives you a good basis to build evaluations that test for something meaningful instead of generic checks. Do not expect this to be a thirty-minute exercise. It usually takes a few hours to digest the answers and leave good notes, and that work is what the rest of the experiment is built on.