Why ship and learn doesn’t work for AI

Written by Madalina Turlea

03 Jan 2026

The AI MVP Trap

Why "ship fast and optimize later" works everywhere except AI and what to do instead

I've been in product management for over 10 years and probably made most product mistakes there are to make.

The first one was the classic: build in isolation until you think the product is feature-complete, then show it to customers. I learned that lesson the expensive way in traditional SaaS. By the time we got customer feedback, refactoring cost more than rebuilding from scratch.

So I learned. I joined the MVP train.

Build the minimum viable product. Prove the value prop with the smallest possible thing. Don't waste engineering resources on unvalidated hypotheses. Ship fast, learn fast, iterate.

And it worked. For years, this worked. In fintech. In payments. In every deterministic product I built.

Then came the AI era.

And suddenly, everyone's adopting the same playbook: "Let's ship something, deploy it, and see how customers use it. See what they ask. See how the AI behaves."

This sounds like good product practice. It feels like the MVP approach. Fast. Lean. Customer-focused.

But here's what I've learned after building AI products and working with teams across fintech, healthtech, and beyond:

This approach can silently kill your product.

The illusion of success

Here's the thing about AI that nobody tells you:

💡

AI always returns something.

In traditional software, your product fails visibly when it's broken.

Users click a button and get an error. The page doesn't load. The transaction fails. They complain. You see it in your logs. The feedback loop is immediate and obvious.

You fix it. You iterate. You learn.

But AI doesn't work this way.

It doesn't crash. It doesn't throw errors. It doesn't fail visibly.

It just... responds. Confidently. Whether it's right or completely wrong.

And this changes everything.

When you ship an AI feature that's subtly wrong, users don't complain about errors. They don't file bug reports. They just quietly stop trusting your product. They disengage. They churn. And you have no idea why.

I recently experience this first hand. I came back to Miro, after a long while of inactivity for building a brainstorming dashboard. I now noticed this big chat box in the middled of the Miro board. My mind immediately went thinking that I can just describe what I want to build and it will put the different frames together for me. the reality: it just formatted the session description I pasted in the chat in a single frame. I was frustrated, disappointed. I closed and went back to FigJam.

That's the trap.

The cost optimization Paradox

Now here's where the advice gets even more dangerous.

I see this all over LinkedIn and in product circles: "Don't optimize costs early. Build first. Solve for retention and adoption. Then optimize."

This is gospel in traditional SaaS, and for good reason. In traditional software, there's zero incremental cost per additional user. Your power users, the ones who use your product the most, are your most profitable customers.

The economics work in your favor. You can focus on building the right thing first, then optimize your infrastructure when you reach massive scale.

I spent over 10 years in fintech. Financial services is data-heavy, process-heavy, and highly regulated. I've led architecture refactorings, cost optimizations, and system redesigns. But these were only priorities when we reached incredible scale or were planning to.

AI breaks this model completely.

In AI products, there is an incremental cost with every single request. The more users use your AI feature, the higher your costs to support them. Your power users, the ones who love your product most, are now the ones who are least profitable.

Let me make this concrete.

The CEO of Loveable shared a story just a few months back: a user vibecoded on their platform for 30 hours straight. With only 5 user interactions per hour and an average cost per request of 0.7 cents, the total cost of AI for this one session is $105. That is 5 times the $20 monthly subscription price. The AI costs alone likely exceeded 5 times the entire month's subscription revenue within the first 48 hours.

This isn't a hypothetical concern. This is happening right now to teams who followed the "optimize later" advice.

The Two Types of "Cost Optimization"

Here's where the advice gets confusing.

When I talk about testing models and costs before shipping, people hear "premature optimization." They think I'm saying: spend weeks comparing API pricing sheets, obsessing over you should finetune an open source model or go with a frontier one.

That's not what I'm saying at all.

There are two completely different activities that get called "cost optimization" and only one of them is premature.

Type 1: Paper Optimization (This Is Premature)

You open pricing pages for OpenAI, Anthropic, Gemini.

You compare: GPT-5 costs X per million tokens. Claude-Opus-4.5 costs Y. Gemini 3 pro costs Z.

You make a spreadsheet. You pick the cheapest one.

You've never run your actual prompt. You've never tested with your data. You've never seen how any of these models actually perform for your specific use case.

This is premature optimization. And it's almost worthless.

Why? Because you're optimizing a variable you don't understand yet. You don't know which model will actually work for your use case. You don't know if a "cheaper" model will cost you more because it requires 3x the requests to get acceptable results.

You're making decisions in a vacuum.

Type 2: Hands-On Experimentation (This Is Discovery)

You write your actual prompt.

You create 10-20 test cases, the real scenarios your AI will face, including the messy edge cases.

You run these test cases across 5-10 different models.

You evaluate the results blindly, without knowing which model produced which output.

And you discover things you could never learn from a pricing page:

- An older model outperforms the newest one on your specific task
- A "weaker" model with a better-structured prompt beats a "stronger" model with a generic prompt
- The cost difference between your options is 20x, for identical accuracy

This isn't premature optimization.

This is discovery.

This is how you learn what actually works before you commit engineering resources to building it.

The distinction matters because in traditional software, you optimize after you've proven something works. In AI, experimentation is how you prove it works in the first place.

But we'll set up traces and run evals in production

The common advice I hear is: "Deploy your AI feature, set up traces, evaluate production responses, and see where the AI fails."

This sounds reasonable. It's what we do with traditional software: monitor in production, respond to issues, iterate.

But here's what this approach misses:

You can uncover 70-80% of your AI's failure modes before you deploy it to users.

Before you erode their trust.

Before you burn through your runway on avoidable costs.

All while having a clear view of exactly how much it will cost you at scale.

You don't need to learn these lessons from production. You can learn them in hours, not weeks, through systematic experimentation on your own data.

You might think that higher costs are acceptable as a cost of learning. And that would be true, if production was the only way to learn how AI fails.

But there's a better way.

From MVP to Minimum Validated Product

So if "ship and learn" doesn't work for AI, what does?

I call it the Minimum Validated Product approach.

Not Minimum Viable Product. Minimum Validated Product.

The difference is when and how you learn.

In traditional MVP thinking:

- Build the smallest thing
- Ship it to users
- Learn from their behavior
- Iterate based on feedback

In Minimum Validated Product thinking:

- Define what good looks like for your use case
- Build test cases that represent real usage (including edge cases)
- Experiment with different prompting approaches and multiple AI models and track performance
- Make a data-backed decision about your AI deployment
- Monitor and refine based on production patterns you couldn't test

The key insight: You can uncover 70-80% of AI failure patterns through systematic experimentation.

What changes when you experiment first

When you validate AI features before deployment, you discover things you'd never learn from users:

- You learn that an older, cheaper model might outperform the frontier model for your specific use case, at 1/20th the cost.
- You learn which exact prompts break on edge cases that represent 30% of your real users.
- You learn that your cost per request could be €0.005 instead of €0.05 with identical quality.
- You learn all of this in a few hours of experimentation, not months of expensive production learning.

This isn't about being cautious. It's not about slowing down.

It's about learning faster and in the right order.

In traditional software, you optimize after you've proven the feature works. In AI, you need to prove it works and understand the cost structure before you ship.

Not because you're optimizing prematurely.

Because experimentation before deployment is how you discover what actually works in the first place. This is product discovery at its best.

The Systematic Experimentation Framework

Here's how you validate an AI feature before shipping:

Step 1: Define Success

What does "works" mean for your specific feature?

For a customer research synthesis tool: Quotes must be real, themes must be accurate.

For a product recommendation feature: Suggestions must match user preferences, handle edge cases (conflicting requirements), return results in the expected format.

For a medical test interpreter: Units must be correct, ranges must be appropriate, missing data must be handled safely.

Be specific.

Step 2: Build Your Test Dataset

Start with 10-20 test cases.

Don't just test happy paths. Those are the easy ones. Spend 70% of your effort on edge cases:

Happy paths:

- Clear, complete information
- Obvious correct answer
- Perfect formatting

Edge cases (where AI actually breaks):

- Conflicting requirements ("I want it fast AND heavily cushioned")
- Missing information ("user left some fields blank")
- Ambiguous requests ("I want something good")
- Unusual combinations
- Typos and real-world messiness

This is where you'll discover your AI's actual failure modes.

Step 3: Run Multi-Model Experiments

Test your prompt across 5-10 models. Experiment with multiple prompting techniques.

Don't assume you know which model is "best." Let the data tell you.

If you test 2 prompt variations across 5 test cases with 4 models, you get 40 responses to evaluate. This takes a few hours, not weeks.

💡

Critical: Evaluate blindly. Don't look at which model produced which output while evaluating. This eliminates bias. You might discover your assumptions about "frontier models" are completely wrong.

Step 4: Evaluate and document

For each inadequate response document why it failed:

- "Ignored the format instructions"
- "Recommended the wrong product because it prioritized speed over cushioning"
- "Hallucinated a feature that doesn't exist"
- "Couldn't handle missing data, gave generic recommendation instead of asking for clarification"

This is where you learn the failure patterns.

Step 5: Iterate your prompt

Look at your failure patterns. Adapt your prompt instructions to better account for them.

If 5 responses failed because "the AI ignored conflicting requirements," add explicit instructions: "When users have conflicting requirements, prioritize X over Y, and explain the tradeoff."

If 3 responses failed because "the AI hallucinated features," add: "Only recommend features that are explicitly listed in the product catalog. If unsure, say 'I don't have enough information to recommend this feature.'"

Run the experiments again with your improved prompt.

Step 6: Establish Your Baseline

Keep iterating until you hit your success threshold.

For most use cases, you want 90%+ accuracy before considering deployment.

Why 90%? Because the remaining 10% are usually edge cases you can't anticipate until you have real production data. But 90% means you've caught the systematic failures, the predictable ways your AI breaks.

At this point, you know:

- Which model works best for your use case
- What your actual cost per request will be
- What your accuracy baseline is

Now you're ready to build.

The Counter-Intuitive Part

Here's what makes this approach feel wrong to many product people:

It feels slower.

It feels like you're not "shipping fast."

It feels like you're optimizing too early.

But here's what actually happens:

You move faster. Because you're not rebuilding features that fail in production. You're not chasing silent failures you don't understand. You're not burning engineering time on models that don't work for your use case.

You waste less. That Loveable user who vibecoded for 30 hours on a €20 subscription? With experimentation, you'd know your cost structure before that happened.

You maintain trust. Your users never see the obvious failures, the hallucinations, the misinterpretations. They see a feature that works reliably.

The illusion is that skipping validation saves time.

The reality is that skipping validation means you learn the expensive lessons in production, with real users, burning real money, eroding real trust.

Systematic experimentation isn't about being cautious. It's about being strategic. It's about learning the right things at the right time.