Should you still write PRDs when building AI features?

By Madalina Turlea·

Written by Madalina Turlea

14 Apr 2026

For over 10 years, the PRD was one of my most important outputs as a product manager.

It didn't matter the company, the product, or the stage. Before a single line of code was written, I wrote the spec. Requirements, acceptance criteria, edge cases, expected behavior. The PRD was how product teams managed risk, because code was expensive, and the cost of building the wrong thing was high. So you did all the discovery upfront, reduced uncertainty as much as possible, and handed a tight spec to engineering.

This worked. For a decade, it worked. And it still works for deterministic software, though coding assistants have significantly reduced the cost of writing code.

Then I started building AI products.

For the very first AI agent I worked on, I did exactly what I'd always done. I wrote a detailed PRD: what the agent should do, how customers would use it, how we should protect against abuse and errors. I gave it to engineering to figure out the implementation.

It took me longer than I'd like to admit to realize this was the problem.

Because here's the thing about AI that changes everything: the programming language is plain English. The most important lever you have to shape AI behavior is the system prompt, and it's written in the same language as your PRD. When a PM writes a PRD describing how the AI should behave, then hands it to an engineer to "implement," the engineer translates your spec into... another document written in English. A shorter, blurrier version of what you already wrote.

You're adding a translation layer that doesn't need to exist. And every translation loses context.

I'm not the only PM who starts here. From our work with product teams at Lovelaice, I've seen the same pattern in roughly 80% of teams. The PM writes the PRD. The engineer writes the prompt. And testing is done on 3-4 happy cases, with the QA team "tasked to figure out how to automate AI eval".

Meanwhile, the loudest voices in product are telling PMs that "evals are the new PRDs"; that AI has made your spec work obsolete, and the thing that replaces it is evaluations. This sounds right. It isn't. And it's making the real problem worse.

The process that's failing you

Let me walk you through what the AI product building process looks like in most teams.

The PM writes a PRD for an AI feature. Detailed requirements, acceptance criteria, expected behavior, edge cases. Solid work, the kind of spec that would serve you well in traditional software.

The engineer reads the PRD and writes the system prompt. Here's what that usually looks like:

"You are a helpful assistant that provides insights on user data. Be concise and accurate."

Two sentences. Your detailed PRD, the domain knowledge, the implicit expectations, the nuanced understanding of what your users actually need, compressed into a generic instruction.

And here's what happens next: when the team sees the first results from this prompt, the instinct isn't to iterate on the prompt. It's to switch the model. "Let's try the latest GPT." "Let's move to Claude." The prompt stays the same two sentences, they just throw a more expensive model at it.

We've seen teams double their accuracy, from 40% to 80%, by iterating on the prompt alone, without changing the model. The prompt is the highest-leverage thing you can change. And in most teams, nobody is changing it. If you're a PM and you're not involved in this experimentation, you're leaving enormous value on the table, for your users and for your product.

Then QA takes the PM's acceptance criteria and validates the output. What that usually looks like: 3-5 happy cases are skimmed and the conclusion is "the answers look good." In some cases, teams move directly to automating evaluations, they instruct another AI to look at what their own AI is outputting and check for the same acceptance criteria. They're testing for failure modes that nobody has actually observed, because nobody has read what the AI is actually saying yet.

The feature ships.

And the first person to actually read what the AI says to your users... is the user.

Not the PM who defined the feature. Not the engineer who wrote the prompt. Not QA. The user is the first human being to judge whether the AI's response is good.

You might be thinking: "That's not how we do it, we test and validate." But walk through it honestly: did anyone sit down and read 20, 30, 50 real AI responses before launch? And really spend time thinking through the answers?

Or did the team test 3-4 happy-path cases and call it done?

From our work with product teams at Lovelaice, here's what we see consistently:

  • - In ~80% of teams, engineers write the production prompt. The PM has never seen it.
  • - In ~90% of teams, quality evaluation is manual and "vibe-based", no structured criteria, no systematic testing, not based on actual observed failures.

And the cost of this isn't theoretical.

We've seen a model upgrade cause a 10x increase in negative user feedback overnight, with no eval catching it. We've seen AI features shipped when 70% of the failures could have been caught before launch. We've seen accuracy jump from 50% to 93% in a single iteration on the prompt, when proper error analysis was performed.

This is what the traditional handoff chain produces when applied to AI. In deterministic software, the handoff works: code either does what the spec says or it doesn't, and QA can verify with a clear pass/fail. But AI is non-deterministic. The prompt is simultaneously the spec and the implementation. When you separate the person who understands the user from the artifact that controls the AI's behavior, you create a translation gap. And AI's non-determinism exploits that gap ruthlessly.

Why "evals are the new PRDs" makes this worse

If you follow product content on LinkedIn, you've probably heard some version of this: evals are the new PRDs. I've heard it too from several product leaders. The idea is that for AI products, all that matters now is evaluation and that PMs should be able to write them.

I had the same reaction when I first heard it. It sounds right.

It isn't.

Think about what PRDs actually are. A PRD is written before you build. It describes intent: what the feature should do, how it should behave, what constraints apply. When PMs hear that evals are the new PRD, what most understand is that they should write AI evals upfront. Before any implementation, write the evals. Same as acceptance criteria for traditional software.

Here's the disconnect: an eval is written after you see actual outputs from AI. It catches failures you have actually seen your AI make, not what you imagine it might get wrong.

When you tell PMs that evals replaced their PRDs, you're telling them two things: that their spec work is obsolete, and that the evals are the only way they can shape AI behavior.

Both are wrong.

The spec work isn't obsolete. It just moved. The prompt IS the spec, it's where you encode what the AI should do, in what tone, with what constraints, for what user. And the PM is still the person best positioned to write it for most products. Not because PMs should become engineers, but because the prompt is product work written in a different medium.

Evals are highly important, and you should absolutely write and run them before shipping. But the quality of your evals is only as good as the work you put in to pinpoint your AI's actual failures and the feedback loop you build around them. If you're only measuring failures you imagined and never improve your prompt based on evals, your efforts are wasted.

The prompt is the PRD

Here's what changes when you see the prompt for what it is.

The prompt is where you encode your domain knowledge directly: what the AI should do, how it should behave, what constraints apply, what tone to use, what context matters. This is not prompt engineering. I'm not talking about technical tricks for getting better outputs. This is context engineering, deciding what exactly you need to instruct the AI on, what data it needs, what's the minimum context required to get the best quality output.

This is product work. It always was.

We've heard it from CPOs who started writing prompts themselves: the prompt a product person writes is fundamentally different from one an engineer writes, because the product person understands the user better. Not because engineers are bad at their jobs. Because they're optimizing for different things.

The teams we've seen with the best AI quality are the ones where PMs and domain experts are closest to the prompt. In one startup, the CPO writes prompts directly and found 12.5x cost savings by switching models, with only a 1% difference in accuracy. That's not a decision an engineer would make, because it requires knowing what "good enough" looks like for the customer. In another company, the PM has direct access to every AI interaction, and they've achieved AI resolution rates above what the human team delivers.

The pattern is consistent: the closer the person with domain knowledge is to the prompt and the evaluation, the better the AI performs.

But I hear the objections

When I tell PMs they should own the prompt and experiment with the AI directly, I get the same responses almost every time. They're reasonable on the surface. Let me address each one.

"I don't want to become an AI engineer, I just want to move business metrics."

That's exactly why you should own the prompt. For your AI to deliver value to users and push real outcomes, it needs to be high quality. Owning the prompt is how you ensure that. The prompt is not engineering work, it's product work written in a different medium. You're not becoming an engineer. You're doing product management where the product actually lives.

"I already describe the requirements in the PRD and push it to engineering to translate into prompts."

And you lose enormous context in that translation. Every instruction in a prompt impacts output quality in ways you can only see by reading the actual responses. When you write "the AI should provide helpful insights on user data" in a PRD, the engineer translates that into a two-sentence prompt. That's not your spec. That's a shadow of your spec. The gap between your intent and that prompt is where quality dies, and you'll never see it unless you're in the prompt yourself.

"AI quality is a QA task, not a product task."

QA can run the evals. But the PM is the one who best understands the customer and product and can best articulate what "good" looks like. QA doesn't carry the domain knowledge, the user context, the implicit expectations. Only the PM knows that the tone is slightly off, that the response answered the question but missed the real concern behind it, that the recommendation is technically correct but irrelevant to this user's situation. QA executes. The PM defines.

"Prompts are technical, I trust the engineers."

It's not a matter of trust. Your engineers are great at engineering. But they can't express how a model falls short the way a PM can, because they don't know the user the way you do. They don't know that a response "feels corporate" when your users expect directness. They don't know that the AI missed the subtext in the user's question. The PM sees quality gaps that engineers structurally cannot see, not because engineers are bad, but because the PM carries context that never made it into the PRD.

"That's not scalable, I can't review every AI response."

You don't need to review every response forever. You need to read enough to discover the real failure patterns, then build evals for those. Once you've identified the patterns, evals run automatically. The PM's job is to keep discovering new failure patterns as models and usage evolve, not to manually review every output for eternity.

"We already have evals."

If your evals were written before anyone read the AI's actual responses, they're testing for what you think goes wrong, not what actually goes wrong. We've seen teams spend months and significant budgets trying to automate evaluation, and fail, because there was no clear framework for what to measure. The PM wasn't close enough to the outputs to define it. Evals built on imagined failures miss the real ones.

The process

So what does a PM-owned AI development process actually look like?

1. Write the prompt, not just the PRD.

Encode your domain knowledge directly into the system prompt. What should the AI do? What should it never do? What tone? What data does it need? What does a good response look like for this specific user, in this specific context? This IS the spec. No translation layer needed. Start with your best guess of the first prompt. Once you see the first outputs, you'll know exactly what you need to change.

2. Read the AI's actual responses.

Not a dashboard. Not a summary metric. Read 20-50 real outputs. This is where your implicit expectations surface, the things you knew but didn't write down because they seemed obvious. You'll find yourself thinking "that's technically correct but not what I meant", and that gap is your most valuable signal. Think of this like a PRD review: the AI shows you what it understood from your specs. If its answer missed the mark, clarify the requirements. Or maybe you forgot an edge case altogether?

Pro tip: run your prompt on multiple models at the same time. When you see multiple AI models trying to solve the same task, the differences in their responses reveal the missing context in your original prompt. The next iteration writes itself.

3. Identify real failure patterns.

When you read the AI responses, write down where you think "that's not quite right", and, critically, why. These are your eval criteria. You couldn't have written them before seeing the outputs, because you didn't know what the AI would get wrong. This is the step most teams skip, and it's the one that matters most. It unlocks two things:

  • - It gives you clear direction on how to improve your prompt.
  • - It tells you exactly what to automate in your evals.

4. Build evals based on observed failures.

Now you build the quality gates, for the failures you actually saw, not the ones you imagined. Evals are the quality assurance, not the specification. They test whether the AI still fails in the ways you've already discovered. I wrote more about how to build your evals here →

5. Run evals continuously.

Models update. Data drifts. User behavior changes. We've seen model upgrades cause catastrophic quality regressions overnight, with no eval catching it. Continuous evals catch regressions, but only against known failure patterns. Which is why you keep reading outputs to discover new ones. The PM's job isn't to review every response forever. It's to keep discovering failure patterns as the AI and usage evolve.

Where does your team fall?

From our research across dozens of product teams, we've mapped a maturity spectrum. Most teams are at Level 1 or 2. The best are at 3 or 4.

Green field. The PM writes the PRD, engineering writes the prompt. There's no evaluation baseline, quality signal comes from user thumbs up/down, which barely anyone clicks. Nobody has read the AI's actual outputs systematically. You don't know what you don't know yet.

Early instincts. The PM tests a few happy-path cases manually the first time around. But there's no iteration on the prompt based on what they find, the task gets passed to QA. Quality checks never fail, because they're not systematic and not data-based. Whatever the AI quality is right now gets accepted and shipped.

Building structure. The PM can see the prompt, but can't directly iterate on it. Improvements still go through engineering, and there's a days-long delay between identifying an issue and seeing the fix in production. The instincts and some infrastructure are there. What's missing: a way to run experiments without engineering becoming the bottleneck every time someone wants to test a prompt change.

Systematic. The PM writes the prompt and defines eval criteria based on observed failures. Evals run automatically. You're ahead of 90% of teams. But the risks at this stage are subtle, cost drift, silent regressions that compound, knowledge that hasn't made it into the eval criteria yet.

Evaluation-native. Prompt ownership, continuous evals, systematic iteration, the full loop is running. The question isn't whether you have a process. It's where your specific blind spots are, and whether there's a faster path to your quality targets than what you're currently running.

Want to know exactly where your team falls? We built a free AI product maturity assessment, fill in the details of your current process and get a custom report based on what we've seen across dozens of teams. Take the assessment →

Bottom line

The teams that get AI right aren't the ones with the best models or the most sophisticated infrastructure. They're the ones where the person who understands the user is closest to the AI's output, not three handoffs away.

The prompt is the PRD. Evals are the quality gate. And the PM should own both.

Not because PMs need to become engineers. But because AI fails in the translation between people, not in the technology.

— Madalina


Lovelaice is the AI feature validation platform built for product teams who want to prove their AI works, before users churn. Try it free →

And if you want to learn this methodology step by step, with hands-on practice and feedback, my Maven workshop on AI Evals for Product Managers covers exactly this process, from first test case to production-ready evaluation. Learn more →