Building an Expense-Policy AI Agent: What We Learned Reverse-Engineering Ramp

By Madalina Turlea·15 Jan 2026

Written by Madalina Turlea

15 Jan 2026

Ramp, the fintech app for business expense management, launched a feature where an employee can text and ask "can I expense this?" and an AI agent answers based on their company's policy. It got very good feedback online, in reviews and LinkedIn posts. We tried to reverse-engineer it, to see what accuracy you could expect if you built the same thing from scratch, which instructions work best, and where it fails.

The setup

We prepared three versions of the system prompt, all referencing the same expense policy. We did not have a real policy on hand, so we generated one with AI and used that same policy across all three prompts, formatted differently each time.

The first version was basic. It used a placeholder for the company name and a placeholder for the policy, with a simple instruction: when an employee asks about an expense, determine if it is allowed based on the policy.

The second version formatted the policy as XML, by asking a chat model to convert the document into XML tags, since Anthropic's prompting guide recommends putting domain knowledge in XML tags so the model reads it better. It added more instruction too: analyse the request, respond with approved, not approved, or needs review, give a brief explanation, and reference the specific policy sections. It did not say when something should be "needs review," leaving the model to figure that out.

The third version added more structure: a role, more detail, the same XML policy, and detailed instructions for when to escalate to a human as "needs review." Giving the model a way out like this is meant to reduce hallucinations, since without one it tries to comply with your instructions anyway, and that is when it makes things up.

We ran all three across six models, including Claude Sonnet 4.5 and Opus, Gemini, two OpenAI models, and DeepSeek, on seven test cases. The cases ranged from an obvious approval, like a desk chair, to ones designed to need human review, like a conference hotel at 280 dollars a night when other options are 30 minutes away, or an expense submitted outside its allowed time window. In total that was 126 runs.

How we scored it

We evaluated test case by test case. One check looked automatically at the decision, approved, denied, or needs review. Then we manually evaluated whether the reasoning and the policy reference were correct, because even a right decision with wrong reasoning is a problem. We also marked answers as incorrect when the response was very long, since this is meant to be a text-message interaction, and a long answer is not suited for that.

The results

Across the runs it came out roughly fifty-fifty. The best combination was GPT-5 with the structured prompt, at 86% accuracy, six of the seven cases. The same model with the basic placeholder prompt dropped to 43%. The structured XML formatting significantly outperformed the placeholders, and Claude Opus and Gemini 2.5 also achieved good accuracy with the structured prompt.

Cost told its own story. For a successful response, GPT-5 cost 0.028. Claude Opus was almost three times more. Gemini 2.5 had a very low cost with similar accuracy. At scale this is not a rounding error: Ramp has thousands of customers, each with tens or hundreds of employees, so a feature like this could be called millions of times a month, and the difference between the top models could run into hundreds of thousands or millions.

The surprising part

The detailed edge-case instructions in the third prompt actually made it worse for GPT-5, not better. When the edge cases were added, performance dropped. The feature also struggled most on travel and work-from-home expenses, which is where extra instructions could help in a future iteration.

The sample was small, only seven test cases, so it needs more thorough testing before you can be confident in the numbers. But the pattern is the kind of learning you only get by experimenting systematically on the exact problem: structure helped, the most expensive model was not the cheapest path to the same accuracy, and an instruction added to help can quietly hurt.

Article15 JAN 2026

We Tested the Viral Prompt Tricks. Most of Them Do Nothing.

Threatening the model, all caps, high-stakes framing — we ran the viral prompt techniques across nine models on real extraction tasks. Structure and clarity beat hacks every time.

Madalina Turlea

Article14 MAY 2026

What is AI experimentation, and why do you need it?

One idea. One prompt. Five real cases. Several models. Read every response. That's where AI-native products start. An 8-step playbook for product thinkers running their first experiment.

Madalina Turlea

Article15 JAN 2026

The Model Selection Blind Spot: Why the Newest Model Is Not Always the Best

Teams default to the newest model before checking whether it fits the task. Frontier models do not win every time — and accuracy, cost, and latency rarely peak in the same place.

Madalina Turlea

Building an Expense-Policy AI Agent: What We Learned Reverse-Engineering Ramp

The setup

How we scored it

The results

The surprising part

You might also like

We Tested the Viral Prompt Tricks. Most of Them Do Nothing.

What is AI experimentation, and why do you need it?

The Model Selection Blind Spot: Why the Newest Model Is Not Always the Best

Your AI is live.Do you know it's working?

Your AI is live.
Do you know it's working?