We Tested the Viral Prompt Tricks. Most of Them Do Nothing.

Written by Madalina Turlea
15 Jan 2026
There is a lot of advice online about prompting techniques that supposedly unlock better answers. Threaten the model. Write in all caps. Tell it your life depends on it. We ran an experiment to see whether any of it actually makes a difference when you are building AI products, or whether it is just clickbait.
The setup
The task was invoice data extraction: pulling structured fields like vendor name, service, unit price, and total price out of invoices that come in different formats and layouts. We took one baseline prompt and applied different prompting techniques to the exact same text, so the only thing changing was the technique.
The versions were a baseline, a structured version with clear Markdown sections, a version using XML tags, a version using all caps to highlight important instructions, a high-stakes or threatening version, and an all-in-one version combining everything. We ran these across nine models from all the major providers, on ten tricky invoices, including cases with a missing unit price, a total split between an advance and a remaining amount, an invoice in another language, and a credit note where the amounts are negative.
Threatening the model hurt it
The high-stakes prompt added pressure: this extraction feeds directly into payments, an incorrect result would cause overpayment, it costs us this much money. The idea is to scare the model into being careful.
It did not work. The threatening prompts consistently underperformed. They did not just fail to improve accuracy, they actively hurt performance. On the harder task of extracting product pain points from interview transcripts, the prompting hacks underperformed again, and many of the threatening runs returned zero.
All caps does nothing extra
The all-caps version highlighted instructions like returning the response only in JSON. When we compared it to the version that simply had structured, clear sections, the all-caps version achieved the same performance. All caps did not add anything special on top of just having clear, structured instructions.
Combining everything was not the answer either. Putting Markdown, XML tags, and all caps together did not help, and for one model it made things worse. Gemini, already near the bottom of the ranking, degraded further when all the techniques were combined, because it confused the model even more.
What actually worked
The conclusion across both experiments was the same: focus on structure and clarity over tricks or hacks. Both GPT-4.1 and Claude Opus 4.1 reached the top accuracy of 90% with structure, and the structured prompt was what delivered the best results on the harder task too.
A few techniques show up again and again as the ones that consistently help: formatting the prompt with Markdown so the information has a hierarchy, giving the model examples of good answers (few-shot prompting), specifying clearly what the input and output format should be, and giving the model a way out for when it does not know what to do. That last one matters, because without a way to say it does not know, the model tries to comply anyway, and that is when it starts hallucinating.
The flashy tricks are easy to share. The boring work of writing clear, structured instructions is what moves the numbers.
You might also like

Markdown, XML, or JSON: How to Format a Prompt So the Model Understands It
The format you put your prompt in changes accuracy more than you would expect. When to use Markdown, when to reach for XML tags, and when JSON is the right tool.

Should you still write PRDs when building AI features?
The programming language is plain English. The prompt IS the spec, so why is the PM three handoffs away? Why "evals are the new PRDs" makes things worse, and what PM-owned AI development actually looks like.

LLM-as-a-Judge: How to Evaluate AI Features Without Checking Every Answer by Hand
Most teams build the wrong judge — vague rubrics, one-to-five scoring, no validation. The three-stage evaluation ladder, why a judge should be binary and single-error, and how one prompt went from 50% to 93% agreement with humans.