The Model Selection Blind Spot: Why the Newest Model Is Not Always the Best

By Madalina Turlea·
The Model Selection Blind Spot: Why the Newest Model Is Not Always the Best

Written by Madalina Turlea

15 Jan 2026

In a lot of teams, the pattern looks the same. Someone writes a prompt, picks the first model that comes to mind, usually something like GPT-5, and runs it there. The model gets chosen before anyone checks whether it is the right one for the task.

This matters because each model has its own strengths and weaknesses. They are better at certain tasks than others. And the frontier models, the latest and most expensive ones, do not win every time. Even though they perform better at deep reasoning and research tasks, on concrete problems they sometimes perform worse than lower-level models.

The only way to know is to test your instructions on multiple models from the beginning, to see which one is actually more capable for the problem you are trying to solve.

What the experiments showed

In a data extraction experiment pulling structured fields out of invoices, the clear winners were Claude Opus 4.1 and GPT-4.1, both reaching 90% accuracy. GPT-5, the newest and most expensive model, produced a lower accuracy at the highest cost. On a harder task, extracting product opportunities from customer interview transcripts, GPT-5 was again a massive cost outlier with lower accuracy, consuming up to 30,000 output tokens where other models used a fraction of that. The summary line for that task was blunt: avoid GPT-5 for this one.

In an expense-policy experiment, GPT-5 did reach 86% accuracy, but only with a structured prompt, and the same model with a basic prompt dropped to 43%. The model alone did not decide the outcome.

Accuracy is only one of three things

When you decide how good you want your AI to be, there are at least three dimensions: accuracy, cost, and latency. Most of the time there will not be one model that has all three at the top.

In a fitness-app experiment, GPT-5 took close to one minute on average to answer. A smaller Claude model answered in around four seconds, and a smaller OpenAI model in around eleven seconds. GPT-5 used roughly six times more tokens and worked out to about thirty times more expensive. If a user clicked a button and had to wait a full minute for a suggestion, that is not a good experience.

In an evaluation experiment, DeepSeek was actually slower than GPT-4.1, but more than eleven times cheaper. Which one makes sense depends on how the feature is used, how often it runs, and how fast you need the answer.

The lock-in trap

Every model provider gives you an API, but how it looks is a little different for each one. The parameters are named differently, the data is sent differently. Because of that, many teams pick one model, wire it into their code, and stay locked into it, simply because they have already implemented the way of calling that one model.

A unified interface like OpenRouter lets you call many models by changing only the model name, so you are not stuck with whichever one you happened to integrate first.

One warning: you generally cannot assume what another model will do, especially if you also change the prompt. The newer GPT versions tend to become more efficient with tokens, and they may give you similar accuracy at lower cost, but you still have to test it rather than guess.