Temperature, Max Tokens, and Streaming: The LLM Settings That Quietly Change Your Output

By Madalina Turlea·
Temperature, Max Tokens, and Streaming: The LLM Settings That Quietly Change Your Output

Written by Madalina Turlea

15 Jan 2026

To build with a large language model, you call it through its API, and you send more than just your prompt. You send parameters. Not all models have the same ones, but most have temperature, max tokens, and top-P. These settings change how the model behaves, and they are easy to overlook.

Temperature

Temperature is how creative the model should be. At zero, it is more deterministic. When it calculates the probabilities of the next token, it takes the one with the highest probability, so you are much more likely to get consistent answers over time. The trade-off is that the answers are not creative. For something like brainstorming content for stories, LinkedIn, or emails, a low temperature can make the output sound blank, with no personality.

A higher temperature means more creative answers, but it can also mean hallucinations and wrong information, because the model can match on lower probabilities. It produces something that sounds like a plausible train of thought but actually makes no sense.

Temperature also matters for reliability. If you have a task that categorises items, like customer support tickets, you want the same problem reported by different customers in different languages to always be classified in the same category, rather than the model deciding one is a product review and another is a refund when they are really the same issue. How reliable you need the model to be on the same problem, over and over, is a temperature decision.

Max tokens

Max tokens puts a limit on how many tokens the model outputs. It is a way of keeping cost in check.

It is also the reason an answer sometimes seems to stop in the middle. The model has not finished its thought, it has hit the maximum token limit, and that is a hard stop. It cannot go beyond it.

Reasoning and tools

Newer models add more parameters. With models like GPT-5, GPT-5.1, or Claude Opus, you can give direction on how much the model should think and how much reasoning it should do. You can also specify which tools it is allowed to use, like access to the web, or to an MCP server connecting to something like Figma or Google Sheets.

Streaming

When a chat model answers as if someone is typing, that is streaming. It works because the model always calculates the next token from everything that came before, so it can output the answer token by token. It does not look at the whole answer first and decide it makes sense, which is part of why it sometimes hallucinates: it just keeps adding the next word based on what it has generated so far.

In the API, you can enable or disable streaming. With streaming on, you take each token as it comes and show it to the user or save it. With streaming off, you send your request, the model finishes generating all the tokens, and then returns the whole answer at once. Once you go past a certain max output token level, some models actually require you to use streaming.

The reason is technical. Traditional web applications expect fast communication between front end and back end, and LLMs are slow. The existing infrastructure tends to time out on long responses. Streaming avoids that by sending one piece of the answer at a time, and it has the side benefit of better experience for the user, who starts seeing something quickly instead of staring at a blank screen for a minute. Streaming has no impact on cost.

To make any of these calls, you need a backend, or a tool that acts as the backend for you.