LLM & AI Agent Applications with LangChain and LangGraph — Part 8 — Temperature, Top-p, Top-k and Max Tokens: How to Shape Model Behavior
Last Updated on January 2, 2026 by Editorial Team
Author(s): Michalzarnecki
Originally published on Towards AI.

Welcome back to another article focused on the LLM-driven applications development.
In this part of the course I want to focus on something very practical: the main generation parameters you can control when working with large language models.
These settings decide a lot about the style, quality and predictability of the answers you get. If you understand them well, you can tune the model depending on what you need: a precise and repeatable assistant, or a more creative partner that explores different possibilities.
Temperature — from analyst to poet
The first parameter is temperature.
You can think of it as a kind of entropy control, a measure of “chaos” in the model’s choices. Technically it is a number, usually between 0 and 1, sometimes up to 2 depending on the API.
At low values, for example temperature = 0, the model becomes very predictable. It always picks the most likely next token. If you send the same prompt ten times, you will get almost identical answers. That is perfect for tasks where you want stability and reproducibility: many business applications, structured data extraction, deterministic analysis.
When you increase the temperature, say to 1.0, the model starts to explore more of the probability distribution. It becomes more creative. Answers are more varied, less repetitive, occasionally a bit surprising. This is useful for brainstorming, creative writing, idea generation.
You can imagine temperature as a slider running from “strict analyst” on one side to “inspired poet” on the other.




Top-p — focusing on the main part of the distribution
The next parameter is top-p, also known as nucleus sampling.
Top-p is an alternative way of controlling randomness. Instead of stretching or compressing the whole distribution like temperature does, it focuses on a subset of tokens that together cover a certain share of the probability mass.
For example, if top_p = 0.9, the model will look at the list of possible next tokens sorted by probability, take just enough of them so that their cumulative probability reaches 90 percent, and only then sample from that reduced group.
The effect is that the model ignores the very long tail of unlikely options, but still has some freedom within the “reasonable” candidates. You control diversity, but in a way that stays focused on the most plausible region. It is a bit like saying: “Stay within this main group of candidates, but do not always pick the same one.”
Top-k — choosing from the best k candidates
A related parameter is top-k.
Here we do not look at percentages, but at a fixed number. We tell the model: “Only consider the top k most probable tokens at each step.”
If top_k = 1, the model will always pick the single most likely token, which makes it extremely deterministic. If top_k = 50, it has a much larger pool of good candidates, so generations will be more diverse.
Top-k is especially common in open source models and in libraries that expose lower level sampling options. It gives you a simple mental model: small k means safe and predictable, large k means more variety and more room for creativity.
You can also combine temperature, top-p and top-k, but often it is enough to pick one approach and tune it for your use case.

Max tokens — how much the model is allowed to say
The last parameter I will mention here is max tokens.
This one is conceptually simple: it is the limit on how long the model’s answer can be in a single call. The value is expressed in tokens, not in words, but the idea is straightforward. Once the model reaches that limit, it has to stop, even if it would happily continue the explanation or story.
If you set a low max_tokens, you will get short, concise answers. That is useful when you want summaries, brief status messages or when you need to keep costs and latency under tight control.
If you set a high value, the model is allowed to write longer narratives, detailed analyses and multi-step reasoning. For some tasks that is exactly what you want: enough space for the model to lay out its thinking and cover edge cases.
Putting it together
As you can see, with just a few parameters you can dramatically change the character of the model:
- temperature and top-p / top-k steer how predictable or creative it is
- max tokens controls how much it can say in a single run
In real projects there is rarely a single “perfect” setting. You experiment, look at the outputs, adjust, and eventually settle on a configuration that fits your specific task.
In the next step we will move to a notebook and see how these parameters behave in practice. We will call the same model with different temperature, top-p and top-k values and compare the answers side by side, so you can directly feel the impact of each setting.
Install libraries and load environment variables
!pip install -q langchain-openai python-dotenv
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
load_dotenv()
Helper function requesting LLM API with specified parameters
def test_generation(param_name, values, prompt="Generate new unknown recipe for cheesecake."):
for v in values:
print("="*60)
print(f"{param_name} = {v}")
llm = ChatOpenAI(model="gpt-4", temperature=0.7) # bazowe settings
kwargs = {param_name: v}
response = llm.invoke(prompt, **kwargs)
print(response.content, "\n")
Experiment with temperature
test_generation("temperature", [0.0, 0.7, 1.2])
output:
============================================================
temperature = 0.0
Recipe Name: Tropical Coconut Mango Cheesecake
Ingredients:
For the crust:
1. 1 1/2 cups graham cracker crumbs
When temperature=0, the responses will be repetitive.
When temperature=1.2, the responses will be more frantic.
Experiment with top_p
test_generation("top_p", [0.2, 0.7, 1.0])
output:
============================================================
top_p = 0.2
Recipe Name: Tropical Coconut Mango Cheesecake
Ingredients:
For the crust:
1. 1 1/2 cups graham cracker crumbs
When top_p is low, the model will be conservative.
Experiment with max_tokens
test_generation("max_tokens", [30, 100, 300])
output:
============================================================
max_tokens = 30
Recipe: Tropical Passion Fruit Cheesecake
Ingredients:
For the crust:
1. 2 cups graham cracker crumbs
Value of max_tokens will decide whether the story is 2 sentences or a whole page.
That all for this chapter. In the next one I will show how to create conversation with memory so model will keep the context of previous messages.
see next chapter
see previous chapter
see the full code from this article in the GitHub repository
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.