
Large Language Models (LLMs): The Confidence Conundrum
Last Updated on April 17, 2025 by Editorial Team
Author(s): Nikhilesh Pandey
Originally published on Towards AI.
Large Language Models (LLMs): The Confidence Conundrum

Introduction
LLMs have revolutionized the way we interact with AI, dazzling us with their ability to understand and generate human-like text. But there’s a catch: confidence calibration. Despite their brilliance, LLMs often struggle to accurately gauge how confident they should be in their answers. This isn’t just a technical hiccup — it’s a critical flaw, especially in high-stakes fields like medicine or law, where overconfidence can lead to disastrous outcomes.
Two recent papers address this challenge using different approaches: “Rewarding Doubt” and “Thermometer”. While both aim to improve the confidence calibration of LLMs, they take distinct paths to achieve this goal. In this article, we explore the details of each approach, compare their pros and cons, and discuss their implications for the future of LLM calibration.
Rewarding Doubt: A Reinforcement Learning Approach
The first paper, “Rewarding Doubt”, introduces a Reinforcement Learning (RL) approach to fine-tune LLMs for better confidence calibration. The authors model the problem as a betting game, where the model predicts a confidence score along with its answer. The reward function is designed to penalize both overconfidence and under confidence, encouraging the model to align its confidence with the actual probability of being correct.
How do they do this?
Here’s how it works:

- The Betting Game: The model gives an answer and a confidence score (like “I’m 70% sure this is correct”). If the answer is correct, the model gets a reward based on how confident it was. If the answer is wrong, it gets penalized, especially if it was very confident.
- Reward Function: They design a special reward system to encourage the model to be honest about its confidence. If the model is overconfident (too sure) or under-confident (not sure enough), it gets penalized. The goal is to make the model’s confidence match the actual probability that its answer is correct.
- Training: They fine-tune the model using this reward system. Over time, the model learns to give better confidence scores that match how often it’s actually right.
1. Confidence Calibration Problem
Given a question q, the model generates:
- An answer a.
- A confidence score p̂, where 0≤ p̂ ≤1.
A model is perfectly calibrated if: P(j(a) = 1 | p̂ = x) = x, ∀x∈[0,1]
where j(a) is the correctness function:
- j(a)=1 if a is correct.
- j(a)=0 if a is incorrect.
2. Reward Function for Confidence Calibration
The reward function is designed to encourage accurate confidence expression:
Why Use Logarithmic Scaling?

- Encourages high confidence for correct answers since log( p̂) is large when p̂→1.
- Penalizes overconfidence in wrong answers as log(1− p̂) becomes highly negative if p̂→1 and the answer is incorrect.
- Ensures the best reward is achieved when p̂ = p*, the actual probability of correctness.
3. Expected Reward Maximization
The model optimizes its expected reward:

Taking the derivative and solving for p̂, the optimal confidence is:p̂ = p*
Thus, the model achieves maximum reward when its confidence matches the true correctness probability.
4. Reinforcement Learning Framework
The calibration process is structured as a Markov Decision Process (MDP):
- State Space:
(q: question, a: answer, c₁:ₜ₋₁ previous confidence tokens).
- Action Space: Selecting the next confidence token .
- Transition Function: Updates confidence tokens step-by-step.
- Reward Function: Encourages calibrated confidence using the logarithmic formulation.
This reinforcement learning setup enables the model to adaptively adjust its confidence based on past performance.
Key Strengths:
- Encourages calibrated confidence: The model learns to express doubt when unsure and certainty when correct.
- Prevents overconfidence: High confidence in incorrect answers is heavily penalized.
- Generalizes across tasks: Works effectively on unseen datasets without retraining.
Limitations:
- RL-based methods can be computationally expensive and harder to implement compared to simpler calibration techniques.
- The approach is primarily tested on factual question-answering tasks, and its effectiveness on more complex tasks (e.g., creative writing) is not explored.
- Sometimes the model might get stuck always predicting the same confidence score (like always saying 50%). This is a training issue that could be improved in the future.
Thermometer
The second paper, “Thermometer”, takes a different approach by proposing an auxiliary model that predicts a dataset-specific temperature to calibrate the LLM’s uncertainties. The method uses temperature scaling, a simple yet effective technique to adjust the model’s confidence scores without changing the underlying predictions.
How Thermometer Works
LLMs often provide confidence scores for their predictions, but these scores may not be well-calibrated, meaning they do not accurately reflect the true likelihood of correctness. Thermometer addresses this issue by learning an auxiliary model that maps an LLM’s output probabilities to more reliable confidence estimates.
Instead of performing traditional temperature scaling (which requires labeled data), Thermometer predicts a dataset-specific temperature without requiring labels. This allows calibration for unseen tasks without retraining the model.
Lets understand it by analogy of actual thermometer.
Imagine you’re baking cookies in an oven. The recipe says to bake them at 350°F. However, your oven’s built-in temperature display might not be accurate. It could be too hot or too cold, meaning your cookies could end up burnt or under cooked.To fix this, you use a thermometer to measure the actual temperature inside the oven. If the thermometer shows 375°F instead of 350°F, you adjust the temperature down to avoid burning the cookies. Similarly, if it reads 325°F, you increase the heat so your cookies bake properly.
How This Relates to Calibrating an LLM
Now, think of an LLM (like ChatGPT) as the oven, and its confidence in its answers as the temperature. The problem is that sometimes, the LLM feels too confident in wrong answers (like an oven that’s too hot) or not confident enough in correct answers (like an oven that’s too cold).
To fix this, Thermometer acts like a real thermometer for LLMs. It measures how well-calibrated (accurate in confidence) the model is and then adjusts its confidence to be more reliable.
Mathematical Formulation
1.Base LLM Prediction
Given an input prompt xₙ, an LLM generates a probability distribution over possible next tokens:
where:
- M represents the LLM,
- W are model parameters,
- ϕ(xₙ;W) is a feature extractor that maps input xn to a representation,
- Wv is a weight vector for token v′,
- V is the vocabulary size.
2.Temperature Scaling
Temperature scaling adjusts the probabilities to be better calibrated:

where τ (temperature parameter) controls how “sharp” or “flat” the probability distribution is. A lower τ sharpens the distribution, making the model more confident in its predictions, while a higher τ smooths it.
3.Learning Dataset-Specific Temperatures
Thermometer predicts the optimal temperature for a dataset by using a recognition network ψθ:

where:
- ψθ is a neural network mapping input features to temperature values,
- Nₖ is the number of examples in the dataset.
4.Optimization
The objective function for training Thermometer is:
where:
- q(τₖ;θ) is the variational approximation for the temperature posterior
- p(Dₖ|τₖ;M) is the likelihood of the dataset given the LLM,
- KL(⋅∣∣⋅) is the Kullback-Leibler divergence ensuring smoothness in temperature predictions.
5. Applying Thermometer at Inference Time
For a new test task with unlabeled data, the temperature is estimated as:
This allows the model to self-adjust confidence levels without requiring additional labeled data.
Key Strengths:
- Thermometer is highly efficient, requiring only a small overhead during inference (∼0.5% slower than the uncalibrated LLM).
- The auxiliary model can generalize to new tasks without retraining, making it highly versatile.
- The method works well on both multiple-choice and free-form question-answering tasks, showing its broad applicability.
Limitations:
- While effective, temperature scaling is a relatively simple method and may not capture more complex aspects of uncertainty in LLMs.
- The approach is limited to tasks where a binary correctness measure can be established, which may not apply to more subjective tasks like creative writing.
- Although it doesn’t require labeled data for new tasks, it still needs a sufficient amount of unlabeled data to predict the temperature accurately.
Comparison and Conclusion
Both papers make significant contributions to the field of LLM calibration, but they differ in their approaches and scope. Rewarding Doubt leverages Reinforcement Learning to fine-tune the model’s confidence calibration, making it particularly useful for high-stakes applications. On the other hand, Thermometer uses temperature scaling and an auxiliary model to predict dataset-specific temperatures, offering a more computationally efficient and broadly applicable solution.
Future Directions:
- Combining the strengths of both approaches could lead to even more robust calibration methods. For example, integrating RL-based fine-tuning with temperature scaling could capture both complex uncertainty patterns and maintain computational efficiency.
- Extending these methods to more subjective tasks, such as creative writing or summarization, could further enhance the trustworthiness of LLMs in diverse applications.
References
- https://arxiv.org/abs/2503.02623
- https://arxiv.org/pdf/2403.08819
- https://news.mit.edu/2024/thermometer-prevents-ai-model-overconfidence-about-wrong-answers-0731
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.