Large Language Models (LLMs): The Confidence Conundrum

Last Updated on April 17, 2025 by Editorial Team

Author(s): Nikhilesh Pandey

Originally published on Towards AI.

Large Language Models (LLMs): The Confidence Conundrum

Introduction

LLMs have revolutionized the way we interact with AI, dazzling us with their ability to understand and generate human-like text. But there’s a catch: confidence calibration. Despite their brilliance, LLMs often struggle to accurately gauge how confident they should be in their answers. This isn’t just a technical hiccup — it’s a critical flaw, especially in high-stakes fields like medicine or law, where overconfidence can lead to disastrous outcomes.

Two recent papers address this challenge using different approaches: “Rewarding Doubt” and “Thermometer”. While both aim to improve the confidence calibration of LLMs, they take distinct paths to achieve this goal. In this article, we explore the details of each approach, compare their pros and cons, and discuss their implications for the future of LLM calibration.

Rewarding Doubt: A Reinforcement Learning Approach

The first paper, “Rewarding Doubt”, introduces a Reinforcement Learning (RL) approach to fine-tune LLMs for better confidence calibration. The authors model the problem as a betting game, where the model predicts a confidence score along with its answer. The reward function is designed to penalize both overconfidence and under confidence, encouraging the model to align its confidence with the actual probability of being correct.

How do they do this?

Here’s how it works:

Image source: https://arxiv.org/pdf/2503.02623

The Betting Game: The model gives an answer and a confidence score (like “I’m 70% sure this is correct”). If the answer is correct, the model gets a reward based on how confident it was. If the answer is wrong, it gets penalized, especially if it was very confident.
Reward Function: They design a special reward system to encourage the model to be honest about its confidence. If the model is overconfident (too sure) or under-confident (not sure enough), it gets penalized. The goal is to make the model’s confidence match the actual probability that its answer is correct.
Training: They fine-tune the model using this reward system. Over time, the model learns to give better confidence scores that match how often it’s actually right.

1. Confidence Calibration Problem

Given a question q, the model generates:

An answer a.
A confidence score p̂, where 0≤ p̂ ≤1.

A model is perfectly calibrated if: P(j(a) = 1 | p̂ = x) = x, ∀x∈[0,1]

where j(a) is the correctness function:

j(a)=1 if a is correct.
j(a)=0 if a is incorrect.

2. Reward Function for Confidence Calibration

The reward function is designed to encourage accurate confidence expression:

Equation source: https://arxiv.org/pdf/2503.02623

Why Use Logarithmic Scaling?

Source: https://arxiv.org/pdf/2503.02623

Encourages high confidence for correct answers since log⁡( p̂) is large when p̂→1.
Penalizes overconfidence in wrong answers as log⁡(1− p̂) becomes highly negative if p̂→1 and the answer is incorrect.
Ensures the best reward is achieved when p̂ = p*, the actual probability of correctness.

3. Expected Reward Maximization

The model optimizes its expected reward:

Taking the derivative and solving for p̂, the optimal confidence is:p̂ = p*

Thus, the model achieves maximum reward when its confidence matches the true correctness probability.

4. Reinforcement Learning Framework

The calibration process is structured as a Markov Decision Process (MDP):

State Space:

(q: question, a: answer, c₁:ₜ₋₁ previous confidence tokens).

Action Space: Selecting the next confidence token .
Transition Function: Updates confidence tokens step-by-step.
Reward Function: Encourages calibrated confidence using the logarithmic formulation.

This reinforcement learning setup enables the model to adaptively adjust its confidence based on past performance.

Key Strengths:

Encourages calibrated confidence: The model learns to express doubt when unsure and certainty when correct.
Prevents overconfidence: High confidence in incorrect answers is heavily penalized.
Generalizes across tasks: Works effectively on unseen datasets without retraining.

Limitations:

RL-based methods can be computationally expensive and harder to implement compared to simpler calibration techniques.
The approach is primarily tested on factual question-answering tasks, and its effectiveness on more complex tasks (e.g., creative writing) is not explored.
Sometimes the model might get stuck always predicting the same confidence score (like always saying 50%). This is a training issue that could be improved in the future.

Thermometer

The second paper, “Thermometer”, takes a different approach by proposing an auxiliary model that predicts a dataset-specific temperature to calibrate the LLM’s uncertainties. The method uses temperature scaling, a simple yet effective technique to adjust the model’s confidence scores without changing the underlying predictions.

How Thermometer Works

LLMs often provide confidence scores for their predictions, but these scores may not be well-calibrated, meaning they do not accurately reflect the true likelihood of correctness. Thermometer addresses this issue by learning an auxiliary model that maps an LLM’s output probabilities to more reliable confidence estimates.

Instead of performing traditional temperature scaling (which requires labeled data), Thermometer predicts a dataset-specific temperature without requiring labels. This allows calibration for unseen tasks without retraining the model.

Lets understand it by analogy of actual thermometer.
Imagine you’re baking cookies in an oven. The recipe says to bake them at 350°F. However, your oven’s built-in temperature display might not be accurate. It could be too hot or too cold, meaning your cookies could end up burnt or under cooked.

To fix this, you use a thermometer to measure the actual temperature inside the oven. If the thermometer shows 375°F instead of 350°F, you adjust the temperature down to avoid burning the cookies. Similarly, if it reads 325°F, you increase the heat so your cookies bake properly.

How This Relates to Calibrating an LLM

Now, think of an LLM (like ChatGPT) as the oven, and its confidence in its answers as the temperature. The problem is that sometimes, the LLM feels too confident in wrong answers (like an oven that’s too hot) or not confident enough in correct answers (like an oven that’s too cold).

To fix this, Thermometer acts like a real thermometer for LLMs. It measures how well-calibrated (accurate in confidence) the model is and then adjusts its confidence to be more reliable.

Mathematical Formulation

1.Base LLM Prediction
Given an input prompt xₙ, an LLM generates a probability distribution over possible next tokens:

Equation source: https://arxiv.org/pdf/2403.08819

where:

M represents the LLM,
W are model parameters,
ϕ(xₙ;W) is a feature extractor that maps input xn to a representation,
Wv is a weight vector for token v′,
V is the vocabulary size.

2.Temperature Scaling
Temperature scaling adjusts the probabilities to be better calibrated:

where τ (temperature parameter) controls how “sharp” or “flat” the probability distribution is. A lower τ sharpens the distribution, making the model more confident in its predictions, while a higher τ smooths it.

3.Learning Dataset-Specific Temperatures
Thermometer predicts the optimal temperature for a dataset by using a recognition network ψθ:

where:

ψθ is a neural network mapping input features to temperature values,
Nₖ is the number of examples in the dataset.

4.Optimization
The objective function for training Thermometer is:

where:

q(τₖ;θ) is the variational approximation for the temperature posterior
p(Dₖ|τₖ;M) is the likelihood of the dataset given the LLM,
KL(⋅∣∣⋅) is the Kullback-Leibler divergence ensuring smoothness in temperature predictions.

5. Applying Thermometer at Inference Time

For a new test task with unlabeled data, the temperature is estimated as:

This allows the model to self-adjust confidence levels without requiring additional labeled data.

Key Strengths:

Thermometer is highly efficient, requiring only a small overhead during inference (∼0.5% slower than the uncalibrated LLM).
The auxiliary model can generalize to new tasks without retraining, making it highly versatile.
The method works well on both multiple-choice and free-form question-answering tasks, showing its broad applicability.

Limitations:

While effective, temperature scaling is a relatively simple method and may not capture more complex aspects of uncertainty in LLMs.
The approach is limited to tasks where a binary correctness measure can be established, which may not apply to more subjective tasks like creative writing.
Although it doesn’t require labeled data for new tasks, it still needs a sufficient amount of unlabeled data to predict the temperature accurately.

Comparison and Conclusion

Both papers make significant contributions to the field of LLM calibration, but they differ in their approaches and scope. Rewarding Doubt leverages Reinforcement Learning to fine-tune the model’s confidence calibration, making it particularly useful for high-stakes applications. On the other hand, Thermometer uses temperature scaling and an auxiliary model to predict dataset-specific temperatures, offering a more computationally efficient and broadly applicable solution.

Future Directions:

Combining the strengths of both approaches could lead to even more robust calibration methods. For example, integrating RL-based fine-tuning with temperature scaling could capture both complex uncertainty patterns and maintain computational efficiency.
Extending these methods to more subjective tasks, such as creative writing or summarization, could further enhance the trustworthiness of LLMs in diverse applications.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Large Language Models (LLMs): The Confidence Conundrum

Author(s): Nikhilesh Pandey

Large Language Models (LLMs): The Confidence Conundrum

Introduction

Rewarding Doubt: A Reinforcement Learning Approach

How do they do this?

1. Confidence Calibration Problem

2. Reward Function for Confidence Calibration

Why Use Logarithmic Scaling?

3. Expected Reward Maximization

4. Reinforcement Learning Framework

Thermometer

Mathematical Formulation

Key Strengths:

Limitations:

Comparison and Conclusion

References

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Large Language Models (LLMs): The Confidence Conundrum

Author(s): Nikhilesh Pandey

Large Language Models (LLMs): The Confidence Conundrum

Introduction

Rewarding Doubt: A Reinforcement Learning Approach

How do they do this?

1. Confidence Calibration Problem

2. Reward Function for Confidence Calibration

Why Use Logarithmic Scaling?

3. Expected Reward Maximization

4. Reinforcement Learning Framework

Thermometer

Mathematical Formulation

Key Strengths:

Limitations:

Comparison and Conclusion

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement