Quantifying Uncertainty in Large Language Models

Understanding when AI systems should trust their own answers

Large language models (LLMs) are powerful tools that can generate fluent and informative responses across a wide range of tasks. However, these models can sometimes produce incorrect answers while appearing highly confident. Our project explores how to measure and understand the uncertainty behind these predictions.

View Code Project Report

Fong Clement Vo • Brooklyn Pagaza • Pranav Singh • Calwin Li • Kening Li
UC San Diego – Data Science Capstone

Why This Matters

As large language models are increasingly deployed in real-world applications such as education, finance, and healthcare, understanding their reliability becomes critical. While these systems often perform impressively, their probabilistic nature means they can still produce incorrect answers.

A key challenge is that language models may appear confident even when they are wrong. This makes it difficult for users to determine when a model’s output should be trusted. Quantifying uncertainty helps identify when models are likely to make reliable predictions and when caution is required.

In this project, we analyze several uncertainty quantification techniques and evaluate their performance across multiple types of tasks, including multiple-choice question answering, open-ended responses, and interactive reasoning problems.

Datasets

To evaluate how uncertainty behaves across different problem types, we tested models on several benchmark datasets.

Multiple Choice Question Answering
- ARC Easy
- ARC Challenge
- OpenQA
Open-Ended Question Answering
- TriviaQA
- TruthfulQA
Active Reasoning Tasks
- Detective Cases
- Situational Puzzles
- Guessing Numbers

These datasets represent increasingly complex reasoning environments, allowing us to study how uncertainty methods behave across different levels of task difficulty and answer structure.

Uncertainty Quantification Methods

We evaluate four approaches to estimating uncertainty in large language models. Each method captures a different signal indicating when model predictions may be unreliable.

Quantile Risk Control (QRC)

Quantile Risk Control evaluates whether a model’s most confident predictions are actually reliable. Instead of evaluating all outputs equally, QRC focuses on the highest-confidence predictions and measures how often those predictions are correct.

Technical Details

QRC90 – accuracy among the top 10% most confident predictions
CVaR90 – mean confidence within that quantile
ECE90 – calibration error restricted to high-confidence predictions

Sampling with Perturbations (SPUQ)

This method estimates uncertainty by introducing small variations to the input prompt. If the model produces significantly different outputs when the prompt is slightly modified, it suggests that the model may be uncertain about its answer.

Technical Details

Prompt paraphrasing
Adding dummy tokens
Modifying system prompts

Output variability is used as a signal for uncertainty.

Bayesian LoRA (BLoB)

Bayesian LoRA introduces probabilistic modeling into parameter-efficient fine-tuning. Instead of using a single fixed set of weights, the model samples multiple weight configurations and measures how predictions vary across them.

Technical Details

LoRA adapter weights are treated as random variables sampled from Gaussian or Laplace distributions. Prediction variance across samples is used to estimate uncertainty.

Input Clarification Ensembling (ICE)

ICE addresses uncertainty caused by ambiguous questions. The method generates clarification prompts that interpret the original question in slightly different ways and aggregates the resulting answers.

Technical Details

Model responses to clarification prompts are combined to evaluate confidence and calibration using metrics such as Expected Calibration Error (ECE).

Results

Our experiments evaluate how different uncertainty quantification methods perform across multiple datasets and reasoning settings. We measure both predictive accuracy and calibration quality to understand when model confidence aligns with correctness.

Overall, the results reveal important differences in how uncertainty behaves across task types. Multiple-choice tasks tend to produce more stable predictions, while open-ended and active reasoning tasks introduce greater variability and uncertainty.

Key Takeaways

Multiple-choice question answering produces the most stable and predictable model behavior.
Open-ended question answering introduces greater variability due to the large possible answer space.
Active reasoning tasks present the greatest challenge for uncertainty estimation because models must dynamically gather information before producing a final answer.
Reliable uncertainty estimation is essential for deploying language models safely in real-world applications.

Quantifying Uncertainty in Large Language Models

Understanding when AI systems should trust their own answers

Why This Matters

Datasets

Uncertainty Quantification Methods

Quantile Risk Control (QRC)

Sampling with Perturbations (SPUQ)

Bayesian LoRA (BLoB)

Input Clarification Ensembling (ICE)

Results

Key Takeaways

Team