Quantifying Uncertainty in Large Language Models

Understanding when AI systems should trust their own answers

Large language models (LLMs) are powerful tools that can generate fluent and informative responses across a wide range of tasks. However, these models can sometimes produce incorrect answers while appearing highly confident. Our project explores how to measure and understand the uncertainty behind these predictions.

View Code Project Report

Fong Clement Vo • Brooklyn Pagaza • Pranav Singh • Calwin Li • Kening Li
UC San Diego – Data Science Capstone

Why This Matters

As large language models are increasingly deployed in real-world applications such as education, finance, and healthcare, understanding their reliability becomes critical. While these systems often perform impressively, their probabilistic nature means they can still produce incorrect answers.

A key challenge is that language models may appear confident even when they are wrong. This makes it difficult for users to determine when a model’s output should be trusted. Quantifying uncertainty helps identify when models are likely to make reliable predictions and when caution is required.

In this project, we analyze several uncertainty quantification techniques and evaluate their performance across multiple types of tasks, including multiple-choice question answering, open-ended responses, and interactive reasoning problems.

Datasets

To evaluate how uncertainty behaves across different problem types, we tested models on several benchmark datasets.

These datasets represent increasingly complex reasoning environments, allowing us to study how uncertainty methods behave across different levels of task difficulty and answer structure.

Uncertainty Quantification Methods

We evaluate four approaches to estimating uncertainty in large language models. Each method captures a different signal indicating when model predictions may be unreliable.

Quantile Risk Control (QRC)

Quantile Risk Control evaluates whether a model’s most confident predictions are actually reliable. Instead of evaluating all outputs equally, QRC focuses on the highest-confidence predictions and measures how often those predictions are correct.

Technical Details
  • QRC90 – accuracy among the top 10% most confident predictions
  • CVaR90 – mean confidence within that quantile
  • ECE90 – calibration error restricted to high-confidence predictions

Sampling with Perturbations (SPUQ)

This method estimates uncertainty by introducing small variations to the input prompt. If the model produces significantly different outputs when the prompt is slightly modified, it suggests that the model may be uncertain about its answer.

Technical Details
  • Prompt paraphrasing
  • Adding dummy tokens
  • Modifying system prompts
Output variability is used as a signal for uncertainty.

Bayesian LoRA (BLoB)

Bayesian LoRA introduces probabilistic modeling into parameter-efficient fine-tuning. Instead of using a single fixed set of weights, the model samples multiple weight configurations and measures how predictions vary across them.

Technical Details LoRA adapter weights are treated as random variables sampled from Gaussian or Laplace distributions. Prediction variance across samples is used to estimate uncertainty.

Input Clarification Ensembling (ICE)

ICE addresses uncertainty caused by ambiguous questions. The method generates clarification prompts that interpret the original question in slightly different ways and aggregates the resulting answers.

Technical Details Model responses to clarification prompts are combined to evaluate confidence and calibration using metrics such as Expected Calibration Error (ECE).

Results

Our experiments evaluate how different uncertainty quantification methods perform across multiple datasets and reasoning settings. We measure both predictive accuracy and calibration quality to understand when model confidence aligns with correctness.

Overall, the results reveal important differences in how uncertainty behaves across task types. Multiple-choice tasks tend to produce more stable predictions, while open-ended and active reasoning tasks introduce greater variability and uncertainty.

Key Takeaways

Team