Large language models (LLMs) are powerful tools that can generate fluent and informative responses across a wide range of tasks. However, these models can sometimes produce incorrect answers while appearing highly confident. Our project explores how to measure and understand the uncertainty behind these predictions.
View Code Project Report
Fong Clement Vo • Brooklyn Pagaza • Pranav Singh • Calwin Li • Kening Li
UC San Diego – Data Science Capstone
As large language models are increasingly deployed in real-world applications such as education, finance, and healthcare, understanding their reliability becomes critical. While these systems often perform impressively, their probabilistic nature means they can still produce incorrect answers.
A key challenge is that language models may appear confident even when they are wrong. This makes it difficult for users to determine when a model’s output should be trusted. Quantifying uncertainty helps identify when models are likely to make reliable predictions and when caution is required.
In this project, we analyze several uncertainty quantification techniques and evaluate their performance across multiple types of tasks, including multiple-choice question answering, open-ended responses, and interactive reasoning problems.
To evaluate how uncertainty behaves across different problem types, we tested models on several benchmark datasets.
These datasets represent increasingly complex reasoning environments, allowing us to study how uncertainty methods behave across different levels of task difficulty and answer structure.
We evaluate four approaches to estimating uncertainty in large language models. Each method captures a different signal indicating when model predictions may be unreliable.
Quantile Risk Control evaluates whether a model’s most confident predictions are actually reliable. Instead of evaluating all outputs equally, QRC focuses on the highest-confidence predictions and measures how often those predictions are correct.
This method estimates uncertainty by introducing small variations to the input prompt. If the model produces significantly different outputs when the prompt is slightly modified, it suggests that the model may be uncertain about its answer.
Bayesian LoRA introduces probabilistic modeling into parameter-efficient fine-tuning. Instead of using a single fixed set of weights, the model samples multiple weight configurations and measures how predictions vary across them.
ICE addresses uncertainty caused by ambiguous questions. The method generates clarification prompts that interpret the original question in slightly different ways and aggregates the resulting answers.
Our experiments evaluate how different uncertainty quantification methods perform across multiple datasets and reasoning settings. We measure both predictive accuracy and calibration quality to understand when model confidence aligns with correctness.
Overall, the results reveal important differences in how uncertainty behaves across task types. Multiple-choice tasks tend to produce more stable predictions, while open-ended and active reasoning tasks introduce greater variability and uncertainty.