Description
\acp{LLM} achieve state-of-the-art performance across diverse tasks such as question answering, reasoning, and summarization.
Despite these successes, their reliability is undermined by hallucinations, where the model produces fluent but factually incorrect
outputs. These errors can range from subtle factual slips to outright fabrications, which are particularly problematic in high-stakes
domains. Existing detection approaches largely rely on supervised learning with labeled data; however, the necessary datasets are
expensive to construct, incomplete in coverage, and risk overfitting to specific error types.
This thesis explores an alternative unsupervised approach to detect hallucinations based on anomalies in the residual stream, a
structured internal representation of transformer models that has been shown to encode truth-related signals. We propose an
\ac{AE} framework trained exclusively on truthful responses, with the goal of learning the distribution of normal activations. Deviations
from this distribution are captured as reconstruction errors, which we interpret as potential indicators of hallucinations.
Building on this idea, we address three research questions. First, we examine the feasibility of residual-based anomaly detection and characterize
how anomaly signals manifest across layers and error categories. Second, we investigate which modeling choices including architecture, bottleneck
size, kernel scale, and scoring metrics most strongly affect performance. Third, we assess the generalization and robustness of the approach across
tasks, datasets, and model architectures.
Our experiments show that hallucinations consistently appear as localized token-level anomalies, especially in mid-to-late transformer layers.
A Single-Token \ac{C1AE} trained on truthful activations achieves strong performance, outperforming sequence-level baselines and transferring
effectively across tasks and models. Detection is most reliable for severe factual errors, while minor or incomplete deviations remain harder to
capture. These findings demonstrate that \ac{LLM} residual streams provide a powerful signal for unsupervised hallucination detection, while
also highlighting open challenges in thresholding, layer selection, and cross-model transfer.
Keywords: Large Language Models, Hallucination Detection, Anomaly Detection, Residual Stream Analysis, Unsupervised Learning
|