TUM Logo

Unmasking Hidden Defects in AI with Anomaly Detection in LLM Representation Space

Unmasking Hidden Defects in AI with Anomaly Detection in LLM Representation Space

Supervisor(s): Daniel Kowatsch, Maximilian Wendlinger
Status: finished
Topic: Others
Author: Alexander Wagner
Submission: 2025-09-17
Type of Thesis: Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

\acp{LLM} achieve state-of-the-art performance across diverse tasks such as question answering, reasoning, and summarization. 
Despite these successes, their reliability is undermined by hallucinations, where the model produces fluent but factually incorrect 
outputs. These errors can range from subtle factual slips to outright fabrications, which are particularly problematic in high-stakes 
domains. Existing detection approaches largely rely on supervised learning with labeled data; however, the necessary datasets are 
expensive to construct, incomplete in coverage, and risk overfitting to specific error types.  
This thesis explores an alternative unsupervised approach to detect hallucinations based on anomalies in the residual stream, a 
structured internal representation of transformer models that has been shown to encode truth-related signals. We propose an
\ac{AE} framework trained exclusively on truthful responses, with the goal of learning the distribution of normal activations. Deviations 
from this distribution are captured as reconstruction errors, which we interpret as potential indicators of hallucinations.  
Building on this idea, we address three research questions. First, we examine the feasibility of residual-based anomaly detection and characterize 
how anomaly signals manifest across layers and error categories. Second, we investigate which modeling choices including architecture, bottleneck 
size, kernel scale, and scoring metrics most strongly affect performance. Third, we assess the generalization and robustness of the approach across 
tasks, datasets, and model architectures.  
Our experiments show that hallucinations consistently appear as localized token-level anomalies, especially in mid-to-late transformer layers. 
A Single-Token \ac{C1AE} trained on truthful activations achieves strong performance, outperforming sequence-level baselines and transferring 
effectively across tasks and models. Detection is most reliable for severe factual errors, while minor or incomplete deviations remain harder to 
capture. These findings demonstrate that \ac{LLM} residual streams provide a powerful signal for unsupervised hallucination detection, while 
also highlighting open challenges in thresholding, layer selection, and cross-model transfer. 
Keywords: Large Language Models, Hallucination Detection, Anomaly Detection, Residual Stream Analysis, Unsupervised Learning