TUM Logo

Dataset Generation for Software Vulnerability Detection with LLMs

Dataset Generation for Software Vulnerability Detection with LLMs

Supervisor(s): Tobias Specht, Alexander Wagner
Status: finished
Topic: Others
Author: Dogan Can Hasanoglu
Submission: 2025-11-05
Type of Thesis: Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Research on software vulnerability detection is constrained by datasets that either rely on
simplified synthetic samples or suffer from noisy labels in real code. This thesis introduces a
practical framework that produces package-level, traceable vulnerability variants from Debian
sources. The pipeline constrains large language models to emit strict unified diffs, sanitizes
outputs, and applies patches to selected C/C++ files while preserving build systems and
packaging metadata. Validation is containerized end-to-end: baseline rebuilds establish the
ground truth; modified packages are rebuilt and installed in isolation; smoke tests and, when
available, as-installed tests provide conservative evidence that base functionality is retained.
The approach moves beyond simplified single-file edits by operating at the package level
with pinned environments and machine-readable manifests. Every run records versions,
parameters, logs, and outcomes, which support audit and reproducibility. The framework is
configurable by package set, vulnerability theme, and prompt strategy, enabling controlled
variation without hand-tuning for each project. Ethics and safety are integral to the design:
all injections are synthetic, execution is contained, and releases favor documentation and diffs
over installable binaries.
This advances evaluation beyond artificial examples by supplying real, buildable contexts
with consistent provenance, improving detector training and enabling reproducible benchmarking
across varied packages. The corpus also serves research and education that require
real code and provides a practical baseline for future work across ecosystems.
The resulting dataset couples real-world code with ground-truth diffs and reproducible
execution signals. This provides clearer training and evaluation signals for vulnerability
detection tools, establishing a consistent basis for comparison across packages. The contributions
include a strict diff interface with output sanitization, a rebuild gate that filters
misapplied changes, and a containerized validation path that yields comparable results
across heterogeneous projects. The framework provides a reproducible foundation that can
be extended with automated trigger generation and broader ecosystem support, aiming to
enhance the realism and utility of datasets used for secure software development.