Dataset Generation for Software Vulnerability Detection with LLMs

Supervisor(s):	Tobias Specht, Alexander Wagner
Status:	finished
Topic:	Others
Author:	Dogan Can Hasanoglu
Submission:	2025-11-05
Type of Thesis:	Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description Research on software vulnerability detection is constrained by datasets that either rely on simplified synthetic samples or suffer from noisy labels in real code. This thesis introduces a practical framework that produces package-level, traceable vulnerability variants from Debian sources. The pipeline constrains large language models to emit strict unified diffs, sanitizes outputs, and applies patches to selected C/C++ files while preserving build systems and packaging metadata. Validation is containerized end-to-end: baseline rebuilds establish the ground truth; modified packages are rebuilt and installed in isolation; smoke tests and, when available, as-installed tests provide conservative evidence that base functionality is retained. The approach moves beyond simplified single-file edits by operating at the package level with pinned environments and machine-readable manifests. Every run records versions, parameters, logs, and outcomes, which support audit and reproducibility. The framework is configurable by package set, vulnerability theme, and prompt strategy, enabling controlled variation without hand-tuning for each project. Ethics and safety are integral to the design: all injections are synthetic, execution is contained, and releases favor documentation and diffs over installable binaries. This advances evaluation beyond artificial examples by supplying real, buildable contexts with consistent provenance, improving detector training and enabling reproducible benchmarking across varied packages. The corpus also serves research and education that require real code and provides a practical baseline for future work across ecosystems. The resulting dataset couples real-world code with ground-truth diffs and reproducible execution signals. This provides clearer training and evaluation signals for vulnerability detection tools, establishing a consistent basis for comparison across packages. The contributions include a strict diff interface with output sanitization, a rebuild gate that filters misapplied changes, and a containerized validation path that yields comparable results across heterogeneous projects. The framework provides a reproducible foundation that can be extended with automated trigger generation and broader ecosystem support, aiming to enhance the realism and utility of datasets used for secure software development.

Dataset Generation for Software Vulnerability Detection with LLMs

Description