Comparing SAST-Tools and Code Property Graphs and their Application to Vulnerability Analysis in Binaries

Supervisor(s):	Alexander Küchler, Maximilian Kaul
Status:	finished
Topic:	Others
Author:	Johnny Nguyen
Submission:	2025-09-22
Type of Thesis:	Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description This thesis re-evaluates the effectiveness of modern static application security testing (SAST) tools on source code and decompiled binaries of ten real-world C/C++ vulnerabilities originally studied by Mantovani et al. [35]. The goal was to measure current detection rates and quantify the effect of decompilation on practical vulnerability discovery. Seven tools and frameworks (CodeQL, Joern, the Fraunhofer CPG, CPPCheck, IKOS, Infer, and Clang’s scan-build) and two decompilers (Ghidra and RetDec) were tested across three scenarios: the original source, the decompiled snapshots from the prior study (repaired to be compilable), and new decompilations produced and repaired with a hybrid per-function patching workflow. Repairing and integrating decompiled functions required on average 20-30 minutes of manual editing per case. The results show that source-level analysis remains the most effective approach. Across tools, we observed roughly 57% recall and 19% precision on the source baseline. There the Fraunhofer CPG achieved the best balance, with 62.5% precision and 90% recall. CodeQL and Joern achieved good recall, but suffered from query brittleness and API/compatibility issues. Many legacy CodeQL and Joern queries from the previous study either fail to run or return different results in modern releases, contributing to apparent regressions. Therefore, query maintenance is a non-trivial operational cost. Running the tools on decompiled pseudocode substantially degrades performance. Average recall drops to around 31% on Ghidra output, with a precision of 12%. This means that approximately 55% of vulnerabilities detected at source level remain detectable after Ghidra decompilation. RetDec results are worse: Average recall is around 14% and precision only 4%. On decompiled input, the Fraunhofer CPG and Joern again achieved the best results, while the other tools lose most of their detections. Roughly 43% of the decompiled code that Mantovani et al. [35] manually repaired to be "recompilable" no longer builds with modern toolchains. This makes it difficult to determine whether the differences are due to tool regressions or changes in the environment and complicates direct comparisons with the original results. It also underscores portability problems. Analyses are sensitive to differences in tools, compilers, and build environments, so results may not generalize across platforms or over time. False-positive behavior is tool-specific. IKOS reports a very high number of false positives (FPs) under our configurations, whereas the query-based tools produce fewer FPs when queries execute correctly. Importantly, choices in tool configuration and subjective labeling criteria, i.e., what constitutes a true positive, can alter reported detection and false-positive rates as much as the tools themselves.

Comparing SAST-Tools and Code Property Graphs and their Application to Vulnerability Analysis in Binaries

Description