Description
This thesis re-evaluates the effectiveness of modern static application security testing (SAST)
tools on source code and decompiled binaries of ten real-world C/C++ vulnerabilities
originally studied by Mantovani et al. [35]. The goal was to measure current detection
rates and quantify the effect of decompilation on practical vulnerability discovery. Seven
tools and frameworks (CodeQL, Joern, the Fraunhofer CPG, CPPCheck, IKOS, Infer, and
Clang’s scan-build) and two decompilers (Ghidra and RetDec) were tested across three
scenarios: the original source, the decompiled snapshots from the prior study (repaired to
be compilable), and new decompilations produced and repaired with a hybrid per-function
patching workflow. Repairing and integrating decompiled functions required on average
20-30 minutes of manual editing per case.
The results show that source-level analysis remains the most effective approach. Across
tools, we observed roughly 57% recall and 19% precision on the source baseline. There the
Fraunhofer CPG achieved the best balance, with 62.5% precision and 90% recall. CodeQL
and Joern achieved good recall, but suffered from query brittleness and API/compatibility
issues. Many legacy CodeQL and Joern queries from the previous study either fail to run or
return different results in modern releases, contributing to apparent regressions. Therefore,
query maintenance is a non-trivial operational cost.
Running the tools on decompiled pseudocode substantially degrades performance. Average
recall drops to around 31% on Ghidra output, with a precision of 12%. This means that
approximately 55% of vulnerabilities detected at source level remain detectable after Ghidra
decompilation. RetDec results are worse: Average recall is around 14% and precision only 4%.
On decompiled input, the Fraunhofer CPG and Joern again achieved the best results, while
the other tools lose most of their detections.
Roughly 43% of the decompiled code that Mantovani et al. [35] manually repaired to be
"recompilable" no longer builds with modern toolchains. This makes it difficult to determine
whether the differences are due to tool regressions or changes in the environment and
complicates direct comparisons with the original results. It also underscores portability
problems. Analyses are sensitive to differences in tools, compilers, and build environments,
so results may not generalize across platforms or over time.
False-positive behavior is tool-specific. IKOS reports a very high number of false positives
(FPs) under our configurations, whereas the query-based tools produce fewer FPs when
queries execute correctly. Importantly, choices in tool configuration and subjective labeling
criteria, i.e., what constitutes a true positive, can alter reported detection and false-positive
rates as much as the tools themselves.
|