TUM Logo

Fine-Grained Evaluation of Type Recovery Systems

Fine-Grained Evaluation of Type Recovery Systems

Supervisor(s): Fabian Kilger
Status: finished
Topic: Others
Author: Simon Hanssen
Submission: 2025-11-21
Type of Thesis: Masterthesis

Description

Decompilers are important tools in reverse engineering and security applications. One
of a decompiler’s keys tasks is recovering function parameters, variables and their
respective types. Since this is such a crucial task which decompilers still struggle
with, many different tools aiming to improve on that issue have been developed and
presented in past literature. However, it is impossible to clearly compare these tools
regarding their accuracy when performing type inference. Evaluations conducted are
hard to reproduce and compare and implementations of tools are inaccessible.
In this work, we investigate and systemise past publications and the evaluations
they conducted. To address lacking reproducibility we present a framework designed
to allow a general evaluation and unified comparison of type recovery tools. We do
this by converting output produced by type recovery tools into a unified format and
performing any evaluation on this format only. Additionally we propose new metrics
for primitive and structural type evaluation respectively, fixing suspected issues in
previously used metrics.
To work towards a standard benchmark we then evaluate these metrics by measuring
how well they capture singular aspects of type recovery. Additionally we check the
most common benchmark suite used in the past for its representativeness, the GNU
coreutils. Our results show that the coreutils introduced significant biases in results
regardless of the evaluation metric used, especially when paired with using popular
tools like Ghidra or IDA as references in evaluations. Our evaluation also revealed
methodical issues is past evaluations when including Ghidra and IDA, for which we
suggest and implement a mitigation. Overall we find that it is urgently necessary
to agree on a larger and more diverse set of programs as benchmarks for expressive
results in future evaluations, as establishing a representative benchmark will allow for
better and more comparable results.