Description
Decompilers are important tools in reverse engineering and security applications. One of a decompiler’s keys tasks is recovering function parameters, variables and their respective types. Since this is such a crucial task which decompilers still struggle with, many different tools aiming to improve on that issue have been developed and presented in past literature. However, it is impossible to clearly compare these tools regarding their accuracy when performing type inference. Evaluations conducted are hard to reproduce and compare and implementations of tools are inaccessible. In this work, we investigate and systemise past publications and the evaluations they conducted. To address lacking reproducibility we present a framework designed to allow a general evaluation and unified comparison of type recovery tools. We do this by converting output produced by type recovery tools into a unified format and performing any evaluation on this format only. Additionally we propose new metrics for primitive and structural type evaluation respectively, fixing suspected issues in previously used metrics. To work towards a standard benchmark we then evaluate these metrics by measuring how well they capture singular aspects of type recovery. Additionally we check the most common benchmark suite used in the past for its representativeness, the GNU coreutils. Our results show that the coreutils introduced significant biases in results regardless of the evaluation metric used, especially when paired with using popular tools like Ghidra or IDA as references in evaluations. Our evaluation also revealed methodical issues is past evaluations when including Ghidra and IDA, for which we suggest and implement a mitigation. Overall we find that it is urgently necessary to agree on a larger and more diverse set of programs as benchmarks for expressive results in future evaluations, as establishing a representative benchmark will allow for better and more comparable results.
|