TUM Logo

LLVM-Based Generic Deobfuscation to Enhance Analysis of Machine Code

As soon as new malware samples are encountered, they need to be analyzed and understood thoroughly in order to deploy efficient countermeasures, which prevent their further spread and undo the inflicted damage. There- fore malware authors often use various obfuscation techniques to delay or prevent this behavioral analysis.This work gives an overview over many widespread obfuscation techniques and how they complement each other. It then presents a potent approach to automatic deobfuscation using dynamically extracted execution traces. The central key component in the presented architecture is LLVM with its appropriate level of abstraction and its comprehensive infrastructure for program analysis.It is shown which challenges stemming from obfuscation can already be handled by existing LLVM optimization passes and how others can be addressed.It is suggested to use a concolic execution engine to generate sufficiently divergent program inputs. A stealth debugger then records all executions resulting from this set of inputs. From those traces a meaningful control flow graph is then recovered using a novel algorithm. After all instruction have been lifted into LLVM intermediate representation the entire program can be handed over to the augmented LLVM infrastructure to perform the actual program simplifications.Finally evaluation of this approach against modern obfuscation suites shows high recovery rates for very different kinds of obfuscation, even combined and stacked on top of each other.

LLVM-Based Generic Deobfuscation to Enhance Analysis of Machine Code

Supervisor(s): Julian Kirsch Bruno Bierbaumer
Status: finished
Topic: Anomaly Detection
Author: Markus Blöchl
Submission: 2017-03-15
Type of Thesis: Masterthesis
Proof of Concept No

Astract:

As soon as new malware samples are encountered, they need to be analyzed and understood thoroughly in order to deploy efficient countermeasures, which prevent their further spread and undo the inflicted damage. There- fore malware authors often use various obfuscation techniques to delay or prevent this behavioral analysis.This work gives an overview over many widespread obfuscation techniques and how they complement each other. It then presents a potent approach to automatic deobfuscation using dynamically extracted execution traces. The central key component in the presented architecture is LLVM with its appropriate level of abstraction and its comprehensive infrastructure for program analysis.It is shown which challenges stemming from obfuscation can already be handled by existing LLVM optimization passes and how others can be addressed.It is suggested to use a concolic execution engine to generate sufficiently divergent program inputs. A stealth debugger then records all executions resulting from this set of inputs. From those traces a meaningful control flow graph is then recovered using a novel algorithm. After all instruction have been lifted into LLVM intermediate representation the entire program can be handed over to the augmented LLVM infrastructure to perform the actual program simplifications.Finally evaluation of this approach against modern obfuscation suites shows high recovery rates for very different kinds of obfuscation, even combined and stacked on top of each other.