TUM Logo

Anomaly Detection with Graph Structure

Nowadays, the Control Flow Graph (CFG) is widely utilized in the areas of static code analysis of software applications, as it is able to correctly express the flow inside of a program unit. Further, it is considered to be an effective technique to mitigate software vulnerabilities, particularly for code reuse attacks. Yet, there is an open question that can arise: How can we leverage CFG, or graph structure in general to detect malware? What are the pros and cons of this methodology? And How about the robustness of the graph-based anomaly detection system under the influence of the adversarial samples?
In these research topics, we introduce malware detection systems using graphs data on DEX files and native code levels for both Android and Desktop. To this end, we use Natural Language Processing (NLP) concepts, particularly, embedding techniques to transform graphs into numerical vectors to feed our classifiers. In a nutshell, our research direction is associated with machine learning as well as natural language processing.

Researcher: Peng Xu

  • Detecting and Categorizing Android Malware with Graph Neural Network(Under submission S&P2021)
    In this project, we present a new NLP-inspired Android malware detection and categorization technique based on Function Call Graph Embedding. Using Natural Language Processing technology, we treat the opcodes of Dalvik instruction as words, functions as sentences or paragraphs, and the whole application as a document. Our goal is to detect and identify Android malware families. Yet, there exist two significant differences in the malware detection field compared to the typical natural language processing techniques. First, in malware detection, we have multiple functions’ jumps, while this is not the case when dealing with typical NLP applications. Second, inside the code of each Android apps, there are multiple branches that are connected by the functions, which are unsolved by the typical NLP method. In our work, we design a two-layer, intra-function(function embedding) and inter-function graph neural network (graph embedding) based-approaches to convert the whole graph structure of an Android app to a vector. Consequently, we utilize the graphs’ vectors to detect and categorize malware families. Our results reveal that the graph embedding technique yields a better result: we get 99.6% accuracy on average for malware detection and 98.7% accuracy for malware categorization. Our work is the first in adopting the graph embedding technique for malware detection and categorization.
  • Multi-Platform Malware Detection with Graph Neural Networks (Resubmission)
    In this paper, we design and implement a control flow graph based multi-platform malware detection system to tackle the problems mentioned above. In more detail, we utilize a graph neural network method to convert the control flow graphs of executables to vectors and then use a machine learning-based classifier to create a malware detection system. We evaluate our framework by testing real samples on multi-platforms, including Linux (x86, x64, and ARM-32) and Windows (x86 and x64). Our results outperform most of the existing works with accuracy 96.8% on Linux and 93.9% on Windows. To the best of our knowledge, our work is the first to consider graph neural networks in the malware detection field.
  • Function Embedding for Binary Similarity (Resubmission)
    In this paper, we present three types of function embedding methods: (i) simple function embedding, (ii) ecall function embedding, and (iii) graph-based function embedding. These methods convert binaries’ functions to vectors in order to measure the similarities in the binary executable level. We consider the feature of disassembled binaries code, leverage the natural language processing inspired methods to convert opcodes, instructions as well as control flow graph of functions to vectors and get opcode2vec, instruction2vec, and function2vec representations. After getting function vectors, we utilize them to additional tasks such as code similarity, vulnerability searching, and semantic classification. We evaluate our function embedding framework to code similarity, function searching, vulnerability searching, semantic classifying and malware detection tasks. In contrast to existing works, our results outperform nearly all of them.
  • Layered Android Malware Detection Using Program Dependence Graph Embedding and Manifest Features(Under submission DIMVA2020)
    present a multi-layer approach that utilizes machine learning, natural language processing (NLP), as well as graph embedding techniques to handle the threats of Android malware. To be specific, the first layer of our detection approach acts on the application’s properties declared in the Manifest file, whereas the second layer operates on the application code’s structural relationships. Large-scale experiments on 30,113 malware samples show that the context-based approach yields an accuracy of 91%, which is nearly comparable to state-of-the-art techniques, while the structure-based method attains an accuracy of 99% which outperforms various related works. Further, for optimum Android malware detection, we introduce a hook-based anti-malware application that utilizes the complementary strengths of our multi-layer approach to scan applications before installation.
  • MANIS: Evading Malware Detection System on Graph Structure(SAC 2020, 24.48%) slides
    Adversarial machine learning has attracted attention because it makes classifiers vulnerable to attacks. Meanwhile, machine learning on graph-structured data makes great achievements in many fields like social networks, recommendation systems, molecular structure prediction, and malware detection. Unfortunately, although the malware graph structure enables effective detection of malicious code and activity, it is still vulnerable to adversarial data manipulation. However, adversarial example crafting for machine learning systems that utilize the graph structure, especially taking the entire graph as an input, is very little noticed. In this paper, we advance the field of adversarial machine learning by designing an approach to evade machine learning-based classification systems, which takes the whole graph structure as input through adversarial example crafting. We derive such an attack and demonstrate it by constructing MANIS, a system that can evade graph-based malware detection with two attacking approaches: the n-strongest nodes and the gradient sign method. We evaluate our adversarial crafting techniques utilizing the Drebin malicious dataset. Under the white-box attack, we get a 72.2% misclassification rate only by injecting 22.7% nodes with the n-strongest node. For the gradient sign method, we obtain a 33.4% misclassification rate with 36.34% node injection. Under the gray-box attack, the performance of our adversarial examples is evenly significant, although attackers may not have the complete knowledge of the classifiers’ mechanisms.