TUM Logo

Anomaly Detection with Graph Structure

Nowadays, the Control Flow Graph (CFG) is widely utilized in the areas of static code analysis of software applications, as it is able to correctly express the flow inside of a program unit. Further, it is considered to be an effective technique to mitigate software vulnerabilities, particularly for code reuse attacks. Yet, there is an open question that can arise: How can we leverage CFG, or graph structure in general to detect malware? What are the pros and cons of this methodology? And How about the robustness of the graph-based anomaly detection system under the influence of the adversarial samples?
In these research topics, we introduce malware detection systems using graphs data on DEX files and native code levels for both Android and Desktop. To this end, we use Natural Language Processing (NLP) concepts, particularly, embedding techniques to transform graphs into numerical vectors to feed our classifiers. In a nutshell, our research direction is associated with machine learning as well as natural language processing.

Researcher: Peng Xu

  • Detecting and Categorizing Android Malware with Graph Neural Network (SAC2021)    paper  slides video
    In this project, we present a new NLP-inspired Android malware detection and categorization technique based on Function Call Graph Embedding. Using Natural Language Processing technology, we treat the opcodes of Dalvik instruction as words, functions as sentences or paragraphs, and the whole application as a document. Our goal is to detect and identify Android malware families. Yet, there exist two significant differences in the malware detection field compared to the typical natural language processing techniques. First, in malware detection, we have multiple functions’ jumps, while this is not the case when dealing with typical NLP applications. Second, inside the code of each Android apps, there are multiple branches that are connected by the functions, which are unsolved by the typical NLP method. In our work, we design a two-layer, intrafunction(function embedding) and inter-function graph neural network (graph embedding) based-approaches to convert the whole graph structure of an Android app to a vector. Consequently, we utilize the graphs’ vectors to detect and categorize malware families. Our results reveal that the graph embedding technique yields a better result: we get 99.6% accuracy on average for malware detection and 98.7% accuracy for malware categorization. Our work is the first in adopting the graph embedding technique for malware detection and categorization.
  • Multi-Platform Malware Detection with Graph Neural Networks (ICANN21) paper slides video
    In this paper, we design and implement a control flow graph based multi-platform malware detection system to tackle the problems mentioned above. In more detail, we utilize a graph neural network method to convert the control flow graphs of executables to vectors and then use a machine learning-based classifier to create a malware detection system. We evaluate our framework by testing real samples on multi-platforms, including Linux (x86, x64, and ARM-32) and Windows (x86 and x64). Our results outperform most of the existing works with accuracy 96.8% on Linux and 93.9% on Windows. To the best of our knowledge, our work is the first to consider graph neural networks in the malware detection field.
  • Malware Detection and Categorization with Network Traffic Images (ICANN21) paper slides video
    Android is the most dominant operating system in the mobile ecosystem, and it became one of the favorite platforms for adversarial for discovering new victims through malicious apps. Therefore, it is safe to admit that traditional anti-malware techniques have become cumbersome, sparking the urge to develop an efficient way to detect Android malware. In this paper, we present Falcon, an Android malware detection and categorization framework. We treat network traffic classification task as a 2D image sequence classification and handle each network packet as a 2D image. Furthermore, we use a bidirectional LSTM network to process those converted 2D images to obtain the network vectors. We then utilize those converted vectors to detect and categorize the malware. Our results reveal that Falcon yields better results than other systems, as we get 97.16% accuracy on average for malware detection and 88.32% accuracy for malware categorization.

  • Hybroid: Toward Android Malware Detection and Categorization with Program Code and Network Traffic (ISC2021)  paper slides

    Android malicious applications have become so sophisticated that they can bypass endpoint protection measures. Therefore, it is safe to admit that traditional anti-malware techniques have become cumbersome, thereby raising the need to develop efficient ways to detect Android malware. In this paper, we present Hybroid, a hybrid Android malware detection and categorization solution that utilizes program code structures as static behavioral features and network traffic as dynamic behavioral features for detection (binary classification) and categorization (multi-label classification). For static analysis, we introduce a natural language processing-inspired technique based on function call graph embeddings and design a graph neural network-based approach to convert the whole graph structure of an Android app to a vector. In dynamic analysis, we extract network flow features from the raw network traffic by capturing each application's network flow. Finally, Hybroid utilizes the network flow features combined with the graphs' vectors to detect and categorize the malware. Our solution demonstrates 97.0% accuracy on average for malware detection and 94.0% accuracy for malware categorization. Also, we report outstanding results in terms of different performance metrics such as F1-score, precision, recall, and AUC. 

  • Hybrid Pattern Malware Detection and Categorization with Network Traffic and Program Code (Resubmission)
    Nowadays, Android is the most dominant operating system in the mobile ecosystem, with billions of people using its apps daily. As expected, this trend did not go unnoticed by miscreants, and Android became the favorite platform for discovering new victims through malicious apps. Moreover, these apps have become so sophisticated that they can bypass anti-malware measures to protect the users. Therefore, it is safe to admit that traditional anti-malware techniques have become cumbersome, sparking the urge to develop an efficient way to detect Android malware. This paper presents hybrid-Flacon, a hybrid pattern Android malware detection and categorization framework. It combines dynamic and static features of Android malware, which are from network traffic and code graph structure. In hybrid-Flacon, we treat network traffic as a dynamic feature and process it as a 2D-image sequence. Meanwhile, hybrid-Flacon handles each network flow in the packet as a 2D image and uses a bidirectional LSTM network to process those 2D-image sequences to obtain vectors representing network packets. We use the program code graph for a static feature and introduce natural language processing (NLP) inspired techniques on function call graph (FCG). We design a graph neural network-based approach to convert the whole graph structure of Android apps to vectors. Finally, We utilize those converted vectors, both network and program code features, and concatenate them to detect and categorize the malware. Our results reveal that hybrid-Flacon yields better results as we get 97.16% accuracy on average for malware detection and 88.32% accuracy for malware categorization. Additionally, we release a dataset AndroNetMnist, which converts the network traffic to a 2D-image sequence and helps to accomplish malware detection on a 2D-image sequence.
  • Function Embedding for Binary Similarity (Resubmission)
    In this paper, we present three types of function embedding methods: (i) simple function embedding, (ii) ecall function embedding, and (iii) graph-based function embedding. These methods convert binaries’ functions to vectors in order to measure the similarities in the binary executable level. We consider the feature of disassembled binaries code, leverage the natural language processing inspired methods to convert opcodes, instructions as well as control flow graph of functions to vectors and get opcode2vec, instruction2vec, and function2vec representations. After getting function vectors, we utilize them to additional tasks such as code similarity, vulnerability searching, and semantic classification. We evaluate our function embedding framework to code similarity, function searching, vulnerability searching, semantic classifying and malware detection tasks. In contrast to existing works, our results outperform nearly all of them.
  • Layered Android Malware Detection Using Program Dependence Graph Embedding and Manifest Features (Resubmission)
    presents a multi-layer approach that utilizes machine learning, natural language processing (NLP), as well as graph embedding techniques to handle the threats of Android malware. To be specific, the first layer of our detection approach acts on the application’s properties declared in the Manifest file, whereas the second layer operates on the application code’s structural relationships. Large-scale experiments on 30,113 malware samples show that the context-based approach yields an accuracy of 91%, which is nearly comparable to state-of-the-art techniques, while the structure-based method attains an accuracy of 99% which outperforms various related works. Further, for optimum Android malware detection, we introduce a hook-based anti-malware application that utilizes the complementary strengths of our multi-layer approach to scan applications before installation.
  • MANIS: Evading Malware Detection System on Graph Structure (SAC 2020, 24.48%) slides paper video
    Adversarial machine learning has attracted attention because it makes classifiers vulnerable to attacks. Meanwhile, machine learning on graph-structured data makes great achievements in many fields like social networks, recommendation systems, molecular structure prediction, and malware detection. Unfortunately, although the malware graph structure enables effective detection of malicious code and activity, it is still vulnerable to adversarial data manipulation. However, adversarial example crafting for machine learning systems that utilize the graph structure, especially taking the entire graph as an input, is very little noticed. In this paper, we advance the field of adversarial machine learning by designing an approach to evade machine learning-based classification systems, which takes the whole graph structure as input through adversarial example crafting. We derive such an attack and demonstrate it by constructing MANIS, a system that can evade graph-based malware detection with two attacking approaches: the n-strongest nodes and the gradient sign method. We evaluate our adversarial crafting techniques utilizing the Drebin malicious dataset. Under the white-box attack, we get a 72.2% misclassification rate only by injecting 22.7% nodes with the n-strongest node. For the gradient sign method, we obtain a 33.4% misclassification rate with 36.34% node injection. Under the gray-box attack, the performance of our adversarial examples is evenly significant, although attackers may not have the complete knowledge of the classifiers’ mechanisms.