TUM Logo

Incremental Construction of Code Property Graphs

Incremental Construction of Code Property Graphs

Supervisor(s): Konrad Weiss
Status: finished
Topic: Others
Author: Samuel Hopstock
Submission: 2021-09-15
Type of Thesis: Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Graph-based code analysis systems are a versatile tools for reasoning
about the correctness of complex software projects. One area in which
they are widely used is in source code auditing: Security
vulnerabilities, for example using cryptographic functions with insecure
algorithms, can be introduced by coding patterns that spread over the
boundaries of several methods, classes or even files in the project.
This is where graph-based analysis makes finding these vulnerabilities
easier, by creating a framework where the source code can be represented
as a graph and vulnerable patterns can be found by performing efficient
graph-based pattern matching queries.

There are several graph representations for source code, but this thesis
is based on the Code Property Graph (CPG) approach, as it conveniently
combines a program's Abstract Syntax Tree (AST), Control Flow Graph
(CFG) and Data Flow Graph (DFG) in a single data structure. One drawback
of this representation is that its generation is a complex task that can
be very time consuming for larger projects. Considering that security
analyses are desirably carried out repeatedly, to find out if recent
changes introduced new vulnerabilities, long graph generation times may
slow down the feedback cycle.

This is why this thesis presents an improved CPG generation process that
can take advantage of the fact that most of the changes to a project
will not affect the whole program, but will rather have a limited scope:
Portions of the CPG generated for the previous code version can be
reused, if the corresponding code did not change. The actual file
changes are then modeled as incremental updates to the graph, which
avoids having to re-generate the CPG for all files in case only a few
have been modified.

This technique has been evaluated by analyzing the source code of the
CPG implementation that serves as the basis for this thesis, an actively
maintained project containing over 200 Java files: While the previous
implementation took 60 minutes to sequentially parse 50 commits in this
repository, the new incremental approach showed a performance
improvement of nearly 80%, taking only 13 minutes for the same commit range.