TUM Logo

On the Security of Code produced by Code Generation Models

On the Security of Code produced by Code Generation Models

Supervisor(s): Ludwig Peuckert
Status: finished
Topic: Others
Author: Mariam Labib
Submission: 2024-05-15
Type of Thesis: Masterthesis

Description

Within the evolving area of software development, Large Language Models (LLMs) have
had a major influence on several applications in sectors such as automotive and information
technology. Although they have greatly improved coding processes, code generators—which
are developed from LLMs—have also revealed a marked rise in security vulnerabilities. This
work aims to clarify the security problems that arise from depending on code generated by
these models.
This research conducted a thorough investigation of code outputs from multiple code
generator models accepting input from different prompts, subsequently passing these outputs
through a pipeline of static analysis and security scanners. This process allows for a nuanced
understanding of code generators and the potential security threats they can introduce. At
the core of this inquiry were two key tools: PolyCoder, a well-known open-source model in
this field, and GitHub Copilot, which is built around OpenAI’s Codex. Codex belongs to
the Generative Pre-trained Transformer (GPT) family. In this thesis we conduct a distinct
contribution by retraining the PolyCoder model using secure code snippets as input. The
objective is to augment the security of the code it generates. The comparative analysis of the
original PolyCoder model and its retrained iteration, utilising sanitised input data, verified
the efficacy of this strategy in enhancing code security. The findings highlight the potential to
mitigate security risks through deliberate training methods.
This study has several implications, highlighting the crucial integration of security measures
at every stage of code generator development. We propose a practical insights to developers
and security professionals by methodically identifying and assessing the effects of model
retraining as a solution to common security vulnerabilities in code generators.
As the utilisation of code generators in software development escalates, it becomes impera-
tive to strike a balance between their utility and the requirement for secure coding practices.
We introduce calls for a progressive philosophy that emphasises the mutually advantageous
connection between efficiency and security in order to promote responsible growth in software
development.