On the Security of Code produced by Code Generation Models

Supervisor(s):	Ludwig Peuckert
Status:	finished
Topic:	Others
Author:	Mariam Labib
Submission:	2024-05-15
Type of Thesis:	Masterthesis
Description Within the evolving area of software development, Large Language Models (LLMs) have had a major influence on several applications in sectors such as automotive and information technology. Although they have greatly improved coding processes, code generators—which are developed from LLMs—have also revealed a marked rise in security vulnerabilities. This work aims to clarify the security problems that arise from depending on code generated by these models. This research conducted a thorough investigation of code outputs from multiple code generator models accepting input from different prompts, subsequently passing these outputs through a pipeline of static analysis and security scanners. This process allows for a nuanced understanding of code generators and the potential security threats they can introduce. At the core of this inquiry were two key tools: PolyCoder, a well-known open-source model in this field, and GitHub Copilot, which is built around OpenAI’s Codex. Codex belongs to the Generative Pre-trained Transformer (GPT) family. In this thesis we conduct a distinct contribution by retraining the PolyCoder model using secure code snippets as input. The objective is to augment the security of the code it generates. The comparative analysis of the original PolyCoder model and its retrained iteration, utilising sanitised input data, verified the efficacy of this strategy in enhancing code security. The findings highlight the potential to mitigate security risks through deliberate training methods. This study has several implications, highlighting the crucial integration of security measures at every stage of code generator development. We propose a practical insights to developers and security professionals by methodically identifying and assessing the effects of model retraining as a solution to common security vulnerabilities in code generators. As the utilisation of code generators in software development escalates, it becomes impera- tive to strike a balance between their utility and the requirement for secure coding practices. We introduce calls for a progressive philosophy that emphasises the mutually advantageous connection between efficiency and security in order to promote responsible growth in software development.

On the Security of Code produced by Code Generation Models

Description