Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Scaling Code Repair with LLM-Generated Synthetic Bugs

Loading...
Thumbnail Image

Files

ls3841_written_final_report-2.pdf (335.28 KB)

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation and repair, yet their performance is often constrained by the availability of high-quality training data, which is expensive to curate. We present a scalable pipeline to generate synthetic data to augment code repair datasets by using an LLM to perturb instances of correct code. Other similar methods produce unrealistic examples and rely on imprecise, human-defined bug taxonomies. Our method learns a well-defined taxonomy of bugs from existing data and creates perturbations in correct code snippets by retrieving the most relevant bug categories. We use few-shot learning with code differentials to demonstrably improve the realism and complexity of the inserted bugs. When used to train a downstream Llama-3.1-8B-Instruct model for code repair, the synthetic buggy-to-repaired code examples lead to performance that far exceeds the previously published baseline and nearly matches that of a model trained on an equally sized human dataset. Our method is accessible from a cost standpoint where a million example synthetic dataset can be created for under $1000.

Description

Keywords

Citation