Scaling Code Repair with LLM-Generated Synthetic Bugs

Stepanewk, Leo

Publication:
Scaling Code Repair with LLM-Generated Synthetic Bugs

Files

ls3841_written_final_report-2.pdf (335.28 KB)

Date

2025

Authors

Stepanewk, Leo

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation and repair, yet their performance is often constrained by the availability of high-quality training data, which is expensive to curate. We present a scalable pipeline to generate synthetic data to augment code repair datasets by using an LLM to perturb instances of correct code. Other similar methods produce unrealistic examples and rely on imprecise, human-defined bug taxonomies. Our method learns a well-defined taxonomy of bugs from existing data and creates perturbations in correct code snippets by retrieving the most relevant bug categories. We use few-shot learning with code differentials to demonstrably improve the realism and complexity of the inserted bugs. When used to train a downstream Llama-3.1-8B-Instruct model for code repair, the synthetic buggy-to-repaired code examples lead to performance that far exceeds the previously published baseline and nearly matches that of a model trained on an equally sized human dataset. Our method is accessible from a cost standpoint where a million example synthetic dataset can be created for under $1000.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp01765374831

Collections

Computer Science, 1987-2025

Full item page

Thesis Central

Publication:
Scaling Code Repair with LLM-Generated Synthetic Bugs

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: Scaling Code Repair with LLM-Generated Synthetic Bugs

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
Scaling Code Repair with LLM-Generated Synthetic Bugs