Scaling Code Repair with LLM-Generated Synthetic Bugs

Narayanan, ArvindStepanewk, Leo2026-01-052026-01-052025https://theses-dissertations.princeton.edu/handle/88435/dsp01765374831Large language models (LLMs) have demonstrated remarkable capabilities in code generation and repair, yet their performance is often constrained by the availability of high-quality training data, which is expensive to curate. We present a scalable pipeline to generate synthetic data to augment code repair datasets by using an LLM to perturb instances of correct code. Other similar methods produce unrealistic examples and rely on imprecise, human-defined bug taxonomies. Our method learns a well-defined taxonomy of bugs from existing data and creates perturbations in correct code snippets by retrieving the most relevant bug categories. We use few-shot learning with code differentials to demonstrably improve the realism and complexity of the inserted bugs. When used to train a downstream Llama-3.1-8B-Instruct model for code repair, the synthetic buggy-to-repaired code examples lead to performance that far exceeds the previously published baseline and nearly matches that of a model trained on an equally sized human dataset. Our method is accessible from a cost standpoint where a million example synthetic dataset can be created for under $1000.en-USScaling Code Repair with LLM-Generated Synthetic BugsPrinceton University Senior Theses