Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Scaling Code Repair with LLM-Generated Synthetic Bugs

datacite.rightsrestricted
dc.contributor.advisorNarayanan, Arvind
dc.contributor.authorStepanewk, Leo
dc.date.accessioned2026-01-05T22:18:14Z
dc.date.available2026-01-05T22:18:14Z
dc.date.issued2025
dc.description.abstractLarge language models (LLMs) have demonstrated remarkable capabilities in code generation and repair, yet their performance is often constrained by the availability of high-quality training data, which is expensive to curate. We present a scalable pipeline to generate synthetic data to augment code repair datasets by using an LLM to perturb instances of correct code. Other similar methods produce unrealistic examples and rely on imprecise, human-defined bug taxonomies. Our method learns a well-defined taxonomy of bugs from existing data and creates perturbations in correct code snippets by retrieving the most relevant bug categories. We use few-shot learning with code differentials to demonstrably improve the realism and complexity of the inserted bugs. When used to train a downstream Llama-3.1-8B-Instruct model for code repair, the synthetic buggy-to-repaired code examples lead to performance that far exceeds the previously published baseline and nearly matches that of a model trained on an equally sized human dataset. Our method is accessible from a cost standpoint where a million example synthetic dataset can be created for under $1000.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01765374831
dc.language.isoen_US
dc.titleScaling Code Repair with LLM-Generated Synthetic Bugs
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-12-15T15:42:25.295Z
pu.contributor.authorid920286857
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ls3841_written_final_report-2.pdf
Size:
335.28 KB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download