Repository logo

Thesis Central

Communities & Collections
Browse
Log In
  1. Home
  2. Browse by Author

Browsing by Author "Stepanewk, Leo"

Filter results by typing the first few letters
Now showing 1 - 1 of 1
  • Results Per Page
  • Sort Options
  • Loading...
    Thumbnail Image

    Scaling Code Repair with LLM-Generated Synthetic Bugs

    (2025) Stepanewk, Leo; Narayanan, Arvind

    Large language models (LLMs) have demonstrated remarkable capabilities in code generation and repair, yet their performance is often constrained by the availability of high-quality training data, which is expensive to curate. We present a scalable pipeline to generate synthetic data to augment code repair datasets by using an LLM to perturb instances of correct code. Other similar methods produce unrealistic examples and rely on imprecise, human-defined bug taxonomies. Our method learns a well-defined taxonomy of bugs from existing data and creates perturbations in correct code snippets by retrieving the most relevant bug categories. We use few-shot learning with code differentials to demonstrably improve the realism and complexity of the inserted bugs. When used to train a downstream Llama-3.1-8B-Instruct model for code repair, the synthetic buggy-to-repaired code examples lead to performance that far exceeds the previously published baseline and nearly matches that of a model trained on an equally sized human dataset. Our method is accessible from a cost standpoint where a million example synthetic dataset can be created for under $1000.

© 2024 The Trustees of Princeton University. All rights reserved.

  • Privacy policy
  • Accessibility
  • Send Feedback