We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text

Lam, Gordon K.

Publication:
We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text

Files

LAM_GORDON_THESIS_FINAL.pdf (879.33 KB)

Date

2025-04-09

Authors

Lam, Gordon K.

Abstract

Our research investigates the viability of using Large Language Models (LLMs) for natural language text augmentation. One major factor behind the significant improvement in LLM performance in recent years has been the increased volume of data being used to train models. However, state-of-the-art models have already been trained on nearly the entire internet, effectively exhausting the supply of unique, human-generated data. As a result, availability of unique data is emerging as a significant bottleneck for further advances in model performance. To address this issue, we explored data augmentation as a method of synthetic data generation aimed at expanding the size of existing training corpora. While data augmentation has proven effective for expanding dataset sizes and improving model performance in domains like image classification, robust methods for text data remain underdeveloped due to the complex structure of natural language. In our study, we used the Gutenberg English dataset to generate augmented versions of long-form passages using a state-of-the-art large language model. We then trained three identical ~124M parameter GPT-2 style models to convergence: one on the original dataset, one on the synthetic dataset, and one on a combination of both. Across nearly all evaluation benchmarks, including in-distribution and zero-shot tasks, the model trained solely on human-generated data outperformed the others. These findings highlight the importance of data quality in pretraining, not only underscoring the role it plays in improving model performance, but also the potential risks associated with relying on synthetically generated data, even as past gains have largely been driven by data volume. Our work also highlights limitations in current approaches to text data quality assessment, such as the inadequacy of cosine similarity as a proxy. While our results tell a cautionary tale about the risks of training LLMs on synthetic data, we also suggest potential directions for future work, particularly in refining synthetic data generation and filtering strategies.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp0173666795v

Collections

Operations Research and Financial Engineering, 2000-2025

Full item page

Publication:
We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text