Publication: We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Our research investigates the viability of using Large Language Models (LLMs) for natural language text augmentation. One major factor behind the significant improvement in LLM performance in recent years has been the increased volume of data being used to train models. However, state-of-the-art models have already been trained on nearly the entire internet, effectively exhausting the supply of unique, human-generated data. As a result, availability of unique data is emerging as a significant bottleneck for further advances in model performance. To address this issue, we explored data augmentation as a method of synthetic data generation aimed at expanding the size of existing training corpora. While data augmentation has proven effective for expanding dataset sizes and improving model performance in domains like image classification, robust methods for text data remain underdeveloped due to the complex structure of natural language. In our study, we used the Gutenberg English dataset to generate augmented versions of long-form passages using a state-of-the-art large language model. We then trained three identical ~124M parameter GPT-2 style models to convergence: one on the original dataset, one on the synthetic dataset, and one on a combination of both. Across nearly all evaluation benchmarks, including in-distribution and zero-shot tasks, the model trained solely on human-generated data outperformed the others. These findings highlight the importance of data quality in pretraining, not only underscoring the role it plays in improving model performance, but also the potential risks associated with relying on synthetically generated data, even as past gains have largely been driven by data volume. Our work also highlights limitations in current approaches to text data quality assessment, such as the inadequacy of cosine similarity as a proxy. While our results tell a cautionary tale about the risks of training LLMs on synthetic data, we also suggest potential directions for future work, particularly in refining synthetic data generation and filtering strategies.