The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Narayanan, ArvindJerdee, Alexandra2025-08-062025-08-062025-04-27https://theses-dissertations.princeton.edu/handle/88435/dsp01m039k836rMachine learning (ML) techniques have been increasingly implemented across diverse fields, and many suffer from a set methodological errors called data leakage. Some scholars describe this wave of incorrect ML executions as a "reproducibility crisis." However, the pervasiveness of machine learning pitfalls has not been robustly measured, and the task of finding erroneous papers is difficult due to the diverse language to describe ML across disciplines. This thesis project leverages large language models (LLMs) to build a systematic search pipeline to find papers with data leakage and help to quantify the scale of erroneous ML practices. The pipeline uses LLMs to answer questions using abstract text and full-text of academic papers, filtering from a set of 5 million papers down to 1000 papers. In this process, we double the number of known papers affected by data leakage, and point towards thousands more. This provides a proof of concept of large-scale LLM-based search pipelines, and contributes substantial evidence for the existence of a "reproducibility crisis" in machine learning.en-USThe Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper ScreeningPrinceton University Senior Theses