Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

datacite.rightsrestricted
dc.contributor.advisorNarayanan, Arvind
dc.contributor.authorJerdee, Alexandra
dc.date.accessioned2025-08-06T15:39:10Z
dc.date.available2025-08-06T15:39:10Z
dc.date.issued2025-04-27
dc.description.abstractMachine learning (ML) techniques have been increasingly implemented across diverse fields, and many suffer from a set methodological errors called data leakage. Some scholars describe this wave of incorrect ML executions as a "reproducibility crisis." However, the pervasiveness of machine learning pitfalls has not been robustly measured, and the task of finding erroneous papers is difficult due to the diverse language to describe ML across disciplines. This thesis project leverages large language models (LLMs) to build a systematic search pipeline to find papers with data leakage and help to quantify the scale of erroneous ML practices. The pipeline uses LLMs to answer questions using abstract text and full-text of academic papers, filtering from a set of 5 million papers down to 1000 papers. In this process, we double the number of known papers affected by data leakage, and point towards thousands more. This provides a proof of concept of large-scale LLM-based search pipelines, and contributes substantial evidence for the existence of a "reproducibility crisis" in machine learning.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01m039k836r
dc.language.isoen_US
dc.titleThe Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-04-28T01:31:56.804Z
pu.contributor.authorid920280870
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
aj3564_written_final_report (3).pdf
Size:
1.65 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download