The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Jerdee, Alexandra

Publication:
The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Files

aj3564_written_final_report (3).pdf (1.65 MB)

Date

2025-04-27

Authors

Jerdee, Alexandra

Abstract

Machine learning (ML) techniques have been increasingly implemented across diverse fields, and many suffer from a set methodological errors called data leakage. Some scholars describe this wave of incorrect ML executions as a "reproducibility crisis." However, the pervasiveness of machine learning pitfalls has not been robustly measured, and the task of finding erroneous papers is difficult due to the diverse language to describe ML across disciplines. This thesis project leverages large language models (LLMs) to build a systematic search pipeline to find papers with data leakage and help to quantify the scale of erroneous ML practices. The pipeline uses LLMs to answer questions using abstract text and full-text of academic papers, filtering from a set of 5 million papers down to 1000 papers. In this process, we double the number of known papers affected by data leakage, and point towards thousands more. This provides a proof of concept of large-scale LLM-based search pipelines, and contributes substantial evidence for the existence of a "reproducibility crisis" in machine learning.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp01m039k836r

Collections

Computer Science, 1987-2025

Full item page

Publication:
The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening