Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

The Hunt for Data Leakage Reviews: Using LLMs to Automate Academic Paper Screening

Loading...
Thumbnail Image

Files

aj3564_written_final_report (3).pdf (1.65 MB)

Date

2025-04-27

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Machine learning (ML) techniques have been increasingly implemented across diverse fields, and many suffer from a set methodological errors called data leakage. Some scholars describe this wave of incorrect ML executions as a "reproducibility crisis." However, the pervasiveness of machine learning pitfalls has not been robustly measured, and the task of finding erroneous papers is difficult due to the diverse language to describe ML across disciplines. This thesis project leverages large language models (LLMs) to build a systematic search pipeline to find papers with data leakage and help to quantify the scale of erroneous ML practices. The pipeline uses LLMs to answer questions using abstract text and full-text of academic papers, filtering from a set of 5 million papers down to 1000 papers. In this process, we double the number of known papers affected by data leakage, and point towards thousands more. This provides a proof of concept of large-scale LLM-based search pipelines, and contributes substantial evidence for the existence of a "reproducibility crisis" in machine learning.

Description

Keywords

Citation