Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

Loading...
Thumbnail Image

Files

written_final_report.pdf (14.59 MB)

Date

2025-04-10

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

My research applies Item Response Theory (IRT) to systematically evaluate Large Language Models’ (LLMs) capabilities in judgment tasks. By modeling both model abilities and task characteristics simultaneously, this approach provides an interpretable framework that takes LLM evaluation beyond traditional accuracy metrics. Analysis of multiple benchmark datasets—including MT-Bench, JudgeBench, and LLMBar—reveals weak correlations between benchmark performance, challenging the unidimensionality assumption of judgment capability. Additionally, I evaluate models on essay grading tasks using both pairwise and pointwise judgment methodologies. Through this, I demonstrate that comparative judgment and single-answer evaluation represent distinct skills rather than manifestations of a single trait. My findings indicate that while model size generally correlates with judgment capability, this relationship varies substantially across evaluation contexts. My research contributes methodological innovations for benchmark characterization and usage. These findings have important implications for responsible AI deployment and the development of more targeted evaluation methodologies for LLM-based assessment systems.

Description

Keywords

Citation