Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

Shukla, Tara H.

Publication:
Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

Files

written_final_report.pdf (14.59 MB)

Date

2025-04-10

Authors

Shukla, Tara H.

Abstract

My research applies Item Response Theory (IRT) to systematically evaluate Large Language Models’ (LLMs) capabilities in judgment tasks. By modeling both model abilities and task characteristics simultaneously, this approach provides an interpretable framework that takes LLM evaluation beyond traditional accuracy metrics. Analysis of multiple benchmark datasets—including MT-Bench, JudgeBench, and LLMBar—reveals weak correlations between benchmark performance, challenging the unidimensionality assumption of judgment capability. Additionally, I evaluate models on essay grading tasks using both pairwise and pointwise judgment methodologies. Through this, I demonstrate that comparative judgment and single-answer evaluation represent distinct skills rather than manifestations of a single trait. My findings indicate that while model size generally correlates with judgment capability, this relationship varies substantially across evaluation contexts. My research contributes methodological innovations for benchmark characterization and usage. These findings have important implications for responsible AI deployment and the development of more targeted evaluation methodologies for LLM-based assessment systems.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp01k643b4633

Collections

Computer Science, 1987-2025

Full item page

Thesis Central

Publication:
Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models