Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models

datacite.rightsrestricted
dc.contributor.advisorLiu, Lydia Tingruo
dc.contributor.authorShukla, Tara H.
dc.date.accessioned2025-08-06T15:00:59Z
dc.date.available2025-08-06T15:00:59Z
dc.date.issued2025-04-10
dc.description.abstractMy research applies Item Response Theory (IRT) to systematically evaluate Large Language Models’ (LLMs) capabilities in judgment tasks. By modeling both model abilities and task characteristics simultaneously, this approach provides an interpretable framework that takes LLM evaluation beyond traditional accuracy metrics. Analysis of multiple benchmark datasets—including MT-Bench, JudgeBench, and LLMBar—reveals weak correlations between benchmark performance, challenging the unidimensionality assumption of judgment capability. Additionally, I evaluate models on essay grading tasks using both pairwise and pointwise judgment methodologies. Through this, I demonstrate that comparative judgment and single-answer evaluation represent distinct skills rather than manifestations of a single trait. My findings indicate that while model size generally correlates with judgment capability, this relationship varies substantially across evaluation contexts. My research contributes methodological innovations for benchmark characterization and usage. These findings have important implications for responsible AI deployment and the development of more targeted evaluation methodologies for LLM-based assessment systems.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01k643b4633
dc.language.isoen_US
dc.titleStatistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-04-13T23:26:28.260Z
pu.contributor.authorid920289066
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
written_final_report.pdf
Size:
14.59 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download