Publication: Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models
| datacite.rights | restricted | |
| dc.contributor.advisor | Liu, Lydia Tingruo | |
| dc.contributor.author | Shukla, Tara H. | |
| dc.date.accessioned | 2025-08-06T15:00:59Z | |
| dc.date.available | 2025-08-06T15:00:59Z | |
| dc.date.issued | 2025-04-10 | |
| dc.description.abstract | My research applies Item Response Theory (IRT) to systematically evaluate Large Language Models’ (LLMs) capabilities in judgment tasks. By modeling both model abilities and task characteristics simultaneously, this approach provides an interpretable framework that takes LLM evaluation beyond traditional accuracy metrics. Analysis of multiple benchmark datasets—including MT-Bench, JudgeBench, and LLMBar—reveals weak correlations between benchmark performance, challenging the unidimensionality assumption of judgment capability. Additionally, I evaluate models on essay grading tasks using both pairwise and pointwise judgment methodologies. Through this, I demonstrate that comparative judgment and single-answer evaluation represent distinct skills rather than manifestations of a single trait. My findings indicate that while model size generally correlates with judgment capability, this relationship varies substantially across evaluation contexts. My research contributes methodological innovations for benchmark characterization and usage. These findings have important implications for responsible AI deployment and the development of more targeted evaluation methodologies for LLM-based assessment systems. | |
| dc.identifier.uri | https://theses-dissertations.princeton.edu/handle/88435/dsp01k643b4633 | |
| dc.language.iso | en_US | |
| dc.title | Statistical Assessment for LLM-as-a-Judge: Leveraging Item Response Theory for Meta-Evaluation of Large Language Models | |
| dc.type | Princeton University Senior Theses | |
| dspace.entity.type | Publication | |
| dspace.workflow.startDateTime | 2025-04-13T23:26:28.260Z | |
| pu.contributor.authorid | 920289066 | |
| pu.date.classyear | 2025 | |
| pu.department | Computer Science |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- written_final_report.pdf
- Size:
- 14.59 MB
- Format:
- Adobe Portable Document Format
Download
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 100 B
- Format:
- Item-specific license agreed to upon submission
- Description:
Download