Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Fact or Fiction? Evaluating the Ability of Large Language Models to Detect Legal Hallucinations

Loading...
Thumbnail Image

Files

lm8183_written_final_report-3.pdf (1.64 MB)

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

As large language models (LLMs) become increasingly integrated into legal research tools, concerns about their tendency to “hallucinate”—generate factually incorrect or unsupported content—have grown. This paper investigates whether LLMs can also serve as factual consistency checkers in legal question-answering: given a legal query, an AI-generated answer, and its cited sources, can the model assess whether the response contains hallucinated information?

To evaluate this approach, we construct two datasets: one comprising AI-generated question–answer pairs with controlled hallucinations, and another based on real outputs from Westlaw’s AI-Assisted Research (AI-AR) tool. We assess five models—GPT-4o, DeepSeek-R1, and three LLaMA variants—on their ability to detect and classify hallucinations. Results show that larger models, particularly GPT-4o and DeepSeek-R1, significantly outperform smaller alternatives and can reliably serve as automated evaluators of legal content. Although Westlaw AI-AR has improved since prior benchmarks, hallucinations remain a recurring issue. These findings suggest that LLMs hold promise not only as content generators, but also as scalable evaluators for legal AI systems.

Description

Keywords

Citation