Publication: Fact or Fiction? Evaluating the Ability of Large Language Models to Detect Legal Hallucinations
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Access Restrictions
Abstract
As large language models (LLMs) become increasingly integrated into legal research tools, concerns about their tendency to “hallucinate”—generate factually incorrect or unsupported content—have grown. This paper investigates whether LLMs can also serve as factual consistency checkers in legal question-answering: given a legal query, an AI-generated answer, and its cited sources, can the model assess whether the response contains hallucinated information?
To evaluate this approach, we construct two datasets: one comprising AI-generated question–answer pairs with controlled hallucinations, and another based on real outputs from Westlaw’s AI-Assisted Research (AI-AR) tool. We assess five models—GPT-4o, DeepSeek-R1, and three LLaMA variants—on their ability to detect and classify hallucinations. Results show that larger models, particularly GPT-4o and DeepSeek-R1, significantly outperform smaller alternatives and can reliably serve as automated evaluators of legal content. Although Westlaw AI-AR has improved since prior benchmarks, hallucinations remain a recurring issue. These findings suggest that LLMs hold promise not only as content generators, but also as scalable evaluators for legal AI systems.