LLMs for Legal and Linguistic Accessibility: Toward Task-Aligned Interpretable Evaluation in Low-Resource Languages

Agarwal, Tara

Publication:
LLMs for Legal and Linguistic Accessibility: Toward Task-Aligned Interpretable Evaluation in Low-Resource Languages

datacite.rights	restricted
dc.contributor.advisor	Kshirsagar, Mihir
dc.contributor.author	Agarwal, Tara
dc.date.accessioned	2025-08-06T15:54:47Z
dc.date.available	2025-08-06T15:54:47Z
dc.date.issued	2025-05-04
dc.description.abstract	The Constitution of India designates English as the language of jurisprudence in the Supreme and High Courts, thereby excluding those without English proficiency—the majority of the country’s population—from meaningful participation in legal discourse. To address the dual challenges of linguistic inaccessibility and legal complexity, this thesis investigates a range of large language model (LLM) configurations for summarizing English-language legal judgments into Hindi, India’s most widely spoken language. Recognizing the limitations of traditional rule-based evaluation metrics for cross-lingual tasks, we introduce a task-aligned, interpretable evaluation suite. Key components include pairwise BERTScores to assess output homogeneity, a question-answering framework (LLM-as-a-judge) to measure faithfulness, and named entity preservation metrics as proxies for legal precision. Consistent with prior work, we observe that the summarize-then-translate pipeline outperforms direct end- to-end generation. Surprisingly, across both paradigms, one-shot prompting results in performance declines relative to its zero-shot counterpart for LLaMA 3.1 8B and Qwen 2.5 7B alike. Accompanied by increased homogeneity and extractiveness, this behavior indicates that providing a judgment-summary example encourages stylistic imitation at the cost of information coverage. We find that decoder-only models, despite generating less extractive summaries, achieve substantial gains over baseline ROUGE and BERTScores, contrary to the presumed trade-off between abstractiveness and faithfulness. Finally, our question-answering framework reveals that models are prone to error when reproducing the court’s reasoning. Despite favorable scores from embedding- and overlap-based metrics, this demonstrates that current LLMs fall short of the factuality required for high-stakes legal summarization tasks. Nonetheless, our task-aligned evaluation suite serves as an important institutional readiness check for public-facing deployment, mirrors how humans evaluate summaries, and yields deeper insight into model behavior.
dc.identifier.uri	https://theses-dissertations.princeton.edu/handle/88435/dsp0112579w71w
dc.language.iso	en_US
dc.title	LLMs for Legal and Linguistic Accessibility: Toward Task-Aligned Interpretable Evaluation in Low-Resource Languages
dc.type	Princeton University Senior Theses
dspace.entity.type	Publication
dspace.workflow.startDateTime	2025-05-04T14:30:08.840Z
pu.contributor.authorid	920282230
pu.date.classyear	2025
pu.department	Computer Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Thesis_vFINAL.pdf
Size:: 3.79 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 100 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Computer Science, 1987-2025

Publication: LLMs for Legal and Linguistic Accessibility: Toward Task-Aligned Interpretable Evaluation in Low-Resource Languages

Files

Original bundle

License bundle

Collections

Publication:
LLMs for Legal and Linguistic Accessibility: Toward Task-Aligned Interpretable Evaluation in Low-Resource Languages