Publication: Design and Evaluation of a Modular Architecture To Assess LLMs in Summarizing Electronic Health Records
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Low health literacy affects nearly half of Americans and poses a major barrier to effective healthcare, particularly when interpreting complex electronic health records (EHRs). While large language models (LLMs) offer promising capabilities for simplifying medical information, little research has explored their performance on personalized patient data. This thesis presents a modular framework for evaluating five state-of-the-art LLMs—GPT-4o-mini, Gemini 2.0 Flash, Claude 3.7 Sonnet, DeepSeek V3, and MiniMax-01 Text—on their ability to generate readable, patient-friendly summaries of structured EHRs. Using synthetic data from the Synthea dataset and prompts targeting sixth-grade (AMA) and eighth-grade (NIH) reading levels, the framework measures outputs using quantitative readability metrics, including Flesch-Kincaid, SMOG, and Gunning Fog scores. The results reveal significant variation in tone, complexity, and prompt adherence across models. GPT-4o-mini consistently produced the most readable summaries, while Claude struggled with prompt sensitivity and cost-effectiveness. The findings highlight the importance of prompt engineering, context length, and model choice in improving health communication. This work contributes a replicable evaluation pipeline and underscores the potential of LLMs to enhance health literacy and patient empowerment