Publication: Optical Document Recognition (ODR) with Large Vision-Language Models: Enhancing Metadata Creation and Digitization in Libraries
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Access Restrictions
Abstract
As libraries and archives digitize their collections, a familiar challenge persists: making them searchable and accessible. Manual metadata creation is infeasible at scale, and legacy OCR systems often falter on complex, symbol-rich pages. This thesis introduces MetaScribe, a flexible system that uses recent advances in large vision-language models (LVLMs) to support metadata generation at scale. Tested on materials from the Princeton Prosody Archive (PPA), MetaScribe improved character recognition accuracy by over 20 percentage points and produced field-level metadata with promising reliability (average F1 score of 0.72). Yet the aim is not automation for its own sake. MetaScribe is designed to work alongside archivists and librarians, not in place of them. Through this thesis, we offer a modular, transparent framework that preserves human judgment while extending institutional capacity. As AI capabilities grow, tools like MetaScribe are a practical path forward: adaptable, accountable, and grounded in the needs of cultural stewardship.