Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Optical Document Recognition (ODR) with Large Vision-Language Models: Enhancing Metadata Creation and Digitization in Libraries

Loading...
Thumbnail Image

Files

James_Zhang_COS_Thesis.pdf (11.89 MB)

Date

2025-04-10

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

As libraries and archives digitize their collections, a familiar challenge persists: making them searchable and accessible. Manual metadata creation is infeasible at scale, and legacy OCR systems often falter on complex, symbol-rich pages. This thesis introduces MetaScribe, a flexible system that uses recent advances in large vision-language models (LVLMs) to support metadata generation at scale. Tested on materials from the Princeton Prosody Archive (PPA), MetaScribe improved character recognition accuracy by over 20 percentage points and produced field-level metadata with promising reliability (average F1 score of 0.72). Yet the aim is not automation for its own sake. MetaScribe is designed to work alongside archivists and librarians, not in place of them. Through this thesis, we offer a modular, transparent framework that preserves human judgment while extending institutional capacity. As AI capabilities grow, tools like MetaScribe are a practical path forward: adaptable, accountable, and grounded in the needs of cultural stewardship.

Description

Keywords

Citation