Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval

Loading...
Thumbnail Image

Files

Thesis (2).pdf (769.48 KB)

Date

2025-05-10

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

This thesis investigates the limitations of current document retrieval systems and introduces an alternative architecture leveraging topic-level sparse indexing of contextual embeddings. This theoretical retrieval system seeks to achieve high computational efficiency through low latency and indexing overhead, while also achieving high semantic understanding and respecting local meaning and document cohesion. Additionally, the system supports scalable and context-aware document matching without reliance on user interaction data

In pursuit of these objectives, the system makes 2 key assumptions on the structure and content of documents within a chosen application domain. The first assumption is that documents can be broken into self-contained semantic components, the second assumes an ability to represent the application domain's distinct meanings as a finite, discrete set of topics.

At a high level, the proposed system aims to represent a document as a bag of topics, then apply sparse vector ranking algorithms at retrieval time. Topics are inferred by clustering the contextualized embeddings of semantic components within a learned embedding space.

The contributions of this thesis involve a review of existing retrieval methods, an outline of the proposed system's intuition and architecture, and an explorative implementation against a strategically chosen application domain. The thesis finds that standard embedding models (SBERT in this case) are insufficient for identifying application specific topics. Future work will focus on fine-tuning embedding models to better capture domain-specific semantics and fully evaluate the potential of this topic-based retrieval framework.

The thesis also provides the necessary tooling, for extension and modification of the retrieval pipeline. Namely, it supports the training and querying of the proposed retrieval system, while accepting custom implementations at each step.

Description

Keywords

Citation