Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval

Irons, Carson P.

Publication:
Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval

datacite.rights	restricted
dc.contributor.advisor	Hanin, Boris
dc.contributor.author	Irons, Carson P.
dc.date.accessioned	2025-08-07T12:44:34Z
dc.date.available	2025-08-07T12:44:34Z
dc.date.issued	2025-05-10
dc.description.abstract	This thesis investigates the limitations of current document retrieval systems and introduces an alternative architecture leveraging topic-level sparse indexing of contextual embeddings. This theoretical retrieval system seeks to achieve high computational efficiency through low latency and indexing overhead, while also achieving high semantic understanding and respecting local meaning and document cohesion. Additionally, the system supports scalable and context-aware document matching without reliance on user interaction data In pursuit of these objectives, the system makes 2 key assumptions on the structure and content of documents within a chosen application domain. The first assumption is that documents can be broken into self-contained semantic components, the second assumes an ability to represent the application domain's distinct meanings as a finite, discrete set of topics. At a high level, the proposed system aims to represent a document as a bag of topics, then apply sparse vector ranking algorithms at retrieval time. Topics are inferred by clustering the contextualized embeddings of semantic components within a learned embedding space. The contributions of this thesis involve a review of existing retrieval methods, an outline of the proposed system's intuition and architecture, and an explorative implementation against a strategically chosen application domain. The thesis finds that standard embedding models (SBERT in this case) are insufficient for identifying application specific topics. Future work will focus on fine-tuning embedding models to better capture domain-specific semantics and fully evaluate the potential of this topic-based retrieval framework. The thesis also provides the necessary tooling, for extension and modification of the retrieval pipeline. Namely, it supports the training and querying of the proposed retrieval system, while accepting custom implementations at each step.
dc.identifier.uri	https://theses-dissertations.princeton.edu/handle/88435/dsp01d504rp782
dc.language.iso	en_US
dc.title	Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval
dc.type	Princeton University Senior Theses
dspace.entity.type	Publication
dspace.workflow.startDateTime	2025-04-10T19:56:29.821Z
pu.contributor.authorid	920253093
pu.date.classyear	2025
pu.department	Ops Research & Financial Engr

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Thesis (2).pdf
Size:: 769.48 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 100 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Operations Research and Financial Engineering, 2000-2025

Publication: Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval

Files

Original bundle

License bundle

Collections

Publication:
Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval