Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

Evaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval

datacite.rightsrestricted
dc.contributor.advisorHanin, Boris
dc.contributor.authorIrons, Carson P.
dc.date.accessioned2025-08-07T12:44:34Z
dc.date.available2025-08-07T12:44:34Z
dc.date.issued2025-05-10
dc.description.abstractThis thesis investigates the limitations of current document retrieval systems and introduces an alternative architecture leveraging topic-level sparse indexing of contextual embeddings. This theoretical retrieval system seeks to achieve high computational efficiency through low latency and indexing overhead, while also achieving high semantic understanding and respecting local meaning and document cohesion. Additionally, the system supports scalable and context-aware document matching without reliance on user interaction data In pursuit of these objectives, the system makes 2 key assumptions on the structure and content of documents within a chosen application domain. The first assumption is that documents can be broken into self-contained semantic components, the second assumes an ability to represent the application domain's distinct meanings as a finite, discrete set of topics. At a high level, the proposed system aims to represent a document as a bag of topics, then apply sparse vector ranking algorithms at retrieval time. Topics are inferred by clustering the contextualized embeddings of semantic components within a learned embedding space. The contributions of this thesis involve a review of existing retrieval methods, an outline of the proposed system's intuition and architecture, and an explorative implementation against a strategically chosen application domain. The thesis finds that standard embedding models (SBERT in this case) are insufficient for identifying application specific topics. Future work will focus on fine-tuning embedding models to better capture domain-specific semantics and fully evaluate the potential of this topic-based retrieval framework. The thesis also provides the necessary tooling, for extension and modification of the retrieval pipeline. Namely, it supports the training and querying of the proposed retrieval system, while accepting custom implementations at each step.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01d504rp782
dc.language.isoen_US
dc.titleEvaluating Domain-Specific Topic Reduction for Sparse Vector Document Retrieval
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-04-10T19:56:29.821Z
pu.contributor.authorid920253093
pu.date.classyear2025
pu.departmentOps Research & Financial Engr

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis (2).pdf
Size:
769.48 KB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download