Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

Garbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP

datacite.rightsrestricted
dc.contributor.advisorDieng, Adji Bousso
dc.contributor.authorAliu, Aminah O.
dc.date.accessioned2025-08-06T15:52:41Z
dc.date.available2025-08-06T15:52:41Z
dc.date.issued2025-04-27
dc.description.abstractEmbedding models, which map text or other data to a point in vector space, form the backbone of many modern Natural Language Processing (NLP) tasks, including Machine Translation (MT), Question-Answering (QA), and Named Entity Recognition (NER). While an abundance of data and Machine Learning (ML) tools exist for NLP Tasks in English, the same cannot be said for low-resource languages. A low-resource language is one that lacks the online data or technical-linguistic tools necessary to effectively train ML models. In particular, Yorùbá is a low-resource African language for which embedding model availability is limited. This scarcity presents a bottleneck across African NLP development efforts, as access to quality embeddings affects multiple downstream tasks. Through application of the Vendiscope, a tool capable of analyzing the composition of data at scale, I uncover insight into presently available Yorùbá-friendly embedding models. Further analysis reveals implicit assumptions within ML development which should be mitigated in future African NLP work
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01db78tg48g
dc.language.isoen_US
dc.titleGarbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-04-26T23:58:36.548Z
pu.contributor.authorid920249842
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Aminah Aliu Written Report.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download