Princeton University Users: If you would like to view a senior thesis while you are away from campus, you will need to connect to the campus network remotely via the Global Protect virtual private network (VPN). If you are not part of the University requesting a copy of a thesis, please note, all requests are handled manually by staff and will require additional time to process.
 

Publication:

Garbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP

dc.contributor.advisorDieng, Adji Bousso
dc.contributor.authorAliu, Aminah O.
dc.date.accessioned2025-08-06T15:52:41Z
dc.date.available2025-08-06T15:52:41Z
dc.date.issued2025-04-27
dc.description.abstractEmbedding models, which map text or other data to a point in vector space, form the backbone of many modern Natural Language Processing (NLP) tasks, including Machine Translation (MT), Question-Answering (QA), and Named Entity Recognition (NER). While an abundance of data and Machine Learning (ML) tools exist for NLP Tasks in English, the same cannot be said for low-resource languages. A low-resource language is one that lacks the online data or technical-linguistic tools necessary to effectively train ML models. In particular, Yorùbá is a low-resource African language for which embedding model availability is limited. This scarcity presents a bottleneck across African NLP development efforts, as access to quality embeddings affects multiple downstream tasks. Through application of the Vendiscope, a tool capable of analyzing the composition of data at scale, I uncover insight into presently available Yorùbá-friendly embedding models. Further analysis reveals implicit assumptions within ML development which should be mitigated in future African NLP work
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01db78tg48g
dc.language.isoen_US
dc.titleGarbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-04-26T23:58:36.548Z
pu.contributor.authorid920249842
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Aminah Aliu Written Report.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download