Publication: Garbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP
dc.contributor.advisor | Dieng, Adji Bousso | |
dc.contributor.author | Aliu, Aminah O. | |
dc.date.accessioned | 2025-08-06T15:52:41Z | |
dc.date.available | 2025-08-06T15:52:41Z | |
dc.date.issued | 2025-04-27 | |
dc.description.abstract | Embedding models, which map text or other data to a point in vector space, form the backbone of many modern Natural Language Processing (NLP) tasks, including Machine Translation (MT), Question-Answering (QA), and Named Entity Recognition (NER). While an abundance of data and Machine Learning (ML) tools exist for NLP Tasks in English, the same cannot be said for low-resource languages. A low-resource language is one that lacks the online data or technical-linguistic tools necessary to effectively train ML models. In particular, Yorùbá is a low-resource African language for which embedding model availability is limited. This scarcity presents a bottleneck across African NLP development efforts, as access to quality embeddings affects multiple downstream tasks. Through application of the Vendiscope, a tool capable of analyzing the composition of data at scale, I uncover insight into presently available Yorùbá-friendly embedding models. Further analysis reveals implicit assumptions within ML development which should be mitigated in future African NLP work | |
dc.identifier.uri | https://theses-dissertations.princeton.edu/handle/88435/dsp01db78tg48g | |
dc.language.iso | en_US | |
dc.title | Garbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP | |
dc.type | Princeton University Senior Theses | |
dspace.entity.type | Publication | |
dspace.workflow.startDateTime | 2025-04-26T23:58:36.548Z | |
pu.contributor.authorid | 920249842 | |
pu.date.classyear | 2025 | |
pu.department | Computer Science |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- Aminah Aliu Written Report.pdf
- Size:
- 1.57 MB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 100 B
- Format:
- Item-specific license agreed to upon submission
- Description: