Publication: Garbage Upstream, Garbage Downstream: Diagnosing Embedding Model Failures in Yorùbá NLP
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Embedding models, which map text or other data to a point in vector space, form the backbone of many modern Natural Language Processing (NLP) tasks, including Machine Translation (MT), Question-Answering (QA), and Named Entity Recognition (NER). While an abundance of data and Machine Learning (ML) tools exist for NLP Tasks in English, the same cannot be said for low-resource languages. A low-resource language is one that lacks the online data or technical-linguistic tools necessary to effectively train ML models. In particular, Yorùbá is a low-resource African language for which embedding model availability is limited. This scarcity presents a bottleneck across African NLP development efforts, as access to quality embeddings affects multiple downstream tasks. Through application of the Vendiscope, a tool capable of analyzing the composition of data at scale, I uncover insight into presently available Yorùbá-friendly embedding models. Further analysis reveals implicit assumptions within ML development which should be mitigated in future African NLP work