Publication: InvestiHate: How Hate Speech Detection Models Identify Language Targeting Different Social Demographics
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Access Restrictions
Abstract
In recent years, hate speech has risen at an alarming rate, underscoring the urgent need for effective content moderation systems to ensure the safety of online spaces. However, mounting political pressure from the new Trump administration, coupled with widespread skepticism about the reliability of hate speech classification models, has led many social media platforms to significantly reduce their moderation efforts. This thesis investigates the weaknesses and vulnerabilities of three hate speech detection models - logistic regression, SVM, and BERT - on Twitter posts. It explores how these models distinguish hate speech from offensive or neutral language, with a particular focus on the impact of slurs and references to gender, race, and sexuality on classification outcomes. The findings reveal three key insights: (1) While BERT achieves the highest overall accuracy (82%), all models struggle to differentiate hate speech from offensive language. (2) All models also exhibit a clear bias against classifying even blatant misogyny as hate speech. (3) Model performance deteriorates significantly when encountering text that differs from the training data. As the issue of online hate speech continues to escalate, it is crucial that we improve the ability of hate speech detection systems to identify and mitigate the most harmful online discourse.