Princeton University users: to view a senior thesis while away from campus, connect to the campus network via the Global Protect virtual private network (VPN). Unaffiliated researchers: please note that requests for copies are handled manually by staff and require time to process.
 

Publication:

InvestiHate: How Hate Speech Detection Models Identify Language Targeting Different Social Demographics

Loading...
Thumbnail Image

Files

written_final_report.pdf (6.39 MB)

Date

2025-04-07

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

In recent years, hate speech has risen at an alarming rate, underscoring the urgent need for effective content moderation systems to ensure the safety of online spaces. However, mounting political pressure from the new Trump administration, coupled with widespread skepticism about the reliability of hate speech classification models, has led many social media platforms to significantly reduce their moderation efforts. This thesis investigates the weaknesses and vulnerabilities of three hate speech detection models - logistic regression, SVM, and BERT - on Twitter posts. It explores how these models distinguish hate speech from offensive or neutral language, with a particular focus on the impact of slurs and references to gender, race, and sexuality on classification outcomes. The findings reveal three key insights: (1) While BERT achieves the highest overall accuracy (82%), all models struggle to differentiate hate speech from offensive language. (2) All models also exhibit a clear bias against classifying even blatant misogyny as hate speech. (3) Model performance deteriorates significantly when encountering text that differs from the training data. As the issue of online hate speech continues to escalate, it is crucial that we improve the ability of hate speech detection systems to identify and mitigate the most harmful online discourse.

Description

Keywords

Citation