Computer Science, 1987-2025
Permanent URI for this collectionhttps://theses-dissertations.princeton.edu/handle/88435/dsp01mp48sc83w
Browse
Browsing Computer Science, 1987-2025 by Issue Date
- Results Per Page
- Sort Options
Modeling Gendered Semantic Differences in English-Language Poetry
(2025) Choi, Melody; Fellbaum, Christiane DorotheaThis thesis presents a computational exploration of gendered semantic differences in English-language poetry. By training separate Word2Vec and FastText embedding models on collections of male- and female-authored poems across multiple time periods, we investigate the way poets of different genders use language in quantitatively distinct ways. Nearest-neighbor embedding visualizations and n-gram-based co-occurrence networks were used to identify meaningful semantic relationships between words, which often reflected broader sociocultural narratives around gender.
The findings suggest that while semantic distinctions between male and female poets have become less pronounced in more contemporary periods, they remain detectable, especially around themes of identity, embodiment, and domesticity. Co- occurrence networks further illustrate thematic clustering and distinct community structures that vary across gender and era. While computational tools cannot fully capture the metaphorical ambiguity or emotional content of poetry, they offer new modes of inquiry, enabling the analysis of linguistic patterns that poetry has long conveyed through style and form. This work contributes to the growing field of computational text analysis, demonstrating that even through quantitative frameworks, language continues to carry the nuance of human thought and experience.
Billboard Hot 100 Chart-Toppers Understood: A Comprehensive Analysis of Popular Music in the 21st Century
(2025) Gomez, Richard; Li, XiaoyanThis paper delves into the audio and lyrical features of popular music in the 21st century, primarily focusing on hit songs in the United States that charted on the Billboard Hot 100. Historical Billboard Hot 100 charts, lyric data from Genius Lyrics, and Spotify audio feature are the three primary datasets that construct a snapshot of contemporary popular music. Exploratory data analysis and clustering techniques highlight changes and continuities within the data, while latent Dirichlet allocation (LDA) is utilized to discover the thematic topics of hit and non-hit music. The overarching goal is to classify songs as hits and non-hits based on their underlying audio and lyrical features. To achieve this, a support vector machine model (SVM) is trained and optimized. The SVM achieves an accuracy rate of 82%, mirroring the successes of other papers in the field, while adding a new dimension to the data. Beyond the core features of the project, this paper contributes to the field of hit song science (HSS) and offers a new framework to study Billboard Hot 100 hits.
On the Path Dependence of Infrastructure Logic in Transit Planning
(2025) Hu, Daniel Y.; Fish, Robert S.This project investigates the Metropolitan Transportation Authority’s (MTA) 2025-2029 Operational Budget and Capital Plan to examine how transit investment decisions in New York City reflect shifting patterns in ridership, demographics, and ac- cessibility. Specifically, it investigates how historical infrastructure logic continues to greatly influence present-day planning despite undoubtable changes in commuter behavior and urban growth through the lens of path dependence. Using a conjunction of ridership forecasting models, borough-level investment equity analysis, and inflation-adjusted efficiency metrics, the thesis finds that capital investments remain disproportionately allocated and accessibility investments remain skewed towards Manhattan, although boroughs like Brooklyn and Queens experience both faster population growth and stronger ridership recovery. Such findings suggest inertia in the MTA’s planning process, raising concerns about its ability to not only ensure equity in the people it supports but also how well it adapts to an ever-transforming metropolis.
PaceVision: Augmented Reality Sunglasses for Real-Time Running Metrics and Performance
(2025) Peixoto, Jonathan N.; Jamieson, Kyle AndrewPaceVision addresses a gap that runners face in accessing real-time performance metrics without disturbing running form. This project presents an augmented reality (AR) running assistant that utilizes Engo 2 AR sunglasses to display real-time pace data in the runner’s field of view. The system’s key innovation is a dynamic pacing line that provides visual feedback on the current pace in relation to the target pace, thus removing the need to glance at wrist-worn devices. PaceVision utilizes adaptive algorithms that balance responsiveness and stability in pace calculations from GPS data. Furthermore, the system includes an interval training mode that automatically manages workout segments with visual cues. Evaluation results show that the system achieved an average response time under 3 seconds for pace changes. Although traditional watch methods provide slightly better average pace accuracy, PaceVision provides a convenience factor for the runners. Across controlled one-mile trials there was a deviation of 4.4 versus 6.6 seconds; in long-distance runs 7.5 versus 8.6 seconds, and during interval workouts 6.4 versus 7.0 seconds. This research demonstrates the effectiveness of AR for enhancing a user’s running experience.
A Bioinformatics Approach to Information-Driven Folding and Docking of Antibody-Antigen Complexes
(2025) Burbank-Embry, Sarah H.; Dieng, Adji BoussoThis thesis presents a user friendly approach to information driven antibody-antigen folding and docking.
Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba
(2025) Singh, Divraj; Petras, IasonasDespite the impressive capabilities of large language models, low-resource languages (LRLs) such as Swahili, Haitian Creole, and Yoruba remain significantly underserved due to a lack of training data. This thesis explores a text-only approach to addressing this gap by leveraging back-translation to generate synthetic data. Using pre-trained multilingual models like mT5, original sentences in each target language are translated into English and then back into their original form to produce varied and contextually rich text pairs. These pairs are used to fine-tune LLMs, enhancing their fluency and generalization in low-resource settings. The results show measurable improvements in output diversity and translation quality, demonstrating that synthetic data augmentation can play a key role in advancing equitable language technology.
Right Place, Right Time: Computer Vision Tools for Analysis of Defensive Positioning in NCAA Men’s Volleyball
(2025) Tate, Mason J.; Moretti, Christopher M.In the fast-growing and fast-paced sport of collegiate men’s volleyball, a strong defensive system is critical to team success. Central to this defense is the positioning of backcourt players, who attempt to ”dig” the opponent’s attacks and prolong rallies by transitioning possession to their own team. While sports analytics has advanced rapidly in recent decades, men’s collegiate volleyball has seen relatively limited development in data-driven performance analysis, particularly in the area of defensive positioning. This project aims to bridge that gap by using computer vision, specifically object classification models, to detect and map defender positions from match footage. These spatial coordinates are then paired with outcome-based statistics to explore the relationship between positioning and dig success. Using homographic transformation techniques, player locations are projected onto a standardized 2D court model to enable comparison across venues. The resulting dataset is visualized through both frame-level position plots and aggregate heatmaps filtered by team, play outcome, or attack location. The data collection process achieved a usable frame conversion rate of 69.9%, indicating that the proposed methodology is viable for scalable, automated defensive analysis. This work demonstrates the potential of computer vision in volleyball analytics and provides a foundation for future research into tactical trends and optimization.
A Corpus-Based Approach to English Adversative Coordination
(2025) Weizel, Oliver L.; Fellbaum, Christiane DorotheaWhat is the difference between and and but? In many sentences, they can be freely interchanged—consider that both ”the weather is sunny and cold” and ”the weather is sunny but cold” are true if and only if the weather is sunny and the weather is cold. What then causes speakers to chose and over but and vice versa? To that end, I gather data from the Corpus of Contemporary American English and investigate properties of the distributions of the two conjunctions. I find that and is more unmarked and neutral, while but is more likely to appear when greater contrasts exist between the two conjuncts themselves, or more broadly in more salient contexts. Along the way, novel analyses for the underlying syntactic structure of certain uses of but are proposed.
RadSched: A Latency Optimizing Scheduler for Stateful Serverless Edge Computing
(2025) Mindel, Jonathan; Lloyd, Wyatt A.This thesis presents and evaluates RadSched, a latency-optimizing scheduler designed for stateful, per function execution in a serverless edge computing environment. In distributed systems, where data consistency and latency vary by time and location, selecting the optimal edge location for function execution becomes a complex decision. Built on top of the Radical framework, RadSched maintains records of network conditions and learns from past data consistency outcomes to automate routing function requests to the optimal edge location. The system employs an ϵ-greedy exploration strategy to adapt to shifting network conditions and data availability, thereby ensuring responsiveness. Through empirical evaluation across multiple AWS regions, this thesis demonstrates that RadSched maintains comparable median latency to baseline systems in a stable environment, though with higher tail latency – a tradeoff that allows the system to route functions to a shifting optimal edge in volatile environments. Ultimately, by abstracting edge selection away from the client, RadSched both improves performance and simplifies developer interaction with stateful server-less functions on the edge.
Estimating Displacement via Nighttime Satellite Imagery
(2025) Aziz, Rawand D.; Weinberg, MattNighttime satellite imagery provides data which may be useful in addressing the needs of people affected by war. Using Syria as a case study, I seek to build a tool which allows for estimating the number of people who flee a given city after a battle; to do this, I build a machine learning model which estimates a given city’s population by evaluating various different models which analyze different aspects/features of the lights over the city. With the results achieved, I hope to provide humanitarian organizations that operate in conflict zones with actionable information that is useful for planning their crisis response and relief efforts.
Fine-tuning Small Language Models for Javanese Translation
(2025) Menezes, Trivan T.; Fellbaum, Christiane DorotheaNatural language processing tools remain scarce for low-resource languages, despite their possibly large speaker populations. This research investigates the potential for improving machine translation for Javanese, an Indonesian language with over 80 million speakers but limited digital resources. We analyze the performance of fine-tuning techniques—supervised fine-tuning (SFT), model distillation, and Chain-of-Thought (CoT) distillation—on enhancing Javanese-to-Indonesian translation quality using open-weight models (T5, mT5, Gemma 3 4B, Aya 8B, Aya 32B), comparing them against zero-shot and many-shot baselines from larger proprietary models.
Evaluation using BLEU, TER, chrF, and BERTScore reveals that while large models like Gemini 2.0 Flash achieve top performance, fine-tuning significantly boosts the performance of smaller models. The first stage of SFT on the 500-example NusaX dataset caused significant improvement in translation quality. Subsequent model distillation and CoT distillation yielded only marginal improvements over SFT, suggesting diminishing returns potentially limited by pre-training knowledge. The improvements were still tangible, with the fine-tuned 4-billion parameter Gemma 3 4B model achieving performance comparable to, and sometimes exceeding, much larger models like GPT-4o in a zero-shot setup.
The results show that fine-tuning smaller, accessible models offers a resource-efficient path to high-quality translation for low-resource languages like Javanese, potentially enabling deployment on edge devices and broadening access to NLP technologies for underserved linguistic communities.
ReRoute: A Consistent Serverless Map Application Running Near Users for Lower Latencies
(2025) Wang, Donna; Lloyd, Wyatt A.Map applications, which require low-latency responses for a seamless user global experience are well-suited for cloud deployment. However, their stateful nature and the need for strong consistency present significant challenges in fully leveraging the low-latency advantages of running computations at edge locations, or data centers near users. To overcome these limitations, we introduce ReRoute, a serverless, cloud-based map application that integrates Radical, a system designed to preserve consistency guarantees from primary data centers while enabling reduced latencies at edge locations. Our evaluation demonstrates that integrating with Radical substantially lowers median end-to-end latency, particularly for clients near edge locations, highlighting its potential to enhance the performance of map applications.
A Computer Vision Approach to Analyzing Player Movement
(2025) Aguirre, Maria F.; Heide, FelixThis thesis provides a new resource for squash performance analysis by developing a computer vision system that integrates advanced object detection and tracking techniques. By stringing together YOLOv8 for precise player detection and StrongSORT for multi-object tracking, the system accurately processes game footage to collect player data. A tool developed in this project is a user-assisted manual court mapping interface corrects perspective distortions, providing the resource to generate movement based analytics that reflect on-court dynamics, such as control of the critical ’T’ position. The adaptability of the technologies used create the opportunity for the expansion of this project. Further development of this project offers valuable insight for coaching and performance improvement, and further refinements are expected to enhance the accuracy of detection and tracking even further.
On the Value of Coronal Magnetic Field Data from NLFFF Extrapolations for Predicting Solar Energetic Particle Events Using Machine Learning Methods
(2025) Fernandes, Ian; Dieng, Adji BoussoSolar energetic particle (SEP) prediction approaches relying on solar active region physical characteristics have predominantly relied on proxies of the free magnetic energy of a region calculated from photospheric magnetograms despite SEPs forming higher in the corona, with this data mismatch largely being due to limited availability of magnetic vector data in the corona. In this work, we generate approximations of the coronal magnetic field of solar active regions by employing a non-linear force-free field (NLFFF) method that extrapolates magnetic field data from photospheric vector magnetograms upward into the corona. To help make the analysis of these coronal volumes more tractable for lower-complexity models, we develop an approach that estimates the most relevant areas of the volume for SEP prediction purposes and extracts the corresponding cutouts; in this research, we mainly focus on analyzing the mini-cube transversely centered at the polarity inversion region of the volume. Moreover, we parameterize the volumes in a multitude of ways, e.g., by generating several proxies of the free magnetic energy in the corona. To determine the value of this coronal data data compared to the typically-used photospheric data, we conduct several rounds of grid searches that attempt to find the highest-performing ML models and their hyperparameters for each subset of data. We find that the numeric coronal proxies generated from the volumes don’t improve SEP prediction for the models we test compared to numeric photospheric proxies, even when used in combination. We also find evidence suggesting that the coronal volume mini-cubes themselves don’t provide enough signal for the convolutional models used in our experiments. Thus, we emphasize the importance of future work that explores different approaches that both numerically parameterize and natively process coronal volumes for SEP prediction and furthermore suggest the usage of such data in a wide variety of other space weather modeling and prediction tasks (like flare and CME prediction) that may be able to utilize the signal provided by these coronal volumes more efficiently and robustly than the super-rare event prediction task of SEP forecasting.
And the Grammy Goes To...: A Predictive Analysis of Grammy Award Outcomes
(2025) Elsharkawi, Sarah A.; Wayne, KevinThis project aims to build a predictive model that effectively ranks the winners of the Grammys, a prestigious music award. The project takes a data-driven approach to analyze the factors influencing Grammy recognition, focusing on both commercial success and artistic merit. By integrating data from sources such as Spotify audio features, Billboard chart performance, and Genius lyrics, the model aims to predict Grammy winners in three major categories: Song of the Year, Record of the Year, and Best Rap Song. The results show that a hybrid approach, using both a global model and category-specific models, offers the best performance by capturing both broad trends and category-specific nuances. Although the model demonstrates strong predictive capability, the project highlights areas for improvement, including the need for additional features such as genre, record label information, and artist social media engagement. Future work will focus on expanding the dataset, incorporating Grammy history, and refining the model to provide even more accurate predictions, moving us closer to understanding the factors that drive Grammy success.
A User-Centric Approach to Content Curation in Pantry
(2025) Marin Carabajo, Gabriel S.; Monroy-Hernandez, AndresThe shift towards personalized content consumption in social media has driven the rise of black-box algorithms which are foundational in delivering tailored experiences. These algorithms do an incredible job at delivering tailored experiences without much effort from the user, however often utilize intrusive methods such as location, watch time, and scrolling behavior. At the same time, alternative social networks have emerged with an added emphasis on decentralization, transparency, and privacy. However, a significant gap remains in this space: the absence of alternative social media applications that provide users with feed curation tools, make these tools accessible to all end-users, and unify fragmented user communities across diverse networks. This paper introduces Pantry, a social media reader application designed to address this gap through the idea of teachable feeds, inspired by existing literature, and powered by an on-device machine learning model. Evaluation through a user study reveals that Pantry succeeds in delivering feed curation tools driven by users and provide several key insights. The results of this paper, from the user study, help inform and advance the future design space.
OpenDeli: Designing a Decentralized Food Delivery Protocol
(2025) Tan, Libo; Monroy-Hernandez, AndresThis study explores how participatory algorithm design can empower gig workers by introducing a configurable Hungarian matching system on Open Deli, a localized food delivery platform. It implements a web-based prototype that allowed couriers to adjust and visualize preference-based matching configurations in real time. Through live study sessions with potential couriers, we found that preference-based matching (e.g., location, food type, compensation) enhanced perceptions of fairness and alignment, even among participants without algorithmic expertise. The study revealed trade-offs between flexibility and information overload, highlighting the need for intuitive preference elicitation. Overall, our findings suggest that transparent, collaborative systems can better reflect worker values and support more equitable gig work experiences.
Characterizing National Soccer Identity via K-means Clustering of World Cup Match Performances
(2025) Steinert, Max; Moretti, Christopher M.This study investigates national playing style identity in professional soccer by applying unsupervised machine learning techniques to match statistics from the 2018 and 2022 FIFA World Cups. Motivated by countries like Spain and Brazil with well-known, signature playing styles, we aim to explore whether other countries exhibit national playing styles in the World Cup and to what extent these styles have cultural and historical ties. Our study uses a dataset of 200 match performances from 24 countries with 21 features that represent in-depth match statistics relating to possession, passing, defensive actions, goalkeeping, and shooting from FBRef.com. We implement four variations of k-means clustering assisted by principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) for clustering and visualization. Our results show seven clusters in the match data, each corresponding to a well-known playing style or strategic approach. We find that countries with strong national soccer identity more frequently use one playing style while other countries vary their playing styles between matches. While we observe some correlation between chosen playing style and geopolitical factors like income, population size, and geographical region, the globalization of soccer markets appears to have diminished these effects. This study demonstrates how national playing styles can be quantitatively identified and used to understand how countries express their identity through professional soccer.
How to Reignite Suns: A Novel, Digital Way to Experience the Album
(2025) Shin, Claire; Dall'Agnol, MarcelThis thesis presents Remixer Reloaded, a novel digital interface designed to let listeners interact with the author’s album at a granular level. Inspired by traditional digital audio workstations (DAWs) like Logic Pro, but reimagined for non-musicians, the tool enables users to explore individual audio and MIDI tracks from the album How to Reignite Suns through a web interface. Built with React and Tone.js, the application visualizes notes, waveforms, and timing grids, offering intuitive tools like zooming, track isolation, and note labeling. Unlike standard DAWs, the project emphasizes read-only exploration over editing, centering accessibility and music education. Through user testing, the tool was refined for clarity, performance, and engagement. This project merges the author’s identities as a musician and computer scientist to create an original, listener-centered musical experience.
Pioneering High Entropy Alloy Superconductors for Next Generation Qubit Design
(2025) Miryala, Sushma; Adams, Ryan P.Superconducting high-entropy alloys (HEAs) have recently garnered significant attention across numerous fields due to their unique blend of properties such as increased mechanical strength, structural stability, and tunable electronic properties. These attractive features thus position HEAs as a strong candidate for multiple real-world applications, especially as next-generation superconducting qubit materials considering their robust performance under extreme conditions such as low temperatures and high magnetic fields. However, the creation of HEAs consists of a vast compositional space, enabling researchers to choose from a great range of elements in different proportions heated at multiple cycles. In order to navigate this complex field, this study utilizes Bayesian optimization as a data-driven strategy to expedite the process of discovering and optimizing HEAs with high superconducting performance. Due to the high cost and time often associated with carrying out experiments in laboratories, this approach of iteratively updating a probabilistic model with an initial set of training data proves to be beneficial in focusing efforts on only the most promising configurations. It is also crucial to note that this research study is the first in literature to explore and computationally optimize a novel composition of seven specific elements of Gold, Tin, Antimony, Palladium, Silver, Tellurium, and Indium. This combination of Bayesian optimization and superconducting HEAs demonstrates a dynamic convergence between machine learning and materials innovation, broadening research horizons for quantum technology and engineering.