Operations Research and Financial Engineering, 2000-2025
Permanent URI for this collectionhttps://theses-dissertations.princeton.edu/handle/88435/dsp011r66j119j
Browse
Browsing Operations Research and Financial Engineering, 2000-2025 by Author "Cattaneo, Matias Damian"
- Results Per Page
- Sort Options
A Sustainable Extension of the Fama-French Factor Models: The Role of Carbon Emissions-Based Factors in Describing U.S. Stock Returns
(2025-04-10) Huang, Elaine L.; Cattaneo, Matias DamianAmidst climate change concerns, many investors are incorporating climate-related considerations, such as a company's carbon dioxide (CO2) emissions, into their investment decisions. Unfortunately, CO2 data is often missing or estimated. Therefore, we aim to understand how companies' carbon emissions can describe --- and how sector membership and carbon disclosure can impact --- excess stock returns. We extend the Fama-French (FF) three-factor and five-factor models, which describe stock returns using financial metrics, to also include our constructed ``Green-Minus-Brown" (GMB) factors: GMB_U (based on Log(CO2) emissions), and GMB_S (based on CO2 intensity). Our results show (1) Both GMB_U and GMB_S are statistically significant and have negative associations with excess stock returns; (2) Stocks in greener sectors have more positive interactions with the GMB factors, stocks in browner sectors have more negative interactions, and sectors with less polarizing CO2 emissions tend to have statistically insignificant interactions; and (3) The returns of companies with reported CO2 data are more sensitive to changes in the GMB factors than those with estimated CO2 data. Our research supports existing literature that carbon emissions can be used to describe stock returns while being the first to build factors based on both unscaled and scaled carbon emissions and to analyze performance across sectors and CO2 data sources (i.e., estimated vs. reported). In addition, our GMB factors can be used by companies and investors alike to track the monthly spreads between the excess returns of green stocks and the excess returns of brown stocks.
From Policy to Patient: A Finite-Horizon Markov Decision Process for Optimizing Non-Small Cell Lung Cancer Treatment
(2025-04-10) Parikh, Krishna V.; Cattaneo, Matias DamianAdvancements in immunotherapy have transformed treatment for advanced stages of non-small cell lung cancer (NSCLC). However, optimal sequencing of chemotherapy, immunotherapy, and combination chemoimmunotherapy still remains unresearched. Chemotherapy may prime the tumor microenvironment, enhancing immune activation and, as a result, immunotherapy’s effectiveness. To explore this timing advantage, we develop a finite-horizon Markov Decision Process (MDP) to model treatment selection over a course of ten cycles. The model incorporates four clinical variables to guide decision making: toxicity, PD-L1 expression (as a proxy for immune activation), disease progression, and overall survival. Transition probabilities and survival outcomes are derived from clinical trial data, and cost is defined as a normalized ratio of burden (toxicity and disease progression) to survival. The results indicate that chemotherapy is only optimal under extreme exaggeration of its role in immune activation or when parameters like progression are eliminated. However, there is benefit in combined regimens: chemoimmunotherapy followed by immunotherapy proves optimal in all initial states of no toxicity or disease progression. When compared to the three therapies on their own, the costs for the optimal policy is significantly lower in all cases, highlighting the benefit of an adaptive treatment plan. Such can inform future clinical trial planning for NSCLC. This work is the first of its kind to integrate immunotherapy and account for dynamic immune activation, providing a novel starting point for more complex treatment optimization.
We Need More Data: The Promise and Peril of Training Large Language Models on Synthetically Generated Text
(2025-04-09) Lam, Gordon K.; Cattaneo, Matias DamianOur research investigates the viability of using Large Language Models (LLMs) for natural language text augmentation. One major factor behind the significant improvement in LLM performance in recent years has been the increased volume of data being used to train models. However, state-of-the-art models have already been trained on nearly the entire internet, effectively exhausting the supply of unique, human-generated data. As a result, availability of unique data is emerging as a significant bottleneck for further advances in model performance. To address this issue, we explored data augmentation as a method of synthetic data generation aimed at expanding the size of existing training corpora. While data augmentation has proven effective for expanding dataset sizes and improving model performance in domains like image classification, robust methods for text data remain underdeveloped due to the complex structure of natural language. In our study, we used the Gutenberg English dataset to generate augmented versions of long-form passages using a state-of-the-art large language model. We then trained three identical ~124M parameter GPT-2 style models to convergence: one on the original dataset, one on the synthetic dataset, and one on a combination of both. Across nearly all evaluation benchmarks, including in-distribution and zero-shot tasks, the model trained solely on human-generated data outperformed the others. These findings highlight the importance of data quality in pretraining, not only underscoring the role it plays in improving model performance, but also the potential risks associated with relying on synthetically generated data, even as past gains have largely been driven by data volume. Our work also highlights limitations in current approaches to text data quality assessment, such as the inadequacy of cosine similarity as a proxy. While our results tell a cautionary tale about the risks of training LLMs on synthetic data, we also suggest potential directions for future work, particularly in refining synthetic data generation and filtering strategies.