Princeton University Users: If you would like to view a senior thesis while you are away from campus, you will need to connect to the campus network remotely via the Global Protect virtual private network (VPN). If you are not part of the University requesting a copy of a thesis, please note, all requests are processed manually by staff and will require additional time to process.
 

Publication:

Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization

Loading...
Thumbnail Image

Files

written_final_report.pdf (2.84 MB)

Date

2025-04-10

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Large Language models have revolutionized the machine learning and Artificial Intelligence field. With the advent of the transformer architecture in 2017, companies and governments across the world have invested trillions of dollars into Language Models. Along with the level of investment, there have been rapid scalings of language model sizes. While the numbers are not official, GPT-3.5 released by OpenAI is reported to be around 3.5 billion parameters. On the other hand, the more recent Llama 4 Behometh is reported to have more than 2 trlilion parameters. [17] Although language models have become larger, more efficient, and higher performing, there is a consistent problem of memorization. Language models across all sizes memorize and regurgitate their training data. While memorization is necessary for language models to learn basic facts and reasoning, there are concerns that undesirable memorization hurts data privacy, creates copyright issues, and compromises output quality. Many papers have been written attempting to mitigate LLM memorization. However, many of the efforts have been focused on extensive preprocessing of datasets or postprocessing of outputs. Methods that focus on altering the training often degrade generation performance significantly. We propose mitigating LLM memorization using Direct Preference Optimization (DPO), which is a newer alternative to the standard Reinforcement Learning with Human Feedback (RLHF) and Proximal Policy Optimization (PPO) framework. With this framework, we reward non-memorized outputs and punish memorized outputs. We hypothesize that this preference framework will incentivize LLMs to generalize better without relying on verbatim reproduction of training data, mitigating memorization concerns without significantly degrading overall performance. We find that Direct Preference Optimization is a viable framework, mitigating memorization rates of a fine-tuned language model by an average of 88.46% over different temperatures. We also find that there is no significant degradation in performance after performing DPO training while maintaining performance

Description

Keywords

Citation