Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization

Jeong, Jonathan J.

Publication:
Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization

datacite.rights	restricted
dc.contributor.advisor	Stewart, Brandon Michael
dc.contributor.author	Jeong, Jonathan J.
dc.date.accessioned	2025-08-06T15:39:57Z
dc.date.available	2025-08-06T15:39:57Z
dc.date.issued	2025-04-10
dc.description.abstract	Large Language models have revolutionized the machine learning and Artificial Intelligence field. With the advent of the transformer architecture in 2017, companies and governments across the world have invested trillions of dollars into Language Models. Along with the level of investment, there have been rapid scalings of language model sizes. While the numbers are not official, GPT-3.5 released by OpenAI is reported to be around 3.5 billion parameters. On the other hand, the more recent Llama 4 Behometh is reported to have more than 2 trlilion parameters. [17] Although language models have become larger, more efficient, and higher performing, there is a consistent problem of memorization. Language models across all sizes memorize and regurgitate their training data. While memorization is necessary for language models to learn basic facts and reasoning, there are concerns that undesirable memorization hurts data privacy, creates copyright issues, and compromises output quality. Many papers have been written attempting to mitigate LLM memorization. However, many of the efforts have been focused on extensive preprocessing of datasets or postprocessing of outputs. Methods that focus on altering the training often degrade generation performance significantly. We propose mitigating LLM memorization using Direct Preference Optimization (DPO), which is a newer alternative to the standard Reinforcement Learning with Human Feedback (RLHF) and Proximal Policy Optimization (PPO) framework. With this framework, we reward non-memorized outputs and punish memorized outputs. We hypothesize that this preference framework will incentivize LLMs to generalize better without relying on verbatim reproduction of training data, mitigating memorization concerns without significantly degrading overall performance. We find that Direct Preference Optimization is a viable framework, mitigating memorization rates of a fine-tuned language model by an average of 88.46% over different temperatures. We also find that there is no significant degradation in performance after performing DPO training while maintaining performance
dc.identifier.uri	https://theses-dissertations.princeton.edu/handle/88435/dsp01g732dd44h
dc.language.iso	en_US
dc.title	Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization
dc.type	Princeton University Senior Theses
dspace.entity.type	Publication
dspace.workflow.startDateTime	2025-04-10T21:35:04.737Z
pu.contributor.authorid	920281839
pu.date.classyear	2025
pu.department	Computer Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: written_final_report.pdf
Size:: 2.84 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 100 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Computer Science, 1987-2025

Publication: Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization

Files

Original bundle

License bundle

Collections

Publication:
Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization