Publication: Punishing Memory, Rewarding Amnesia: Direct Preference Optimization as a Framework for Mitigating Undesirable LLM Memorization
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large Language models have revolutionized the machine learning and Artificial Intelligence field. With the advent of the transformer architecture in 2017, companies and governments across the world have invested trillions of dollars into Language Models. Along with the level of investment, there have been rapid scalings of language model sizes. While the numbers are not official, GPT-3.5 released by OpenAI is reported to be around 3.5 billion parameters. On the other hand, the more recent Llama 4 Behometh is reported to have more than 2 trlilion parameters. [17] Although language models have become larger, more efficient, and higher performing, there is a consistent problem of memorization. Language models across all sizes memorize and regurgitate their training data. While memorization is necessary for language models to learn basic facts and reasoning, there are concerns that undesirable memorization hurts data privacy, creates copyright issues, and compromises output quality. Many papers have been written attempting to mitigate LLM memorization. However, many of the efforts have been focused on extensive preprocessing of datasets or postprocessing of outputs. Methods that focus on altering the training often degrade generation performance significantly. We propose mitigating LLM memorization using Direct Preference Optimization (DPO), which is a newer alternative to the standard Reinforcement Learning with Human Feedback (RLHF) and Proximal Policy Optimization (PPO) framework. With this framework, we reward non-memorized outputs and punish memorized outputs. We hypothesize that this preference framework will incentivize LLMs to generalize better without relying on verbatim reproduction of training data, mitigating memorization concerns without significantly degrading overall performance. We find that Direct Preference Optimization is a viable framework, mitigating memorization rates of a fine-tuned language model by an average of 88.46% over different temperatures. We also find that there is no significant degradation in performance after performing DPO training while maintaining performance