Data Repetition Significantly Harms Language Model Performance

Jessica Chudnovsky, Joshua Kazdan, Noam Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho· June 25, 2026 View original

▶ The 2-minute explainer

Summary

New research reveals that internal data repetition systematically damages language model performance, leading to substantial compute-equivalent loss. The study quantifies this damage using a modernized scaling law, showing that even moderate repetition can be highly detrimental.

As language models grow, the availability of high-quality training data is becoming a bottleneck, and even carefully deduplicated corpora still contain some level of repetition. This research revisits the impact of data repetition on language model performance, using a modern scaling law paradigm to quantify the "Compute-Equivalent Loss" incurred. This metric allows for a direct comparison of performance degradation against the computational resources that would be wasted. The study identifies three systematic ways repetition damages models. Firstly, there's an optimal "repeat count" where evaluation loss peaks, meaning repeating a moderately sized subset a moderate number of times causes more damage than extreme repetition scenarios. Secondly, this peak's location scales with model size, indicating that the most damaging repetition grows faster than compute. Finally, the research demonstrates significant compute-equivalent loss: if repeated documents consume just 10% of the FLOPs budget, the performance loss can be equivalent to running a no-repetition model with only 67% of the FLOPs. These findings are not exclusive to language models and can be explained by a statistical tradeoff between memorization and generalization, highlighting how repetition can lead to a misspecified statistical model.

Why it matters

For professionals involved in training large language models, understanding the precise impact of data repetition is critical for optimizing resource allocation and model performance. This research provides quantifiable insights to guide data curation strategies, ensuring more efficient compute usage and better model generalization.

How to implement this in your domain

  1. 1Implement aggressive and sophisticated deduplication techniques during the data curation phase for large language models.
  2. 2Develop tools to analyze and quantify the "repeat structure" within training corpora to identify potential performance bottlenecks.
  3. 3Adjust training budgets and model architectures based on the identified compute-equivalent loss from data repetition.
  4. 4Prioritize the acquisition of novel, high-quality data over simply expanding existing datasets with potentially redundant information.
  5. 5Educate data scientists and ML engineers on the systematic damage caused by internal data repetition and best practices for mitigation.

Who benefits

AI DevelopmentCloud ComputingData ScienceResearch & Development

Key takeaways

  • Internal data repetition systematically degrades language model performance.
  • The damage can be quantified as significant compute-equivalent loss.
  • An intermediate repeat count often causes the most severe performance degradation.
  • Aggressive deduplication and careful data curation are crucial for efficient model training.

Original post by Jessica Chudnovsky, Joshua Kazdan, Noam Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

"arXiv:2606.24998v1 Announce Type: new Abstract: Language models are running out of high-quality training data, and even aggressively deduplicated corpora retain some amount of repetition. Earlier controlled studies predated Chinchilla-style scaling laws and could only measure the…"

View on X

Originally posted by Jessica Chudnovsky, Joshua Kazdan, Noam Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses