Sample Selection Bias Accelerates AI Model Collapse

Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang· June 15, 2026 View original

Summary

This research demonstrates that sample selection bias in recursive training on synthetic data can precipitate model collapse, especially in low-resource verification regimes. It shows that siloed data selection prunes globally relevant tail modes, and proposes collaborative proxy references as a mitigation.

The increasing reliance on recursive training with synthetic data, while useful for addressing data scarcity, carries a significant risk: model collapse. This phenomenon occurs when repeated training erodes the diversity of data, leading to homogenized outputs and a loss of distributional tails. Data selection is often proposed as a remedy, but its effectiveness is critically dependent on the quality and completeness of the reference distribution used for verification. This paper reveals a crucial vulnerability: in low-resource verification environments, such as healthcare consortia or proprietary financial institutions where raw data cannot be pooled, each verifier only observes a small, fragmented, and inherently biased slice of the target data manifold. Consequently, the selection process itself becomes biased, preferentially retaining samples that align with the local data distribution while inadvertently discarding globally relevant, diverse data points. The research theoretically proves that this 'siloed selection' accelerates model collapse and induces a power-law decay in data diversity. As an initial mitigation strategy, the authors propose constructing Wasserstein proxy references from multiple silos without requiring the sharing of raw, sensitive data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references effectively mitigate diversity degradation, underscoring the need for caution in recursive synthetic-data pipelines when real-data coverage is fragmented.

Why it matters

Professionals developing and deploying AI models, particularly in data-sensitive or resource-constrained sectors, must understand how biased data selection in synthetic data pipelines can lead to model collapse. Recognizing this risk is crucial for designing robust training strategies that preserve data diversity and ensure reliable model performance.

How to implement this in your domain

  1. 1Assess the completeness and representativeness of reference distributions used for data verification in AI training pipelines.
  2. 2Be aware of potential sample selection biases when generating or selecting synthetic data, especially in low-resource environments.
  3. 3Explore and implement collaborative data reference methods, such as Wasserstein proxy references, to mitigate diversity degradation without sharing raw data.
  4. 4Establish monitoring mechanisms to detect early signs of model collapse, such as homogenization of outputs or loss of distributional tails.
  5. 5Prioritize data diversity and representativeness in synthetic data generation to prevent unintended biases and accelerate model collapse.

Who benefits

HealthcareBFSIAI/ML EngineeringData SciencePharmaceuticals

Key takeaways

  • Recursive training on synthetic data risks model collapse due to diversity erosion.
  • Sample selection bias, especially in low-resource settings, accelerates model collapse.
  • Siloed data selection preferentially prunes globally relevant data, leading to diversity decay.
  • Collaborative proxy references can mitigate diversity degradation without sharing raw data.

Original post by Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang

"arXiv:2606.13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy…"

View on X

Originally posted by Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses