Scalable Algorithm Boosts Diversity-Aware Data Selection for

Scalable Algorithm Boosts Diversity-Aware Data Selection for ML

Richard Yi Da Xu· June 19, 2026 View original

Summary

This paper introduces a scalable continuous relaxation of Determinantal Point Processes (DPPs) for diversity-aware data selection, crucial for large-scale machine learning tasks. The new algorithm, OurMethod, recasts DPP-MAP as a Nonlinear Eigenvalue Problem with eigenvector dependency (NEPv), enabling near-linear scaling in ground-set size.

This research addresses a critical challenge in modern machine learning: efficiently selecting a small, diverse, and high-quality subset from an enormous pool of candidates. This task is fundamental for applications like data curation, coreset selection for model training, active learning, prompt selection, and retrieval diversification. Determinantal Point Processes (DPPs) offer a principled framework for defining diversity, but their Maximum A Posteriori (MAP) objective, which involves maximizing the log determinant of a submatrix, is computationally intractable (NP-hard) for large datasets. Existing greedy and sampling algorithms also scale superlinearly, becoming prohibitive when dealing with millions or billions of candidates. The paper proposes a novel approach by reformulating the DPP-MAP problem as a continuous optimization task on the Stiefel manifold. This reformulation reveals that its first-order optimality conditions correspond to a previously unstudied type of Nonlinear Eigenvalue Problem with eigenvector dependency (NEPv). To solve this, the authors introduce a self-consistent field (SCF) iteration, which comes with a local contraction guarantee based on the spectral gap. The resulting algorithm, referred to as OurMethod, provides a principled iterative solver where the diversity objective directly influences an eigenvector-dependent operator. Crucially, OurMethod achieves near-linear scaling in the ground-set size, running in `O((ndk+nk^2)t)` time for a small number of iterations `t`, where `n` is the ground-set size, `k` is the subset size, and `d` is the feature dimension. This efficiency is achieved by requiring only matrix-vector products with the kernel, making it compatible with low-rank and feature-map kernels commonly used in machine learning. This work focuses on the theoretical relaxation, solver, and scaling analysis, with empirical validation planned for future studies.

Why it matters

This breakthrough enables efficient and principled diversity-aware data selection for massive datasets, which is vital for improving the performance and reducing the cost of training and fine-tuning large AI models.

How to implement this in your domain

1Explore integrating OurMethod into data curation pipelines for large-scale model training and fine-tuning.
2Apply this scalable DPP approach for active learning batch acquisition to select diverse and informative samples.
3Utilize the algorithm for prompt and exemplar selection in in-context learning to enhance model performance.
4Implement diversity-aware retrieval systems to provide more varied and relevant results to users.
5Investigate the use of OurMethod in experimental design to select diverse sets of experiments or features.

Who benefits

AI/ML DevelopmentData ScienceCloud ComputingE-commerceResearch & Development

Key takeaways

A new algorithm, OurMethod, offers scalable diversity-aware data selection using DPPs.
It reformulates DPP-MAP as a Nonlinear Eigenvalue Problem, solvable with an SCF iteration.
The algorithm achieves near-linear scaling, crucial for massive datasets.
This enables more efficient data curation, active learning, and prompt selection in ML.

Original post by Richard Yi Da Xu

"arXiv:2606.19411v1 Announce Type: new Abstract: Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning -- data curation and coreset selection for training and fine-tuning large models, active-learning…"

View on X

Originally posted by Richard Yi Da Xu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Scalable Algorithm Boosts Diversity-Aware Data Selection for ML

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets