BLADE Boosts LLM Training Efficiency with Adaptive Data Selection

Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang· June 18, 2026 View original

Summary

Researchers introduced BLADE, a Hessian-free framework for scalable bi-level adaptive data selection in Large Language Model (LLM) training. It reformulates influence-based optimization as a penalized single-level objective, dynamically synchronizing a reference model to efficiently filter uninformative data and improve learning trajectories.

A new framework called BLADE (Bi-Level Adaptive Data sElection) has been developed to enhance the efficiency of Large Language Model (LLM) training, particularly as datasets grow to trillions of tokens. Data selection is critical for filtering noise and creating adaptive learning paths, but existing methods have limitations. Influence-based methods are principled but computationally intensive, requiring intractable inverse-Hessian calculations. Excess-loss methods are efficient but rely on a static reference model that can become misaligned during training. BLADE addresses these issues by reformulating the bi-level optimization problem of influence-based methods into a penalized single-level objective using Lagrange multipliers. This reformulation avoids inverse-Hessian computations and creates a dynamic reference model that stays synchronized with the evolving proxy model during training. The framework guarantees first-order convergence and is instantiated as a memoryless randomized block-coordinate Frank-Wolfe algorithm for efficient online batch selection. Extensive experiments show BLADE consistently outperforms state-of-the-art data selection baselines, offering a practical solution for LLM training.

Why it matters

AI engineers and researchers can use BLADE to significantly improve the efficiency and performance of Large Language Model training by intelligently selecting data, leading to faster development cycles and more capable models.

How to implement this in your domain

  1. 1Integrate BLADE into LLM training pipelines to optimize data selection and reduce computational costs.
  2. 2Apply BLADE to large-scale datasets to filter uninformative tokens and improve learning trajectories.
  3. 3Experiment with BLADE's dynamic reference model to maintain synchronization during long training runs.
  4. 4Utilize the memoryless randomized Frank-Wolfe algorithm for efficient online batch selection in LLM pre-training.

Who benefits

AI DevelopmentCloud ComputingData ScienceSoftware Engineering

Key takeaways

  • BLADE offers a scalable, Hessian-free approach for adaptive data selection in LLM training.
  • It dynamically synchronizes a reference model, overcoming limitations of static methods.
  • The framework guarantees first-order convergence for efficient optimization.
  • BLADE consistently outperforms existing data selection baselines, improving LLM performance.

Original post by Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang

"arXiv:2606.18650v1 Announce Type: new Abstract: As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering…"

View on X

Originally posted by Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses