Online Algorithm Optimizes LLM Selection Under Dynamic Constraints.

Yin Huang, Qingsong Liu, Jie Xu· June 17, 2026 View original

Summary

This paper presents a novel online learning algorithm for selecting Large Language Models (LLMs) in edge-cloud inference systems, addressing challenges like model heterogeneity, stochastic performance, and time-varying demand. The algorithm uses confidence-bound estimates and demand predictions to balance reward maximization with hard resource budgets and soft service-level requirements.

Deploying Large Language Models (LLMs) in edge-cloud inference systems presents significant challenges due to the diverse performance profiles of different models, unpredictable user demand, and the stochastic nature of their operation. Effectively selecting the right LLM for each incoming task is crucial for maintaining service quality and optimizing resource use, especially when faced with strict monetary budgets and latency guarantees. Researchers have formulated this complex problem as a constrained stochastic bandit learning task. The goal is for a system to sequentially choose models while adhering to both hard resource limits (packing-type constraints) and soft service-level agreements (covering-type constraints), all without prior knowledge of underlying reward, cost, or latency distributions. The system must also adapt to fluctuating task demands. A novel online learning algorithm has been developed to tackle this. It employs confidence-bound estimates and predictions of future demand to intelligently balance maximizing overall reward with ensuring long-term satisfaction of all constraints. Theoretical guarantees confirm its effectiveness, showing sublinear regret and constraint violations compared to an ideal offline benchmark. Synthetic workload experiments further validate its robustness in dynamic, resource-constrained environments.

Why it matters

This research provides a critical solution for efficiently managing and deploying LLMs in real-world, resource-constrained environments, ensuring optimal performance and cost-effectiveness. Professionals involved in MLOps, cloud infrastructure, and AI service delivery can use this to build more resilient and economical LLM inference systems.

How to implement this in your domain

  1. 1Implement dynamic LLM selection strategies in edge-cloud inference systems using constrained bandit algorithms.
  2. 2Integrate demand prediction models to inform real-time resource allocation and model switching for AI services.
  3. 3Develop monitoring systems to track confidence-bound estimates for LLM performance metrics (accuracy, latency, cost).
  4. 4Define clear packing-type (e.g., budget) and covering-type (e.g., latency SLA) constraints for LLM deployment.
  5. 5Explore applying similar online learning techniques to other resource management problems in distributed AI systems.

Who benefits

Cloud ComputingEdge AITelecommunicationsSaaSMLOps

Key takeaways

  • A new algorithm optimizes LLM selection under dynamic constraints and time-varying demand.
  • It balances reward maximization with hard resource budgets and soft service-level requirements.
  • The method operates without prior knowledge of model performance distributions.
  • Theoretical guarantees and experimental results confirm its effectiveness and robustness.

Original post by Yin Huang, Qingsong Liu, Jie Xu

"arXiv:2606.17489v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is cri…"

View on X

Originally posted by Yin Huang, Qingsong Liu, Jie Xu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses