Tail-Aware Scheduling Optimizes LLM Inference Latency Withou

Tail-Aware Scheduling Optimizes LLM Inference Latency Without Prediction.

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta· June 18, 2026 View original

Summary

This paper introduces a distribution-aware, prediction-free scheduling framework for LLM inference that significantly reduces tail latency (P90-P99) and time-to-first-token (TTFT). It replaces explicit length prediction with soft priority boosting driven by statistical signals and co-optimizes scheduling with cache-aware preemption.

Serving Large Language Models (LLMs) presents significant challenges due to the extreme variability in output length, which complicates size-based scheduling. Current LLM schedulers often approximate Shortest Job First (SJF) or Shortest Remaining Processing Time (SRPT) by predicting decode lengths or ranks, primarily reporting mean-centric metrics like Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT). However, these prediction-driven policies can be unstable under shifts in data distribution, bursty arrival patterns, and GPU memory pressure. Crucially, they offer limited control over tail latency (P90-P99), which is a dominant factor in user experience, even with perfect knowledge of decode lengths. To address these limitations, researchers propose a novel distribution-aware, prediction-free scheduling framework. This system replaces explicit length prediction with a soft priority boosting mechanism, which is driven by lightweight statistical signals. The design also co-optimizes scheduling with cache-aware preemption, effectively accounting for the memory-coupled decode dynamics across various workload mixes. Evaluations conducted on both production and open-source traces demonstrate the framework's effectiveness. The new method reduces P99 Time-To-Last-Token (TTLT) by up to 35-50% relative to SRPT, even when SRPT has perfect length knowledge. Furthermore, it reduces TTFT by 34-47% across diverse workloads, including those heavy in reasoning and chat. These results highlight a robust alternative for optimizing tail latency in online LLM serving environments.

Why it matters

For companies deploying LLMs in production, this scheduling framework offers a significant improvement in user experience by drastically reducing tail latency, leading to more responsive and reliable AI services without relying on fragile prediction models.

How to implement this in your domain

1Re-evaluate your current LLM inference scheduling strategies, especially concerning tail latency.
2Explore implementing prediction-free, distribution-aware scheduling frameworks for LLM serving.
3Integrate soft priority boosting driven by statistical signals instead of explicit length predictions.
4Co-optimize scheduling with cache-aware preemption to manage GPU memory effectively for diverse workloads.

Who benefits

Cloud ComputingSaaSTelecommunicationsGamingCustomer Service

Key takeaways

LLM inference tail latency is a critical user experience factor often overlooked by prediction-based schedulers.
A new prediction-free, distribution-aware framework significantly reduces P99 TTLT and TTFT.
It uses soft priority boosting from statistical signals and cache-aware preemption.
The framework offers a robust alternative for optimizing online LLM serving performance.

Original post by Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

"arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such a…"

View on X

Originally posted by Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Tail-Aware Scheduling Optimizes LLM Inference Latency Without Prediction.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

MCP and A2A Protocols Standardize Agentic Internet Development

VISReg Enhances JEPA Training with Novel Regularization

Ford's AI-Driven Layoffs Backfire Significantly