CONCORD Boosts RAG Throughput in Device-Cloud Private Data Settings

Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia· June 16, 2026 View original

Summary

CONCORD is a new framework designed to enhance Retrieval-Augmented Generation (RAG) performance in scenarios where private documents reside on edge devices and public knowledge is in the cloud. It achieves significant throughput improvements and reduced communication by using asynchronous sparse aggregation, addressing privacy and latency constraints.

Retrieval-augmented generation (RAG) is a key technique for enhancing language models by integrating external knowledge. A new challenge arises with device-cloud collaborative inference, where smaller language models on edge devices need to access private documents locally while leveraging public knowledge from the cloud. Privacy and policy often prevent direct document exchange, creating a "document-isolated dual-end RAG" environment. Existing RAG methods in this setting suffer from high latency and bandwidth usage due to frequent synchronization and dense evidence transfer. To overcome this, researchers propose CONCORD, an asynchronous sparse aggregation framework. CONCORD treats the cloud as an asynchronous evidence source, employing "waiting debt control" to decide when to wait for remote input and a "certificate-guided minimal supplementation" mechanism to request only essential remote evidence. This approach allows many decoding steps to commit locally without remote evidence, while still preserving the greedy token decisions of dense dual-end aggregation when consulting the cloud. Experiments show CONCORD significantly improves end-to-end throughput (1.66x to 2.15x) and reduces per-token communication by over two orders of magnitude, all while maintaining comparable answer quality.

Why it matters

This framework offers a practical solution for deploying RAG systems in privacy-sensitive and resource-constrained environments, enabling efficient use of LLMs on edge devices without compromising data security or performance.

How to implement this in your domain

  1. 1Evaluate CONCORD's architecture for potential integration into existing device-cloud RAG deployments.
  2. 2Implement the asynchronous sparse aggregation techniques to optimize communication between edge devices and cloud services.
  3. 3Develop "waiting debt control" mechanisms to intelligently manage remote evidence requests based on latency and bandwidth.
  4. 4Design "certificate-guided minimal supplementation" to reduce the volume of data transferred for RAG queries.

Who benefits

HealthcareFinanceEdge ComputingTelecommunicationsIoT

Key takeaways

  • CONCORD optimizes RAG for device-cloud settings with document isolation.
  • It uses asynchronous sparse aggregation to reduce communication and improve throughput.
  • The framework maintains answer quality while significantly cutting latency and bandwidth usage.
  • It addresses privacy concerns by keeping sensitive documents on edge devices.

Original post by Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia

"arXiv:2606.15179v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small l…"

View on X

Originally posted by Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses