SPSD Compresses LLM Prompts at Edge, Reduces Cloud Energy Costs

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan· June 19, 2026 View original

Summary

SPSD (Sentiment Preserving Semantic Distillation) is an edge-based pipeline that uses a 4-bit quantized Small Language Model to compress user prompts before sending them to a cloud LLM, significantly reducing input token costs and cloud energy consumption. It achieves this by removing "social scaffolding" while largely preserving response quality and sentiment.

The "prefill" stage of Large Language Model (LLM) inference is a major contributor to the energy consumption of cloud-scale AI. Many user prompts, especially in conversational or customer support contexts, contain "social scaffolding" – polite phrases, apologies, and repetitions – that are important for human interaction but carry little semantic information for machine reasoning. This disparity is termed the "Social-Semantic Gap." To address this, researchers propose SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline. This system utilizes a 4-bit quantized Small Language Model (SLM) to compress user prompts directly on the user's device before they are transmitted to a larger, cloud-deployed LLM. This process aims to remove the non-essential social scaffolding while retaining the core semantic and emotional content. Evaluations using Gemma-2-2B-Instruct as the SLM and Llama-3.1-8B-Instruct as the cloud model showed significant results. The system achieved a mean saving of 99.9 input tokens per distilled call, with all tested distilled calls yielding positive savings. Response quality, assessed by a blind LLM-as-judge method, was found to be non-inferior to raw prompts within a specified margin, with a good balance of ties, distilled wins, and raw wins. Estimated per-call net energy savings range from 70-270 uWh. For safety-critical domains, a rule-based gate ensures conservative routing to passthrough, highlighting a practical approach to reducing cloud LLM costs while maintaining quality.

Why it matters

This innovation offers a practical solution for reducing the operational costs and environmental impact of cloud-based LLM inference by optimizing prompt transmission, making LLM deployment more efficient and sustainable.

How to implement this in your domain

  1. 1Assess current LLM inference costs, particularly for prompt prefill, to identify potential savings.
  2. 2Investigate deploying a small, quantized language model on edge devices for prompt compression.
  3. 3Implement a prompt distillation pipeline to remove "social scaffolding" while preserving core semantics.
  4. 4Establish evaluation metrics for response quality and sentiment preservation after prompt compression.
  5. 5Develop rule-based gates for safety-critical applications to ensure uncompressed prompt passthrough.

Who benefits

Customer ServiceTelecommunicationsCloud ComputingMobile ApplicationsIoT

Key takeaways

  • "Social scaffolding" in prompts contributes significantly to cloud LLM energy costs.
  • SPSD uses edge-based SLMs to compress prompts, reducing input tokens and energy.
  • Prompt compression can maintain response quality within practical non-inferiority margins.
  • This approach offers substantial per-call energy savings for cloud LLM inference.

Original post by Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

"arXiv:2606.19364v1 Announce Type: new Abstract: The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, rep…"

View on X

Originally posted by Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses