Lightweight Transformers Benchmarked for On-Device Fault Detection.

Disha Patel· June 24, 2026 View original

Summary

This study benchmarks lightweight transformer models against traditional ML for on-device fault detection on resource-constrained hardware. It evaluates performance, size, and latency across various datasets, finding transformers can match traditional ML but with higher resource demands, and proposes an adaptive inference pipeline for efficiency.

Researchers conducted a benchmark study comparing lightweight transformer architectures with traditional machine learning methods for on-device fault detection. The goal was to assess their suitability for deployment on resource-constrained hardware, considering trade-offs between accuracy, latency, and model size. The evaluation covered models like DistilBERT, TinyBERT, MobileBERT, and traditional algorithms such as Random Forest and XGBoost, across three public datasets: NASA C-MAPSS, SECOM, and UCI AI4I 2020. Results showed that on well-separated sensor data (C-MAPSS), lightweight transformers could match traditional ML in F1-score (87.8%) but at a cost of 100x larger model size and 9000x higher latency. TinyBERT-4L emerged as the most deployment-friendly transformer, balancing size (55 MB) and CPU latency (18 ms). INT8 dynamic quantization further reduced model size by 25% while largely preserving accuracy. The study also proposed a two-stage adaptive inference pipeline, which routed 97.9% of predictions through a quantized triage model and only 2.1% to a larger expert model. This achieved comparable F1-score (87.6%) with an average latency of 19.5 ms. However, both traditional and transformer methods struggled significantly on severely imbalanced datasets (SECOM, UCI-PM), highlighting a fundamental limitation for extreme class imbalance in fault detection.

Why it matters

For professionals in industrial IoT, manufacturing, and edge computing, this benchmark provides crucial insights into selecting appropriate models for on-device fault detection. Understanding the trade-offs between model complexity, resource consumption, and performance is vital for deploying effective and efficient predictive maintenance solutions.

How to implement this in your domain

  1. 1Evaluate the resource constraints of your target edge devices for fault detection applications.
  2. 2Consider lightweight transformer models like TinyBERT-4L for well-separated sensor data, balancing accuracy with deployment feasibility.
  3. 3Implement INT8 dynamic quantization to reduce model size and improve inference speed on edge devices.
  4. 4Explore a two-stage adaptive inference pipeline to optimize latency and resource usage by routing simpler cases to smaller models.
  5. 5Address extreme class imbalance in your datasets, as both traditional ML and transformers struggle in such scenarios.

Who benefits

ManufacturingIndustrial IoTAutomotiveAerospaceEnergy

Key takeaways

  • Lightweight transformers can achieve high accuracy for on-device fault detection but demand significantly more resources than traditional ML.
  • TinyBERT-4L offers a good balance of size and latency for deployment-friendly transformer models.
  • INT8 dynamic quantization effectively reduces model size while largely preserving performance.
  • An adaptive inference pipeline can optimize latency by routing predictions through a triage model.

Original post by Disha Patel

"arXiv:2606.24173v1 Announce Type: new Abstract: On-device fault detection enables real-time diagnostics without cloud dependency, but deploying machine learning models on resource-constrained hardware demands careful tradeoffs between accuracy, latency, and model size. We present…"

View on X

Originally posted by Disha Patel on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses