Improved Quranic ASR with Fine-Tuned Transformer Models

Nabil Mosharraf Hossain (Greentech Apps Foundation, United Kingdom), Riasat Islam (Greentech Apps Foundation, United Kingdom, Queen Mary University of London, United Kingdom), Unaizah Obaidellah (University of Malaya, Malaysia)· June 19, 2026 View original

Summary

This study systematically evaluates fine-tuning pretrained Transformer models (Wav2Vec2.0, HuBERT, XLS-R) for Quranic Automatic Speech Recognition (ASR). It identifies key factors affecting transcription accuracy, achieving significant Word Error Rate (WER) reductions and faster training times compared to baselines.

Automatic Speech Recognition (ASR) for Quranic recitation faces challenges with high error rates on user-recited verses and incomplete coverage of the Quranic corpus. This research presents a comprehensive empirical study focused on fine-tuning pretrained Transformer-based models to improve Quranic ASR performance. The study investigates advanced speech feature extraction methods, specifically Wav2Vec2.0, HuBERT, and XLS-R, which utilize self-supervised learning to capture context-aware speech features. These pretrained models were fine-tuned using a substantial dataset of over 870 hours of both professional and user Quranic recitations. Through detailed ablation studies, the researchers analyzed the impact of various factors, including feature extractors, output label formats, training strategies, and audio clip durations, on transcription accuracy. The most effective configuration achieved a Word Error Rate (WER) of 0.08 on the EveryAyah subset and 0.11 on a combined dataset, representing a notable five-percentage-point improvement over the Citrinet baseline. Furthermore, this optimized approach reduced training time from 140 hours to 40 hours. Key findings indicate that Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall speech representation.

Why it matters

This research significantly advances Quranic ASR, enabling more accurate and efficient tools for memorization, search, and religious education, with potential applications in other specialized language ASR domains.

How to implement this in your domain

  1. 1Adopt pretrained Transformer models like Wav2Vec2.0 or XLS-R for domain-specific ASR tasks.
  2. 2Conduct ablation studies on speech feature extractors, label formats, and training strategies to optimize ASR performance.
  3. 3Curate high-quality, domain-specific datasets, including both professional and user-generated content, for fine-tuning.
  4. 4Consider using simplified text representations (e.g., without diacritics) for improved ASR fine-tuning in certain languages.

Who benefits

EdTechReligious ServicesLanguage LearningAI ResearchMedia & Entertainment

Key takeaways

  • Fine-tuning pretrained Transformer models significantly improves Quranic ASR accuracy.
  • Wav2Vec2-XLSR-53 provides the strongest speech representation for this domain.
  • Arabic text without diacritics yields the best fine-tuning results.
  • Optimized configurations reduce training time while improving Word Error Rate.

Original post by Nabil Mosharraf Hossain (Greentech Apps Foundation, United Kingdom), Riasat Islam (Greentech Apps Foundation, United Kingdom, Queen Mary University of London, United Kingdom), Unaizah Obaidellah (University of Malaya, Malaysia)

"arXiv:2606.19747v1 Announce Type: new Abstract: Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rat…"

View on X

Originally posted by Nabil Mosharraf Hossain (Greentech Apps Foundation, United Kingdom), Riasat Islam (Greentech Apps Foundation, United Kingdom, Queen Mary University of London, United Kingdom), Unaizah Obaidellah (University of Malaya, Malaysia) on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses