LLMs Show Promise, Limitations in Aphasia Discourse Assessment

Jason M Pittman, Yesenia Medina-Santos, Anton Phillips Jr., Brielle C. Stark· June 16, 2026 View original

Summary

A study investigated whether instruction-tuned large language models can reliably identify Correct Information Units (CIUs) in aphasic discourse transcripts, a time-intensive task for human raters. While zero-shot prompting was insufficient, few-shot prompting enabled models like Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B to achieve competitive F1 scores, though their agreement with human annotation is not yet sufficient for fully autonomous use.

Researchers explored the capability of instruction-tuned large language models (LLMs) to automatically classify Correct Information Units (CIUs) within transcripts of discourse from individuals with aphasia. CIUs are a critical metric for assessing communicative informativeness in aphasia, but their manual scoring is labor-intensive and requires specialized training. The study benchmarked several publicly available LLMs using both zero-shot and few-shot prompting conditions. The findings indicate that zero-shot prompting was ineffective. However, few-shot prompting significantly improved performance, with models such as Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B achieving F1 scores between 0.776 and 0.817. These models demonstrated high recall but lower precision, suggesting a tendency to over-classify tokens as CIUs. Performance also varied with the severity of aphasia, being weakest in more severe cases. While LLMs show promise for supporting CIU identification, their current reliability is not yet sufficient for fully autonomous clinical application, suggesting a human-in-the-loop approach is still necessary.

Why it matters

Automating CIU identification could significantly reduce the time and resources required for aphasia assessment, making it more accessible and efficient for clinicians. Professionals in healthcare and AI development should note the potential for LLMs to assist in complex clinical tasks, while also recognizing the current need for human oversight.

How to implement this in your domain

  1. 1Explore integrating few-shot prompted LLMs into existing clinical workflows for preliminary CIU scoring to assist human raters.
  2. 2Develop user interfaces that allow clinicians to easily review and correct LLM-generated CIU classifications, ensuring human-in-the-loop validation.
  3. 3Investigate fine-tuning smaller, specialized models on larger aphasic discourse datasets to improve precision and reduce over-classification.
  4. 4Collaborate with speech-language pathologists to refine LLM prompting strategies and evaluation metrics for better alignment with clinical needs.

Who benefits

HealthcareMedical TechnologyAI DevelopmentEducationResearch

Key takeaways

  • LLMs can assist in identifying Correct Information Units (CIUs) in aphasic discourse with few-shot prompting.
  • Few-shot prompting significantly outperforms zero-shot for this specialized clinical task.
  • Current LLM performance is not yet sufficient for fully autonomous CIU scoring, requiring human oversight.
  • LLMs show high recall but lower precision, indicating a tendency to over-classify CIUs.

Original post by Jason M Pittman, Yesenia Medina-Santos, Anton Phillips Jr., Brielle C. Stark

"arXiv:2606.15696v1 Announce Type: new Abstract: Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained rater…"

View on X

Originally posted by Jason M Pittman, Yesenia Medina-Santos, Anton Phillips Jr., Brielle C. Stark on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses