LLMs Show Promise, Limits in Coding Humanitarian Data.

Jerome Marston, Tino Kreutzer, Salom\'e Garnier, Ella Boone, Phuong N Pham, Patrick Vinck· June 26, 2026 View original

Summary

A benchmark study compared 46 LLMs against human experts for coding qualitative humanitarian data, finding that some LLMs can achieve comparable reliability with structured prompts. However, models struggle with nuanced needs, indirect expressions, and protection-relevant concerns, highlighting the need for human oversight.

Humanitarian organizations face significant challenges in timely and consistent interpretation of qualitative data from affected populations, often lacking the resources for large-scale analysis. Large Language Models (LLMs) offer a potential solution, but their reliability in coding such nuanced data has been largely unproven. This study directly addresses this by benchmarking 46 LLMs against human expert adjudication. The research utilized 150 high-fidelity synthetic humanitarian transcripts, evaluating LLMs using inter-rater reliability (Krippendorff's alpha), discrepancy analysis, and qualitative assessment against humanitarian-specific criteria. The findings indicate that several LLMs can perform deductive coding at reliability levels similar to experienced human coders, especially when guided by structured prompts and reasoning-enabled configurations. However, the study also revealed significant limitations. LLMs struggled with recognizing needs expressed indirectly, identifying needs outside predefined categories, and discerning protection-relevant concerns like physical safety and discrimination. This suggests that while LLMs can augment analytical capacity, they are not substitutes for human judgment. Effective deployment requires structured codebooks, reasoning-enabled models, careful attention to theme-specific performance, and tiered human oversight, particularly for sensitive data where miscoding could have severe programmatic consequences. Open-weight models on self-hosted infrastructure are suggested for better data governance.

Why it matters

This study provides crucial insights for humanitarian organizations and AI developers on the practical applicability and limitations of LLMs for sensitive data analysis, guiding responsible and effective AI integration in critical aid efforts.

How to implement this in your domain

1Pilot LLM-assisted coding for qualitative data in non-critical humanitarian contexts.
2Develop structured codebooks and detailed prompting strategies for LLM deployment.
3Implement a tiered human oversight system, focusing on high-risk categories for review.
4Evaluate open-weight LLMs and self-hosted infrastructure for enhanced data governance.
5Train staff on the capabilities and limitations of LLMs in data analysis to ensure responsible use.

Who benefits

Humanitarian AidNon-profitSocial SciencesPublic PolicyHealthcare

Key takeaways

LLMs can achieve human-comparable reliability for deductive coding with structured prompts.
Models struggle with nuanced, indirect, and protection-relevant humanitarian data.
Human judgment and oversight remain critical for sensitive data analysis.
Structured codebooks and reasoning-enabled models are essential for effective LLM use.

Original post by Jerome Marston, Tino Kreutzer, Salom\'e Garnier, Ella Boone, Phuong N Pham, Patrick Vinck

"arXiv:2606.26541v1 Announce Type: new Abstract: Data from affected populations are crucial for informing humanitarian response, but their value depends on timely and consistent interpretation of nuanced accounts of need. Humanitarian organizations often lack the staff, time, and…"

View on X

Originally posted by Jerome Marston, Tino Kreutzer, Salom\'e Garnier, Ella Boone, Phuong N Pham, Patrick Vinck on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLMs Show Promise, Limits in Coding Humanitarian Data.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets