HistoriQA: New Multi-Hop QA Dataset for French History.

Aur\'elien Pellet (LRE), Julien Perez (EPITA, LRE), Marie Puren (LRE, CJM)· July 1, 2026 View original

▶ The 2-minute explainer

Summary

HistoriQA-ThirdRepublic is a new French-language multi-hop question answering dataset derived from parliamentary debates and newspapers of the French Third Republic (1870-1940). Developed with historians, it captures complex reasoning patterns like cross-source synthesis and temporal reasoning, providing a resource to evaluate retrieval-augmented and large language models in domain-specific historical contexts.

Historical research often involves complex reasoning, requiring the synthesis of information from multiple, heterogeneous sources and an understanding of temporal relationships. Traditional NLP benchmarks frequently fall short in capturing these nuanced patterns, creating a gap between general language model capabilities and the specific needs of historical scholarship. To address this, researchers have introduced HistoriQA-ThirdRepublic, a novel French-language dataset specifically designed for multi-hop historical question answering. This corpus is built from parliamentary debates and newspapers spanning the French Third Republic (1870-1940) and was developed in close collaboration with a historian to ensure its relevance and accuracy. The dataset comprises 1782 questions that emphasize multi-hop connections across diverse historical documents, demanding cross-source synthesis, temporal reasoning, and the integration of sparse evidence. HistoriQA-ThirdRepublic serves as a valuable resource for evaluating the performance of retrieval-augmented and large language models in domain-specific contexts, demonstrating how the methodology can be adapted for other languages and national corpora, thereby bridging the divide between advanced NLP and the practical demands of historical inquiry.

Why it matters

For professionals in AI/NLP development, digital humanities, and government archives, this dataset offers a unique resource for training and evaluating advanced language models on complex, multi-hop historical reasoning, pushing the boundaries of AI applications in specialized domains.

How to implement this in your domain

  1. 1Utilize HistoriQA-ThirdRepublic to benchmark and fine-tune retrieval-augmented and large language models for domain-specific historical inquiry.
  2. 2Adapt the methodology for constructing multi-hop QA datasets to other languages or national historical corpora.
  3. 3Collaborate with historians or domain experts to design datasets that capture complex reasoning patterns relevant to specific fields.
  4. 4Explore the application of multi-hop QA systems for internal knowledge management or archival research within organizations.
  5. 5Develop AI tools that can synthesize information from heterogeneous sources, including text and potentially other media, for comprehensive analysis.

Who benefits

AcademiaGovernmentMediaConsultingEdTech

Key takeaways

  • Historical research requires complex multi-hop reasoning across diverse sources.
  • HistoriQA-ThirdRepublic is a new dataset for French historical QA.
  • It evaluates LLMs on cross-source synthesis and temporal reasoning.
  • The methodology is adaptable for other languages and historical contexts.

Original post by Aur\'elien Pellet (LRE), Julien Perez (EPITA, LRE), Marie Puren (LRE, CJM)

"arXiv:2606.31325v1 Announce Type: new Abstract: We present HistoriQA-ThirdRepublic: a French-language dataset of multi-hop historical questions derived from parliamentary debates and newspapers of the French Third Republic. Designed in collaboration with a historian, the corpus c…"

View on X

Originally posted by Aur\'elien Pellet (LRE), Julien Perez (EPITA, LRE), Marie Puren (LRE, CJM) on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Research

AI ResearchAI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

This paper critically reviews the intersection of philosophy of science and explainable AI (XAI) in health sciences, examining what constitutes an adequate medical explanation. It identifies causality, trust, and epistemic adequacy as central axes for designing robust XAI systems in clinical decision-making.

Martina Mattioli, Marcello PelilloJul 1, 2026
AI ResearchAI Engineering & DevTools

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

This research introduces the Relative Surprisal Index (RSI), an information-theoretic metric for adaptive token selection in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. RSI-S, an entropy-adaptive filtering method based on RSI, improves reasoning accuracy by 2-3 percentage points by retaining tokens within a stable surprisal interval.

Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda ChenJul 1, 2026
AI Engineering & DevToolsAI Research

New ACE Module Boosts LLM Agent Context Management

Researchers introduce ACE (Adaptive Context Elasticizer), a plug-and-play module that dynamically manages historical information for LLM-based agents. ACE maintains a lossless message layer and adaptively orchestrates context, significantly improving performance across various agent frameworks without architectural changes.

Ning Liao, Zihao Long, Xiaoxing Wang, Xue Yang, Yaoming Wang, Ziyuan Zhuang, Xunliang Cai, Rongxiang Weng, Junchi YanJul 1, 2026