AI Forecasting Improves with Diverse Model Ensembles

Matthew Aitchison, Scott Jeen, Toby Shevlane, Ben Day· June 30, 2026 View original

Summary

This research finds that combining forecasts from diverse large language models (LLMs), rather than just highly accurate ones, significantly improves the accuracy of AI forecasting systems. Models like Grok 4 contribute disproportionately due to their less correlated predictions, highlighting the importance of complementary errors in ensemble design.

New research into AI forecasting systems reveals that simply combining predictions from the most accurate large language models (LLMs) is not the optimal strategy for maximizing accuracy. Instead, the study emphasizes the critical role of diversity among the models within an ensemble. On binary questions from the Metaculus AI Benchmark, it was observed that many frontier LLMs tend to produce highly correlated predictions. This redundancy limits the value of adding more forecasts from similar models. The most effective ensembles were those that combined accurate models with diverse predictive patterns, meaning their errors were complementary rather than overlapping. For instance, models such as Grok 4 were found to contribute significantly to ensemble performance because their predictions were less correlated with other leading LLMs. This suggests that the true power of an "AI crowd" lies in strategically selecting models that offer distinct perspectives and error profiles, rather than indiscriminately sampling from a pool of high-performing but similar models.

Why it matters

Professionals relying on AI for forecasting and strategic decision-making can significantly improve prediction accuracy by focusing on model diversity in their ensemble approaches. This insight can lead to more robust and reliable AI-driven predictions for future events.

How to implement this in your domain

  1. 1Assess: Evaluate the correlation of predictions from different LLMs before forming an ensemble.
  2. 2Prioritize: Select LLMs for ensembles based on both individual accuracy and the diversity of their error patterns.
  3. 3Experiment: Test various ensemble weighting strategies that account for model diversity, not just individual performance.
  4. 4Integrate: Incorporate diversity metrics into the model selection process for AI forecasting systems.
  5. 5Explore: Investigate less common or emerging LLMs that might offer unique predictive perspectives.

Who benefits

Financial ServicesConsultingMarket ResearchGovernmentStrategic Planning

Key takeaways

  • Ensembling diverse LLMs improves forecasting accuracy more than just combining accurate ones.
  • Many frontier LLMs make highly correlated predictions, limiting ensemble value.
  • Models with less correlated predictions, like Grok 4, are disproportionately valuable.
  • Optimizing for both model quality and diversity is crucial for robust AI forecasting.

Original post by Matthew Aitchison, Scott Jeen, Toby Shevlane, Ben Day

"arXiv:2606.29661v1 Announce Type: new Abstract: Top AI forecasting systems are approaching superforecaster-level accuracy on future world events, but still rely primarily on off-the-shelf LLMs combined with forecasting-specific context gathering and scaffolding. We study how to i…"

View on X

Originally posted by Matthew Aitchison, Scott Jeen, Toby Shevlane, Ben Day on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses