Causal Direction Benchmarks Re-evaluated with New Parameter-Free Baseline

Wietse Stienstra· June 24, 2026 View original

▶ The 2-minute explainer

Summary

A re-evaluation of bivariate causal direction methods on the Tuebingen dataset reveals that published accuracy figures are often inflated due to inconsistent protocols. A new, simple, parameter-free compression baseline performs comparably to complex methods under a standardized evaluation.

The field of causal direction inference often compares methods using headline accuracies from the Tuebingen dataset, but these comparisons are flawed. Each method is typically evaluated under its authors' specific protocols, which vary in pair subsets, weightings, model selection, and decision rates. This inconsistency makes direct comparison unreliable and can inflate reported performance. Researchers conducted a "same-hands" re-evaluation, running all methods on an identical set of 102 pairs with a strict rule: no tuning and a forced decision for every pair. They also introduced a minimal, parameter-free baseline using sorted-conditional compression, which processes quantized, sorted, and differenced data with a standard compressor. Under this standardized "common ruler," the ranking of methods significantly differs from published literature. The simple compression baseline achieved 74.7% weighted accuracy, performing comparably to more complex, tuned methods. The study highlights mechanisms that inflate published figures, such as test-set model selection and significance-gated abstention, and provides a more accurate, consistent benchmark for future causal inference research.

Why it matters

For professionals relying on causal inference in data analysis, understanding the true performance of methods is critical. This re-evaluation exposes potential overestimations in published results and provides a more reliable benchmark, promoting more rigorous and trustworthy causal discovery.

How to implement this in your domain

  1. 1Critically assess reported accuracies of causal inference methods, considering the evaluation protocols used.
  2. 2Prioritize methods evaluated under standardized, "same-hands" conditions to ensure fair comparisons.
  3. 3Consider using simple, parameter-free baselines as a reference point when developing or evaluating new causal inference techniques.
  4. 4Adopt rigorous evaluation practices, including forced decisions and consistent datasets, to avoid inflated performance metrics.

Who benefits

Data ScienceResearch & DevelopmentAcademiaFinanceHealthcare

Key takeaways

  • Published causal inference accuracies are often inflated due to inconsistent evaluation protocols.
  • A standardized re-evaluation reveals a different ranking of methods.
  • A simple, parameter-free compression baseline performs comparably to complex methods.
  • Rigorous evaluation protocols are essential for reliable causal inference research.

Original post by Wietse Stienstra

"arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates. We…"

View on X

Originally posted by Wietse Stienstra on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses