Counterfactual Data Augmentation Boosts Regression Model Accuracy.

Hossein Mohebbi, Oliver Schulte, Ke Li, Pascal Poupart· June 30, 2026 View original

Summary

Counterfactual Residual Data Augmentation (CRDA) is a novel model-agnostic technique for tabular regression that generates new training samples by exploiting the invariance of noise residuals under small feature perturbations. This method effectively expands datasets, reducing MSE for MLPs by 22.9% and XGBoost by 6.4% on average across various benchmarks.

Data-driven modeling in real-world regression tasks frequently struggles with limited training data, high collection costs, and noisy observations. Drawing inspiration from the success of data augmentation in computer vision and natural language processing, this research introduces Counterfactual Residual Data Augmentation (CRDA), a novel technique specifically designed for tabular regression. The core insight behind CRDA is that once a regressor has captured the systematic patterns in the data, the remaining noise can be treated as an invariant residual that remains stable even when small, carefully chosen perturbations are applied to features. By leveraging this residual invariance, CRDA can generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real-world data collection. The method is model-agnostic, making it broadly applicable to various regressors. Extensive experiments across diverse benchmark datasets demonstrated significant performance improvements: CRDA reduced the Mean Squared Error (MSE) for MLP Regressors by an average of 22.9% and for XGBoost Regressors by 6.4%. It consistently outperformed existing state-of-the-art data generators and augmentation techniques in MSE reduction, offering a simple and efficient solution for noise-prone, small-sample regression problems.

Why it matters

For data scientists and ML engineers, CRDA offers a powerful, model-agnostic way to improve the accuracy and robustness of regression models, especially when dealing with limited or noisy tabular data, potentially saving significant data collection costs.

How to implement this in your domain

  1. 1Integrate CRDA into your data preprocessing pipeline for tabular regression tasks with limited data.
  2. 2Benchmark CRDA against existing data augmentation techniques to quantify performance improvements on your specific datasets.
  3. 3Apply CRDA to improve the robustness of models in noise-prone environments, such as sensor data or financial forecasting.
  4. 4Explore using CRDA to reduce the need for expensive data collection in new regression projects.

Who benefits

BFSIHealthcareManufacturingRetailEnergy

Key takeaways

  • CRDA is a novel data augmentation technique for tabular regression.
  • It generates new data by exploiting residual invariance under feature perturbations.
  • CRDA is model-agnostic and significantly reduces MSE for various regressors.
  • It offers an efficient solution for small-sample, noise-prone regression problems.

Original post by Hossein Mohebbi, Oliver Schulte, Ke Li, Pascal Poupart

"arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel C…"

View on X

Originally posted by Hossein Mohebbi, Oliver Schulte, Ke Li, Pascal Poupart on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses