Synthetic Data Filtering Boosts Survival Model Training

Niccol\`o Maria Rizzi, Eugenio Lomurno, Alberto Archetti, Matteo Matteucci· July 2, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces FoGS (Filtered Mixture-of-Generators for Survival analysis), a novel method that reframes synthetic data construction as sample selection rather than generation. FoGS draws from multiple generators and filters samples using an ensemble of survival models, significantly improving downstream survival model performance when training on synthetic data in privacy-restricted clinical settings.

Survival analysis, which models time-to-event data, faces significant challenges in clinical settings due to the scarcity and high cost of training data. Events accrue over long periods, cohorts are often small, and strict privacy regulations limit data sharing across institutions. While tabular generative models offer potential solutions for data augmentation and privacy-preserving sharing, a single generator typically struggles to adequately characterize small cohorts, leading to suboptimal performance when downstream models are trained on its synthetic output. FoGS (Filtered Mixture-of-Generators for Survival analysis) addresses this by re-conceptualizing synthetic data construction as a sample selection problem. Instead of relying on a single generator, FoGS creates a candidate pool from four architecturally distinct tabular generators. Each synthetic sample is then scored for plausibility by an ensemble of seven survival models, which were initially trained on real data, using proper scoring rules. The method employs a two-level optimization pipeline. The outer loop optimizes a selection policy, including generator quotas, scorer weights, a random complement, and stratified balancing on event time and censoring, against the performance of a held-out downstream model (XGBoost-Cox). The inner loop tunes this downstream model. Experiments on 16 public datasets show that FoGS significantly improves C-index and IBS metrics when training on synthetic data and testing on real data, often matching or exceeding real-data training performance without compromising privacy margins.

Why it matters

Healthcare professionals, pharmaceutical researchers, and data scientists can leverage FoGS to overcome data scarcity and privacy concerns in survival analysis, enabling the development of more robust and accurate predictive models for patient outcomes, drug efficacy, and disease progression using fully synthetic data.

How to implement this in your domain

1Assess current data privacy challenges and data scarcity issues in your survival analysis projects.
2Explore implementing a mixture-of-generators approach for synthetic data creation.
3Develop or integrate a sample filtering mechanism based on plausibility scoring using an ensemble of models.
4Pilot FoGS or similar synthetic data generation and filtering techniques for specific clinical or research cohorts.
5Collaborate with data privacy experts to ensure synthetic data generation methods meet regulatory compliance.

Who benefits

HealthcarePharmaceuticalsLife SciencesInsuranceClinical Research

Key takeaways

FoGS improves survival model performance by filtering synthetic data from a mixture of generators.
It addresses data scarcity and privacy concerns in clinical settings by enabling fully synthetic training.
The method uses an ensemble of survival models to score and select plausible synthetic samples.
FoGS often matches or exceeds real-data training performance without compromising privacy.

Original post by Niccol\`o Maria Rizzi, Eugenio Lomurno, Alberto Archetti, Matteo Matteucci

"arXiv:2607.00127v1 Announce Type: new Abstract: Survival analysis models time-to-event data, but in clinical settings training data are costly and scarce: events accrue over years of follow-up, cohorts are small, and privacy regulations restrict sharing across institutions. Tabul…"

View on X

Originally posted by Niccol\`o Maria Rizzi, Eugenio Lomurno, Alberto Archetti, Matteo Matteucci on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Synthetic Data Filtering Boosts Survival Model Training

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Keynotes on Sandboxing and World Models Receive High Praise

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC