VAEs Uncover Latent Structures in Large-Scale X-ray Scattering Data.

Monika Choudhary, Xiaoya Chong, Runbo Jiang, Wiebke Koepp, Petrus H. Zwart, Damon English, Gregory M. Su, Eric Schaible, Chenhui Zhu, Mostafa Nassr, Noah P. Wamble, Kelvin Kam-Yun Li, Jonathan M. Chan, Jose Carlos Diaz, Cameron McKay, Lynn Katz, Benny Freeman, Guillaume Freychet, Yevgen Matviychuk, Eliot Gann, Daniel B. Allan, Benedikt Sochor, Frank Schluenzen, Stephan V. Roth, Ethan Crumlin, Dylan McReynolds, Tanny Chavez, Alexander Hexemer· June 16, 2026 View original

Summary

Researchers developed a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) to process vast amounts of X-ray scattering data, learning low-dimensional representations that reveal structural variations. This model enables efficient exploration of archived datasets and real-time analysis of live experiments, outperforming general-purpose vision models in interpretability for this specific domain.

Scientific facilities are generating X-ray scattering data at rates that overwhelm traditional processing methods. To address this, a new approach utilizes a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) trained on 1.5 million X-ray scattering images. This model learns compact, low-dimensional representations that effectively capture structural variations under diverse experimental conditions. The learned latent space from the C-VAE organizes data into well-defined clusters and smooth trajectories, which directly reflect the progression of experiments. This capability also extends to generating synthetic scattering images that represent various structural states. The model has been successfully deployed without retraining to analyze time-resolved film formation experiments at two synchrotron facilities, revealing interpretable latent structures. Benchmarking against DINOv3, a general-purpose vision foundation model, demonstrated that the domain-specific training of the C-VAE yields a more interpretable latent organization for X-ray scattering data. Both offline exploration and live analysis workflows are integrated into the Latent Space Explorer, part of the MLExchange platform, providing interactive tools for structural investigation.

Why it matters

Professionals in materials science, chemistry, and advanced manufacturing can leverage this technology to accelerate scientific discovery and process optimization by rapidly interpreting complex experimental data. It enables faster insights from high-throughput experiments, reducing bottlenecks in research and development.

How to implement this in your domain

  1. 1Investigate applying Variational Autoencoders (VAEs) or similar unsupervised learning techniques to high-throughput experimental data in your domain.
  2. 2Develop domain-specific training datasets for foundation models to enhance their performance and interpretability for specialized tasks.
  3. 3Implement interactive data exploration tools that visualize latent spaces to help scientists uncover hidden patterns and relationships in complex datasets.
  4. 4Explore integrating real-time machine learning models into live experimental setups for on-the-fly data analysis and feedback.

Who benefits

Materials SciencePharmaceuticalsChemical ManufacturingScientific ResearchAdvanced Manufacturing

Key takeaways

  • Domain-specific VAEs can efficiently process and interpret large-scale X-ray scattering data.
  • The learned latent spaces reveal interpretable structural variations and experimental progressions.
  • This approach supports both offline dataset exploration and live, on-the-fly analysis.
  • Domain-specific training significantly outperforms general-purpose models for specialized scientific data interpretation.

Original post by Monika Choudhary, Xiaoya Chong, Runbo Jiang, Wiebke Koepp, Petrus H. Zwart, Damon English, Gregory M. Su, Eric Schaible, Chenhui Zhu, Mostafa Nassr, Noah P. Wamble, Kelvin Kam-Yun Li, Jonathan M. Chan, Jose Carlos Diaz, Cameron McKay, Lynn Katz, Benny Freeman, Guillaume Freychet, Yevgen Matviychuk, Eliot Gann, Daniel B. Allan, Benedikt Sochor, Frank Schluenzen, Stephan V. Roth, Ethan Crumlin, Dylan McReynolds, Tanny Chavez, Alexander Hexemer

"arXiv:2606.14999v1 Announce Type: new Abstract: Scientific user facilities generate X-ray scattering data faster than traditional workflows can process them. We address this challenge across two settings, offline dataset exploration and live on-the-fly analysis. We train a domain…"

View on X

Originally posted by Monika Choudhary, Xiaoya Chong, Runbo Jiang, Wiebke Koepp, Petrus H. Zwart, Damon English, Gregory M. Su, Eric Schaible, Chenhui Zhu, Mostafa Nassr, Noah P. Wamble, Kelvin Kam-Yun Li, Jonathan M. Chan, Jose Carlos Diaz, Cameron McKay, Lynn Katz, Benny Freeman, Guillaume Freychet, Yevgen Matviychuk, Eliot Gann, Daniel B. Allan, Benedikt Sochor, Frank Schluenzen, Stephan V. Roth, Ethan Crumlin, Dylan McReynolds, Tanny Chavez, Alexander Hexemer on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses