New Method Infers Dataset Usage Without Shadow Models or Held-out Data

Wojciech {\L}apacz, Stanis{\l}aw Pawlak, Jan Dubi\'nski, Franziska Boenisch, Adam Dziedzic· June 26, 2026 View original

▶ The 2-minute explainer

Summary

This research introduces a practical Dataset Usage Inference (DUI) framework that estimates what fraction of a dataset contributed to a model's training without requiring expensive shadow models or a held-out dataset. It achieves this by generating synthetic non-member samples and using diverse membership signals for mixture proportion estimation.

Determining the extent to which a specific dataset was used to train a machine learning model, known as Dataset Usage Inference (DUI), is a critical challenge. Existing DUI methods are often impractical because they demand expensive shadow models and access to both known training samples and a confirmed held-out dataset. These requirements are rarely met, especially for large, modern models and real-world data ownership disputes. A new, practical DUI framework has been developed to overcome these limitations. This method eliminates the need for shadow models and actual held-out data. Instead, it generates synthetic non-member samples, extracts various membership signals, and frames DUI as a mixture proportion estimation problem. Experiments with large image generative models demonstrate that this approach reliably quantifies dataset usage, providing a valuable tool for data owners to ascertain how their data was utilized in model training.

Why it matters

Data owners and organizations concerned with intellectual property, data privacy, and compliance can now practically determine if and how much of their data was used to train AI models, even large generative ones, without prohibitive costs or data requirements.

How to implement this in your domain

  1. 1Adopt this new DUI framework to audit the training data usage of third-party AI models.
  2. 2Develop internal tools based on this method to verify compliance with data licensing agreements.
  3. 3Integrate DUI capabilities into data governance and intellectual property protection strategies.
  4. 4Educate legal and compliance teams on the feasibility of dataset usage inference for dispute resolution.
  5. 5Explore applying this technique to different data modalities beyond image generative models.

Who benefits

LegalMedia & EntertainmentSoftwareData GovernanceIntellectual Property

Key takeaways

  • A new DUI framework eliminates the need for shadow models and held-out data.
  • It uses synthetic non-member samples and membership signals to estimate dataset usage.
  • The method is practical for large generative models and real-world data ownership disputes.
  • This provides a crucial tool for data owners to verify how their data was used in AI training.

Original post by Wojciech {\L}apacz, Stanis{\l}aw Pawlak, Jan Dubi\'nski, Franziska Boenisch, Adam Dziedzic

"arXiv:2606.26257v1 Announce Type: new Abstract: How much of my data was used to train a machine learning model? Dataset Usage Inference (DUI) aims to answer this by estimating what fraction of a dataset contributed to a model's training. However, existing DUI methods rely on assu…"

View on X

Originally posted by Wojciech {\L}apacz, Stanis{\l}aw Pawlak, Jan Dubi\'nski, Franziska Boenisch, Adam Dziedzic on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses