Atlantic Creates Searchable Database of Music Used in AI Training

AI | The Verge· June 20, 2026 View original

▶ The 2-minute explainer

Summary

The Atlantic's Alex Reisner uncovered and made public four large datasets of music used to train AI models, some containing millions of tracks. These datasets have been downloaded thousands of times, with companies like Google and Stability confirming their use in research.

A reporter from The Atlantic, Alex Reisner, has compiled and released a publicly searchable database detailing music datasets utilized in the training of artificial intelligence models. This initiative revealed four distinct collections, two of which are exceptionally vast, comprising 12 million and 9 million tracks respectively, alongside two smaller but still substantial sets each exceeding 100,000 songs. These datasets have seen widespread distribution, with thousands of downloads recorded. While the full extent of their usage by AI developers remains unclear, major players such as Google and Stability have acknowledged employing these resources in their research, as evidenced in their published papers. The origins of some music, like that from the Free Music Archive, permit personal streaming but restrict commercial re-use, raising questions about licensing in AI training.

Why it matters

Professionals in AI development, legal, and content creation need to understand the provenance of training data to ensure ethical practices and avoid potential copyright infringement issues. This database provides transparency into a critical aspect of AI model development.

How to implement this in your domain

1Review the database to identify if your organization's content is present in AI training datasets.
2Assess potential copyright implications for AI models trained on these publicly identified datasets.
3Develop internal guidelines for sourcing and licensing training data to mitigate legal risks.
4Engage with legal counsel to understand the evolving landscape of AI and intellectual property rights.

Who benefits

LegalMusicAI DevelopmentMedia & Entertainment

Key takeaways

A new searchable database reveals music datasets used for AI training.
Millions of tracks from various sources are included, some with unclear usage rights.
Major AI companies have confirmed using these datasets in their research.
Transparency in AI training data is crucial for addressing copyright and ethical concerns.

Original post by AI | The Verge

"Atlantic reporter Alex Reisner recently uncovered four datasets of music being used to train AI models and made them fully searchable for the public. Two of the sets are absolutely enormous at 12 million and 9 million tracks. The other two are much smaller, but still represent a…"

View on X

Originally posted by AI | The Verge on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI News & Tools

AI News & Tools

ChatGPT Logs Used as Evidence in Arson Trial

Prosecutors in the Palisades fire trial presented ChatGPT logs as evidence against Jonathan Rinderknecht, who faced arson charges. The logs revealed his queries about generating fire images, expressions of anger, and discussions about culpability for fires.

AI | The VergeJun 28, 2026

AI News & ToolsAI Engineering & DevTools

Proposing AI Usage Transparency for Credible Commentary

The author suggests a requirement for individuals and organizations to publish their percentage of frontier AI usage at work and personal usage. This transparency would establish credibility before commenting on AI's utility.

@nathanbenaichJun 28, 2026

AI Engineering & DevToolsAI News & Tools

MCP and A2A Protocols Standardize Agentic Internet Development

The Model Context Protocol (MCP) and Agent-to-Agent (A2A) Protocol are standardizing how AI agents discover tools, call services, and coordinate across systems. Understanding these protocols is crucial for developers building agent-compatible infrastructure.

Theo VasilisJun 28, 2026