AI Training Data 2026: The Atlantic Reveals Whose Music Ended Up in Suno and Udio — Yours Included! - gearnews.com

ADVERTISEMENT

24 Jun by Marcus Schmahl | |

Add as preferred source on Google

|

5,0 / 5,0 | Reading time: 8 min

AI Training Data 2026: The Atlantic Reveals Whose Music Ended Up in Suno and Udio — Yours Included!

AI Training Data 2026: The Atlantic Reveals Whose Music Ended Up in Suno and Udio — Yours Included! · Source: The Atlantic / Marcus Schmahl

Previous Next

ADVERTISEMENT

The Atlantic just published four searchable databases as part of its “AI Watchdog” project, together covering more than 20 million songs floating around somewhere in the orbit of AI music generators like Suno, Udio, and Google’s Lyria. If you want to know whether your own name or your band shows up anywhere in this AI training data, you can check for free right now. Sounds like good news for transparency at first. It is, just not quite as reassuring as you might think.

The Secret of the AI Training Data

The Atlantic published four searchable databases containing more than 20 million songs as part of its “AI Watchdog” project on AI training data
The nonprofit LAION assembled the largest dataset, 12.3 million tracks pulled from YouTube
Most of these datasets only contain links to songs, not the actual audio files
Big names like Taylor Swift and Bad Bunny show up right alongside independent electronic producers in the AI training data pool
The major labels have been suing Suno and Udio since June 2024 over mass copyright infringement
Google points to YouTube’s terms of service to justify its own AI models
A key hearing on AI training data in the Suno case is scheduled for July 2026

AI Training Data: Everything on The Atlantic’s Suno, Udio, and Google Investigation

Over 20 Million Songs in the AI Training Data

The reporter behind all this is Alex Reisner, who’s been tracking down AI training data for The Atlantic for a while now, books and research papers before, music now. The result is four databases of very different sizes. The biggest one comes from the German nonprofit LAION and holds 12,320,916 tracks pulled from YouTube, adding up to 91 years of music. A second dataset clocks in around 9 million tracks, and two smaller ones sit at roughly 100,000 songs each, one of them built on the Free Music Archive, a project originally launched by the radio station WFMU.

Here’s the important part: most of these datasets don’t actually contain the audio files themselves, just links to YouTube or Spotify. Automated tools still use those links to pull the actual songs, sometimes bypassing logins, ads, and pretty much everything that’s supposed to pay artists. And there’s still not much in the way of lawsuits or clear rules against this kind of scraping.

What makes Reisner’s work stand out is that the search itself is completely free, no paywall involved. Type in an artist name, and that’s it. That alone sets this investigation apart from the previous debate, which mostly stayed pretty abstract.

You are currently viewing a placeholder content from YouTube. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.

More Information

Unblock content Accept required service and unblock content

Who’s Already Showing Up in the Datasets

Sure, the headlines around this AI training data lead with the big names. Taylor Swift and Bad Bunny show up, which isn’t exactly shocking given the size of these datasets. What’s more interesting to me, and probably to you too, are the cases coming out of the electronic music world.

Berlin-based musician Hainbach says he found 151 of his own songs in just one of the datasets. Breakcore producer sophia_hjkl posted that 138 of her tracks turned up spread across two datasets, by her own account basically her entire output between 2017 and 2024.

Typing in your own artist name takes less than two minutes and gives you an answer right away. I’d give it a shot just out of curiosity. If you don’t find anything, that’s not automatic peace of mind either, more on that in a bit. Spoiler: a few of my own music projects turned up there too, with a fair number of tracks.

The Lawsuits Against Suno and Udio Over AI Training Data

Back in June 2024, we covered the major labels weighing a lawsuit against Suno and Udio, and that’s since turned into a whole bundle of legal proceedings. The RIAA filed at the time on behalf of Universal Music Group, Sony Music, and Warner Music Group over mass copyright infringement, and trade press now puts the number at at least a dozen lawsuits against the two companies. Just a few weeks ago, Universal and Sony moved to add more than 61,000 additional recordings to their case against Suno. As of right now, though, Warner has already settled with Suno, and Universal has settled with Udio, outside of court.

Suno is defending itself with the fair use argument, claiming that training a generative model on copyrighted recordings counts as transformative use under 17 U.S.C. § 107. The key hearing in the case is set for July 2026 in front of Judge Denise Casper in the federal court of Massachusetts. Udio has acknowledged in the Sony filings that it used publicly available audio for training, while disputing that this amounts to infringement. None of it has actually been decided yet.

Google’s Answer: AI Training Data Through Its Own Terms of Service

Google’s argument runs a little differently. When Lyria 3 launched, the company put out a statement on responsible use, complete with safeguards against directly mimicking a specific artist, a watermarking technique called SynthID, and the usual broad language about intellectual property and privacy. So far, so expected.

The real catch is a different line from that same statement. Google said it trains on “materials that YouTube and Google has a right to use under our terms of service, partner agreements, and applicable law.” Translated, that pretty much means: music that you or your label uploaded to YouTube at some point, maybe even through a distributor, without ever thinking about AI training when you clicked “agree.”

When musicians sued over exactly this, Google moved to dismiss the case without confirming or denying whether specific songs were used, arguing that its terms of service already cover it either way. In fairness, Google’s separate Magenta team says its Magenta RealTime 2 project was trained on licensed stock audio and MIDI data, not scraped user content. So not every Google model takes the same route.

My Take: Transparency Only Where It Wasn’t Voluntary

Let’s be honest, all of this only became visible because these four datasets were floating around publicly and researchers got their hands on them. The companies that keep their training sources completely secret don’t show up in this investigation at all, simply because nobody could look inside. Reisner himself rightly calls this just the tip of the iceberg.

Google’s terms-of-service argument is legally clever, but it still feels pretty thin to me. Nobody uploading a demo to YouTube back in 2015 was thinking “training material for an AI music model.” There’s also another point worth raising here: which genres show up heaviest in these training sets has a lot to do with whose cultural labor ends up supplying the emotional substance of these models in the first place, a debate that artists like SZA have publicly pushed into the spotlight.

And then there’s the quality question. When you look at what all this training material is actually producing, the whole effort starts to feel absurd. TechCrunch reported in February on Czech figure skaters dancing to an AI-generated cover of “You Get What You Give” where the lyrics and phrasing barely lined up at all. Twenty million scraped songs, for that. It just doesn’t add up for me. What do you think?

What Producers and Musicians Can Do Right Now

The obvious first step: search your own name or your band’s name in Reisner’s tool, it’s free and takes less than a minute. Coming up empty doesn’t mean you’re automatically in the clear either, since these four datasets only cover what researchers managed to track down publicly. Nobody really knows how much hidden, undisclosed training data is out there on top of that.

Legally, things stay up in the air as long as the cases against Suno and Udio drag on. If you’re using AI music tools yourself, it’s also worth knowing that earlier research found these models can sometimes produce results that get uncomfortably close to protected music in melody, chords, or style, something we already touched on when covering the first lawsuits against Suno.

Bottom Line on AI Training Data

The Atlantic’s four datasets turn what used to be an abstract debate into something pretty concrete. Instead of “AI companies might be using protected music,” you now get: these songs, these artists, searchable in seconds, by anyone who types in a name. Legally, nothing has actually changed yet, the lawsuits keep grinding on, and the Suno hearing isn’t until July 2026. We’ll obviously keep watching how the whole thing plays out.

But for every single musician who types in their name and suddenly sees a list full of their own songs, this stops being some abstract industry story real fast. What do you think, would you actually go check whether your music ended up in there? Let us know in the comments.

More Information on AI Training Data

How do you like this post?

Rating: Yours: | ø:

Leave a Reply Cancel reply

ADVERTISEMENT X

ADVERTISEMENT X