AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms

Non-profit AI research group EleutherAI created the dataset called "the Pile."

Mike Dalton

Jul. 16, 2024 at 9:00 pm UTC

2 min read

Updated: Jul. 17, 2024 at 12:36 am UTC

AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms

Cover art/illustration via CryptoSlate. Image includes combined content which may include AI-generated content.

Non-profit AI research group EleutherAI scraped YouTube subtitles to create a dataset in violation of YouTube's terms of service, ProofNews said on July 16.

The dataset, called the Pile, allegedly includes subtitles of 173,536 YouTube videos from over 48,000 channels. About 12,000 deleted videos are part of the dataset.

Several top tech and AI firms, including Anthropic, have since used the Pile for training. Anthropic spokesperson Jennifer Martinez said the dataset includes “a very small subset of YouTube subtitles” but declined to comment on possible violations of YouTube's terms of service.

Business software firm Salesforce also used the dataset. Salesforce VP of AI research Caiming Xiong said the dataset was “publicly available” and that Salesforce used it for academic and research purposes. ProofNews said Salesforce eventually released the same dataset publicly.

Apple used the Pile to train OpenELM, an efficient language model for on-device AI. Nvidia, Bloomberg, and Databricks also used the Pile for AI training.

ProofNews said its list of companies that used the dataset is not comprehensive, as companies do not always disclose which datasets they use in AI training.

Dataset contains crypto channels, more

ProofNews' search tool indicates that Pile includes videos from crypto channels and creators, including Coinbase, Cointelegraph, Bitcoin Magazine, BitBoy Crypto, 99Bitcoins, Ivan On Tech, and Andreas Antonopolous.

ProofNews highlighted that the dataset includes transcripts from major news channels, education channels, late-night shows, popular YouTube hosts, and other categories. The Pile dataset extends beyond YouTube to other websites and online content.

ProofNews noted an earlier report from the New York Times, which said OpenAI and Google had previously harvested YouTube text. Google, which owns YouTube, said the action was permissible due to its agreement with users. OpenAI did not confirm or deny the report.

AI copyright disputes are far-reaching. Law firm Baker Hoestler lists at least fifteen lawsuits involving tech firms such as Anthropic, Meta, GitHub, Stability AI, Nvidia, and Google. OpenAI faces high-profile lawsuits from Mother Jones' parent company and The New York Times.

Mentioned in this article

Posted in

Author View profile

Mike Dalton

Journalist CryptoSlate

Before transitioning to crypto writing in 2018, Mike studied library and information sciences. Currently, he resides on Canada's West Coast.

Editor View profile

Assad Jafri

Editor & Reporter CryptoSlate

AJ, a passionate journalist since Yemen's 2011 Arab Spring, has honed his skills worldwide for over a decade. Specializing in financial journalism, he now focuses on crypto reporting.

Disclaimer

Our writers' opinions are solely their own and do not reflect the opinion of CryptoSlate. None of the information you read on CryptoSlate should be taken as investment advice, nor does CryptoSlate endorse any project that may be mentioned or linked to in this article. Buying and trading cryptocurrencies should be considered a high-risk activity. Please do your own due diligence before taking any action related to content within this article. Finally, CryptoSlate takes no responsibility should you lose money trading cryptocurrencies. For more information, see our company disclaimers.

Latest News

Asset Coverage

Topic Coverage

Market Structure

Top Links

Applications

Crypto Sectors

Ecosystems

Top Links

Review Categories

Company Verticals

People Categories

Product Categories

Company

Help & Legal

Follow

AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms

Dataset contains crypto channels, more

Daily signals, zero noise.

Related coverage

Bitcoin eyes bullish move to $75,000 where the real fight for recovery is decided beyond Iran pause

Crypto wallets to offer a backdoor recovery if buried amendment to state bill passes Senate

Market swings by $3 trillion as Bitcoin price explodes upward in 5 minutes

Bitcoin price jumps above $70,000 as US announces shock pause on Iran strikes

Bitcoin focus flips from oil to bonds as US and Japan 10-year yields spike into a critical week

The SEC just gave crypto its clearest win in years, but much of it could still be reversed

The AI reset is now underway as layoffs accelerate and one group is hit hardest

Can crypto protect us against the growing web of economic AI agents?

AI is hiring more senior developers while quietly erasing the jobs that create them

One of the biggest US Bitcoin miners eyes sale of its entire 53,000 BTC stash

XRP Ledger nearly shipped a feature that could drain accounts without owners signing

Bitcoin advocate Jack Dorsey wants to slash 50% of Block’s workforce in AI-era overhaul

Playnance Unveils the First Democratic Social Gaming Protocol, Surpassing 1M GCOIN Holders

$METAWIN Presale Raises $350,000 in Hours

Alchemy Pay Integrates Apertum Blockchain and Expands Fiat Access to Its Ecosystem

Now Live: MetaWinners Community Launches $METAWIN Token Presale

Trady Introduces Modular Architecture for Self-Custody Trading Across Multi-Chain Networks

Aster Expands WLFI Collaboration, Launches USD1-Denominated Perpetual Markets

Playnance’s G Coin surpasses 1 million holders as launch-week momentum accelerates

The Market Maker’s Exchange Checklist (Liquidity, Latency, and Risk Controls)