Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC

The dataset is used by companies like Facebook and Google for AI training.

Former Journalist CryptoSlate

Published Apr. 21, 2023 at 10:17 pm GMT

2 min read

This article was published 3 years ago. Some details may no longer reflect current market conditions or recent developments. If you spot anything that needs an update, contact us.

Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC

Cover art/illustration via CryptoSlate. Image includes combined content which may include the use of AI tools.

Make preferred on

Colossal Clean Crawled Corpus (C4), an AI dataset used by major tech companies, contains data from various crypto-related websites.

C4 dataset draws from crypto sites

The Washington Post and the Allen Institute for AI recently analyzed the C4 dataset, ranking websites by the number of “tokens” or text snippets taken from each source.

The U.S. Securities and Exchange Commission — which in part contains content on cryptocurrency regulation — was among the dataset's largest sources. Its website (sec.gov) ranked at #39 and accounted for 36 million, or 0.02%, of C4's tokens.

Advertise with CryptoSlate

Bitcointalk.org, a blockchain discussion board created by Satoshi Nakamoto, ranked at #780. It accounted for 6.1 million, or 0.004%, of C4's tokens.

Cryptocurrency news and aggregation sites such as Cointelegraph and Coinmarketcap.com were also represented. Eight such sites collectively accounted for at least 0.008% of C4's tokens, though other sites likely increase the true total.

Websites related to specific cryptocurrencies and exchanges were also represented in the dataset but accounted for a negligible amount of tokens.

Two crypto-adjacent sites also ranked highly. IPFS (ipfs.io) ranked at #16 while Steemit (steemit.com) ranked at #594. The first site is a distributed network from the blockchain firm Protocol Labs, while the second makes direct use of blockchain. However, these sites do not necessarily contain content related to cryptocurrency.

Mainstream sites topped the list

The C4 dataset is used in AI language models from major tech companies including Google's T5 and Facebook's LLaMA, according to the Washington Post.

Though the above sites are among C4's most significant crypto-related websites, they are outranked by mainstream websites and news sources, which often cover cryptocurrency topics and are likely the primary source for all crypto-related data.

C4 has also been criticized for containing hate speech and pirated data. Though the dataset's name suggests that it has been “cleaned,” its assemblers only used a list of 400 words to censor specific content, meaning that controversial content remains intact.

The presence of crypto sites, as well as the presence of controversial data, could affect the level of bias seen in content produced by AI chatbots.

Posted in

Mike Dalton

Former Journalist CryptoSlate

Before transitioning to crypto writing in 2018, Mike studied library and information sciences. Currently, he resides on Canada's West Coast.

View profile

Advertise with CryptoSlate

Our writers' opinions are solely their own and do not reflect the opinion of CryptoSlate. None of the information you read on CryptoSlate should be taken as investment advice, nor does CryptoSlate endorse any project that may be mentioned or linked to in this article. Buying and trading cryptocurrencies should be considered a high-risk activity. Please do your own due diligence before taking any action related to content within this article. Finally, CryptoSlate takes no responsibility should you lose money trading cryptocurrencies. For more information, see our company disclaimers.

Top AI dataset pulls data from BitcoinTalk, Steemit, and U.S. SEC

C4 dataset draws from crypto sites

Daily signals, zero noise.

Mainstream sites topped the list

Related coverage

Bitcoin miner AI pivot hits roadblock with New York 50 MW permit freeze

Bitcoin miner CleanSpark signed a $6.6B AI lease before securing the $2.1B required to build it

CoreWeave’s $20 billion funding haul shows why Bitcoin is losing the competition for liquidity