Ai companies are usually secret about their educational data sources, but the Proof News study found that some of the richest of these companies in the world have used materials from thousands of YouTube to train AI. Companies did so despite the rules of Youtube against harvesting materials from the platform without permission. In our study, the Silicon Valley Heavyweight Sign used 173,536 YouTube videos, including more than 48,000 channels, including all, Nvidia, Apple and Salesforce.
This data set is called YouTube subtitles, including videos from education and online training channels from the KE Khan Institute, MIT and Harvard. The Wall Street Journal, NPR, and the BBC also use their videos to train artificial intelligence, as do The Late Show with Stephen Colbert, Last Week Tonight with John Oliver, and Jimmy Kimmel Live in this way.
Proof News also revealed the profiles of YouTube superstars, including MrBeast (289 million subscribers, two workout videos), Marks Brownlee (19 million subscribers, seven videos), Jacksepticeye (nearly 31 million subscribers, 377 videos) and PewDiePie (111 million subscribers, 111 ) million subscribers, 7 videos). recorded videos). Some of the material used to train AI also promoted conspiracies such as the “flat-earth theory.”
Proof News created a tool to search for creators in the YouTube AI training dataset.
“No one came to me and said, ‘We would like to use this,’” said David Pakman, host of The David Pakman Show, a left-leaning politics channel with more than 2 million subscribers and more than 2 billion views. His nearly 160 videos were drawn to the concentration of education data on YouTube.
Four people work in Pakman's business. Parkman said that if AI companies are paid, he should be compensated for the use of his data. He noted that several media companies have recently signed contracts requiring them to use their work to train artificial intelligence. "This is my livelihood and I invest time, resources, money and staff time into creating this content," Parkman said. — There is really no shortage of jobs.
"It's a steal," said Nebula CEO Dave Viscus. Stolen from YouTube and used to train artificial intelligence. Viscus said using a creator's work without their consent is "disrespectful," especially since studios can use "generative artificial intelligence to replace artists whenever possible."
"Could it be used to exploit and harm artists? Yes, absolutely," Viscus said. Representatives for EleutherAI, the creator of the dataset, did not respond to a request for comment on Proof's findings, including allegations that the footage was used without permission. The company's website says its overall goal is to lower the barriers to AI development for people outside of large tech companies, and that the company has historically provided "access to cutting-edge AI technology through training and release models."
YouTube subtitles do not contain video images. Instead, they consist of plain text of the video subtitles, often including translations in languages such as Japanese, German, and Arabic. According to research published by EleutherAI, the dataset is part of a collection published by the nonprofit Pile. The stack was developed by including material not only from YouTube, but also from the European Parliament, the English Wikipedia, and a series of emails from Enron employees that were released as part of a federal investigation into the company. Most of the Pile datasets are available and open to any Internet user with enough space and computing power to access them. Academics and other developers outside of Big Tech used the dataset, but they weren't the only ones. Companies worth hundreds of billions and trillions of dollars, including Apple, Nvidia and Salesforce, describe in their research papers and posts how they use Pile to train artificial intelligence. The documents also show that Apple used Pile to train OpenELM, a high-level model that was released in April, a few weeks after the company announced that iPhones and MacBooks would add new artificial intelligence capabilities. The company's publication stated that Bloomberg and data assistants had also trained pile models.
Anthropic is also a leading AI manufacturer who has received $ 4 billion for investment from Amazon and advertised the concentration of "AI security".
Human and human representative, Jennifer Martinez, said in a statement: "There is a small part of YouTube subtexts in this pile." "YouTube's terms apply to the direct use of its platform, which is different from the use of Pile's datasets. For possible violations of YouTube's Terms of Service, we refer the author of Pile.
Salesforce also confirmed that it uses Pile to build artificial intelligence models for "academic and research purposes." Xiong Kaiming, the company's vice president of artificial intelligence research, emphasized in a statement that the dataset is "public."
Salesforce then released the same AI model for public use in 2022, which has since been downloaded at least 86,000 times, according to its Hugging Face page. Salesforce developers in their research writings pointed out that this pile also has blasphemy and "gender prejudice and certain religious groups" and warns that it could lead to "vulnerabilities and security issues". Proof of news revealed that thousands of blasphemy on YouTube subtitles and competition and gender screen examples. Salesforce representative did not answer security questions. Nvidia representatives declined to comment. Representatives for Apple, Databricks and Bloomberg did not respond to requests for comment.