Training AI with YouTube Subtitles and Synthetic Conversations: Ethical Questions and Industry Practices

An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Creators claim their videos were used without their knowledge.

Colorful Circle graph depicting LLM training sources — *Graph appearing in MIT Technology Review of AI Training Sources.*

As shown in the above graph published by MIT Technology Review, YouTube videos and “synthetic conversations” have been training large language models (LLMs). “Synthetic conversations” are dialogues generated by artificial intelligence or crafted by human writers specifically to train AI models. Tech companies such as Apple, Nvidia, Anthropic, and Salesforce have utilized subtitles from over 173,000 YouTube videos, sourced from more than 48,000 channels, to train their AI models without explicit consent from content creators. However, it’s not just YouTube that was mined for training data, but also the European Parliament, English Wikipedia, and a trove of Enron Corporation employees’ emails that was released as part of a federal investigation into the firm. This practice has raised significant ethical concerns, as it potentially violates YouTube’s policies and exploits creators’ intellectual property without compensation.

In response to these concerns, YouTube introduced an option in December 2024 allowing creators to opt-in to permit third-party companies to use their videos for AI training purposes. This feature, disabled by default, empowers creators to decide whether their content can contribute to AI model development, aiming to balance innovation with respect for creators’ rights.

Additionally, the reliance on synthetic conversations for training LLMs has sparked discussions about the quality and authenticity of AI-generated content. An article in The Guardian highlighted the experiences of writers producing hypothetical responses to train AI models like ChatGPT. These writers provide “gold standard” material to help AI systems generate accurate outputs and avoid inaccuracies known as “hallucinations.” This process shows the ongoing need for human input in refining AI capabilities, even as these systems evolve.

These developments emphasize the importance of ethical considerations in AI training methodologies, particularly concerning content creators’ rights and the authenticity of AI-generated information

SOURCE MATERIAL

PROOF NEWS : https://www.proofnews.org/apple-nvidia-anthropic-used-thousands-of-swiped-youtube-videos-to-train-ai/ – By Annie Gilbertson, Alex Reisner Jul 16, 2024

PROOF NEWS : Search the YouTube Videos Secretly Powering Generative AI https://www.proofnews.org/youtube-ai-search/ By Alex Reisner Jul 16, 2024

The Verge : https://www.theverge.com/2024/12/16/24322732/youtube-creators-opt-in-third-party-ai-training-videos – By Jay Peters, December 16, 2024

MIT Technology Review : https://www.technologyreview.com/2024/12/18/1108796/this-is-where-the-data-to-build-ai-comes-from – December 18, 2024

The Guardian : https://www.theguardian.com/technology/article/2024/sep/07/if-journalism-is-going-up-in-smoke-i-might-as-well-get-high-off-the-fumes-confessions-of-a-chatbot-helper – By Jack Apollo George, September 7, 2024