Executive Summary:
The Washington Post analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, to reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data. The analysis revealed that the data set was dominated by websites from industries including journalism, entertainment, software development, medicine, and content creation, but also included at least 27 other sites identified by the U.S. government as markets for piracy and counterfeits. The filters used to limit a model’s exposure to racial slurs and obscenities as it’s being trained have been shown to eliminate some nonsexual LGBTQ content, while failing to remove some troubling content, including white supremacist sites, anti-trans sites, and anonymous message boards known for organizing targeted harassment campaigns against individuals.
Key Insights:
1. The data set used to train AI chatbots is sourced from a wide range of websites, including those that are proprietary, personal, and often offensive.
2. The top sites included at least 27 other sites identified by the U.S. government as markets for piracy and counterfeits, raising concerns about the use of copyrighted material in AI training data.
3. The filters used to limit a model’s exposure to racial slurs and obscenities as it’s being trained have been shown to eliminate some nonsexual LGBTQ content, while failing to remove some troubling content, including white supremacist sites, anti-trans sites, and anonymous message boards known for organizing targeted harassment campaigns against individuals.
4. The use of AI chatbots has the potential to revolutionize many industries, but the analysis of Google’s C4 data set reveals that the training data used to power these chatbots is often sourced from a wide range of websites, including those that are proprietary, personal, and often offensive.
5. Transparency about the sources of AI training data is crucial to building trust with users and avoiding legal challenges.
Business Impact:
The analysis of Google’s C4 data set reveals that the training data used to power AI chatbots is often sourced from a wide range of websites, including those that are proprietary, personal, and often offensive. This raises concerns about the use of copyrighted material in AI training data, as well as the potential for chatbots to spread bias, propaganda, and misinformation without the user being able to trace it to the original source. As companies stress the challenges of explaining how chatbots make decisions, transparency about the sources of AI training data is crucial to building trust with users and avoiding legal challenges. Companies that use AI chatbots should be aware of the potential risks associated with the training data used to power these chatbots and take steps to ensure that the data is sourced from reputable and reliable sources.
—
Read the full article:
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/