OpenAI, Google, Meta, and Anthropic all rely heavily on content from premium publishers to train the large language models, or LLMs, at the heart of their AI efforts, although these companies have regularly understated their use of content from such with copyright, according to the news. Research published this week by online publishing giant Ziff Davis.
Ziff Davis owns CNET, as well as a host of other brands, including IGN, PCMag, Mashable and Everyday Health.
A paper detailing the research and authored by George Wukoson of Ziff Davis, lead AI advocate, and Chief Technology Officer Joey Fortuna, reports that AI companies deliberately filtered out low-quality content in favor of high-quality, man-made to train. their models. Given that AI companies want their models to perform well, it makes sense that they favor quality content in their training data. AI companies used websites’ domain authority, or essentially their Google search rankings, to make these distinctions. In general, sources that filter higher on Google tend to be of higher quality and credibility.
The companies behind popular AI chatbots like ChatGPT and Gemini have been secretive about where they’re getting the information that powers the responses the bots are giving you. This is not helpful for consumers, who do not get visibility into the sources, their credibility, and whether the training data may be biased or perpetuate harmful stereotypes.
But it’s also a major point of contention with publishers, who say AI companies are essentially pirating the copyrighted work they own, without permission or compensation. Although OpenAI has licensed content from several publishers as it transitions from a non-profit to a for-profit company, other media companies are suing ChatGPT’s creator for copyright infringement.
“Leader LLM developers no longer disclose their training data like they used to. They are now more commercial and less transparent,” Wukoson and Fortuna write.
OpenAI, Google, Meta and Anthropic did not immediately respond to requests for comment.
Publishers including The New York Times have sued Microsoft and OpenAI for copyright infringement, while Wall Street Journal and New York Post publisher Dow Jones is suing Perplexity, another AI-generating startup, on similar grounds.
Big Tech has seen tremendous accolades amid the AI revolution. Google is currently valued at around $2.2 trillion and Meta is valued at around $1.5 trillion, due in part to their work with generative AI. Investors currently value startups OpenAI and Anthropic at $157 billion and $40 billion, respectively. Meanwhile, news publishers are struggling and have been forced into waves of layoffs in recent years. News publishers are struggling in a highly competitive online media environment, trying to navigate through the noise of online search, AI-generated “disruption” and social media to find an audience.
Meta CEO Mark Zuckerberg said that creators and publishers “overestimate the value of their specific content,” in an interview with The Verge earlier this year.
Meanwhile, several AI companies have signed licensing agreements with publishers to feed their LLMs with up-to-date news articles. OpenAI signed a deal with the Financial Times, DotDash Meredith, Vox and others earlier this year. Meta and Microsoft have also cut deals with publishers. Ziff Davis has not signed a similar agreement.
Based on an analysis of disclosures made by AI companies for their older models, Wukoson and Fortuna found that URLs from top publishers such as Axel Springer (Business Insider, Politico), Future PLC (TechRadar, Tom’s Guide ), Hearst (San Francisco Chronicle, Men’s Health), News Corp (The Wall Street Journal), The New York Times Company, The Washington Post, and others, accounted for 12.04% of the training data, at least for the dataset OpenWebText2. OpenWebText2 was used to train GPT-3, which is the underlying technology for ChatGPT, although the latest version of ChatGPT is not built directly on top of GPT-3 and is its own thing.
Neither OpenAI, Google, Anthropic nor Meta have disclosed the training data used to train their latest models.
Each of the several trends discussed in the research paper “reflects decisions made by LLM companies to prioritize high-quality Internet text datasets in LLM training, resulting in revolutionary technological advances that bring great value for those companies,” Wukoson and Fortuna write.