OpenAI’s GPT-4o likely trained on paywalled books, new research paper claims | Technology News

OpenAI has been accused of likely training its GPT-4o model on paywalled material, without permission from the publisher.
Researchers at AI Disclosures Project, a non-profit AI watchdog organisation founded in 2024, have published a study stating that OpenAI increasingly relied on paywalled books published O’Reilly Media to train its GPT-4o model.
“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples,” read the research paper.
Story continues below this ad
“GPT-4o [likely] recognises, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” the co-authors of the research paper added. There is no content licensing arrangement between OpenAI and O’Reilly Media, as per the research paper.
The fresh allegations detailed in the research paper come as the Microsoft-backed AI startup battles several lawsuits filed many parties alleging that its training data practices amount to copyright infringement.
To determine whether copyrighted content was included in the training datasets used to develop GPT-4o, the researchers used a method called “membership inference attack” or DE-COP.
This technique lets researchers test whether a large language model (LLM) can reliably dinguish human-authored texts from paraphrased, AI-generated versions of the same text, according to a report TechCrunch. If an LLM can make the dinction, then it suggests that the AI model might have prior knowledge of the text from its training data.Story continues below this ad
The researchers focused on GPT-4o, GPT-3.5 Turbo, and other OpenAI models for their study. They tried to guess at the probability that a particular excerpt had been included in a model’s training dataset relying on 13,962 paragraph excerpts from 34 books published O’Reilly Media.
Based on the findings, GPT-4o “recognised” more paywalled book content than GPT-3.5 Turbo and older OpenAI models. This was observed even after accounting for improvements in the capabilities of OpenAI’s newer models.
However, the paper notes limitations in the research methodology such as users feeding the paywalled book excerpts into ChatGPT as part of their prompts.
OpenAI and Google have lobbied the Trump adminration for codifying the training of AI models on copyrighted works under the fair use exception. Meanwhile, OpenAI has also struck licensing deals with news publishers, social networks, stock media libraries, and others to secure data for AI training purposes.Story continues below this ad
Furthermore, it has reportedly hired journals to help fine-tune its models’ outputs.
© IE Online Media Services Pvt Ltd
Expand