OpenAI's Reasoning Models and Copyright Contradictions

OpenAI's recent moves to trademark its "o1" reasoning models highlight technological advancement and the complex contradictions at the heart of AI development and intellectual property strategy.

Unlike traditional large language models (LLMs), reasoning models 'fact-check' their responses and spend additional time considering questions before generating answers. OpenAI calls the process 'simulated reasoning,' potentially offering more reliable and considered responses.

On November 27, OpenAI filed a trademark application with the U.S. Patent and Trademark Office for "OpenAI o1." The move follows an earlier filing in Jamaica in May of this year, months before o1 was publicly announced. OpenAI, like any sensible business, is getting its intellectual property house in order long before public knowledge and launch.

While actively seeking legal protection for its innovations, OpenAI has simultaneously built its technology using vast amounts of data, including copyrighted materials. So has every other LLM. The debate of fair use vs. theft will no doubt rage on.

In a statement to the British Parliament back in January, OpenAI acknowledged that "it would be impossible to train today's leading AI models without using copyrighted materials." The company further explained that limiting training data to public domain materials "would not provide AI systems that meet the needs of today's citizens." I don't know what the criteria of citizen was back in January, and whether the majority of the world (for 99.1%* of people are citizens of somewhere) knew what an LLM was or how it could benefit them. I'm still not convinced as I write this in the ember part of the year. But that's an aside to the main ongoing 'heated discussions.'

OpenAI's approach to data collection has attracted numerous legal challenges, most notably from The New York Times, which sued OpenAI and Microsoft for "massive copyright infringement" of its work. Additional lawsuits have come from collectives of authors, media companies, and publishers.

It's not just OpenAI getting caught up in these 'heated discussions'. Novelist Christopher Farnsworth filed a class-action lawsuit against Meta, accusing them of using pirated copies of his books to train its Llama AI model without permission. The lawsuit alleges that Meta used a dataset called "Books3" containing nearly 200,000 pirated books.

While OpenAI defends its practices by claiming "fair use," it has simultaneously fought to prevent the examination of its training data. And the unfortunate accidental deletion of potential evidence in the NY Times case probably hasn't helped matters.

As the intellectual property game of table tennis continues, two distinct approaches are emerging in the publishing industry. Some are choosing a collaborative path, with OpenAI establishing content licensing partnerships with media outlets like the Financial Times, Axel Springer, Le Monde, and Prisa.

Meanwhile, others are taking protective measures. Penguin Random House, the world's largest trade publisher, has amended its copyright wording across all imprints globally to explicitly prohibit the use of its books for AI training purposes: "No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems."

Governments are also grappling with these challenges. In the United Kingdom, the Minister for AI and IP, Viscount Camrose, has suggested that modifications to UK copyright law will likely be part of the solution. A Commons Culture, Media, and Sports Committee report has called on the UK government to "ensure that creators have proper mechanisms to enforce their consent and receive fair compensation when their works are used by AI systems."

In Germany, the Hamburg Regional Court provided one possible approach in a landmark decision, dismissing a copyright infringement case against LAION The court ruled that LAION's use of images in its dataset was protected under German copyright law, which allows text and data mining for scientific purposes.

Amid this intellectual property debate, competitors are not standing still. On the same day OpenAI filed its trademark application for o1, Alibaba released QwQ-32B-Preview, its own reasoning model. Unlike OpenAI's models, QwQ-32B-Preview is "openly" available under an Apache 2.0 license for commercial applications.

The apparent contradiction of companies built on fair use doctrine now wielding copyright and IP laws to protect themselves illustrates the industry's challenges.

If the lawsuit approach wins, large language models may become increasingly outdated and less useful due to restricted access to training data. If the partnership approach wins out, it could open substantial revenue streams for a while, but too many exclusive partnership arrangements would lead to the same problem as the lawsuit approach in the long term.

What remains clear is that balancing innovation, protection, and fair compensation presents an awfully large issue for the LLM Famous Five, and indeed others with deep pockets who may enter the fray. It'll keep the lawyers happy though.

Sign up for AI-360

Sign up for AI-360

Sign up for AI-360

Sign up for AI-360