NYT's Lawsuit Against OpenAI Leads to OpenAI's Dataset Disclosure
New development inside the NYT vs OpenAI lawsuit. Court grants access to OpenAI's AI training data. Learn what this means for the future of AI and journalism
Hey there, fellow tech people and AI news lovers! Welcome back to another episode of Tech Trendsetters – the warm place where we discuss the future of technology, science, and artificial intelligence. You probably know we don't discuss just everything happening in the AI world here, but only the major news that has really interesting context. And The New York Times lawsuit against Microsoft and OpenAI is one of such events.
So today we'll look into it, and I'll share some thoughts on why I think this is important, why this is one of those events that can shape the future of AI. After all, we live in a legal world, and these battles in the courtroom can have far-reaching consequences for the tech we use every day.
The Lawsuit: What's Really Going On?
For some reason, all the industry’s attention has been focused on the New York Times lawsuit against Microsoft and OpenAI. It essentially follows a template similar to previous class-action suits with the same argument: artists, journalists, and illustrators are creating history and saving the world from misinformation, while OpenAI is building its business on stealing other people's intellectual property. Cue the world's smallest violin, right?
But let's not turn this into a black and white story of virtuous journalists versus evil AI corporations. Without downplaying the legitimacy of NYT's interests or turning OpenAI into heroes (to be ClosedAI very soon?), it's important to lower the temperature of the rhetoric and understand what exactly the NYT wants and where we currently stand.
The Timeline: From Filing to Disclosure
December 27, 2023: The New York Times files its lawsuit against OpenAI and Microsoft. This marks the beginning of a significant legal challenge in the AI and media landscape.
January 8, 2024: OpenAI responds with a blog post, emphasizing their support of journalism and partnerships with news organizations. They state that the "NYT lawsuit is without merit," directly countering the allegations made against them.
In late February 2024, OpenAI filed a motion to dismiss certain aspects of the lawsuit. About a week later, Microsoft also filed a motion to dismiss. These motions did not address the entirety of the case but focused on what OpenAI and Microsoft described as "ancillary issues".
October 4, 2024: Judge Ona T. Wang of the Southern District of New York approves a stipulation and order regarding the inspection protocol for OpenAI's training data.
This last development is the reason for this episode and a crucial turning point, as it establishes the terms under which The New York Times' legal team can examine the data used to train OpenAI's models.
The NYT's Chess Moves
So, what has The New York Times (which used to be an interesting newspaper before it devolved into a political leaflet) been up to? Let's break it down:
Earlier in 2023, they updated their content usage policies. The new Terms of Service now explicitly prohibit using NYT materials for algorithm training.
In a plot twist worthy of a soap opera, the NYT signed a $100 million deal with Google, allowing the search engine to host NYT content on some of its platforms for the next three years. If you can't beat 'em, join 'em, right?
Interestingly, the NYT chose not to join a coalition of media companies that are open to negotiating with tech firms on the use of their content for AI training. Playing hard to get, are we, NYT?
Show Me the Money!
Now, here's where it gets juicy.
The NYT argues that to produce quality journalism, you need money, and algorithms take money away from newspapers.
Solid statement, but let's put on our thinking caps for a moment: what segment of the market are algorithms really taking from newspapers?
Last I checked, ChatGPT wasn't breaking news about the latest political scandal. There's a time lag between what newspapers publish and what content can be generated by ChatGPT. So unless you're using AI to write your history essays (don't do that, kids), it's not exactly a head-to-head competition with up-to-date original articles. At best, it's providing information comparable to a regular search engine.
The Legal Nitty-Gritty
While it's hard to deny that NYT content was likely used for training (unless the AI has a subscription?), the NYT still can't claim that all of its rights have been violated. They need to show specific instances, which they've partially done in the lawsuit.
Currently, the NYT points to nearly complete reproductions of 2019 and 2012 articles by GPT-4, including some that were behind a paywall. Additionally, September-October 2023 articles were appearing in Bing's results. But let's be real, nothing would have stopped these same articles from being scraped and republished elsewhere. Furthermore, there are many free resources where you can access content that is typically hidden behind paywalls. It's almost like... the internet exists and you don’t like it..?
False Attribution
I'm surprised that little attention has been given to false attribution – cases where hallucinations resulted in fake articles being falsely attributed to NYT authors. This potentially causes greater reputational damage to the company than reprinting old articles.
Think about it: what's worse for The New York Times? An AI regurgitating an old article about the 2012 election, or confidently spouting nonsense about current events and slapping a respected NYT journalist's name on it?
What's particularly interesting is that it's unclear whether the NYT views this as unfair competition or trademark infringement (dilution). Are they more worried about protecting their content, or their brand? It's a nuanced issue that deserves more attention in this legal battle.
The fact that this aspect isn't front and center in the discussions surrounding the lawsuit is really puzzling to me. It seems that the focus on copying old articles misses the bigger picture, proving that the lawsuit is just a way to sue another company rather than about protecting real value. Either way, it seems like an oversight in the broader conversation about AI and journalism.
The Court's Decision – Let’s See What Inside of OpenAI
Just a few days ago, the Southern District of New York court approved a settlement between The New York Times and OpenAI. This agreement is about data inspection, and it's a big deal. Why? Because it means NYT's lawyers get to peek behind the curtain of OpenAI's dataset!
For anyone who still doesn’t get it: data is the most valuable asset any company has. It drives decision-making, product development, competitive advantage, and even revenue streams.
The security measures during this procedure are intense (I bet some of you with a technical background would find this fun):
Training Data will be available at OpenAI's San Francisco office or another secure location of their choosing. It's accessible on business days from 8:30 AM to 5 PM, but the parties will be "reasonable" in accommodating "reasonable" requests for other times.
The Inspecting Party must provide five days' notice before any inspection.
All Training Data is designated "HIGHLY CONFIDENTIAL – ATTORNEYS' EYES ONLY." I just love it when we put CONFIDENTIAL labels on CONFIDENTIAL things – really drives the point home, doesn’t it?
The data will be on a secured laptop in a secure room, connected only to a host computer with the Training Data. A README file will be provided with directory to Training Data and brief descriptions.
OpenAI will provide tools for viewing and searching the data. They will "reasonably cooperate" with Plaintiffs to address any technical concerns.
OpenAI will provide login information and have someone available for technical support. Their outside counsel will be available to "make reasonable efforts" to resolve issues.
The Inspecting Party can request additional software tools, but OpenAI must approve them. There is a detailed protocol for how to install analysis and search programs (with “reasonably” appearing three times).
No phones, cameras, or any recording devices are allowed in the secure room, except at the end of each day for copying notes under supervision.
Handwritten notes are allowed on a separate computer, but no copying of actual Training Data. They can record statistics and file information.
OpenAI may visually monitor the inspection from outside the room, but only to ensure no unauthorized recording or transmission.
To obtain printouts, the Inspecting Party must create a "Print Request" directory – OpenAI will “reasonably” respond within seven business days.
All persons reviewing the Training Data must swear secrecy beyond standard attorney-client confidentiality (which, as far as I know, is already treated as the holy grail of legal protection). They must sign a special Non-Disclosure Agreement, affirming they are worthy of viewing HIGHLY CONFIDENTIAL – ATTORNEYS' EYES ONLY information (blood signatures not explicitly required, but who knows?).
At the end of each day, all materials must be removed from the room. OpenAI isn't responsible for any items left behind.
The Magic Word: "Reasonable"
One last thing to note: the word "reasonable" appears too frequently in the court's decision. It's crucial because it's intended to prevent any trickery. "Reasonable access" means access must be arranged in a way that a normal, sensible person would consider fair. Because if there's one thing lawyers are known for, it's being reasonable, right?
For instance, "reasonable access" means you can't just offer NYT's lawyers access between 6:15 am and 6:22 am on odd-numbered days starting from the 29th of February. It must be arranged in a way that a reasonable (normal, sensible) person would consider it reasonable. So no, OpenAI can't set up the secure room in an underwater cave accessible only by trained dolphins – as cool as that would be.
What This Means for the Future
This decision could be the beginning of something new, and potentially chaotic. OpenAI argues that you can't make ChatGPT produce articles without long and specific prompting. And we still don't know how exactly the newspaper's lawyers managed to get the text. Who knows what tricks they used?
And here's it gets more interesting: If this legal path opens up, anyone could potentially access the datasets of any neural network – provided they can preliminarily demonstrate that their copyright has been blatantly violated by the AI creators. It's like a treasure hunt, but instead of X marking the spot, you need a team of lawyers and some very creative prompting skills. Sounds like another potential startup in legal-tech, doesn't it?
Now, let's take a step back and anaylyze the bigger picture. On one hand, we have technological progress pushing the boundaries of what's possible with AI. On the other, we have the copyright of original human ideas. It's a delicate balance, I agree. After all, humans learn languages by absorbing a multitude of texts and voices, and artists develop their skills by studying and being inspired by countless works before them. Are we holding AI to a different standard than we hold ourselves?
But let's not kid ourselves here. For big corporations like The New York Times and OpenAI, this highbrow debate about the nature of learning and creativity is just window dressing. At its core, this is a fight for money, plain and simple. The NYT sees its content as a goldmine and wants to make sure it's the one holding the pickaxe. OpenAI, on the other hand, views that same content as just another vein in a much larger mine of human knowledge. It's less about preserving the sanctity of journalism or advancing the frontiers of AI, and more about who gets to cash in on the information economy.
As for me, I can’t help but laugh at the irony. Here we are, with AI capable of incredible feats, and we're arguing over who owns what words. But that's why we do this show, right? To make sense of this wild, wacky world where technology and reality collide. So, keep your wits sharp and your common sense even sharper. And maybe, it’s time to read those terms of service. Unit next time!
🔎 Explore more: