ARCHIVE

O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|O|

View on GitHub

NYT vs GPT

This December 27th, The New York Times filed a lawsuit against OpenAI, the company responsible for ChatGPT. Their issue is fundamental to the creation of “Large Language Models,” or “LLMs” (complex AI systems capable of mimicking humans in general situations). Any such system is created by “training” a series of statistical analyses and programs using a large set of text, usually culled from the internet.

OpenAI has used some copywrited materials in the training process for their AIs, but claims that this qualifies as “fair use.” Fair use is an allowance within U.S. copyright law to let certain institutions use copywrited works without permission. It is usually applied in contexts like education, journalism, and criticism; whether AI should be similarly unfettered is controversial and as-of-yet undecided. The four factors that go into deciding this sort of thing in a court of law are:

Dr. Gabriel Ferrer, a Hendrix professor of Computer Science that specializes in AI, noted the complexity of the issue from a legal perspective, and stated that “I think it comes down to what kind of output you expect from the algorithm, because ultimately, the issue isn’t so much about the inputs, it’s about the scope of reuse within the outputs.”

The copyrighted work, in this case, is the entire back-catalog of the New York Times. They are diligent in creating digital back-ups of articles throughout their almost two-century history, providing a lot of well-preserved information on a wide variety of topics. All of this data was used in training ChatGPT, the flagship product of OpenAI, and predominant cause of their current valuation in the neighborhood of 100 billion dollars. OpenAI began as a nonprofit, and maintains some policies and ethics that resemble that corporate structure, but has mostly shifted to a for-profit model over the last couple of years.

Common Crawl, the open-source web-crawler which provides the most influential dataset to GPT, represents the New York Times as their third largest source of textual data, after Wikipedia and Google’s database of patents. However, very little of it is literally plagiarized, with OpenAI responding to the NYT’s claims of “regurgitation” by claiming it to be “a rare bug that we are working to drive to zero.” Some of it is, though, with many examples featured in the filing of almost verbatim responses, copying NYT articles. In most cases, though, LLMs are bombarded with so much data (essentially the entire internet) that even the massive NYT archive is a drop in the bucket: the system usually cobbles together its answers to queries from much more than one source, similarly to a human brain.

This lawsuit is a critical point for both the New York Times and OpenAI. If NYT wins, they want all of their data removed from GPT’s training set, which would mean creating a new model from scratch, and leaving billions on the table. If OpenAI wins, it is easy to imagine a world where such easy access to information makes people far less likely to pay for subscriptions to the New York Times and other newspapers with similar business models, which would impact the already-withering smaller institutions first.

Of course, humans have had access to massive wells of information culled throughout the internet for the average Hendrix student’s entire lifetime, through search engines like Google. Nevertheless, GPT and other generative AI models have prompted far more fear of automation and other social ramifications than their forerunners. Dr. Ferrer said that “I think [LLMs] are feared more than Google because of, for lack of a better phrase, the way it stitches together search results into a larger narrative. In the past, if I was doing a research paper, I might use Google to find a bunch of sources. Then I would personally read them, apprehend them, and construct a summary. This seems to be able to bypass that stage.”

I asked Dr. Ferrer how people who sell information should generate money ethically, whether they’re journalists or AIers. “Advertising seems to be a ubiquitous solution that isn’t, at least superficially, ethically compromised. There’s a little bit of it, though, with the tracking and storage of user data, which could be seen as a concern. The original vision of hypertext used micro-payments as the means of promoting information. You could say that medium.com is an example of trying to do this, where you have to pay a membership fee, which then gets apportioned based on which writers you read.”

Milo Strain, a student journalist for UCA’s Echo, also noted the complexity of the case, but ultimately stated that “I think GPT shouldn’t fall under fair use, because part of what makes use fair is it being transformative. It does not seem like the large language models are really doing that. From what I’ve seen, they’re just spitting out the work that other humans have actually done.”

Ultimately, Dr. Ferrer said that “The technician in me wants to see OpenAI win, but my concern for civil liberties might want to see the New York Times win. I’m not sure which of those aspects of myself is dominant. We have a technical conceit where we sometimes don’t see the need for moral limitations on what we’re doing. “If we can do it, it’s okay” is way too dominant a means of thinking. At the same time, we’re in a really fragile society that doesn’t really seem to have a moral center (by which I mean, different people have different notions of moral center). So, it has become really difficult to work out a society-wide consensus on what’s acceptable. I think that’s making it easier for pure technical utilitarians to do whatever they want. I have no idea what a good solution to that actually looks like: I’m just identifying the problem.”