Fair Use to the Rescue?
The large number of AI copyright cases working their way through the courts are unlikely to reach their final resolution anytime soon. My current reading of the fair use tea leaves is that the AI companies are likely to lose on a number of these issues. However, for those that are interested, OpenAI submitted a response to the Copyright Office’s request for comments on Artificial Intelligence and Copyright. It is well written and a great summary of OpenAI’s view on the topic.
Replicating copyright protected content on the output side seems like a clear case of infringement, and fair use affirmative defenses seem less likely to succeed there (but see OpenAI’s response). The input side is more interesting in this regard. There is clearly copying going on at the input side, but here there are much more compelling fair use arguments.
I’d like to address one of the oft repeated arguments that ingestion is a “non-expressive” use. Versions of this argument have been made by many commentators and some of the AI companies themselves. Based on my understanding of how LLMs are trained, these arguments fall short, and many of them seem to assume the conclusion as part of their argument. In the idea-expression dichotomy, ideas are not copyrightable, but expression is. The argument made by many is that ingesting copyrighted works is non-expressive because these models simply take in the “ideas” and the statistical correlations between words or tokens in the works, without copying the expression. But isn’t the correlation between words in a particular work actually based at least in part on the author’s expressive creation? The assertion that LLMs are not making use of this expression seems unproven to me.
Furthermore, one strong indication that the models are in fact learning authors’ expressive choices is that these models have to be filtered at their output to prevent them from producing exact copies of the works they ingested. This seems to be a very clear indication that the authors’ creative choices are in fact in the models themselves. Simply preventing exact output copies just masks the easiest method of seeing that this is true.
Leveraging the above discussion, I also think that there is a “bundle of sticks” type of argument that could be made that ingesting an author’s creative style into a model cannot be fair use under current copyright law. In this argument, all of the creative content of the work is specifically taken into the model because it is exactly that creative arrangement of words that the AI companies seek to make their models better. This would be a very interesting factual issue to test. I suspect that training models on just “ideas” would make them perform more poorly on many tasks. Admittedly, this is a continuum, and the line between “ideas'' and “expression” might be a bit blurry, but I don’t think these models make that distinction. They are taking in ideas and expression because they don’t know how to distinguish between them.
I really have very little doubt that AI companies can (and maybe already are) figuring out how to train their models while preparing for the possibility that they may lose these cases. I have already seen some great discussions about technical and legal fixes that could work for both sides.