There's a Difference between "Publicly Available" and "Free as in Speech"
A couple weeks ago, the Wall Street Journal published 10-minute interview with Mira Murati, CTO of OpenAI, in which Murati said that their new "Sora" video generator model was trained on "publicly available or licensed data" (emphasis added by Ed Zitron). Ed wrote a longer piece that is worth reading, but I want to focus on this one point and clarify what Murati and Altman would rather stay obscured: a piece of content being "publicly available" does not mean you can do anything you'd like with it.
I am not a lawyer—just an open source developer—and this doesn't constitute legal advice. It is an overview of why it is incorrect to conflate "publicly available" and "licensed." If you have specific questions related to copyright, find a real lawyer.
The License Clause of Every Terms of Service
Nearly every website that allows you to upload content you create will have something like this snippet from Instagram's Terms of Use.
you hereby grant to us a non-exclusive, royalty-free, transferable, sub-licensable, worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content (consistent with your privacy and application settings).
Sometimes these snippets cause a round of fear-mongering chain posts, but they are (mostly) essential to operating a service like Instagram or Substack or TikTok any other service where you share something you created. That is because the instant you create a piece of text, code, art, sound recording, map, video, or more, you own the copyright on that piece of content (provided it is protectable). Copyright applies to the concrete ("fixed") expression of an idea, not to the idea itself—that is the domain of other areas of Intellectual Property (IP) law, like Patents. That means that copyright doesn't apply to the concept of a selfie, but it applies to any specific selfie.
By default, you as the copyright holder have certain rights—enumerated in Title XVII of the United States Code § 106 (17 USC § 106). Those rights include the right to copy, modify, perform, and distribute—either by selling the rights or licensing them. These rights are exclusive: you are the only person with them. If someone takes your photo and shares it without your permission, they have infringed on your copyright, and you can sue them (though if you haven't registered it with the Copyright Office, winning will be harder).
(Software code is also protected by copyright! Instagram has to grant Apple and Google the right to copy and distribute their app, and you the right to have and use it.)
In order to take your photo and its caption—which are both protected works—resize it ("modify") and show it to your followers ("distribute" and "publicly perform or display"), you need to grant Instagram et al. the right to do so. You could try to charge them, but they probably wouldn't pay up. They use most of the rights in the license clause simply to do what Instagram does: "copy" allows them to keep your photo and caption on more than one computer for backup and scale; "translate" and "create derivative works" allows them to provide that little "see translation" button. "Host," "use," and "run" aren't defined in 17 USC but can be seen as overarching categories that lump together the other defined rights.
Having a Copy is Not a License
Owning a copy of a book does not grant you the right to adapt that book into a movie (paywall). Such an adaptation would be a "derivative work" (17 USC § 101) and "create derivative works" is one of the exclusive rights of a copyright holder. Neither does watching a performance—including a projection or rendering of a movie or video—or listening to a recording of a song. None of those, in fact, grant you as the reader/watcher/listener any rights to the work at all. If you own a physical copy, you can sell that copy. That's it.
When you upload a photo to Instagram, you're granting Instagram a set of rights, but you are not granting any rights to anyone who views the photo. When you scroll through a timeline or watch a YouTube video, Instagram and YouTube are only exercising the rights they have from the author. (There are more complex terms of the license that allow for things like stitches and duets on TikTok or Instagram Reels.)
As a reader of this blog, you don't have any rights to the content. I, the copyright holder, am choosing to distribute it (or rather, grant a company permission to distribute it) but I retain all of my exclusive rights under copyright law (17 USC § 106).
Derivative Works are Intentionally Fuzzy
The right to create derivative works is exclusive to a copyright owner, but the definition of "derivative work" is not perfectly clear. Here it is, in full, from 17 USC § 101:
A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.
There are two important phrases here: "such as," which has its own definition to clarify that the following are examples, not a complete list; and "any other form."
You do not have the right to copy the text of this blog post and put it on your own website. What if you wrote your own blog post on the same topic? Would that qualify as a "form in which [the] work [is] ... adapted?" Probably not, but it's intentionally a little vague.
The main reason for this vagueness is that questions like this are judgement calls. The intent is for judges and juries to decide on a case-by-case basis. There are a variety of ways courts look for "substantial similarity." In general, if the exact words (or images or sounds) aren't literally copied, there are multiple analytical steps to perform, including looking at specific elements of the work and the overall—for lack of a better word—vibe: would an ordinary person recognize this as a copy?
(Another reason is to avoid needing to update the definition constantly as technology changes.)
Fair Use is Intentionally Fuzzy
"Fair use" is a limitation to a copyright holder's exclusive rights that permits certain types of limited copying (17 USC § 107). Quoting a short passage of a book in a review, or using a movie poster or still on a news segment talking about the movie, or parodying another work are all the sorts of activities that are enabled by fair use. Much like "derivative work," the exact definitions are intentionally a little fuzzy. § 107 lists four specific questions to ask about the nature of the possibly-infringing work.
In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
The intent here is, again, that in cases on the boundaries of acceptable, a court or jury will apply several tests and decide if the use is "fair" or not.
Is it OK to Train a Model on "Available" Data?
I don't know—and I am not going to try to answer.
Judicial precedent is actively evolving. Is the model itself a derivative work? Is the model protected, or is it a "fact" of the training set? Is training the model fair use? Are the outputs of generative AI protected? These are all questions that either have or may reasonably come before courts soon.
"Publicly Available" is not "Licensed"
My point with this whole post is to underscore the difference between "publicly available" content and "licensed" content. Murati and Altman and the rest of OpenAI have a material interest in blurring that line. And they have the open legal questions and legal resources to defend their behavior at length. For the rest of us, we should be careful not to believe what OpenAI would like people to think is true: that if something is available to the public it must be fine to do whatever you want with it.
Exclusive rights under copyright do not evaporate because a work is published—even if it's published to social media. If you're considering doing something with published text, video, sound, etc, that you don't own, talk to an IP lawyer first.