This is excellent, because hopefully this goes to court (please!!!) and a judge can rule and tell us what the law actually is. There's a ton of legal LARPing that people in general like to do and it's all pretty meaningless to me.
A lawyers opinion is much better, but ultimately not where the real value lies since you'll always find lawyers on both sides of unruled cases, otherwise how do you get plaintiff vs defendants. :) Unless, the lawyer brings up actual case law that gives pretty clear precedent, of course!
We need this kind of stuff to go to court and get a true judgement. Only then is the law actually clear.
I hope this can be it. If not this, I hope something else soon. I don't really care which way it falls (I mean, I have opinions of course, but my opinion doesn't matter in the face of the law), I just want to be given guidance on what is/isn't legal.
(Also note, this isn't a moral argument. What is/isn't moral isn't simply what the law is. This is just a legal argument. And... addendum number two, I suppose there are philosophical arguments that morals are defined by laws but I disagree with that. In any case, either opinion doesn't matter, what I'm looking for is law.)
My hunch is still that people will not want to bring it to court given the significant implications and potential consequences. So far what I got are that ambiguity is better in the interim for everybody with vested interests.
I would love stuff like this to go to court though.
Its clearly a copyright violation. Like taking s blury photo of a painting is. But if its not, then csn we relicense proprietary code this way? Can the AI finish reactos?
The stance of the industry right now seems to be that AI output is not subject to copyright even if it's an almost verbatim copy of copyrighted works. We don't know how much of the code it produces is identical to existing code and nobody cares to check. So I don't see why this should be any different?
The whole thing should be illegal on moral grounds, but the next best thing would be for the whole thing to be illegal on copyright grounds.
even if it's an almost verbatim copy of copyrighted works
This isn't true, and the people in the industry saying it don't actually believe it, they're just trying to make a legal smokescreen. If you get copilot to hallucinate a 1:1 copy of a chunk of Quake, and it does it without the correct license and attributions, and Quake was in its training data, then you just made it do copyright infringement.
A whole lot of the world's software is in the training data. When Claude writes hundreds of thousands of lines of code for you, how do you know that none of it is an almost verbatim replica found in its training data? Do you check? Can you check?
I meant: "the stance (aka position held publicly, for political convenience) of the industry is that AI output can't be a copyright violation".
You meant: "the stance (aka privately held belief) of industry leaders is that AI output can be a copyright violation, even if they don't admit it publicly".
The two are compatible and I apologize for taking a hostile tone.
The training set of LLMs is so vast. You could probably run some search over them (if you can somehow acquire it in the first place - can you?) to find cases where the output was copied verbatim, but if the output has been even slightly altered that won't really work anymore.
What about those of us who feel current copyright laws are immoral? Are the moral grounds based on the supposition that current copyright laws are moral? At least in the US, our copyright laws are absolutely not moral by any standard I'd use to measure. They are driven by greed and no longer serve the original stated purpose of copyright. I suppose it depends on whether you mean "moral" to be "lawful" or "good".
make no mistake. regardless of the outcome of this, copyright laws will still be applicable to you and me. whatever loopholes are found are only going to benefit those with the most capital
Sure, which is why I say the current copyright laws in the US aren't moral. If they are only to the benefit of those with the most capital, then we need to get rid of them. I'm not saying that is likely to happen, but I am saying we shouldn't let the rich and powerful dictate what is right and wrong or good and bad.
Those of us who recognize that current copyright laws are immoral (myself among them) still typically recognize that there's a need for something to protect some form of intellectual property. If I as an individual compose and produce a song, and Warner Brothers wants to use that song as the theme song for an upcoming blockbuster, I want some say in what the terms are and I want to get something back from it. If I write a piece of software, and Google wants to build a billion dollar business on top of it, I want to have a say in that.
That’s not clear at all. You can trivially make a clean room reimplementation of libraries these days with AI. The only argument against it hinges on “but the LLM was probably trained on it” and I think this doesn’t hold a lot of water because they usually come up with other implementations.
I have done this a few times now and I’m fairly sure that these reimplementations would hold up in court if given a chance. There is a separate question of if LLM generated code can be copyrighted but that’s an independent question. It might weaken copyright and thus copyleft but that might also be a good thing.
It might weaken copyright and thus copyleft but that might also be a good thing.
This seems like it would only really benefit proprietary software, and harm FOSS. Microsoft could take any GPL-licensed library, "rewrite" it using an LLM, and then use it freely. On the other hand, I couldn't do the same with e.g. the Windows source code, because Microsoft doesn't publish it.
Humans have been clean room reverse engineering compiled binaries for decades. In fact it's what clean room RE was invented for. That's not to say that it's the same amount of difficult, it's definitely harder, but it's where this whole idea started in the first place and it's still done. Similarly, the reason projects like WINE have a zero-internal-knowledge-of-windows-code policy for contribution is to play it maximally safe, not because looking at machine code poisons the legal state of your brain (because it doesn't).
If you dump a windows binary into Ghidra, have a future LLM clean up the output pseudo-C code into real C, and then try to automatically clean room RE it, you've fixed the "asymmetry" problem and ended up in an equally legally dubious place (rather than more dubious) vs proprietary shops stealing FOSS code with LLM-driven pseudo-clean-room-RE.
My point is that now it's effectively free for companies to use GPL code without submitting to the terms of the license; before this they would need to actually spend resources on a clean room rewrite - which would be an amount of effort comparable to just writing the code from scratch.
In other words, companies just weren't able to benefit from GPL-licensed code unless they used open licenses for their own code too.
If that proprietary code gets released, which the companies would no longer have to do.
Even if it were (or if it would be reversed from the binary)... this still feels bad for FOSS as a whole.
A lot of FOSS maintainers do their work entirely for free, in their spare time - and then their work ends up foundational to multibillion corporations, while they continue not to see a dime for their work. That's obviously fucked up. Licenses such as AGPL have been one solution to this; they've made dual licensing a viable business model.
Except, well, apparently now that went out of the window. If using LLMs to launder licenses are legal, then any developers whose work is published online[1] now have no recollection against it being exploited by large corporations.
edit: I agree with the broad idea of "we should be able to freely reuse each other's code", but I also think it'd be great for everyone (even developers) to have free food and shelter. Alas, we don't live in an utopia, and copyright (along with copyleft licenses) provides developers with some leverage against that inequality.
[1] Notice how I'm not even saying "FOSS developers" - it seems that the license of the original code is irrelevant, so this could be done to code under restrictive licenses.
I couldn't do the same with e.g. the Windows source code, because Microsoft doesn't publish it.
Isn't the Windows source code floating around enough places that it's not too hard to find if you look? And then you wouldn't need to openly publish it yourself, just privately feed it to an LLM to rewrite it.
First off, I lightly expect for this to go the way of mechanical licenses. For example, there could be two distinct copyright elements: the API and the implementation. This might already be true as both fall under copyright.
Second off, I disagree, it seems pretty clear that LLMs can't make a "clean room reimplementation." Regardless of whether LLM generated code can be copyrighted, the use of an LLM is a mechanical process. If you feed the model source code and it spits out an output— spec or rewrite— I struggle to see a world in which that output isn't a derivative work. The key question is whether it falls under fair use:
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.
There's an edge case in your description of the spec generation process. If I feed a graphics shader to an LLM asking it to give me a functional specification, and it gives me, verbatim, the following description:
This is a fragment shader for a game engine with PBR lighting. It implements Lambert diffuse lighting and GLTF-compliant specular lighting. Lambert diffuse lighting is the cosine (dot product) term of light vs surface normal. GLTF specular lighting is defined in the GLTF spec. Lights are integrated over a uniform array buffer and the illuminated surface data is put through a simple x/(x+1) pseudo-tonemapper before being output.
...Then that only contains fully uncopyrightable information, even though the process was entirely mechanical. And, containing no copyrightable information from the original data, it's not a "derivative work", despite being mechanical.
It's likely that an LLM is going to poison its spec with copyrightable details in practice, but it's not intrinsic to being a mechanical process.
How is that description "fully uncopyrightable" information?
Are you familiar with Conceptual Art? This article might tickle you.
You feed the shader to an LLM, it generates a spec, you feed the spec to an LLM, it generates some code. I strongly suspect the intermediate spec— just like the LLM transcripts and thinking tokens— would be irrelevant to whether the final work would be considered a derivative and releasing it as fair use.
You, the human, contributed little to no creative expression. "give me a functional specification for [X]" from an LLM is a derivative work of X. Had you written the spec yourself, we'd be having a different conversation. Accordingly, I suspect "make an html page that renders [spec] via webgl" will be, at best, akin to a translation.
re: 1: That description is like a textbook-perfect case of not containing copyrightable information from the original software. If you don't accept that then I don't know what to tell you.
re: 3: I am not commenting on that in this post. Don't twist my words. It's upsetting.
re: 4: If the LLM's output doesn't contain any copyrightable expression from the original code then it's not a derivative work. Being mechanically linked doesn't cause a derivative work. This is a fundamental misunderstanding of copyright law.
I don't accept your textbook perfect case. Would you point me at a textbook so I can correct myself? I would be surprised if a functional spec was uncopyrightable. Now, whether it contains copyrightable information… well, copyright doesn't cover information. And that (imho copyrightable) description is presenting ideas that would fall under patent.
I'm sorry for upsetting you. I believe that I have misunderstood you. You're not commenting on the context of the AI rewrite of chardet; rather, making a point only about an LLM generating a functional spec?
And, yeah, I agree that the example you gave is probably fine.
My point about the mechanical nature was about where and how much human contribution is within the end result. Basically, it would be the prompts, no?
I don't accept your textbook perfect case. Would you point me at a textbook so I can correct myself?
Copyright doesn't extend to pure math, industrial practice, general truths, etc. The only pieces of information from the original code contained in that sentence are only of this sort. You could drop this in a textbook when it starts talking about copyright and it would be a valid example of distilling out only uncopyrightable parts of the original thing, because it limits itself to those kinds of things that are categorically excluded from copyright protection in copyright law as it's written.
US law says:
"In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."
Copyright can cover information, when that information is (or is an encoding of) creative expression. As a concrete example, with typefaces, the data (binary information) of a font file is copyrighted, but not the idea of the shape, even if you make a near 1:1 perfect replica by hand (in the US or Japan).
And that (imho copyrightable) description is presenting ideas that would fall under patent.
In this specific case, the english statements in the spec aren't encoding copyrighted expression as information; the only anything from the original shader is all uncopyrightable. And AI outputs on their own are not copyrightable, they're only governed by copyright when they contain copyrighted expression copied from elsewhere, so the text itself doesn't have its own new copyright. So it doesn't contain copyrightable information (when given by an AI as it would have been in my example).
My point about the mechanical nature was about where and how much human contribution is within the end result. Basically, it would be the prompts, no?
Reading your original post, you said "Regardless of whether LLM generated code can be copyrighted, the use of an LLM is a mechanical process. If you feed the model source code and it spits out an output— spec or rewrite— I struggle to see a world in which that output isn't a derivative work.". That was what I was responding to.
Yes, I'm familiar with this common basic knowledge about copyright.
wrt. the typefaces, the fixation of the font file is copyrighted. Moreover, at least in the US, they were ruled as copyrightable because they're computer programs. The data (binary information) is emphatically not what "information" means in the context of copyright. Feist v. Rural (1991) is the textbook case.
In your specific example spec, I agree that it would be hard to argue. But, if I were to argue it, the creative choice of which shaders and the use of a pseudo-tonemapper are a clear set of instructions. Instructions themselves aren't copyrightable but the specific original wording when fixed in a medium are.
I stand by my original statement about the mechanical process because of the absence of additional human creative expression. But, yes, your point that if we slice thin enough to have an LLM generate something uncopyrightable, then we're good. Very thin is "port 1+1 to Forth." Ya got me! Meanwhile, @pmarreckcleanroom reimplemented a bunch of compression algorithms. That's more interesting to me. Compression algorithms are where patent law is more often wielded than copyright.
chardet 7.0, however, has the same API and the same package name. Google v. Oracle (2021) was unambiguous. And, at least to me, the idea of this being fair use is wild.
In your reference, "information" isn't bare information, but factual information about reality (they literally had to go out of their way to express it in long form as "raw data—i.e. wholly factual information not accompanied by any original expression", and were not able to rely on just saying "information" on its own). Factual information is indeed uncopyrightable, because facts are uncopyrightable. But information in general can either be copyrightable or not. Sometimes information is expression. Just the word "information" is emphatically not a singularly-defined term of art in copyright law; it has multiple definitions, even in copyright law.
But, if I were to argue it, the creative choice of which shaders and the use of a pseudo-tonemapper are a clear set of instructions. Instructions themselves aren't copyrightable but the specific original wording when fixed in a medium are.
Yes, if a human wrote the spec then the particular sequence of words they chose to write the spec would be copyrighted.
I stand by my original statement about the mechanical process because of the absence of additional human creative expression.
This does cause it to not have additional copyright of its own, but it doesn't say anything, one way or the other, about the copyright relations between it and anything that went into it. It's not related.
chardet 7.0, however, has the same API and the same package name. Google v. Oracle (2021) was unambiguous. And, at least to me, the idea of this being fair use is wild.
Ugh, I'm referencing your own words. This is dumb.
As of now, the output from LLMs are both able to violate copyright and unable to be copyrighted due to their mechanical (inhuman) nature.
Yes, if you prompt the LLM to generate anything sans "copyrightable details" then… it won't violate copyright. Excellent point, well made.
To the original point, the mechanical (inhuman) nature matters because of fair use.
Finally, I'd love to see a big studio steal a cool shadertoy by running it through an LLM. Claim in court their use of a functional specification intermediate step and all. No creative expression, just basic math! Judges love gotchas and totally don't consider things holistically.
You're going off the rails. The point of contention is whether or not there is a connection between it being a mechanistic process and the output being a derivative work. There isn't. The fair use question is worth thinking about but wasn't something that I was ever arguing one way or the other over, and arguing about it is pointless either way because fair use is almost entirely determined by court precedent.
EDIT:
Finally, I'd love to see a big studio steal a cool shadertoy by running it through an LLM. Claim in court their use of a functional specification intermediate step and all. No creative expression, just basic math! Judges love gotchas and totally don't consider things holistically.
This is dishonest framing of my argument. The example I gave was something where there just isn't enough going on to cross that line.
I think the current law and its definitions are unable to accurately describe what is happening. It's not clearly any of the problems/uses we've had in the past.
Laws written in a world without GenAI haven't fully anticipated how it will complicate the situation, so not only the laws don't clearly describe the legality of it, the consequences of deciding either way couldn't have been properly considered.
Existing copyright assumes it's possible to identify what has been copied and therefore who suffered a loss from the infringement. A model can copy a tiny bit from everyone, and it's debatable (rather than explicitly codified) whether that means everyone or no one has been wronged.
We have the paradox that training process seems to dilute and mix the inputs so much that you could argue nothing has been copied, and each work contributed less than it'd be allowable under fair use if you quoted or remixed one piece, and yet the the models are able to produce works that are eerily similar to pre-existing works (that you probably wouldn't get away with if you made such copy yourself).
Let's keep in mind that copyright law isn't about who owns what in some information-theoretical sense, but (greatly simplifying) about balancing the rights of artists vs public's access to culture. GenAI upsets this balance, so it's not even a question where it fits in the old copyright framework, but what even is the role of copyright now and where's the new balance?
The only argument against it hinges on “but the LLM was probably trained on it”
You say those things like pretending not to know how this tech works. And as if it would be very simple to come into an Agent and prompt: "Hey! Let's change the whole file structure plus all the variable names so the lawyers will get even more confused muhehehe”.
The code in the codebase right now resembles the earlier version very little. Probably about as much as any other library. So the argument should then be had about LLMs in general.
Temporarily assuming AIs are analogous to humans here... (aside: I don't believe they are, and if the human operator of AI 1 and AI 2 has studied the foobar implementation, then foobar was input to the creative process, and bazqux is derivative, IMO, but for the sake of discussion, let's assume that they're analogous.)
There is no On, no. If you replace the AI with a human this is perfectly acceptable.
Unless both human 1 and human 2 were trained on the implementation of foobar.
And in the case of the AIs, if foobar is public code, as in the context of this overall discussion, then both AIs were trained on the implementation of foobar. With that as an input on both sides, bazqux sure seems derivative, even if the end result looks superficially different.
And in the case of the AIs, if foobar is public code, as in the context of this overall discussion, then both AIs were trained on the implementation of foobar. With that as an input on both sides, bazqux sure seems derivative, even if the end result looks superficially different.
While there are some arguments in that favor, that has never been so clear cut. A famous case here is Sega v. Accolade where even though some people had internal information, the court ruled in favor of interoperability.
At the end a court would have to decide this, but I think generally given how this is a from scratch implementation we might as well argue that chardet code within Claude would effectively disallow any use of Claude to write any software without attribution to chardet which seems unreasonable.
We're not arguing about any software, though. While I do have concerns that Claude could emit software with an inappropriate license anywhere, we're talking about using something that was trained on chardet to "reimplement" chardet. That's different.
Hard disagree. If I took a pile of C code, and used a compiler to translate it to assembly language, then used an LLM to convert that to pascal, it'd look pretty clear that it had nothing to do with the original C. But it'd still be, obviously, a derivative work. That's pretty what the difference in the new code looks like to me.
If I took a pile of C code, and used a compiler to translate it to assembly language, then used an LLM to convert that to pascal, it'd look pretty clear that it had nothing to do with the original C. But it'd still be, obviously, a derivative work.
But that would most likely still use the same algorithms or approaches. That is not the case with this implementation.
I am not a lawyer, but I've read the introductory paragraph of a bunch of Wikipedia articles so you decide whether this counts as an informed opinion:
My understanding is that when you copyright code, you're copyrighting the specific code you've written, not the algorithm or logic behind it. So when you're trying to decide if Project B violates the copyright of Project A, you're trying to figure out these two things:
Does the code in Project B look like the code in Project A?
If so, is it likely that, just by chance, two independent developers would write code that similar?
The "clean room" rewrite process is not legally necessary, but it does have the advantage that it makes it easier to argue point (2) — if there are commonalities between the codebases, then we can be very confident that this is just chance because the developers who wrote the code in Project B never saw the code in Project A.
So in theory, a single human could have rewritten chardet by themselves, and it wouldn't have been a copyright violation, as long as the code they wrote was substantially different from the original project. It wouldn't even matter that they'd seen the original code, as long as the new code clearly is materially different. (Whether this is feasible in practice or recommended by lawyers are two very different questions, but in theory this could happen.)
Therefore getting an AI to do the same operation seems broadly similar. As long as the resulting code is materially different from the original code, it is not subject to copyright protection. On the other hand, if the AI code looks substantially similar to the original code, then we need to go back to point (2) above — is it really a meaningful similarity, or is it just that there are only so many ways of writing the same algorithm?
(A corollary of this is that any AI-generated code can have the same problem, even if it's not a deliberate rewrite. If the AI generates code that looks similar to any code that it was trained on, then we have to go back to point (2) above — is this code so simple that there's not really any other way to express it, or is this explicitly a copy of someone else's code that's been hidden in the weights?)
Orthogonal to this is the question of what copyright one can claim over AI-generated code, assuming that code is not a copy of existing code. There, it looks like the answer is "AI code is public domain", at least for now, although I suspect there's enough money riding on AI that this could well change.
So my interpretation of the legal situation is that, assuming the AI-generated version is substantially different, the new project is not a copyright infringement, but because the majority of the code was AI-generated, the new project must be considered public domain.
I'm not a lawyer but I have read a lot about copyright. Copyright of derivative works is much more about process thsn content. To the extent that two identical maps can both be copyright because they shared no process.
And two completely separate works share copyright because one derives from the other. Like how much code in Emacs is still written by Stallman? Doesn't remove his copyright.
I don't quite see how that case is relevant, given that was a very clear case of copying the details from the original art into the poster. That would be like copying code from a copyrighted project and renaming variables, or wrapping it in a GUI or something. The result will still contain the original code in a form that is not substantially different, so it is copyrighted.
The last paragraph is exactly the point under contention. What counts as deriving a work?
My understanding is that when you copyright code, you're copyrighting the specific code you've written
That's not exactly correct. There's also fair-use to consider, because if Project B is only possible because the author leveraged the work of Project A without respect to their rights to redistribute that work, then merely translating any work from English to French would be sufficient.
In the US, the fair-use test most judges will also consider include how Project B uses/monetises that work (i.e. specifically commercially), and if Project B claims it is independent, if they thought they could outsource responsibilities to another company/contractor (or AI).
Also: My experience is that judges are much less entertained by technicalities than they are on TV, and that lawyers know this, so I would not expect anyone serious to go to court with this defence with someone competent.
I don't think fair use is relevant here because the point of fair use is to be a defence in the case that copyright infringement has occurred (i.e. some forms of what would otherwise be copyright infringement are allowed as long as they pass the fair use test). But here, the claim is that the new codebase doesn't infringe on the original codebase's copyright at all, so fair use doesn't apply.
With regards to translation, I agree that a direct translation is not (EDIT: typo) copyright infringement (and the same would be true of code — writing a literal function-by-function translation of C into Python would be copyright infringement). But a paraphrase would be allowed. So there's clearly a line somewhere where you can adapt a work enough but still take inspiration from the original work. What makes this comparison more complicated is that when writing a book, say, you have more ability to copyright the characters and events of that book. I guess an equivalent would be if you tried to reuse the same architecture as the original codebase?
I agree with your last paragraph entirely, though. There's a reason that clean room implementations are preferred for things like this, even if they're not technically necessary: it's better to be whiter than white, than to leave any ambiguity. Similarly to how movies that have vague similarities to a book will often buy the rights for that book, even if the end result is very different, because that's a safer bet than trying to argue the case after the fact.
With regards to translation, I agree that a direct translation is not copyright infringement (and the same would be true of code — writing a literal function-by-function translation of C into Python would be copyright infringement).
These lawyers explain why translations are considered derivative works. I understand this to understand that a translation from one computer language to another would need to be licensed by the original author. I'm not a lawyer, but it seems clear to me in their piece.
Whoops, I said the opposite of what I meant, which is that direct translation is copyright infringement.
I think part of the issue is the similarity between what translation of text is, and what translation of code is. If I translate text, I'm clearly trying to convey some part of the original text. (And if I directly translate code — like the function-by-function approach I mentioned above, that feels similar to me.) But the interface to the code (i.e. the external API) is not copyrightable (at least in the EU, in the US this is IIRC still an open question but it may fall under fair use). So if the only thing I'm translating are the external interfaces, and everything within those interfaces is new code, then that would surely not be a derivative work?
It's surprising to me that the external API is not copyrightable, since developing a good API is half the battle for a lot of software projects. But you must be right, that's why there are so many projects trying to emulate proprietary software. Thinking about it, I'm glad I won't get sued for doing that myself.
If this is the "end of open source" because any API can be implemented easily and becomes automatically public domain, is that ... good? We'll have software abundance? No more IP hoarding?
In my very much non-lawyer opinion, it could be a derivative work. I've done large amounts of translation before. When you translate a book or an article, you're very much adding your own creative input to the process. You decide what words or phrases best express the author's ideas to speakers of the target language. But the argument and its structure/organization are still very much the original author's. The resulting work, as a derivative, contains both of your creative inputs and belongs to you both.
So to your question, I'd say if all you're using are the external interfaces, and you're not taking the original author's implementation as input, that implementation is your creative work, assuming we agree that the interfaces are not their own creative work. (I'm not sure about that, don't know much about what the courts have said about it, and am unwilling to go look.)
BUT: if you feed the original author's implementation into your creative process, I think your implementation is derivative. And that goes double if your process is "feed the original implementation into an LLM and tell it to generate a new implementation."
That definitely all feels true to me for translating a text, but for some reason it feels less true to me for writing code? I think because there's less creative input in code in general, although there's clearly still some creative input there.
That said, my feelings are definitely not lawyer-approved!
I guess the other interesting aspect to me is whether there's a distinction between giving an agent access to the original codebase and saying "port this code", vs the underlying LLM having previously consumed the codebase as part of training but not having access to the codebase while doing the port. The two cases feel different to me somehow, but I can't quite put my finger on why.
No. Even commentary can potentially violate copyright, as in Warner Bros. Entertainment Inc. v. RDR Books, so there should be no doubt on whether paraphrasing someone is a derivative work.
A paraphrase may be allowed under fair-use, but a judge will look at what you are doing with your paraphrasing, so paraphrasing Harry Potter will still get you sued if you put it on Amazon.
...where in the new test data repo even he recognizes the licensing implications of the data by moving it out to a separate repo. I wonder if that's actually a damning admission.
While data itself isn't generally copyrightable, a specific compilation of data in the whole can be if there is sufficient original/creative/deliberate judgement put into the compilation. It seems to me that a project's test data would be exactly the type of thing that would meet that test... making the test data which was the foundation of everything the agent did LGPL licensed unless otherwise specified, as it was originally part of the existing repo (though perhaps there are issues w/ [L]GPL applied to non-code assets)
...where in the new test data repo even he recognizes the licensing implications of the data by moving it out to a separate repo. I wonder if that's actually a damning admission.
If we take the AI out of the picture and we just assume the author did a clean room implementation against a shared test suite it’s legally sound. You are allowed to do that.
My understanding is that's typically done in the human case by either having the test suite be rewritten from specs, or by using it as a black box solely to verify the new code... and specifically not as a case of "read these tests and write code from them."
If it's not a copyright violation, then yes, AI can probably finish ReactOS. But they'd also get sued for it. Proving things is hard. Also some of the important parts of windows are partially covered by patents, oops.
I am not qualified to say whether it's legal or not, but it's absolutely a dick move. I would like to take note of who did this so that I don't accidentally find myself collaborating with them at some future time.
I looked briefly at the code before and after the rewrite happened, and it is plausible to me that the two versions share no copyright. (I believe that it would have been much more reasonable for the maintainer interested in the rewrite to explicitly start a new project with a new license, rather than attempting a complete rewrite of an existing codebase inherited from other people.)
These programs are very specific in that a lot of the value is in bunch of classification data (in the previous version there were various state machines encoded as lists of numbers), and relatively little actual logic. The rewrite seems to have produced entirely new decision data and clearly-different driver logic.
It would be wildly incorrect to generalize this across any kind of AI-assisted code transformation, or to extrapolate without looking at the actual changes on that specific project.
This is what I don’t get. Why not name it at minimum “chardet2”? What is the motivation of trying to keep the name? Do the have upload rights to the Python package & want folks to ‘automatically’ get the update (maliciously like xz hopefully not)? Is just to keep the market saturation of the original package (LLMs do not seem to care for things new/not established before the training so maybe get the uptake)? But even then, what was wrong with the LGPL license that it “must be changed” as it is highlighted in the changelog (not just a regression to the mean of the LLM choosing it as ‘forces’ have been pushing towards permissive licenses in recent years)? Now instead it’s going to spiral out as a legal/philosphical/ethical choice made—one that could have some heavy implications (not that this wasn’t inevitable).
This new version was created by the same person who has been the primary maintainer of this package for many years. It is not a new package. It is a new version of the same package in the same repo.
A plausible explanation is that the current maintainer (who directed the rewrite and decided the license change) prefers MIT to LGPL for whatever reason, and took opportunity of their complete-rewrite plan to change the license, without particularly thinking about the preferences/opinions of previous caretakers of the package. Now that a former maintain has spoken up against the change, I believe that the reasonable thing to do for the current maintainer would indeed be to publish the new version as a separate package or give up on the relicensing.
(Even if the new maintainer posts a separate package with a different license, the previous contributor could come and mention that this is derivative work and ask for LGPL again. But my guess is that the new maintainer would take this opinion into consideration less if it is a separate project, as long as they themselves are convinced that it is not derivative work.)
Do the have upload rights to the Python package & want folks to ‘automatically’ get the update
The current chardet maintainers, who do publish it to pypi, did this for version 7 of the library. They are not the original authors but have been maintaining it for a while.
I started a re-implementation of readline a while ago for similar reasons, because of its GPL license. There is an obvious moral question here, but that isn’t necessarily what I’m interested in.
...but like, you're doing this, and as you've acknowledged, obviously the ethics are muddy. Why aren't you taking some time to decide whether what you're doing is moral?
Why aren't you taking some time to decide whether what you're doing is moral?
From where I stand, it's moral. The world will be better off by a non GPL version of readline that supports unicode well and a clean room implementation of that is not any worse than using AI for any other form of code generation.
I personally think all of this is exciting. I’m a strong supporter of putting things in the open with as little license enforcement as possible. I think society is better off when we share, and I consider the GPL to run against that spirit by restricting what can be done with it.
This reminds me of the paradox of tolerance. (Roughly: if a society tolerates everything, including radical intolerance, then it may end up in an overall less tolerant state than if it places some limits on toleration.)
In this case, if we take a radical stance about the openness of software and don't allow any limits on how it is shared, then we also seem to allow people to take open software and put it in things that are not open and not shared. As I understand it, the GPL means to give openness (and sharing) real teeth by saying, "This is shared and open, but its openness is viral. You can't take this and hide it away in a thing that is closed and not shared."
It's not clear to me that preserving different kinds of sharing, with different rules and limits, means not sharing. That is, I also "think society is better off when we share," but I definitely don't think that the GPL runs against that spirit. The person who initially shares (in this case it was Mark Pilgrim) has the freedom to decide on what terms they want to share. Forcing all sharing to be of one type doesn't strike me as better (unless we stipulate that that one type of sharing is the best or only true way to share.)
For whatever it's worth, I say this as someone who releases all their software under the BSD 3-clause license. But that is my choice.
I have released software under BSD, MIT and Apache 2 licenses for years precisely because you can do with it what you want. For me that is an important feature because I do not agree with how long copyrights last.
It was also for me a reason of why I very much stood behind our FSL license that turns open source after two years.
When the App Store came around a lot of GPL software was incompatible with it and I think that was a bad thing for society overall. It’s hard to put restrictions on licenses when you do not know what the future looks like.
do we have to take it for granted that interoperability with the App Store is a good thing, full stop? maybe, and i get this is not really a world we ever lived in, but maybe things would be Better if the App Store had been under sufficient pressure to ensure that its model of software distribution played nice with copyleft
It's not a good thing, but it's not necessarily the worst thing in the world, either.
The situation with game consoles is probably easier to reason about: the platform and the various bits of middleware need some basic amount of opaqueness or else cheating in online games would be horribly rampant (rather than an annoying amount of background noise). It's not hypothetical, we saw TF2 on PC get rendered almost unplayable by automated cheat bots for multiple years in a row. In game dev, the line between OS and middleware is slightly fuzzy, because you often pull things that ought to be part of the graphics driver or whatever into middleware for customization and performance reasons, so the GPL's provisions for letting you link with the OS don't apply.
I don't like this, but when you're selling is the ability to play by fair rules with other humans, you need at least a basic amount of opaqueness, and the licensing situation with all the individual parts is already murky even BEFORE you start throwing the GPL or LGPL at things.
When the App Store came around a lot of GPL software was incompatible with it and I think that was a bad thing for society overall.
I don't think I understand your response. Are you saying that it "was a bad thing for society" that Apple declared GPL software incompatible with their store? If so, why does that count against the GPL? Apple made that choice, right?
Are you saying that it "was a bad thing for society" that Apple declared GPL software incompatible with their store?
I generally think that the GPL (particularly v3) is not a license that found a good balance because of the many quirks it has. In my book it's a fault of the GPL license that it does not work for something like the app store and not the fault of Apple.
Again, I don't understand how this connects to the part of your original article that I quoted. To repeat: "I’m a strong supporter of putting things in the open with as little license enforcement as possible. I think society is better off when we share."
These two sentences are in tension with each other. (That's why I mentioned the paradox of intolerance.) The Apple app store has very little to do with sharing. People overwhelmingly sell things there, and Apple makes a lot of money in the process.
To put this another way, if the point of all this to to create a "society...[where] we share" why would we choose licenses based on what a commercial company says is compatible with their marketplace?
What you really seem to be arguing is not that society is better off when we share but that society is better off when people are free to make money using things that are available to them with no restrictions. People can agree or disagree, but that's consistent. I have a hard time seeing a coherent position in what you've said so far.
You can release things on the app store for free. If you cannot release GPL code on there for free, then you have, in one regard, been made less free.
Of course, perhaps you think the GPL is good on the whole, that you're more free with it, overall. You can even put the entire blame on Apple. That's a coherent thing to think. But you should acknowledge that there is something you lost.
If you cannot release GPL code on there for free, then you have, in one regard, been made less free.
Yes, absolutely fair. There are two freedoms in competition here, and one has to give. That said, I'd argue (again) that the app store is overwhelmingly (and by clear design) not about releasing things for free. (I'd go further and argue that the free releases are a fig-leaf for Apple and a good sales technique for developers. Free releases help Apple precisely because they and others can remind people that some things are free. And free releases help developers who can then do in-app sales and upsell people to the "Pro" edition.)
Do you think that this helps the OP's overall argument? (Genuine question: no snark.)
These two sentences are in tension with each other. (That's why I mentioned the paradox of intolerance.)
From my standpoint they are not in tension with each other. I believe in sharing, that does not mean that that should apply to everything. Using something shared, to do something that is not shared is, for me, the entire point of it.
why would we choose licenses based on what a commercial company says is compatible with their marketplace?
I think we should use licenses that place as little restrictions on it as possible because we do not know in which environment we will find each other in the future. The GPL placed requirements on the license which were already hard to fulfill on the app store, and particularly some of the clauses in the GPLv2 are really quite tricky to fulfill in practice for older projects.
For me the question is not what a commercial company ways can or cannot go on their market place, but that the issue is a license that has terms in them that create these types of situations.
What you really seem to be arguing is not that society is better off when we share but that society is better off when people are free to make money using things that are available to them with no restrictions.
My world view is very simple: we should share, we should not have non competes but that simultaneously we should all work for commercial enterprises that deliver win-win transactions for their customers. That involves taking open source software and place it in a commercial context. And I want this to work for both the original author as well as anyone else.
It was the other way around actually: the FSF declared the App Store incompatible with their license. Apple couldn't care less and I am 99.9% sure they have never ever spoken about which open source licenses are compatible with their App Store. The App Store guidelines certainly do not mention it.
The paradox is if it course that the most logical reading kills open source dead, by basically putting all code in the public domain after a brief period of mining to remove all copyrights
See I don't think so. I know exactly what you're saying, and it's and interesting an complex question.
That would really be a communist world though. If I cannot even gain respect or create community in exchange for my open work, it is obvious to me that I would no longer do my work in the open
Yeah, that's fair. It would be a massive current boon to open source, but the long-term consequences would be debatable.
I think in a world where code is cheap, what you get respect for no longer is the code, but the work done solving problems, and thinking hard about what to build, not the implementation. The same urges that drive open source would motivate people to build things and share them.
I guess this is as good a time to announce it as any...
I have recently cleanroom-reimplemented (with LLM assistance and my guidance) par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now)
I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
Not only are they rewritten in a completely different language (Zig), but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I think they are sufficiently dissimilar for relicensing (these are all BSD or MIT)
With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.
The motivation for this was that I am building a for-sale app that cannot initially be open-source (although I plan to make it so eventually). (Maybe there's room for a license type that has a time-limited closed-source clause?)
And yes, I believe the power of LLMs these days makes GPL and copyleft far less of a wall.
That doesn't mean there is a copy of the original code in the LLM. It knows "something" but it can't reproduce the original source code. And if it generates code compatible with 7zip it could as well be because it "knows" about discussions about the format or maybe some random person published a document. Or it pieced it together from other data sources.
It is complicated. And not as simple as "oh it was trained so now it can reproduce whatever it was trained on".
the alleged "complication" is nothing but an attempt to obfuscate the theft of labor and laundering of authorship that is happening before our eyes
the original code went through a lossy compression process. so what? it was still part of the training set. the only way you can clean room implement something this way is by using a model that was not trained on the original source, or any code that was derived from it
A human having seen copyrighted code before doesn't instantly disqualify them from being on Team B. The assembly code for the IBM BIOS was literally printed in the manual, with comments and everything. It's certain that some of the people working on Team B when cloning it had seen bits and pieces of it before, or of code that had been written by other people who had, and were copying patterns from it. But they didn't have access to it specifically during the clean room process, and they hadn't memorized it, and that was enough.
I think doing a clean room implementation is nothing more than a choice that you can make to strengten your legal position. It is not a requirement for this kind of work.
i think asking for a clean room impl is a reasonable demand here, given that llms have been shown to be able to regurgitate parts of their training set
In general LLMs absolutely can reproduce training data verbatim (think the NYT suit; there's also been a bunch of more recent studies on this). They've even been known to do so unprompted [1, 2].
I actually came across a decent collection of recent studies on this recently, but I can't find it right now. Someone should start a website cataloguing all of this.
This only matters logically, I think, if the LLM can reproduce, verbatim, the source code of every project that is public, in a working version.
It cannot.
This would be like accusing a person who is well-versed in compression algorithms and reimplemented one from scratch in a different language (from memory and maybe some written notes) for violating copyright, just because he understood how it worked, but used different code, different functions, different architecture entirely, but still stuck to the spec, which is allowed (and necessary, if one wants an interoperable cleanroom reimplementation).
An LLM is a mechanical process. Copyright protects human creative expression.
So, uh, you fed in some markdown (your work) and the source code of 7zip into a mechanical pipeline and got out some new source code. Do you think your contribution to the process was transformative? Because, otherwise, this looks a lot like a derivative work of 7zip to me.
The lawyer-created PolyForm Project also contains the PolyForm Countdown License Grant, which can be used to change to any license. It may be harder to use because you have to figure out how to combine it with the initial closed-source license, and it only supports hard-coding a date for the license change rather than letting you specify a number of years after you publish the software.
Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
That looks like the spec for the file format minus at least the LZMA2 part which is not very interesting IMO.
There's one interesting bit about LZMA: you can't parse it without decoding it. In other words, in order to parse it, you have to implement 99% of decompression. This means, this isn't specifying much of the parsing of the 7z format.
This is showing the opposite of what you've stated: the implementation at https://github.com/pmarreck/z7z is NOT derived from the specification and MUST be derived from prior knowledge.
Also, even without knowing anything about this, the specification is so short that it should be obvious such a large implementation cannot be derived much from it.
The way I see this going in the future is going to have humans involved.
You really don't want to gamble on an LLM producing a clean functional specification with no copyrightable material in it, even if you're pro-LLM. You need humans to make sure it's not including code snippets or attempting to decompile code into step-by-step English.
And for the Team B phase, you would for example DEFINITELY not want to use Copilot in particular, ESPECIALLY if the thing you're CRREing is Windows. Copilot has a track record of being overfit, has been trained on internal private MS code, and you would be hurting your chances in court for no reason. If you're trying as hard as possible to definitely be legal, you wouldn't use LLMs in the Team B phase at all, not even a cleaner LLM. But in practice you're going to have to wait until precedent is set for how copyright-tainted LLM outputs are in practice and under what conditions they're treated as such.
Having said that, current copyright law isn't strong enough to entirely protect against machines being used for clean room RE (in particular the spec generation part, if you have a human review it and delete dirty parts). The law on the books doesn't line up with what most programmers' intuition says about how copying does or doesn't work; it's almost entirely about the historical facts behind the creation of the second thing, and not what it looks like or what physical tools were used to get there. If you disagree in principle instead of in the edge cases, you should look into lobbying for changing the copyright laws (serious, not being rhetorical).
you would for example DEFINITELY not want to use Copilot in particular, ESPECIALLY if the thing you're CRREing is Windows. Copilot has a track record of being overfit, has been trained on internal private MS code, and you would be hurting your chances in court for no reason.
On the contrary - do you think Microsoft would argue that Copilot, when used as intended, has output infringing content? That's not exactly in their self-interest.
I think that a developer using an AI tool to rewrite a codebase is pretty obviously making a derivative work of the original codebase (I am not a lawyer). AI is — as so often happens to be the case — a red herring.
An AI-assisted cleanroom implementation, though, would pretty obviously not be a derivative work. In that case, a developer uses a tool to write a complete specification of the behaviour of the original codebase, and then uses a tool to write code implementing that behaviour, with none of the original code in the context. There is a longstanding principle that writing an exhaustive natural-language description of a codebase is not creating a derived work.
This post also misunderstands the principal of human authorship. The case in question was about someone who wanted his software to own a copyright, which doesn’t make sense. A piece of software can’t own anything because it’s not a legal person. There are other cases which challenge the idea that people using AI tools own a copyright on the output of those tools, but my opinion (again, I am not a lawyer) this makes as much sense as saying that an author who writes in Word instead of longhand doesn’t have a copyright to his work.
An AI-assisted cleanroom implementation, though, would pretty obviously not be a derivative work. In that case, a developer uses a tool to write a complete specification of the behaviour of the original codebase, and then uses a tool to write code implementing that behaviour, with none of the original code in the context. There is a longstanding principle that writing an exhaustive natural-language description of a codebase is not creating a derived work.
If the tool that writes the code implementing that described behavior was trained on the original code, can you really say none of the original code was in the context? I don't think you can. And it's very clear that in this case, all of the frontier models were trained on the original code.
Posting this to see things from Claude's perspective.
"What does the Python chardet package do?"
"Can you reproduce the detect() function without looking at the original source code?"
"Do you have the original source code of chardet in your memory?"
I can't tell if it's an artifact of the "chat" nature of that link, or what, but it has made the same mistake all over the place; I just happened to spot the error in _score_utf16 because it was near the start of the next paragraph
mitchellh | 6 hours ago
This is excellent, because hopefully this goes to court (please!!!) and a judge can rule and tell us what the law actually is. There's a ton of legal LARPing that people in general like to do and it's all pretty meaningless to me.
A lawyers opinion is much better, but ultimately not where the real value lies since you'll always find lawyers on both sides of unruled cases, otherwise how do you get plaintiff vs defendants. :) Unless, the lawyer brings up actual case law that gives pretty clear precedent, of course!
We need this kind of stuff to go to court and get a true judgement. Only then is the law actually clear.
I hope this can be it. If not this, I hope something else soon. I don't really care which way it falls (I mean, I have opinions of course, but my opinion doesn't matter in the face of the law), I just want to be given guidance on what is/isn't legal.
(Also note, this isn't a moral argument. What is/isn't moral isn't simply what the law is. This is just a legal argument. And... addendum number two, I suppose there are philosophical arguments that morals are defined by laws but I disagree with that. In any case, either opinion doesn't matter, what I'm looking for is law.)
mitsuhiko | 5 hours ago
My hunch is still that people will not want to bring it to court given the significant implications and potential consequences. So far what I got are that ambiguity is better in the interim for everybody with vested interests.
I would love stuff like this to go to court though.
timthelion | 14 hours ago
Its clearly a copyright violation. Like taking s blury photo of a painting is. But if its not, then csn we relicense proprietary code this way? Can the AI finish reactos?
mort | 14 hours ago
The stance of the industry right now seems to be that AI output is not subject to copyright even if it's an almost verbatim copy of copyrighted works. We don't know how much of the code it produces is identical to existing code and nobody cares to check. So I don't see why this should be any different?
The whole thing should be illegal on moral grounds, but the next best thing would be for the whole thing to be illegal on copyright grounds.
wareya | 4 hours ago
This isn't true, and the people in the industry saying it don't actually believe it, they're just trying to make a legal smokescreen. If you get copilot to hallucinate a 1:1 copy of a chunk of Quake, and it does it without the correct license and attributions, and Quake was in its training data, then you just made it do copyright infringement.
mort | 3 hours ago
A whole lot of the world's software is in the training data. When Claude writes hundreds of thousands of lines of code for you, how do you know that none of it is an almost verbatim replica found in its training data? Do you check? Can you check?
wareya | 2 hours ago
You might want to reread what you're responding to?
mort | 2 hours ago
Oh I think I understand what's going on.
I meant: "the stance (aka position held publicly, for political convenience) of the industry is that AI output can't be a copyright violation".
You meant: "the stance (aka privately held belief) of industry leaders is that AI output can be a copyright violation, even if they don't admit it publicly".
The two are compatible and I apologize for taking a hostile tone.
wareya | 2 hours ago
No problem, yeah that more or less sums it up.
singpolyma | 8 hours ago
People do check all the time and verbatim copying is basically unheard of.
However in this case it's clear that the original codebase was probably used as input which is a whole different legal ball of wax.
mort | 3 hours ago
When Antropic advertised that Claude had made a C compiler, weren't GCC and Clang used as input?
dzwdz | 7 hours ago
How do you check that?
The training set of LLMs is so vast. You could probably run some search over them (if you can somehow acquire it in the first place - can you?) to find cases where the output was copied verbatim, but if the output has been even slightly altered that won't really work anymore.
gnafuthegreat | 4 hours ago
What about those of us who feel current copyright laws are immoral? Are the moral grounds based on the supposition that current copyright laws are moral? At least in the US, our copyright laws are absolutely not moral by any standard I'd use to measure. They are driven by greed and no longer serve the original stated purpose of copyright. I suppose it depends on whether you mean "moral" to be "lawful" or "good".
sarah-quinones | 4 hours ago
make no mistake. regardless of the outcome of this, copyright laws will still be applicable to you and me. whatever loopholes are found are only going to benefit those with the most capital
gnafuthegreat | 4 hours ago
Sure, which is why I say the current copyright laws in the US aren't moral. If they are only to the benefit of those with the most capital, then we need to get rid of them. I'm not saying that is likely to happen, but I am saying we shouldn't let the rich and powerful dictate what is right and wrong or good and bad.
sarah-quinones | 4 hours ago
i agree, but until such a thing is possible, we can still use them to fight back where we can
mort | 3 hours ago
Those of us who recognize that current copyright laws are immoral (myself among them) still typically recognize that there's a need for something to protect some form of intellectual property. If I as an individual compose and produce a song, and Warner Brothers wants to use that song as the theme song for an upcoming blockbuster, I want some say in what the terms are and I want to get something back from it. If I write a piece of software, and Google wants to build a billion dollar business on top of it, I want to have a say in that.
mitsuhiko | 12 hours ago
That’s not clear at all. You can trivially make a clean room reimplementation of libraries these days with AI. The only argument against it hinges on “but the LLM was probably trained on it” and I think this doesn’t hold a lot of water because they usually come up with other implementations.
I have done this a few times now and I’m fairly sure that these reimplementations would hold up in court if given a chance. There is a separate question of if LLM generated code can be copyrighted but that’s an independent question. It might weaken copyright and thus copyleft but that might also be a good thing.
dzwdz | 10 hours ago
This seems like it would only really benefit proprietary software, and harm FOSS. Microsoft could take any GPL-licensed library, "rewrite" it using an LLM, and then use it freely. On the other hand, I couldn't do the same with e.g. the Windows source code, because Microsoft doesn't publish it.
wareya | 9 hours ago
Humans have been clean room reverse engineering compiled binaries for decades. In fact it's what clean room RE was invented for. That's not to say that it's the same amount of difficult, it's definitely harder, but it's where this whole idea started in the first place and it's still done. Similarly, the reason projects like WINE have a zero-internal-knowledge-of-windows-code policy for contribution is to play it maximally safe, not because looking at machine code poisons the legal state of your brain (because it doesn't).
If you dump a windows binary into Ghidra, have a future LLM clean up the output pseudo-C code into real C, and then try to automatically clean room RE it, you've fixed the "asymmetry" problem and ended up in an equally legally dubious place (rather than more dubious) vs proprietary shops stealing FOSS code with LLM-driven pseudo-clean-room-RE.
dzwdz | 9 hours ago
My point is that now it's effectively free for companies to use GPL code without submitting to the terms of the license; before this they would need to actually spend resources on a clean room rewrite - which would be an amount of effort comparable to just writing the code from scratch.
In other words, companies just weren't able to benefit from GPL-licensed code unless they used open licenses for their own code too.
mitsuhiko | 7 hours ago
This cuts both ways. The same is now true for open source implementations of proprietary code.
dzwdz | 7 hours ago
If that proprietary code gets released, which the companies would no longer have to do.
Even if it were (or if it would be reversed from the binary)... this still feels bad for FOSS as a whole.
A lot of FOSS maintainers do their work entirely for free, in their spare time - and then their work ends up foundational to multibillion corporations, while they continue not to see a dime for their work. That's obviously fucked up. Licenses such as AGPL have been one solution to this; they've made dual licensing a viable business model.
Except, well, apparently now that went out of the window. If using LLMs to launder licenses are legal, then any developers whose work is published online[1] now have no recollection against it being exploited by large corporations.
edit: I agree with the broad idea of "we should be able to freely reuse each other's code", but I also think it'd be great for everyone (even developers) to have free food and shelter. Alas, we don't live in an utopia, and copyright (along with copyleft licenses) provides developers with some leverage against that inequality.
[1] Notice how I'm not even saying "FOSS developers" - it seems that the license of the original code is irrelevant, so this could be done to code under restrictive licenses.
mitsuhiko | 6 hours ago
I do not need the code to make a clean room reimplementation.
Bernerd | 45 minutes ago
Maybe not in webdev. But in hardware driver world...
sarah-quinones | 7 hours ago
one is publicly available for model training. the other is not
mitsuhiko | 6 hours ago
That’s not really relevant to that discussion about reimplementing something. I can reimplement proprietary software with an agent just fine.
IohannesArnold | 3 hours ago
Isn't the Windows source code floating around enough places that it's not too hard to find if you look? And then you wouldn't need to openly publish it yourself, just privately feed it to an LLM to rewrite it.
quad | 8 hours ago
First off, I lightly expect for this to go the way of mechanical licenses. For example, there could be two distinct copyright elements: the API and the implementation. This might already be true as both fall under copyright.
Second off, I disagree, it seems pretty clear that LLMs can't make a "clean room reimplementation." Regardless of whether LLM generated code can be copyrighted, the use of an LLM is a mechanical process. If you feed the model source code and it spits out an output— spec or rewrite— I struggle to see a world in which that output isn't a derivative work. The key question is whether it falls under fair use:
As this rewrite was explicitly done to supersede the original work, I have my doubts.
But, the court does love things that benefit the public. So, who knows. Exciting times we live in.
wareya | 8 hours ago
There's an edge case in your description of the spec generation process. If I feed a graphics shader to an LLM asking it to give me a functional specification, and it gives me, verbatim, the following description:
...Then that only contains fully uncopyrightable information, even though the process was entirely mechanical. And, containing no copyrightable information from the original data, it's not a "derivative work", despite being mechanical.
It's likely that an LLM is going to poison its spec with copyrightable details in practice, but it's not intrinsic to being a mechanical process.
quad | 7 hours ago
wareya | 6 hours ago
re: 1: That description is like a textbook-perfect case of not containing copyrightable information from the original software. If you don't accept that then I don't know what to tell you.
re: 3: I am not commenting on that in this post. Don't twist my words. It's upsetting.
re: 4: If the LLM's output doesn't contain any copyrightable expression from the original code then it's not a derivative work. Being mechanically linked doesn't cause a derivative work. This is a fundamental misunderstanding of copyright law.
quad | 5 hours ago
I don't accept your textbook perfect case. Would you point me at a textbook so I can correct myself? I would be surprised if a functional spec was uncopyrightable. Now, whether it contains copyrightable information… well, copyright doesn't cover information. And that (imho copyrightable) description is presenting ideas that would fall under patent.
I'm sorry for upsetting you. I believe that I have misunderstood you. You're not commenting on the context of the AI rewrite of chardet; rather, making a point only about an LLM generating a functional spec?
And, yeah, I agree that the example you gave is probably fine.
My point about the mechanical nature was about where and how much human contribution is within the end result. Basically, it would be the prompts, no?
wareya | 4 hours ago
Copyright doesn't extend to pure math, industrial practice, general truths, etc. The only pieces of information from the original code contained in that sentence are only of this sort. You could drop this in a textbook when it starts talking about copyright and it would be a valid example of distilling out only uncopyrightable parts of the original thing, because it limits itself to those kinds of things that are categorically excluded from copyright protection in copyright law as it's written.
US law says:
"In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."
For example, basic math (minus its exact typesetting) is excluded by this. This is common basic knowledge about copyright. The copyright office has restated this in clearer terms: https://lists.gnu.org/archive/html/guile-user/2021-06/msg00073.html
Copyright can cover information, when that information is (or is an encoding of) creative expression. As a concrete example, with typefaces, the data (binary information) of a font file is copyrighted, but not the idea of the shape, even if you make a near 1:1 perfect replica by hand (in the US or Japan).
In this specific case, the english statements in the spec aren't encoding copyrighted expression as information; the only anything from the original shader is all uncopyrightable. And AI outputs on their own are not copyrightable, they're only governed by copyright when they contain copyrighted expression copied from elsewhere, so the text itself doesn't have its own new copyright. So it doesn't contain copyrightable information (when given by an AI as it would have been in my example).
Reading your original post, you said "Regardless of whether LLM generated code can be copyrighted, the use of an LLM is a mechanical process. If you feed the model source code and it spits out an output— spec or rewrite— I struggle to see a world in which that output isn't a derivative work.". That was what I was responding to.
quad | 4 hours ago
Yes, I'm familiar with this common basic knowledge about copyright.
wrt. the typefaces, the fixation of the font file is copyrighted. Moreover, at least in the US, they were ruled as copyrightable because they're computer programs. The data (binary information) is emphatically not what "information" means in the context of copyright. Feist v. Rural (1991) is the textbook case.
In your specific example spec, I agree that it would be hard to argue. But, if I were to argue it, the creative choice of which shaders and the use of a pseudo-tonemapper are a clear set of instructions. Instructions themselves aren't copyrightable but the specific original wording when fixed in a medium are.
I stand by my original statement about the mechanical process because of the absence of additional human creative expression. But, yes, your point that if we slice thin enough to have an LLM generate something uncopyrightable, then we're good. Very thin is "port
1+1to Forth." Ya got me! Meanwhile, @pmarreck cleanroom reimplemented a bunch of compression algorithms. That's more interesting to me. Compression algorithms are where patent law is more often wielded than copyright.chardet 7.0, however, has the same API and the same package name. Google v. Oracle (2021) was unambiguous. And, at least to me, the idea of this being fair use is wild.
wareya | 4 hours ago
In your reference, "information" isn't bare information, but factual information about reality (they literally had to go out of their way to express it in long form as "raw data—i.e. wholly factual information not accompanied by any original expression", and were not able to rely on just saying "information" on its own). Factual information is indeed uncopyrightable, because facts are uncopyrightable. But information in general can either be copyrightable or not. Sometimes information is expression. Just the word "information" is emphatically not a singularly-defined term of art in copyright law; it has multiple definitions, even in copyright law.
Yes, if a human wrote the spec then the particular sequence of words they chose to write the spec would be copyrighted.
This does cause it to not have additional copyright of its own, but it doesn't say anything, one way or the other, about the copyright relations between it and anything that went into it. It's not related.
I'm not talking about this.
quad | 3 hours ago
Ugh, I'm referencing your own words. This is dumb.
As of now, the output from LLMs are both able to violate copyright and unable to be copyrighted due to their mechanical (inhuman) nature.
Yes, if you prompt the LLM to generate anything sans "copyrightable details" then… it won't violate copyright. Excellent point, well made.
To the original point, the mechanical (inhuman) nature matters because of fair use.
Finally, I'd love to see a big studio steal a cool shadertoy by running it through an LLM. Claim in court their use of a functional specification intermediate step and all. No creative expression, just basic math! Judges love gotchas and totally don't consider things holistically.
wareya | 2 hours ago
You're going off the rails. The point of contention is whether or not there is a connection between it being a mechanistic process and the output being a derivative work. There isn't. The fair use question is worth thinking about but wasn't something that I was ever arguing one way or the other over, and arguing about it is pointless either way because fair use is almost entirely determined by court precedent.
EDIT:
This is dishonest framing of my argument. The example I gave was something where there just isn't enough going on to cross that line.
kornel | an hour ago
I think the current law and its definitions are unable to accurately describe what is happening. It's not clearly any of the problems/uses we've had in the past.
Laws written in a world without GenAI haven't fully anticipated how it will complicate the situation, so not only the laws don't clearly describe the legality of it, the consequences of deciding either way couldn't have been properly considered.
Existing copyright assumes it's possible to identify what has been copied and therefore who suffered a loss from the infringement. A model can copy a tiny bit from everyone, and it's debatable (rather than explicitly codified) whether that means everyone or no one has been wronged.
We have the paradox that training process seems to dilute and mix the inputs so much that you could argue nothing has been copied, and each work contributed less than it'd be allowable under fair use if you quoted or remixed one piece, and yet the the models are able to produce works that are eerily similar to pre-existing works (that you probably wouldn't get away with if you made such copy yourself).
Let's keep in mind that copyright law isn't about who owns what in some information-theoretical sense, but (greatly simplifying) about balancing the rights of artists vs public's access to culture. GenAI upsets this balance, so it's not even a question where it fits in the old copyright framework, but what even is the role of copyright now and where's the new balance?
yasser | 10 hours ago
You say those things like pretending not to know how this tech works. And as if it would be very simple to come into an Agent and prompt: "Hey! Let's change the whole file structure plus all the variable names so the lawyers will get even more confused muhehehe”.
mitsuhiko | 9 hours ago
The code in the codebase right now resembles the earlier version very little. Probably about as much as any other library. So the argument should then be had about LLMs in general.
toastal | 9 hours ago
AI-1, please make a spec of foobar. AI-2, please make bazqux using AI-1’s spec. Oh, no.
mitsuhiko | 8 hours ago
There is no On, no. If you replace the AI with a human this is perfectly acceptable.
hoistbypetard | 8 hours ago
Temporarily assuming AIs are analogous to humans here... (aside: I don't believe they are, and if the human operator of AI 1 and AI 2 has studied the foobar implementation, then foobar was input to the creative process, and bazqux is derivative, IMO, but for the sake of discussion, let's assume that they're analogous.)
Unless both human 1 and human 2 were trained on the implementation of foobar.
And in the case of the AIs, if foobar is public code, as in the context of this overall discussion, then both AIs were trained on the implementation of foobar. With that as an input on both sides, bazqux sure seems derivative, even if the end result looks superficially different.
mitsuhiko | 7 hours ago
While there are some arguments in that favor, that has never been so clear cut. A famous case here is Sega v. Accolade where even though some people had internal information, the court ruled in favor of interoperability.
hoistbypetard | 7 hours ago
In this case all the people and all the AIs have internal information. That strikes me as qualitatively different.
mitsuhiko | 7 hours ago
At the end a court would have to decide this, but I think generally given how this is a from scratch implementation we might as well argue that chardet code within Claude would effectively disallow any use of Claude to write any software without attribution to chardet which seems unreasonable.
hoistbypetard | 6 hours ago
We're not arguing about any software, though. While I do have concerns that Claude could emit software with an inappropriate license anywhere, we're talking about using something that was trained on chardet to "reimplement" chardet. That's different.
mitsuhiko | 5 hours ago
But if you look at the new code it’s pretty clear that it has nothing to do with old chardet.
hoistbypetard | 5 hours ago
Hard disagree. If I took a pile of C code, and used a compiler to translate it to assembly language, then used an LLM to convert that to pascal, it'd look pretty clear that it had nothing to do with the original C. But it'd still be, obviously, a derivative work. That's pretty what the difference in the new code looks like to me.
mitsuhiko | 4 hours ago
But that would most likely still use the same algorithms or approaches. That is not the case with this implementation.
toastal | 5 hours ago
I just found a thought-provoking FOSDEM talk on this concept: https://fosdem.org/2026/schedule/event/SUVS7G-lets_end_open_source_together_with_this_one_simple_trick/
Johz | 13 hours ago
I am not a lawyer, but I've read the introductory paragraph of a bunch of Wikipedia articles so you decide whether this counts as an informed opinion:
My understanding is that when you copyright code, you're copyrighting the specific code you've written, not the algorithm or logic behind it. So when you're trying to decide if Project B violates the copyright of Project A, you're trying to figure out these two things:
The "clean room" rewrite process is not legally necessary, but it does have the advantage that it makes it easier to argue point (2) — if there are commonalities between the codebases, then we can be very confident that this is just chance because the developers who wrote the code in Project B never saw the code in Project A.
So in theory, a single human could have rewritten chardet by themselves, and it wouldn't have been a copyright violation, as long as the code they wrote was substantially different from the original project. It wouldn't even matter that they'd seen the original code, as long as the new code clearly is materially different. (Whether this is feasible in practice or recommended by lawyers are two very different questions, but in theory this could happen.)
Therefore getting an AI to do the same operation seems broadly similar. As long as the resulting code is materially different from the original code, it is not subject to copyright protection. On the other hand, if the AI code looks substantially similar to the original code, then we need to go back to point (2) above — is it really a meaningful similarity, or is it just that there are only so many ways of writing the same algorithm?
(A corollary of this is that any AI-generated code can have the same problem, even if it's not a deliberate rewrite. If the AI generates code that looks similar to any code that it was trained on, then we have to go back to point (2) above — is this code so simple that there's not really any other way to express it, or is this explicitly a copy of someone else's code that's been hidden in the weights?)
Orthogonal to this is the question of what copyright one can claim over AI-generated code, assuming that code is not a copy of existing code. There, it looks like the answer is "AI code is public domain", at least for now, although I suspect there's enough money riding on AI that this could well change.
So my interpretation of the legal situation is that, assuming the AI-generated version is substantially different, the new project is not a copyright infringement, but because the majority of the code was AI-generated, the new project must be considered public domain.
timthelion | 12 hours ago
https://en.wikipedia.org/wiki/Steinberg_v._Columbia_Pictures_Industries,_Inc.
I'm not a lawyer but I have read a lot about copyright. Copyright of derivative works is much more about process thsn content. To the extent that two identical maps can both be copyright because they shared no process.
And two completely separate works share copyright because one derives from the other. Like how much code in Emacs is still written by Stallman? Doesn't remove his copyright.
Johz | 11 hours ago
I don't quite see how that case is relevant, given that was a very clear case of copying the details from the original art into the poster. That would be like copying code from a copyrighted project and renaming variables, or wrapping it in a GUI or something. The result will still contain the original code in a form that is not substantially different, so it is copyrighted.
The last paragraph is exactly the point under contention. What counts as deriving a work?
geocar | 12 hours ago
That's not exactly correct. There's also fair-use to consider, because if Project B is only possible because the author leveraged the work of Project A without respect to their rights to redistribute that work, then merely translating any work from English to French would be sufficient.
In the US, the fair-use test most judges will also consider include how Project B uses/monetises that work (i.e. specifically commercially), and if Project B claims it is independent, if they thought they could outsource responsibilities to another company/contractor (or AI).
Also: My experience is that judges are much less entertained by technicalities than they are on TV, and that lawyers know this, so I would not expect anyone serious to go to court with this defence with someone competent.
Johz | 11 hours ago
I don't think fair use is relevant here because the point of fair use is to be a defence in the case that copyright infringement has occurred (i.e. some forms of what would otherwise be copyright infringement are allowed as long as they pass the fair use test). But here, the claim is that the new codebase doesn't infringe on the original codebase's copyright at all, so fair use doesn't apply.
With regards to translation, I agree that a direct translation is
not(EDIT: typo) copyright infringement (and the same would be true of code — writing a literal function-by-function translation of C into Python would be copyright infringement). But a paraphrase would be allowed. So there's clearly a line somewhere where you can adapt a work enough but still take inspiration from the original work. What makes this comparison more complicated is that when writing a book, say, you have more ability to copyright the characters and events of that book. I guess an equivalent would be if you tried to reuse the same architecture as the original codebase?I agree with your last paragraph entirely, though. There's a reason that clean room implementations are preferred for things like this, even if they're not technically necessary: it's better to be whiter than white, than to leave any ambiguity. Similarly to how movies that have vague similarities to a book will often buy the rights for that book, even if the end result is very different, because that's a safer bet than trying to argue the case after the fact.
hoistbypetard | 10 hours ago
These lawyers explain why translations are considered derivative works. I understand this to understand that a translation from one computer language to another would need to be licensed by the original author. I'm not a lawyer, but it seems clear to me in their piece.
Johz | 9 hours ago
Whoops, I said the opposite of what I meant, which is that direct translation is copyright infringement.
I think part of the issue is the similarity between what translation of text is, and what translation of code is. If I translate text, I'm clearly trying to convey some part of the original text. (And if I directly translate code — like the function-by-function approach I mentioned above, that feels similar to me.) But the interface to the code (i.e. the external API) is not copyrightable (at least in the EU, in the US this is IIRC still an open question but it may fall under fair use). So if the only thing I'm translating are the external interfaces, and everything within those interfaces is new code, then that would surely not be a derivative work?
ajessejiryudavis | 7 hours ago
It's surprising to me that the external API is not copyrightable, since developing a good API is half the battle for a lot of software projects. But you must be right, that's why there are so many projects trying to emulate proprietary software. Thinking about it, I'm glad I won't get sued for doing that myself.
If this is the "end of open source" because any API can be implemented easily and becomes automatically public domain, is that ... good? We'll have software abundance? No more IP hoarding?
hoistbypetard | 9 hours ago
In my very much non-lawyer opinion, it could be a derivative work. I've done large amounts of translation before. When you translate a book or an article, you're very much adding your own creative input to the process. You decide what words or phrases best express the author's ideas to speakers of the target language. But the argument and its structure/organization are still very much the original author's. The resulting work, as a derivative, contains both of your creative inputs and belongs to you both.
So to your question, I'd say if all you're using are the external interfaces, and you're not taking the original author's implementation as input, that implementation is your creative work, assuming we agree that the interfaces are not their own creative work. (I'm not sure about that, don't know much about what the courts have said about it, and am unwilling to go look.)
BUT: if you feed the original author's implementation into your creative process, I think your implementation is derivative. And that goes double if your process is "feed the original implementation into an LLM and tell it to generate a new implementation."
Johz | 8 hours ago
That definitely all feels true to me for translating a text, but for some reason it feels less true to me for writing code? I think because there's less creative input in code in general, although there's clearly still some creative input there.
That said, my feelings are definitely not lawyer-approved!
I guess the other interesting aspect to me is whether there's a distinction between giving an agent access to the original codebase and saying "port this code", vs the underlying LLM having previously consumed the codebase as part of training but not having access to the codebase while doing the port. The two cases feel different to me somehow, but I can't quite put my finger on why.
geocar | 7 hours ago
No. Even commentary can potentially violate copyright, as in Warner Bros. Entertainment Inc. v. RDR Books, so there should be no doubt on whether paraphrasing someone is a derivative work.
A paraphrase may be allowed under fair-use, but a judge will look at what you are doing with your paraphrasing, so paraphrasing Harry Potter will still get you sued if you put it on Amazon.
mitsuhiko | 7 hours ago
It's a fresh implementation. You can see this comment to see how much code carried over: https://github.com/chardet/chardet/issues/327#issuecomment-4005195078 (basically nothing)
abeyer | 5 hours ago
The maintainer says in that post:
...where in the new test data repo even he recognizes the licensing implications of the data by moving it out to a separate repo. I wonder if that's actually a damning admission.
While data itself isn't generally copyrightable, a specific compilation of data in the whole can be if there is sufficient original/creative/deliberate judgement put into the compilation. It seems to me that a project's test data would be exactly the type of thing that would meet that test... making the test data which was the foundation of everything the agent did LGPL licensed unless otherwise specified, as it was originally part of the existing repo (though perhaps there are issues w/ [L]GPL applied to non-code assets)
mitsuhiko | 5 hours ago
If we take the AI out of the picture and we just assume the author did a clean room implementation against a shared test suite it’s legally sound. You are allowed to do that.
abeyer | 5 hours ago
My understanding is that's typically done in the human case by either having the test suite be rewritten from specs, or by using it as a black box solely to verify the new code... and specifically not as a case of "read these tests and write code from them."
mitsuhiko | 4 hours ago
What we know is that a test suite does not influence the license of the implementation that targets them.
wareya | 9 hours ago
If it's not a copyright violation, then yes, AI can probably finish ReactOS. But they'd also get sued for it. Proving things is hard. Also some of the important parts of windows are partially covered by patents, oops.
hoistbypetard | 11 hours ago
The original author of chardet objects to this, and believes that it infringes on his copyright.
I am not qualified to say whether it's legal or not, but it's absolutely a dick move. I would like to take note of who did this so that I don't accidentally find myself collaborating with them at some future time.
gasche | 13 hours ago
I looked briefly at the code before and after the rewrite happened, and it is plausible to me that the two versions share no copyright. (I believe that it would have been much more reasonable for the maintainer interested in the rewrite to explicitly start a new project with a new license, rather than attempting a complete rewrite of an existing codebase inherited from other people.)
These programs are very specific in that a lot of the value is in bunch of classification data (in the previous version there were various state machines encoded as lists of numbers), and relatively little actual logic. The rewrite seems to have produced entirely new decision data and clearly-different driver logic.
It would be wildly incorrect to generalize this across any kind of AI-assisted code transformation, or to extrapolate without looking at the actual changes on that specific project.
toastal | 10 hours ago
This is what I don’t get. Why not name it at minimum “chardet2”? What is the motivation of trying to keep the name? Do the have upload rights to the Python package & want folks to ‘automatically’ get the update (maliciously like xz hopefully not)? Is just to keep the market saturation of the original package (LLMs do not seem to care for things new/not established before the training so maybe get the uptake)? But even then, what was wrong with the LGPL license that it “must be changed” as it is highlighted in the changelog (not just a regression to the mean of the LLM choosing it as ‘forces’ have been pushing towards permissive licenses in recent years)? Now instead it’s going to spiral out as a legal/philosphical/ethical choice made—one that could have some heavy implications (not that this wasn’t inevitable).
st3fan | 10 hours ago
This new version was created by the same person who has been the primary maintainer of this package for many years. It is not a new package. It is a new version of the same package in the same repo.
gasche | 9 hours ago
A plausible explanation is that the current maintainer (who directed the rewrite and decided the license change) prefers MIT to LGPL for whatever reason, and took opportunity of their complete-rewrite plan to change the license, without particularly thinking about the preferences/opinions of previous caretakers of the package. Now that a former maintain has spoken up against the change, I believe that the reasonable thing to do for the current maintainer would indeed be to publish the new version as a separate package or give up on the relicensing.
(Even if the new maintainer posts a separate package with a different license, the previous contributor could come and mention that this is derivative work and ask for LGPL again. But my guess is that the new maintainer would take this opinion into consideration less if it is a separate project, as long as they themselves are convinced that it is not derivative work.)
hoistbypetard | 10 hours ago
The current chardet maintainers, who do publish it to pypi, did this for version 7 of the library. They are not the original authors but have been maintaining it for a while.
dzwdz | 6 hours ago
...but like, you're doing this, and as you've acknowledged, obviously the ethics are muddy. Why aren't you taking some time to decide whether what you're doing is moral?
mitsuhiko | 4 hours ago
From where I stand, it's moral. The world will be better off by a non GPL version of readline that supports unicode well and a clean room implementation of that is not any worse than using AI for any other form of code generation.
telemachus | 4 hours ago
This reminds me of the paradox of tolerance. (Roughly: if a society tolerates everything, including radical intolerance, then it may end up in an overall less tolerant state than if it places some limits on toleration.)
In this case, if we take a radical stance about the openness of software and don't allow any limits on how it is shared, then we also seem to allow people to take open software and put it in things that are not open and not shared. As I understand it, the GPL means to give openness (and sharing) real teeth by saying, "This is shared and open, but its openness is viral. You can't take this and hide it away in a thing that is closed and not shared."
It's not clear to me that preserving different kinds of sharing, with different rules and limits, means not sharing. That is, I also "think society is better off when we share," but I definitely don't think that the GPL runs against that spirit. The person who initially shares (in this case it was Mark Pilgrim) has the freedom to decide on what terms they want to share. Forcing all sharing to be of one type doesn't strike me as better (unless we stipulate that that one type of sharing is the best or only true way to share.)
For whatever it's worth, I say this as someone who releases all their software under the BSD 3-clause license. But that is my choice.
mitsuhiko | 3 hours ago
I have released software under BSD, MIT and Apache 2 licenses for years precisely because you can do with it what you want. For me that is an important feature because I do not agree with how long copyrights last.
It was also for me a reason of why I very much stood behind our FSL license that turns open source after two years.
When the App Store came around a lot of GPL software was incompatible with it and I think that was a bad thing for society overall. It’s hard to put restrictions on licenses when you do not know what the future looks like.
sloane | 2 hours ago
do we have to take it for granted that interoperability with the App Store is a good thing, full stop? maybe, and i get this is not really a world we ever lived in, but maybe things would be Better if the App Store had been under sufficient pressure to ensure that its model of software distribution played nice with copyleft
wareya | 2 hours ago
It's not a good thing, but it's not necessarily the worst thing in the world, either.
The situation with game consoles is probably easier to reason about: the platform and the various bits of middleware need some basic amount of opaqueness or else cheating in online games would be horribly rampant (rather than an annoying amount of background noise). It's not hypothetical, we saw TF2 on PC get rendered almost unplayable by automated cheat bots for multiple years in a row. In game dev, the line between OS and middleware is slightly fuzzy, because you often pull things that ought to be part of the graphics driver or whatever into middleware for customization and performance reasons, so the GPL's provisions for letting you link with the OS don't apply.
I don't like this, but when you're selling is the ability to play by fair rules with other humans, you need at least a basic amount of opaqueness, and the licensing situation with all the individual parts is already murky even BEFORE you start throwing the GPL or LGPL at things.
telemachus | 2 hours ago
I don't think I understand your response. Are you saying that it "was a bad thing for society" that Apple declared GPL software incompatible with their store? If so, why does that count against the GPL? Apple made that choice, right?
mitsuhiko | 2 hours ago
I generally think that the GPL (particularly v3) is not a license that found a good balance because of the many quirks it has. In my book it's a fault of the GPL license that it does not work for something like the app store and not the fault of Apple.
telemachus | 2 hours ago
Again, I don't understand how this connects to the part of your original article that I quoted. To repeat: "I’m a strong supporter of putting things in the open with as little license enforcement as possible. I think society is better off when we share."
These two sentences are in tension with each other. (That's why I mentioned the paradox of intolerance.) The Apple app store has very little to do with sharing. People overwhelmingly sell things there, and Apple makes a lot of money in the process.
To put this another way, if the point of all this to to create a "society...[where] we share" why would we choose licenses based on what a commercial company says is compatible with their marketplace?
What you really seem to be arguing is not that society is better off when we share but that society is better off when people are free to make money using things that are available to them with no restrictions. People can agree or disagree, but that's consistent. I have a hard time seeing a coherent position in what you've said so far.
hyperpape | an hour ago
You can release things on the app store for free. If you cannot release GPL code on there for free, then you have, in one regard, been made less free.
Of course, perhaps you think the GPL is good on the whole, that you're more free with it, overall. You can even put the entire blame on Apple. That's a coherent thing to think. But you should acknowledge that there is something you lost.
telemachus | an hour ago
Yes, absolutely fair. There are two freedoms in competition here, and one has to give. That said, I'd argue (again) that the app store is overwhelmingly (and by clear design) not about releasing things for free. (I'd go further and argue that the free releases are a fig-leaf for Apple and a good sales technique for developers. Free releases help Apple precisely because they and others can remind people that some things are free. And free releases help developers who can then do in-app sales and upsell people to the "Pro" edition.)
Do you think that this helps the OP's overall argument? (Genuine question: no snark.)
mitsuhiko | an hour ago
From my standpoint they are not in tension with each other. I believe in sharing, that does not mean that that should apply to everything. Using something shared, to do something that is not shared is, for me, the entire point of it.
I think we should use licenses that place as little restrictions on it as possible because we do not know in which environment we will find each other in the future. The GPL placed requirements on the license which were already hard to fulfill on the app store, and particularly some of the clauses in the GPLv2 are really quite tricky to fulfill in practice for older projects.
For me the question is not what a commercial company ways can or cannot go on their market place, but that the issue is a license that has terms in them that create these types of situations.
My world view is very simple: we should share, we should not have non competes but that simultaneously we should all work for commercial enterprises that deliver win-win transactions for their customers. That involves taking open source software and place it in a commercial context. And I want this to work for both the original author as well as anyone else.
If you are curious about my stance here, see my thoughts on the FSL and mixing money an open source.
st3fan | 37 minutes ago
It was the other way around actually: the FSF declared the App Store incompatible with their license. Apple couldn't care less and I am 99.9% sure they have never ever spoken about which open source licenses are compatible with their App Store. The App Store guidelines certainly do not mention it.
conartist6 | 11 hours ago
I couldn't have written this better myself.
The paradox is if it course that the most logical reading kills open source dead, by basically putting all code in the public domain after a brief period of mining to remove all copyrights
shapr | 8 hours ago
This puts everything in the training set into the public domain.
Research papers, Windows source code, fintech algorithms and all.
I predict a return to trade secrets.
jaredkrinke | 6 hours ago
Patents, too!
hyperpape | 2 hours ago
Perhaps it kills copyleft, but it would be a massive boon to open source. They're quite different concepts.
conartist6 | 2 hours ago
See I don't think so. I know exactly what you're saying, and it's and interesting an complex question.
That would really be a communist world though. If I cannot even gain respect or create community in exchange for my open work, it is obvious to me that I would no longer do my work in the open
hyperpape | an hour ago
Yeah, that's fair. It would be a massive current boon to open source, but the long-term consequences would be debatable.
I think in a world where code is cheap, what you get respect for no longer is the code, but the work done solving problems, and thinking hard about what to build, not the implementation. The same urges that drive open source would motivate people to build things and share them.
pmarreck | 9 hours ago
I guess this is as good a time to announce it as any...
I have recently cleanroom-reimplemented (with LLM assistance and my guidance) par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now)
https://github.com/pmarreck?tab=repositories&type=source
I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
Not only are they rewritten in a completely different language (Zig), but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I think they are sufficiently dissimilar for relicensing (these are all BSD or MIT)
With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.
The motivation for this was that I am building a for-sale app that cannot initially be open-source (although I plan to make it so eventually). (Maybe there's room for a license type that has a time-limited closed-source clause?)
And yes, I believe the power of LLMs these days makes GPL and copyleft far less of a wall.
dzwdz | 9 hours ago
...wasn't the LLM trained on that source code?
st3fan | 8 hours ago
That doesn't mean there is a copy of the original code in the LLM. It knows "something" but it can't reproduce the original source code. And if it generates code compatible with 7zip it could as well be because it "knows" about discussions about the format or maybe some random person published a document. Or it pieced it together from other data sources.
It is complicated. And not as simple as "oh it was trained so now it can reproduce whatever it was trained on".
sarah-quinones | 7 hours ago
the alleged "complication" is nothing but an attempt to obfuscate the theft of labor and laundering of authorship that is happening before our eyes
the original code went through a lossy compression process. so what? it was still part of the training set. the only way you can clean room implement something this way is by using a model that was not trained on the original source, or any code that was derived from it
wareya | 7 hours ago
A human having seen copyrighted code before doesn't instantly disqualify them from being on Team B. The assembly code for the IBM BIOS was literally printed in the manual, with comments and everything. It's certain that some of the people working on Team B when cloning it had seen bits and pieces of it before, or of code that had been written by other people who had, and were copying patterns from it. But they didn't have access to it specifically during the clean room process, and they hadn't memorized it, and that was enough.
st3fan | 7 hours ago
I think doing a clean room implementation is nothing more than a choice that you can make to strengten your legal position. It is not a requirement for this kind of work.
IANAL
sarah-quinones | 7 hours ago
i think asking for a clean room impl is a reasonable demand here, given that llms have been shown to be able to regurgitate parts of their training set
dulaku | 4 hours ago
That seems irrelevant when the original claim is that a clean room implementation was done.
dzwdz | 7 hours ago
In general LLMs absolutely can reproduce training data verbatim (think the NYT suit; there's also been a bunch of more recent studies on this). They've even been known to do so unprompted [1, 2].
I actually came across a decent collection of recent studies on this recently, but I can't find it right now. Someone should start a website cataloguing all of this.
pmarreck | 3 hours ago
This only matters logically, I think, if the LLM can reproduce, verbatim, the source code of every project that is public, in a working version.
It cannot.
This would be like accusing a person who is well-versed in compression algorithms and reimplemented one from scratch in a different language (from memory and maybe some written notes) for violating copyright, just because he understood how it worked, but used different code, different functions, different architecture entirely, but still stuck to the spec, which is allowed (and necessary, if one wants an interoperable cleanroom reimplementation).
sarah-quinones | 9 hours ago
was the llm not trained on the public source to begin with?
quad | 8 hours ago
An LLM is a mechanical process. Copyright protects human creative expression.
So, uh, you fed in some markdown (your work) and the source code of 7zip into a mechanical pipeline and got out some new source code. Do you think your contribution to the process was transformative? Because, otherwise, this looks a lot like a derivative work of 7zip to me.
pmarreck | 3 hours ago
Wrong. There was no "source code of 7zip" used. At all. Only the Markdown spec was utilized by the implementing LLM. Nothing else.
quad | 3 hours ago
Oh, I misunderstood! I thought the spec was generated via LLM from previously existing source code. Very cool!
Where are the interoperation tests?
roryokane | 7 hours ago
There are already licenses like these. The OSI has a page about the history of Delayed Open Source Publication. I think the Business Source License is the most famous such license. Fair Source Licenses lists two other such licenses, such as the Functional Source License.
The lawyer-created PolyForm Project also contains the PolyForm Countdown License Grant, which can be used to change to any license. It may be harder to use because you have to figure out how to combine it with the initial closed-source license, and it only supports hard-coding a date for the license change rather than letting you specify a number of years after you publish the software.
pmarreck | 3 hours ago
ah, very cool, thanks for the heads-up!
adrien | 13 minutes ago
That looks like the spec for the file format minus at least the LZMA2 part which is not very interesting IMO.
There's one interesting bit about LZMA: you can't parse it without decoding it. In other words, in order to parse it, you have to implement 99% of decompression. This means, this isn't specifying much of the parsing of the 7z format.
This is showing the opposite of what you've stated: the implementation at https://github.com/pmarreck/z7z is NOT derived from the specification and MUST be derived from prior knowledge.
Also, even without knowing anything about this, the specification is so short that it should be obvious such a large implementation cannot be derived much from it.
telemachus | 4 hours ago
[Sorry: I put this in the wrong part of the thread. I moved it.]
wareya | 7 hours ago
The way I see this going in the future is going to have humans involved.
You really don't want to gamble on an LLM producing a clean functional specification with no copyrightable material in it, even if you're pro-LLM. You need humans to make sure it's not including code snippets or attempting to decompile code into step-by-step English.
And for the Team B phase, you would for example DEFINITELY not want to use Copilot in particular, ESPECIALLY if the thing you're CRREing is Windows. Copilot has a track record of being overfit, has been trained on internal private MS code, and you would be hurting your chances in court for no reason. If you're trying as hard as possible to definitely be legal, you wouldn't use LLMs in the Team B phase at all, not even a cleaner LLM. But in practice you're going to have to wait until precedent is set for how copyright-tainted LLM outputs are in practice and under what conditions they're treated as such.
Having said that, current copyright law isn't strong enough to entirely protect against machines being used for clean room RE (in particular the spec generation part, if you have a human review it and delete dirty parts). The law on the books doesn't line up with what most programmers' intuition says about how copying does or doesn't work; it's almost entirely about the historical facts behind the creation of the second thing, and not what it looks like or what physical tools were used to get there. If you disagree in principle instead of in the edge cases, you should look into lobbying for changing the copyright laws (serious, not being rhetorical).
dzwdz | 7 hours ago
On the contrary - do you think Microsoft would argue that Copilot, when used as intended, has output infringing content? That's not exactly in their self-interest.
Someone should try this, it'd be pretty funny.
gir | 4 hours ago
someone already did: https://github.com/reactos/reactos/issues?q=is%3Apr+author%3A%40copilot
rau | 12 hours ago
I think that a developer using an AI tool to rewrite a codebase is pretty obviously making a derivative work of the original codebase (I am not a lawyer). AI is — as so often happens to be the case — a red herring.
An AI-assisted cleanroom implementation, though, would pretty obviously not be a derivative work. In that case, a developer uses a tool to write a complete specification of the behaviour of the original codebase, and then uses a tool to write code implementing that behaviour, with none of the original code in the context. There is a longstanding principle that writing an exhaustive natural-language description of a codebase is not creating a derived work.
This post also misunderstands the principal of human authorship. The case in question was about someone who wanted his software to own a copyright, which doesn’t make sense. A piece of software can’t own anything because it’s not a legal person. There are other cases which challenge the idea that people using AI tools own a copyright on the output of those tools, but my opinion (again, I am not a lawyer) this makes as much sense as saying that an author who writes in Word instead of longhand doesn’t have a copyright to his work.
hoistbypetard | 7 hours ago
If the tool that writes the code implementing that described behavior was trained on the original code, can you really say none of the original code was in the context? I don't think you can. And it's very clear that in this case, all of the frontier models were trained on the original code.
st3fan | 7 hours ago
Posting this to see things from Claude's perspective.
"What does the Python chardet package do?" "Can you reproduce the detect() function without looking at the original source code?" "Do you have the original source code of chardet in your memory?"
https://claude.ai/share/9dc009c1-d23d-4776-8ef9-c923fc4065ee
mdaniel | 6 hours ago
I can't tell if it's an artifact of the "chat" nature of that link, or what, but it has made the same mistake all over the place; I just happened to spot the error in
_score_utf16because it was near the start of the next paragraph