This seems a really interesting path for interpretability, specially if a big chunk of a model's behavior occurs pseudo-symbolically. This is an idea I had thought about, integrating tools into the main computation path of a model, but I never imagined that it could be done efficiently with just a vanilla transformer.
one of the most interesting pieces I've read recently. Not sure I agree with all the statements there (e.g. without execution the system has no comprehension) - but extremely cool
I'd like to see this combined with reinforcement learning to optimize models to think computationally. Generating ideas with hypothetical results and then running them in the same thought. Their solution sounded like a lot of tokens though.
This shows the downside of using AI to write up your project. I see the eloquent sentences, but don't get the message.
> This works, but the actual execution happened outside the model. The model specified the computation, then waited for an external system to carry it out.
> Our transformer also emits a program, but instead of pausing for an external tool, it executes that program itself, step by step, within the same transformer.
What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
Why is it good that it's "inside" the model? Just making it more elegant and nice? The tool was already "inside" the overall hybrid system. What's the actual problem?
>This shows the downside of using AI to write up your project. I see the eloquent sentences, but don't get the message.
Not really sure what this obsession with calling things you don't like AI generated is but it's poor form. If you have something to say about the text then say it. Otherwise leave baseless accusations out of it.
>What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?....
It's pretty clearly an ideological thing. Some people are firmly on the 'some sort of symbolic logic is necessary' camp. From the article, 'A system that cannot compute cannot truly internalize what computation is.'
Some things are just interesting for the sake of it. This is one of those things. I don't agree with the authors on the above and I'm still glad they shared. It's a very interesting read regardless.
I got the same impression as the parent post. Even if its not AI-generated, the text reads like a politician's speech at a lot of places. Talks a lot, says little.
The idea itself was very cool, so I endured it. But it was not a pleasant read.
> If you have something to say about the text then say it.
I could point out the individual phrases and describe the overall impression in detail, or I can just compactly communicate that by using the phrase "AI". If it bothers you, read it as "AI-like", so there is a pretension.
I have no problem with using AI for writing. I do it too, especially for documentation. But you need to read it and iterate with it and give it enough raw input context. If you don't give it info about your actual goals, intentions, judgments etc, the AI will substitute some washed-out, averaged-out no-meat-on-the-bone fluff that may sound good at first read and give you a warm wow-effect that makes you hit publish, but you read into it all the context that you have in your head, but readers don't have that.
Formatting and language is cheap now. We need a new culture around calling out sloppy work. You would not have had a problem with calling out a badly composed rambling article 5 years ago. But today you can easily slap an AI filter on it that will make it look grammatical and feel narratively engaging, now it's all about deeper content. But if one points that out, replies can always say "oh, you can't prove that, can you?"
>"This shows the downside of using AI to write up your project."
I just find phrases like this a bit obnoxious at times.
>You would not have had a problem with calling out a badly composed rambling article 5 years ago.
Then why not just say that? It's rambling bla bla bla. What's so hard about that? Why invent a reason for issues, as if rambling articles didn't get written 5 years ago.
Like No, being written by an LLM or not is not the reason the article has no benchmarks or interpretability results. Those things would be there regardless if the author was interested in that, so again, it just seems there's little point in making such assertions.
It's very hard to discuss this. To some people it's obvious, to some it isn't. To me, every single paragraphs is obvious fluff AI writing. One problem with it is the repetitiveness and the schmoozing salesman feel. The other is the lack of benchmarks and stuff. It's both. The two are connected because the AI has to lean in to its bullshitter persona when it's not given enough raw material to write up something strong. But whenever an AI writes in its default voice like this, it also indicates that the context was not well curated.
But anyway, yes, I can also just move on to the next article. Most of the time I indeed do that.
For what it’s worth, I agree with you; the article is LLM written although not with the usual gotchas, so they’re more subtle.
The subtle ones like this I don’t mind too much, as long as they get the content correct, which in this case leaves quite a bit to be desired.
I’m also noticing that some people around me appear to just be oblivious to some LLM signals that bother me a lot, so people consume media differently.
I absolutely do believe that AI generated content needs to be called out, although at this point it’s safe to say that pretty much all online content is LLM written.
I'm glad they shared too! Wish they shared without letting the LLM process it so heavily, it makes it too hard to read, it gives monotone importance to every piece of text. Mostly it does this by bringing everything up to a slight over-importance with tone and fluff language, and by turning everything into dry statements of fact.
As to why people call this out without going into great detail about the problems with the actual text, it's because this is happening all over the place and it's very disrespectful to readers, who dig into an article that looks very well written on the surface, only to discover it's a lot of labor to decode and often (but not always) a total waste of time. Asking for a critical report of the text is asking even more of a reader who already feels duped.
This is a nice case study of the downside of creating explicit policies of "no AI comments" without a technical method of enforcing it. I am sure the hacker news comment quality will suffer almost as much from an escalating culture of accusation and paranoia that it will from LLM comment themselves.
Agreeing first that it is genuinely interesting, let me make a constructive comment on the text: Early on, there are too many small paragraphs that don't on their own make a cogent argument. That important but easily overlooked structural work is pushed back to the reader. I felt rewarded in pushing past that though. Bravo.
> Not really sure what this obsession with calling things you don't like AI generated is but it's poor form
Admonishing someone for correctly identifying AI-written or AI-edited blog posts is poor form, friend.
It is without a doubt written by an LLM. All of the telltale signs are there. I work with these tools 8-20 hours a day and after a while the verbiage and grammatical structures stick out like a sore thumb.
Get off the high horse. I too think this is a very interesting read. I was fascinated with the subject, but the presentation was nauseatingly distracting and immediately sets off yellow flags about how Percepta operates, and what kind of quality they're willing to settle with. It tells me they are more interested in appearances and superficiality.
The numbers that are there categorically cannot be trusted, because hallucinating those details is quite common for models. There is simply no indication that a human adequately proof-read this and therefore any of its claims must be taken with a grain of salt. Don't forget the recent Cloudflare+Matrix debacle: https://news.ycombinator.com/item?id=46781516
I share the same concerns as OP; this post lacks metrics and feels like someone did something cool and raced to get an AI to post about it, instead of giving it a proper treatment.
I don't care how sure you are. Honestly, it's irrelevant. 99% of the time, it's a more pleasant and productive conversation for everyone involved if you just focus on issues you had with the text itself than any nebulous AI involvement.
From my point of view, all you've done is said a lot of nonsense and fabricated a convoluted explanation for why you think the text is bad. I'm fine on my horse thanks.
People can no longer freely point out when the fact that a piece of work is automated and the lack of meat are red flags as to the veracity of the content, but your antagonistic metacommentary for other people pointing out factual information is welcome discourse?
You claimed "this obsession with calling things you don't like AI generated" is "poor form", attacking the parent commenter by claiming they are lying about the nature of the content. However, multiple people have pointed out the clear signs which you missed, and the consensus is that you were wrong. Now you suddenly don't care about this point, and have introduced a new argument instead.
"From my point of view, all you've done is said a lot of nonsense and fabricated a convoluted explanation for why you think the text is bad"
What a bad-faith response. Categorically dismissive, vague, antagonistic and ultimately failing to critically engage with anything I said.
Whether a piece of work is automated and 'lacks meat' is ultimately not something you can know for sure as a reader. Articles like this existed plenty Pre-AI and will exist plenty post-AI, involvement or not, so yeah pretty pointless to focus on that. It adds nothing and all we have to go is your own surety, which is fallible. If you can't recognize that then there's not much to say.
I didn't miss anything. I never cared about it one way or another. What clear signs have people pointed out ? This is the problem. It's apparently so obvious yet even the original commenter admits "It's things humans do too". What is clear about that ?
Your inability to recognize the clear imprint of current-generation language models on this article doesn't mean they're not present.
All knowledge is ultimately fallible, but ignoring or not being able to appreciate the high statistical likelihood of this article being LLM edited/generated doesn't change reality.
You're asking me to share my expertise with you so that you can understand, but your antagonistic overtones make it not feel worth the time and effort. Other readers have also pointed out that it has characteristic idiosyncrasies. Feel free to look into it yourself, but it would also be wise to learn to defer these kinds of attacks until you have all the information.
The post is the perfect example of the kind of writing about AI that dupes people that don't really understand how things like LLMs actually work and are actually trained. Anyone who properly understands these things finds the complete and total lack of detail about training and the loss function (and of course real metrics / benchmarks) to be a monstrous red flag here.
Especially egregious to me is the claim "Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself". This is total weasel-language: e.g. we can propagate any weights through any transformer architecture and all sorts of other much more insane architectural designs, but that is irrelevant if you don't have a continuous and differentiable loss function that can properly weight partially-correct solutions or the likelihood / plausibility of arbitrary model outputs. You also need a clearer source of training data (or way to generate synthetic data).
So for e.g. AlphaFold, we needed to figure out a loss function that continuously approximated the energy configuration of various molecular configurations, and this is what really allowed it to actually do something. Otherwise, you are stuck with slow and expensive reinforcement-based systems.
The other tells are garbage analogies ("Humans cannot fly. Building airplanes does not change that; it only means we built a machine that flies for us"). Such analogies add nothing to understanding, and indeed distract from serious/real understanding. Only dupes and fools think you can gain any meaningful understanding of mathematics and computer science through simplistic linguistic analogies and metaphors without learning the proper actual (visuspatial, logical, etc) models and understanding. Thus, people with real and serious mathematical understanding despise such trite metaphors.
But then, since understanding something like this properly requires serious mathematical understanding, copy like that is a huge tell that the authors / company / platform puts bullshitting and sales above truth and correctness. I.e., yes, a huge yellow flag.
Honestly, the most interesting thing here is definitely that just 2D heads are enough to do useful computation (at least they are enough to simulate an interpreter) and that there is an O(log n) algorithm to compute argmax attention with 2D heads. It seems that you could make an efficient pseudosymbolic LLM with some frozen layers that perform certain deterministic operations, but also other layers that are learned.
I read a lot of LLM text every day, so I'm quite good at seeing the cadence, the narrative structures and the phrasing styles. It's not just "it's not just X but Y" or emdashes. I could point them out and you would say oh humans use this trope or phrasing style too, and of course that's true. It's still a tell. But it's pointless to argue about this.
Here is a rough list, some may be contentious individually, but the more of these appear, the more you should suspect an LLM:
Cadence and rhythm: LLMs produce sentences with an extremely low variability in the number of clauses. Normal people run on from time to time, (bracket in lots of asides), or otherwise vary their cadence and rhythm within clauses more than LLMs tend to.
Section headings that are intended to be "cute" and "snappy" or "impactful" rather than technically correct or compact: this is especially a tell when the cuteness/impactfulness is deeply mismatched with the seriousness or technical depth of the subject matter.
Horrible trite analogies that show no actual real understanding of the actual logical, mathematical, or visuo-spatial relationships involved. I.e. analogies are based on linguistic semantics, and not e.g. mathematical isomorphism or core dynamics. "Humans cannot fly. Building airplanes does not change that; it only means we built a machine that flies for us". Can't imagine a more retarded and useless analogy for something as complex as the article topic.
Verbose repetition: The article defines two workarounds: "tool use" and "agentic" orchestration, then defines them, then in the paragraph immediately following, says the exact same thing. There are basically multiple (small paragraphs) that all say nothing at all more than the sentence "LLMs do not reliably perform long, exact computations on their own, so in practice we often delegate the execution to external tools or orchestration systems".
Pseudo-profound bullshit: (https://doi.org/10.1017/S1930297500006999). E.g. "A system that cannot compute cannot truly internalize what computation is." There is thankfully not too much of this in the article, and it appears mostly early on.
Missing key / basic logic (or failing to mention such points clearly) when this would be strongly expected by any serious practitioner or expert: E.g. in this article, we should have seen some simple nice centered LaTeX showing the scaled dot-product self attention equation, and then some simple notation to represent the `.chunk` call, and subsequent linear projection, something like H = [H1 | H2], or etc., I shouldn't have to squint at two small lines of PyTorch code to find this. It should be clear immediately this model is not trained, and this is essentially just compiling a VM into a Transformer, and not revealed more clearly only at the end.
The key difference is that the model is able to write the program as it’s executing it.
Before it needs to write the code and have an external program execute it. Here it can change its mind mid execution. Kinda like what was observed in the CoT’s ah ha moment
> Is it that you can backprop through this computation? Do you do so?
With respect, I feel that you may not have read the article.
> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
and,
> By storing points across nested convex hulls, this yields a decoding cost of O(k+log n).
and,
> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
So yes, and yes.
> Where are the benchmarks?
Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.
Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.
I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so.
The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.
I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.
Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.
Did you read the post you are responding to? It says:
> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"
There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.
my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.
ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.
EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.
Well, for one, by eliminating external tool calling, the model gains an amount of security. This occurs because the tools being called by an LLM can be corrupted, and in this scenario corrupted tools would not be called.
I wish people put half as much energy into actually doing things as they did to complaining about AI generated text. We'd have ascended to energy based being about 18 months ago.
Interesting... But why? What is the benefit, other than increasing our understanding of model architectures?
Our brains can also simulate turing machines, slowly. We automated that with computers that are faster and more reliable. So why not allow a model to use external much faster and reliable tools, just as we do?
Why must models be analogous to humans using tools? Or to take the analogy route further wouldn't it be better if humans had calculators built into their brains, provided they are determisitic and reduce latency
Because it is directly analogous. Neural nets (whether biological or artificial) are not the best way to execute lots of deterministic computations quickly and reliably. That's why we invented computers.
I'm not convinced at all that this is the best way to reduce latency; there are many other ways of doing that.
Having a calculator in our brains would be handy of course, but a gigahertz multi core computer is still going to be better at anything that needs to do a lot of computation and or a lot of data.
Exactly. They've implemented a VM inside a transformer, turned an O(1) memory access call into O(n), optimized it down to O(log n) and wrote a post about how smart they are.
It's a nice bit of engineering, if you don't subscribe to YAGNI. If you do, you must ask the obvious question of what capability this delivers that wasn't available before. The only answer I've got is that someone must have been a bit chilly and couldn't figure out the thermostat
I spent the entire time reading it pondering the same thing.
1. The article presents that calling out to a tool like python is "expensive" because of the overhead of forking a process, loading up the python env etc, but why not just eliminate that overhead and embed WebAssembly so this "tool call" is near zero? This feels very similar to the discussion in the 90's around the overhead of threads v.s. processes or kernel space v.s. user space. Could even go further and have a running beam vm so the LLM can write elixir which is ideal for LLM's that stream out code? Elixir programs will be a lot shorter than webassembly.
2. The core argument stated is "A system that cannot compute cannot truly internalize what computation is." The idea being that it could write a program, execute it and by seeing all of the steps maybe even part way through stop and change its mind or when writing new programs write them better, aka be able to debug on the fly?
3. Not mentioned, but there is a 3rd x factor that LLM's will use this new found computation engine to do overall better at "thinking". Computing in very unexpected ways and to unexpected problems. Maybe it would do dramatically better at some benchmark because of this?
Unfortunately these are not explored and it is just an execution engine even resulting in the conclusion stating "arbitrary programs can be compiled directly into the transformer weights, bypassing the need to represent them as token sequences at all." which goes to point number 1 of if we are compiling to weights why not just optimize the tool calling?
> "A system that cannot compute cannot truly internalize what computation is."
The way this is formulated, almost sounds like they think that giving llms this ability will bring them closer to having experiences of computation or smth? Weird?
I really liked the article, but food for thought: is a transformer that offloads computation to python really that different from Python code being read and then executed by a compiler?
Both examples are of a system we created to abstract most of the hard work.
I think a more important concept here is that the term "AI" has a lot of built-in assumptions, one of which being that it is (or will be) super intelligent, and so folks like the author here think (correctly) that it's important for the AI to be actually doing the work itself.
This seems way cooler than just computation (which is easy to hand off to a tool, and arguably more predictable that way). The broader point here is that you can have your model switch dynamically to/from a kind of attention that scales with the log of the token count, by only exploring the convex hull in a 2D space. A less capable version of attention, to be sure, but one capable of tracing a program’s execution with text representations of registers and stack - which is a meaningful level of flexibility, and one many humans would find difficult to do reliably!
What could you do with an LLM that can go into “focus mode” and generate tokens extremely rapidly? How much more powerful would a reasoning-token-generation phase be that can explore and cull large numbers of paths/hypotheses, so long as they are well defined? Does this have implications for multi-modal models and spatial reasoning?
As the paper suggests:
> These models could be useful in several modes: as a dedicated fast path paired with a slower, more general model; as part of a fast/slow hybrid architecture inside a single system; or as a speculative execution model that proposes tokens quickly while a regular-attention model verifies and accepts them. Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
The speculation about fast/slow hybrid architectures is the most interesting part of the paper to me. We're already seeing this pattern work at the inference pipeline level -- you can route easy tasks to quick single-shot generation and send hard tasks through multi-candidate generation with iterative self-repair, allocating compute proportionally to difficulty. The cost difference is dramatic: a simple knowledge query might take 30 seconds, while a hard coding problem might take 20 minutes through the full pipeline.
What's exciting about this paper is the possibility of that routing happening within a single forward pass rather than requiring external orchestration. Though honestly, even external orchestration with a good confidence scorer gets you most of the way there today.
very cool idea. But, time savings are not true for every tool call, and it's not clear to me yet whether this is batch-able; also, intuitively, for most of the models that run on GPU, you'd still want to offload tool exec part to CPU since it's much cheaper...
If you push tool execution into the model itself, you inherit all the I/O unpredictability and error handling baggage, but now inside a GPU context that's allergic to latency. Inference throughput tanks if external calls start blocking, and A100s make expensive waiters. Batching is fantasy unless you know up front exactly what gets executed, which is the opposite of dynamic tools. If you want "faster" here, the trade is reliable deterministic compute versus the usual Wild West of system calls and side effects.
This seems like it has some potential, but is pretty much useless as it is.
Shame there are no weights released - let alone the "compiler" tool they used to actually synthesize computational primitives into model weights. It seems like a "small model" system that's amenable to low budget experiments, and I would love to see what this approach can be pushed towards.
I disagree with the core premise, it's basically the old neurosymbolic garbage restated, but embedding predefined computational primitives into LLMs could have some uses nonetheless.
LLMs are not deterministic per my understanding. A program always produces the same output for the same input and instructions (ignore FP accuracy for now). How is determinism achieved here?
LLMs may be deterministic for a subset of inputs, if one output (or intermediate layer) neuron-state probability is significantly higher than the rest. My understanding is, when probabilities are close they diverge.
LLMs produce a distribution of token probabilities which is then sampled. This sampling is the only random part of the system.
If you just take the most probable token every time, the system becomes fully deterministic. We don't do this as the output becomes more stiff and less creative.
LLMs (or at least transformer-based LLMs) are effectively almost entirely deterministic, the randomness being largely only present due to (unnecessary) optimizations and other tweaks.
Temperature is not at all core to LLMs, it is something that rather makes the outputs more varied and desirable for human consumption generally. It is trivial to set to zero for applications like this.
On CPUs, the models are essentially fully deterministic, even with FP accuracy, and most common kernels have reproducible (albeit slower) variants even on GPUs. Otherwise, yes, FP non-associativity on GPUs is the only real source of randomness in inference.
The other issue arises from batch invariance, but this is a problem that occurs only at scale when serving multiple users / inputs have some randomness too. You can (usually) trivially eliminate this by controlling what goes in the batch or making the batch size be one. There are also other more clever mitigations for this, none of which are secrets.
> the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
IMHO the key point at which this technique has an unfair advantage vs a traditional interpreter is here.
How disruptive is it to have differentiability? To me it would mean that some tweaking-around can happen in an LLM-program at train-time; like changing a constant, or switching from a function call to another function. Can we gradient-descent effectively inside this huge space? How different is it from tool-calling from a pool of learned programs (think github but for LLM programs written in classic languages)?
The Percepta stuff would seem to demonstrate a mechanism for implementing "thinking". I don't understand how foundation models implement "thinking", but my intuition is that models are specifically trained for matching on and following procedural patterns. A task in a given domain can be performed through an associated and encoded procedure. The model holds all the linkages, as weights, that allows a procedure to be conditionally incrementally generated and performed. Does anyone have any insights about how LLM "thinking" is trained and coded?
Basically just madlibs - the models generate intermediate tokens that help predict a better answer based on training (RLHF & otherwise). They tend to look like "reasoning" because those tokens correlated with accepted answers during training.
Extended thinking passes are just more of the same. The entire methodology exists merely to provide additional context for the autoregression process. There is no traditional computation occurring
Early thoughts - this is very interesting and quite possibly revolutionary. If they have legitimately emulated a computer with memory reliably inside a transformer - that will open up an entirely new world for research.
I don’t want to say too much too soon, but I am pretty excited about this.
I love how this paper describes what actually happens and what the current tradeoffs are.
That having been said, many LLMs are being run on SIMD GPUs, in warps, basically they are just doing a lot of vector multiplications, activation functions and kv self attention (the expendive step).
The issue is we want the LLMs to be one-way through the layers, whereas turing-complete programming languages support loops and no well-defined stopping time. You can stick a simple computer into an LLM, but it won’t be able to do long loops.
However, for these specific workloads, the need to attend only to the latest state is indeed a huge optimization! Gone is the need for n^2 complexity that dominates the cost, now it is (log n)^2 attention which is far smaller.
if you understood the article, please correct my understanding -
they created a new training dataset which also has computation solving step by step (multiplying two numbers or playing sudoku) and then trained a transformer on it- as a result, the model performs the computation(multiplying two numbers) "inside" itself instead of calling calculator (or python)?
++ And they also figured out how to make attention faster?
I can't see anything about "training a transformer". I'm trying to understand if e.g. the Sudoku solver was learned from examples (in which case, what examples?) or whether it was manually coded and then "compiled" into weights.
There is no training in the usual sense of the term, i.e. no gradient descent, no differentiable loss function. They use deceptive language early on to make it sound this way, but near the end make it clear their model as is isn't actually differentiable, and in theory might still work if made differentiable. But they don't actually know.
But IMO this is BS because I don't know how one would get or generate training data, or how one would define a continuous loss function that scores partially-correct / plausible outputs (e.g. is a "partially correct" program / algorithm / code even coherent, conceptually).
Curious how this handles non-determinism. Most transformer inference has temperature > 0, which means the "program execution" is probabilistic. The interesting question is whether the speedup holds when you need consistent outputs across multiple calls.
So, what I'm trying to understand, and I can't find any clear information about that in the article, is how they "compiled" e.g. the Sudoku solver into a Transformer's weights. Did they do it manually? Say, they took the source of a hand-coded Sudoku solver and put it through their code-to-weight compiler, and thus compiled the code to the Transformer weights? Or did they go the Good, Old-Fashioned, Deep Learning way and train their Transformer to learn a ("100% correct"!) Sudoku solver from examples? And, if the latter, where's the details of the training? What did they train with? What did they train on? How did they train? etc etc.
My interpretation is that they built a simple virtual machine directly into the weights, then compiled a WASM runtime for that machine, then compiled the solver to that runtime.
Nope, they encoded or compiled in a simple VM / WASM interpreter to the transformer weights, there is no training. You'd be forgiven for this misreading, as they deliberately mislead early on that their model is (in principle) trainable, but later admit that their actual model is not actually differentiable, but that a differentiable approximation "should" still work (despite no info about what loss function or training data could allow scoring partially correct / incomplete program outputs).
Is their convex hull attention mechanism new and generally useable? I mean, it substantially restricts the shape of the model, so it isn’t a universal solution of course, but it does seem to overcome a pretty annoying limitation.
If you read the section "Richer attention mechanisms", you can see, no, the mechanism is not generally useable (it requires significant modification to become differentiable). They later speculate:
While we do not yet know whether exact softmax attention
can be maintained with the same efficiency, it is easy to
approximate it with k-sparse softmax attention: retrieve
the top-k keys and perform the softmax only over those
but if you have played around with training models that use e.g. topk or other hard thresholding operations in e.g. PyTorch (or just think about how many gradients become zero with such an operation) you know that these tend to work only in extremely limited / specific cases, and make training even more finicky than it already is.
This has a lot of potential. Especially if the compiled "code" can be efficiently shared between models of the same architecture. That would easily overshadow LoRa and finetuning in general.
> The key technical unlock is to restrict lookup heads to head dimension 2, which enables a decoding path where the dominant retrieval/update operations can be computed in log time in the sequence length (for this structured executor regime), rather than by a full prefix-sized attention sweep.
edit: i understand how hullkv works now. very clever.
I dont understand why this strategy is applicable only to "code tokens"
lastly, im not sure why wasm is a good target, iirc wasm seems to be really inefficient (not so much in code but in expressivity). i wonder if that curtails the llms ability to plan higher order stuff (since its always forced to think in the small)
> i have a pretty good understanding of how transformers work but this did not make sense to me. also i dont understand why this strategy is applicable only to "code tokens"
Yes, there is a monstrous lack of detail here and you should be skeptical about most of the article claims. The language is also IMO non-standard (serious people don't talk about self-attention as lookup tables anymore, that was never a good analogy in the first place) and no good work would just use language to express this, there would also be a simple equation showing the typical scaled dot-product attention formula, and then e.g. some dimension notation/details indicating which matrix (or inserted projection matrix) got some dimension of two somewhere, otherwise, the claims are inscrutable (EDIT: see edit below).
There are also no training details or loss function details, both of which would be necessary (and almost certainly highly novel) to make this kind of thing end-to-end trainable, which is another red flag.
EDIT: The key line seems to be around:
gate, val = ff_in(x).chunk(2, dim=-1)
and related code, plus the lines "Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head" but, again, this is very unclear and non-standard.
Well, one can never be sure what the real motivation for a lot of DL advances, as most papers are post-hoc obscurantism / hand-waving or even just outright nonsense (see: internal covariate shift explanations for batch norm, which arguably couldn't be more wrong https://arxiv.org/pdf/1805.11604).
When you really get into this stuff, you tend to see the real motivations as either e.g. kernel smoothing (see comments / discussion at https://news.ycombinator.com/item?id=46357675#46359160) or as encoding correlations / feature similarities / multiplicative interactions (see e.g. broad discussion at https://news.ycombinator.com/item?id=46523887). IMO most insights in LLM architectures and layers tends to come from intuitions about projections, manifolds, dimensionality, smoothing/regularization, overparameterization, matrix conditioning, manifold curvature and etc.
There are almost zero useful understandings or insights to be gained from the lookup-table analogy, and most statistical explanations in papers are also post-hoc and require assumptions (convergence rates, infinite layers, etc) that are never shown to clearly hold for actual models that people use. Obviously these AI models work very well for a lot of tasks, but our understanding of why they do is incredibly poor and simplistic, for the most part.
Of course, this is just IMO, and you can see some people in the linked threads do seem to find the lookup table analogies useful. I doubt such people have spent much time building novel architectures, experimenting with different layers, or training such models.
Treating attention as a lookup operation is popular among computational complexity theorists (e.g. https://arxiv.org/abs/2310.03817 ) because it's easier to work with when you're explicitly constructing a transformer to perform a particular computation, just to demonstrate that tranformers can, in theory, perform it. That's also why there are no training details: the weights are computed directly and not trained.
This is a good link and important (albeit niche) qualification.
It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).
If we take the human brain as an example, it's pretty bad at computation. Multiply two 10-digit numbers takes forever, despite the enormous size of its neural network. It's not the right tool for the job - a few deterministic logic gates could do that much more efficiently. That same circuit can't do much else, but multiplying, oh boy, it's good at that! Why do we think that artificial neural nets would be the right tool for that job? What's wrong with letting the LLM reach out to an ALU to do the calculation, just like a human would do? It's surely going to be quicker and require less energy.
The embedded programs can be connected to the other weights during training, in whatever way the training process finds useful. It doesn't just have to be arithmetic calculation. You can put any hard-coded algorithm in there, make the weights for that algorithm static, and let the training process figure out how to connect the other trillion weights to it.
If we never try, we'll never know. I wouldn't be surprised if there is something to gain from a form of deterministic computation which is still integrated with the NN architecture. After all, tool calls have their own non-trivial overhead.
The real answer is somewhere in between. You don't necessarily need computation inside the weights, but you do want it tightly integrated with the model's inference loop rather than as a disconnected tool call.
The interesting middle ground -- which I think is more practical near-term -- is building verification and repair loops around a frozen model. Let the model generate code, then execute it in a sandbox, then feed failures back into the model for self-repair using its own generated test cases. The model never "internalizes" computation in the weight sense, but the computation becomes part of a tight feedback loop that dramatically improves output quality.
The tool-calling overhead isn't really about process forking latency -- it's about the cognitive overhead of the model having no feedback signal until much later. Tighter loops, whether through approaches like this paper or through external pipelines with fast verification, are where the real wins are.
this is neat but to me seems like the circuitous path to just skipping autoregression, whereas the direct path is to just not do autoregression. get your answers from the one forward pass, and instead of backprop just do lookups and updates as the same operation.
I couldn't tell from the article whether this works as a language model or not. Can it read and write English or is it just a weird program interpreter? If it switches between modes, how do they interact?
This sounds so cool but I can’t tell if it’s a practical joke, even after sitting on it for 2-3 hours. Key points where I lose understanding/trust are when a WASM interpreter suddenly appears in the model, and when we’re representing code in weights.
It is unclear to me how this WASM interpreter is / could be deterministic.
The model isn't trained, it isn't differentiable (read carefully to the end: they say their model might still work if they made it differentiable, but they don't know), and it isn't clear IMO it could ever be made trainable (what is your loss function that scores a "partially correct" program / compiler, and how are you getting such training data?).
You need non-linearity in self-attention because it encodes feature / embedding similarities / correlations (e.g. self-attention is kernel smoothing) and/or multiplicative interactions, it has nothing to do with determinism/indeterminism. Also, LLMs are not really nondeterministic in any serious way, that all just comes from tweaks and optimizations that are not at all core to the architecture.
I initially agreed with a lot of the sentiment that asks "why," but have reframed my opinion. Instead of seeing this as a way to run programs via inference, I'm now seeing this as a way to bootstrap training. Think about the task of classification. If I have an expert system that classifies correctly 80% of the time, now I can embed it into a model and train the model to try to raise the success rate. The lower we can make the cost of training on various tasks, the better it levels the playing field of who can compete in the AI landscape.
The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.
YAY! this is exactly what I wanted as the final step of some agent batching prompts to sub agents, but seeing it in action made me realize: Wow being able to talk to any program during runtime, including the OS, because an LLM is your CPU! What a concept!
Computing is going to be so weird in a few decades, writing programs faster than I can speak with full semantic introspection into every byte of code.
andy12_ | a day ago
Truly, attention is all you need (I guess).
galsapir | a day ago
pennomi | a day ago
koolala | 11 hours ago
behehebd | 11 hours ago
mirekrusin | 11 hours ago
Hey, give it also access to the dump of its weights and way to propose updates so it can see and tinker its brain directly.
ThouYS | 11 hours ago
bonoboTP | 11 hours ago
> This works, but the actual execution happened outside the model. The model specified the computation, then waited for an external system to carry it out. > Our transformer also emits a program, but instead of pausing for an external tool, it executes that program itself, step by step, within the same transformer.
What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
Why is it good that it's "inside" the model? Just making it more elegant and nice? The tool was already "inside" the overall hybrid system. What's the actual problem?
famouswaffles | 11 hours ago
Not really sure what this obsession with calling things you don't like AI generated is but it's poor form. If you have something to say about the text then say it. Otherwise leave baseless accusations out of it.
>What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?....
It's pretty clearly an ideological thing. Some people are firmly on the 'some sort of symbolic logic is necessary' camp. From the article, 'A system that cannot compute cannot truly internalize what computation is.'
Some things are just interesting for the sake of it. This is one of those things. I don't agree with the authors on the above and I'm still glad they shared. It's a very interesting read regardless.
entropi | 10 hours ago
The idea itself was very cool, so I endured it. But it was not a pleasant read.
bonoboTP | 10 hours ago
I could point out the individual phrases and describe the overall impression in detail, or I can just compactly communicate that by using the phrase "AI". If it bothers you, read it as "AI-like", so there is a pretension.
I have no problem with using AI for writing. I do it too, especially for documentation. But you need to read it and iterate with it and give it enough raw input context. If you don't give it info about your actual goals, intentions, judgments etc, the AI will substitute some washed-out, averaged-out no-meat-on-the-bone fluff that may sound good at first read and give you a warm wow-effect that makes you hit publish, but you read into it all the context that you have in your head, but readers don't have that.
Formatting and language is cheap now. We need a new culture around calling out sloppy work. You would not have had a problem with calling out a badly composed rambling article 5 years ago. But today you can easily slap an AI filter on it that will make it look grammatical and feel narratively engaging, now it's all about deeper content. But if one points that out, replies can always say "oh, you can't prove that, can you?"
famouswaffles | 10 hours ago
I just find phrases like this a bit obnoxious at times.
>You would not have had a problem with calling out a badly composed rambling article 5 years ago.
Then why not just say that? It's rambling bla bla bla. What's so hard about that? Why invent a reason for issues, as if rambling articles didn't get written 5 years ago.
Like No, being written by an LLM or not is not the reason the article has no benchmarks or interpretability results. Those things would be there regardless if the author was interested in that, so again, it just seems there's little point in making such assertions.
bonoboTP | 9 hours ago
But anyway, yes, I can also just move on to the next article. Most of the time I indeed do that.
stingraycharles | 9 hours ago
The subtle ones like this I don’t mind too much, as long as they get the content correct, which in this case leaves quite a bit to be desired.
I’m also noticing that some people around me appear to just be oblivious to some LLM signals that bother me a lot, so people consume media differently.
I absolutely do believe that AI generated content needs to be called out, although at this point it’s safe to say that pretty much all online content is LLM written.
furyofantares | 10 hours ago
As to why people call this out without going into great detail about the problems with the actual text, it's because this is happening all over the place and it's very disrespectful to readers, who dig into an article that looks very well written on the surface, only to discover it's a lot of labor to decode and often (but not always) a total waste of time. Asking for a critical report of the text is asking even more of a reader who already feels duped.
stalfie | 9 hours ago
asplake | 8 hours ago
soulofmischief | 5 hours ago
Admonishing someone for correctly identifying AI-written or AI-edited blog posts is poor form, friend.
It is without a doubt written by an LLM. All of the telltale signs are there. I work with these tools 8-20 hours a day and after a while the verbiage and grammatical structures stick out like a sore thumb.
Get off the high horse. I too think this is a very interesting read. I was fascinated with the subject, but the presentation was nauseatingly distracting and immediately sets off yellow flags about how Percepta operates, and what kind of quality they're willing to settle with. It tells me they are more interested in appearances and superficiality.
The numbers that are there categorically cannot be trusted, because hallucinating those details is quite common for models. There is simply no indication that a human adequately proof-read this and therefore any of its claims must be taken with a grain of salt. Don't forget the recent Cloudflare+Matrix debacle: https://news.ycombinator.com/item?id=46781516
I share the same concerns as OP; this post lacks metrics and feels like someone did something cool and raced to get an AI to post about it, instead of giving it a proper treatment.
famouswaffles | 4 hours ago
From my point of view, all you've done is said a lot of nonsense and fabricated a convoluted explanation for why you think the text is bad. I'm fine on my horse thanks.
soulofmischief | 4 hours ago
You claimed "this obsession with calling things you don't like AI generated" is "poor form", attacking the parent commenter by claiming they are lying about the nature of the content. However, multiple people have pointed out the clear signs which you missed, and the consensus is that you were wrong. Now you suddenly don't care about this point, and have introduced a new argument instead.
"From my point of view, all you've done is said a lot of nonsense and fabricated a convoluted explanation for why you think the text is bad"
What a bad-faith response. Categorically dismissive, vague, antagonistic and ultimately failing to critically engage with anything I said.
famouswaffles | 4 hours ago
I didn't miss anything. I never cared about it one way or another. What clear signs have people pointed out ? This is the problem. It's apparently so obvious yet even the original commenter admits "It's things humans do too". What is clear about that ?
soulofmischief | 2 hours ago
All knowledge is ultimately fallible, but ignoring or not being able to appreciate the high statistical likelihood of this article being LLM edited/generated doesn't change reality.
You're asking me to share my expertise with you so that you can understand, but your antagonistic overtones make it not feel worth the time and effort. Other readers have also pointed out that it has characteristic idiosyncrasies. Feel free to look into it yourself, but it would also be wise to learn to defer these kinds of attacks until you have all the information.
D-Machine | 4 hours ago
Especially egregious to me is the claim "Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself". This is total weasel-language: e.g. we can propagate any weights through any transformer architecture and all sorts of other much more insane architectural designs, but that is irrelevant if you don't have a continuous and differentiable loss function that can properly weight partially-correct solutions or the likelihood / plausibility of arbitrary model outputs. You also need a clearer source of training data (or way to generate synthetic data).
So for e.g. AlphaFold, we needed to figure out a loss function that continuously approximated the energy configuration of various molecular configurations, and this is what really allowed it to actually do something. Otherwise, you are stuck with slow and expensive reinforcement-based systems.
The other tells are garbage analogies ("Humans cannot fly. Building airplanes does not change that; it only means we built a machine that flies for us"). Such analogies add nothing to understanding, and indeed distract from serious/real understanding. Only dupes and fools think you can gain any meaningful understanding of mathematics and computer science through simplistic linguistic analogies and metaphors without learning the proper actual (visuspatial, logical, etc) models and understanding. Thus, people with real and serious mathematical understanding despise such trite metaphors.
But then, since understanding something like this properly requires serious mathematical understanding, copy like that is a huge tell that the authors / company / platform puts bullshitting and sales above truth and correctness. I.e., yes, a huge yellow flag.
andy12_ | 10 hours ago
armchairhacker | 9 hours ago
Like, you have a great point (the benefit of this approach isn't explained), but that's a mistake humans frequently make.
bonoboTP | 6 hours ago
D-Machine | an hour ago
Cadence and rhythm: LLMs produce sentences with an extremely low variability in the number of clauses. Normal people run on from time to time, (bracket in lots of asides), or otherwise vary their cadence and rhythm within clauses more than LLMs tend to.
Section headings that are intended to be "cute" and "snappy" or "impactful" rather than technically correct or compact: this is especially a tell when the cuteness/impactfulness is deeply mismatched with the seriousness or technical depth of the subject matter.
Horrible trite analogies that show no actual real understanding of the actual logical, mathematical, or visuo-spatial relationships involved. I.e. analogies are based on linguistic semantics, and not e.g. mathematical isomorphism or core dynamics. "Humans cannot fly. Building airplanes does not change that; it only means we built a machine that flies for us". Can't imagine a more retarded and useless analogy for something as complex as the article topic.
Verbose repetition: The article defines two workarounds: "tool use" and "agentic" orchestration, then defines them, then in the paragraph immediately following, says the exact same thing. There are basically multiple (small paragraphs) that all say nothing at all more than the sentence "LLMs do not reliably perform long, exact computations on their own, so in practice we often delegate the execution to external tools or orchestration systems".
Pseudo-profound bullshit: (https://doi.org/10.1017/S1930297500006999). E.g. "A system that cannot compute cannot truly internalize what computation is." There is thankfully not too much of this in the article, and it appears mostly early on.
Missing key / basic logic (or failing to mention such points clearly) when this would be strongly expected by any serious practitioner or expert: E.g. in this article, we should have seen some simple nice centered LaTeX showing the scaled dot-product self attention equation, and then some simple notation to represent the `.chunk` call, and subsequent linear projection, something like H = [H1 | H2], or etc., I shouldn't have to squint at two small lines of PyTorch code to find this. It should be clear immediately this model is not trained, and this is essentially just compiling a VM into a Transformer, and not revealed more clearly only at the end.
maytc | 9 hours ago
Before it needs to write the code and have an external program execute it. Here it can change its mind mid execution. Kinda like what was observed in the CoT’s ah ha moment
radarsat1 | 9 hours ago
> Is it that you can backprop through this computation? Do you do so?
With respect, I feel that you may not have read the article.
> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
and,
> By storing points across nested convex hulls, this yields a decoding cost of O(k+log n).
and,
> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
So yes, and yes.
> Where are the benchmarks?
Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.
Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.
mike_hearn | 6 hours ago
The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.
I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.
Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.
bonoboTP | 6 hours ago
D-Machine | 4 hours ago
> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?
The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"
There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.
radarsat1 | 2 hours ago
my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.
ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.
D-Machine | an hour ago
EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.
bsenftner | 8 hours ago
idiotsecant | 5 hours ago
ndxone | 11 hours ago
WithinReason | 5 hours ago
MattPalmer1086 | 11 hours ago
Our brains can also simulate turing machines, slowly. We automated that with computers that are faster and more reliable. So why not allow a model to use external much faster and reliable tools, just as we do?
Rastonbury | 9 hours ago
MattPalmer1086 | 7 hours ago
I'm not convinced at all that this is the best way to reduce latency; there are many other ways of doing that.
Having a calculator in our brains would be handy of course, but a gigahertz multi core computer is still going to be better at anything that needs to do a lot of computation and or a lot of data.
graemefawcett | 6 hours ago
It's a nice bit of engineering, if you don't subscribe to YAGNI. If you do, you must ask the obvious question of what capability this delivers that wasn't available before. The only answer I've got is that someone must have been a bit chilly and couldn't figure out the thermostat
mobilejdral | 8 hours ago
1. The article presents that calling out to a tool like python is "expensive" because of the overhead of forking a process, loading up the python env etc, but why not just eliminate that overhead and embed WebAssembly so this "tool call" is near zero? This feels very similar to the discussion in the 90's around the overhead of threads v.s. processes or kernel space v.s. user space. Could even go further and have a running beam vm so the LLM can write elixir which is ideal for LLM's that stream out code? Elixir programs will be a lot shorter than webassembly.
2. The core argument stated is "A system that cannot compute cannot truly internalize what computation is." The idea being that it could write a program, execute it and by seeing all of the steps maybe even part way through stop and change its mind or when writing new programs write them better, aka be able to debug on the fly?
3. Not mentioned, but there is a 3rd x factor that LLM's will use this new found computation engine to do overall better at "thinking". Computing in very unexpected ways and to unexpected problems. Maybe it would do dramatically better at some benchmark because of this?
Unfortunately these are not explored and it is just an execution engine even resulting in the conclusion stating "arbitrary programs can be compiled directly into the transformer weights, bypassing the need to represent them as token sequences at all." which goes to point number 1 of if we are compiling to weights why not just optimize the tool calling?
vuciuc | 4 hours ago
The way this is formulated, almost sounds like they think that giving llms this ability will bring them closer to having experiences of computation or smth? Weird?
D-Machine | 13 minutes ago
jadbox | 2 hours ago
hedgehog | an hour ago
deviation | 10 hours ago
Both examples are of a system we created to abstract most of the hard work.
I think a more important concept here is that the term "AI" has a lot of built-in assumptions, one of which being that it is (or will be) super intelligent, and so folks like the author here think (correctly) that it's important for the AI to be actually doing the work itself.
btown | 10 hours ago
What could you do with an LLM that can go into “focus mode” and generate tokens extremely rapidly? How much more powerful would a reasoning-token-generation phase be that can explore and cull large numbers of paths/hypotheses, so long as they are well defined? Does this have implications for multi-modal models and spatial reasoning?
As the paper suggests:
> These models could be useful in several modes: as a dedicated fast path paired with a slower, more general model; as part of a fast/slow hybrid architecture inside a single system; or as a speculative execution model that proposes tokens quickly while a regular-attention model verifies and accepts them. Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
itigges22 | an hour ago
What's exciting about this paper is the possibility of that routing happening within a single forward pass rather than requiring external orchestration. Though honestly, even external orchestration with a good confidence scorer gets you most of the way there today.
yalok | 9 hours ago
hrmtst93837 | 8 hours ago
rebolek | 9 hours ago
But the right question is, should they?
RagnarD | 9 hours ago
plaidfuji | 8 hours ago
TedHerman | 8 hours ago
akshaysasi | 8 hours ago
j45 | 8 hours ago
ACCount37 | 7 hours ago
Shame there are no weights released - let alone the "compiler" tool they used to actually synthesize computational primitives into model weights. It seems like a "small model" system that's amenable to low budget experiments, and I would love to see what this approach can be pushed towards.
I disagree with the core premise, it's basically the old neurosymbolic garbage restated, but embedding predefined computational primitives into LLMs could have some uses nonetheless.
yorwba | 7 hours ago
ACCount37 | 7 hours ago
Which the blog post brings up as a research direction, but never actually elaborates upon. And the interface between the two is a hard problem.
I'll check out the link though, thanks.
YeGoblynQueenne | 5 hours ago
moktonar | 7 hours ago
manas96 | 7 hours ago
armchairhacker | 7 hours ago
Hugsun | 3 hours ago
If you just take the most probable token every time, the system becomes fully deterministic. We don't do this as the output becomes more stiff and less creative.
D-Machine | an hour ago
Temperature is not at all core to LLMs, it is something that rather makes the outputs more varied and desirable for human consumption generally. It is trivial to set to zero for applications like this.
On CPUs, the models are essentially fully deterministic, even with FP accuracy, and most common kernels have reproducible (albeit slower) variants even on GPUs. Otherwise, yes, FP non-associativity on GPUs is the only real source of randomness in inference.
The other issue arises from batch invariance, but this is a problem that occurs only at scale when serving multiple users / inputs have some randomness too. You can (usually) trivially eliminate this by controlling what goes in the batch or making the batch size be one. There are also other more clever mitigations for this, none of which are secrets.
EDIT - Forgot reference: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
ontouchstart | 7 hours ago
I am talking strictly about computing, not garbage in garbage out IO.
BenoitP | 6 hours ago
IMHO the key point at which this technique has an unfair advantage vs a traditional interpreter is here.
How disruptive is it to have differentiability? To me it would mean that some tweaking-around can happen in an LLM-program at train-time; like changing a constant, or switching from a function call to another function. Can we gradient-descent effectively inside this huge space? How different is it from tool-calling from a pool of learned programs (think github but for LLM programs written in classic languages)?
troelsSteegin | 6 hours ago
graemefawcett | 6 hours ago
Extended thinking passes are just more of the same. The entire methodology exists merely to provide additional context for the autoregression process. There is no traditional computation occurring
sheepscreek | 6 hours ago
I don’t want to say too much too soon, but I am pretty excited about this.
EGreg | 6 hours ago
That having been said, many LLMs are being run on SIMD GPUs, in warps, basically they are just doing a lot of vector multiplications, activation functions and kv self attention (the expendive step).
The issue is we want the LLMs to be one-way through the layers, whereas turing-complete programming languages support loops and no well-defined stopping time. You can stick a simple computer into an LLM, but it won’t be able to do long loops.
However, for these specific workloads, the need to attend only to the latest state is indeed a huge optimization! Gone is the need for n^2 complexity that dominates the cost, now it is (log n)^2 attention which is far smaller.
dwa3592 | 5 hours ago
they created a new training dataset which also has computation solving step by step (multiplying two numbers or playing sudoku) and then trained a transformer on it- as a result, the model performs the computation(multiplying two numbers) "inside" itself instead of calling calculator (or python)?
++ And they also figured out how to make attention faster?
YeGoblynQueenne | 5 hours ago
dwa3592 | 4 hours ago
I also feel a bit of bad smell from the article. Sounding revolutionary with no details or clear explanation.
D-Machine | an hour ago
But IMO this is BS because I don't know how one would get or generate training data, or how one would define a continuous loss function that scores partially-correct / plausible outputs (e.g. is a "partially correct" program / algorithm / code even coherent, conceptually).
jamilton | 4 hours ago
Felixbot | 5 hours ago
YeGoblynQueenne | 5 hours ago
Very light on details that article is.
MadnessASAP | 3 hours ago
gavinray | an hour ago
D-Machine | 40 minutes ago
bee_rider | 5 hours ago
D-Machine | 7 minutes ago
clarionbell | 5 hours ago
dnautics | 4 hours ago
> The key technical unlock is to restrict lookup heads to head dimension 2, which enables a decoding path where the dominant retrieval/update operations can be computed in log time in the sequence length (for this structured executor regime), rather than by a full prefix-sized attention sweep.
edit: i understand how hullkv works now. very clever.
I dont understand why this strategy is applicable only to "code tokens"
lastly, im not sure why wasm is a good target, iirc wasm seems to be really inefficient (not so much in code but in expressivity). i wonder if that curtails the llms ability to plan higher order stuff (since its always forced to think in the small)
D-Machine | 4 hours ago
Yes, there is a monstrous lack of detail here and you should be skeptical about most of the article claims. The language is also IMO non-standard (serious people don't talk about self-attention as lookup tables anymore, that was never a good analogy in the first place) and no good work would just use language to express this, there would also be a simple equation showing the typical scaled dot-product attention formula, and then e.g. some dimension notation/details indicating which matrix (or inserted projection matrix) got some dimension of two somewhere, otherwise, the claims are inscrutable (EDIT: see edit below).
There are also no training details or loss function details, both of which would be necessary (and almost certainly highly novel) to make this kind of thing end-to-end trainable, which is another red flag.
EDIT: The key line seems to be around:
and related code, plus the lines "Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head" but, again, this is very unclear and non-standard.dnautics | 3 hours ago
good analogy otherwise, wasn't hash tables the motivation for the kv tables?
D-Machine | 2 hours ago
When you really get into this stuff, you tend to see the real motivations as either e.g. kernel smoothing (see comments / discussion at https://news.ycombinator.com/item?id=46357675#46359160) or as encoding correlations / feature similarities / multiplicative interactions (see e.g. broad discussion at https://news.ycombinator.com/item?id=46523887). IMO most insights in LLM architectures and layers tends to come from intuitions about projections, manifolds, dimensionality, smoothing/regularization, overparameterization, matrix conditioning, manifold curvature and etc.
There are almost zero useful understandings or insights to be gained from the lookup-table analogy, and most statistical explanations in papers are also post-hoc and require assumptions (convergence rates, infinite layers, etc) that are never shown to clearly hold for actual models that people use. Obviously these AI models work very well for a lot of tasks, but our understanding of why they do is incredibly poor and simplistic, for the most part.
Of course, this is just IMO, and you can see some people in the linked threads do seem to find the lookup table analogies useful. I doubt such people have spent much time building novel architectures, experimenting with different layers, or training such models.
yorwba | 2 hours ago
D-Machine | an hour ago
It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).
teiferer | 4 hours ago
If we take the human brain as an example, it's pretty bad at computation. Multiply two 10-digit numbers takes forever, despite the enormous size of its neural network. It's not the right tool for the job - a few deterministic logic gates could do that much more efficiently. That same circuit can't do much else, but multiplying, oh boy, it's good at that! Why do we think that artificial neural nets would be the right tool for that job? What's wrong with letting the LLM reach out to an ALU to do the calculation, just like a human would do? It's surely going to be quicker and require less energy.
soerxpso | 3 hours ago
pegasus | 3 hours ago
If we never try, we'll never know. I wouldn't be surprised if there is something to gain from a form of deterministic computation which is still integrated with the NN architecture. After all, tool calls have their own non-trivial overhead.
teiferer | 3 hours ago
I'm asking whether it's a desirable end state.
OneDeuxTriSeiGo | 3 hours ago
Not necessarily pure number crunching but the boundary between rote algorithms and fuzzy intuition based models that humans in particular excel at.
itigges22 | an hour ago
The interesting middle ground -- which I think is more practical near-term -- is building verification and repair loops around a frozen model. Let the model generate code, then execute it in a sandbox, then feed failures back into the model for self-repair using its own generated test cases. The model never "internalizes" computation in the weight sense, but the computation becomes part of a tight feedback loop that dramatically improves output quality.
The tool-calling overhead isn't really about process forking latency -- it's about the cognitive overhead of the model having no feedback signal until much later. Tighter loops, whether through approaches like this paper or through external pipelines with fast verification, are where the real wins are.
hashmap | 4 hours ago
skybrian | 4 hours ago
D-Machine | 3 hours ago
I don't see how this could work as an LLM given that, but the article is missing a huge amount of other crucial details too.
refulgentis | 4 hours ago
It is unclear to me how this WASM interpreter is / could be deterministic.
SPascareli13 | 3 hours ago
Also, if it's execution is purely deterministic, you probably don't need non linearity in the layers, right?
D-Machine | 43 minutes ago
You need non-linearity in self-attention because it encodes feature / embedding similarities / correlations (e.g. self-attention is kernel smoothing) and/or multiplicative interactions, it has nothing to do with determinism/indeterminism. Also, LLMs are not really nondeterministic in any serious way, that all just comes from tweaks and optimizations that are not at all core to the architecture.
derangedHorse | 3 hours ago
yorwba | 2 hours ago
refulgentis | 2 hours ago
casey2 | 51 minutes ago
Computing is going to be so weird in a few decades, writing programs faster than I can speak with full semantic introspection into every byte of code.