This is a fascinating read about how the models are structured. Even if you are tired of all the vibecoding articles lately, this one is correctly tagged as ai, because it gets much more into how the things work and what structural changes to a model ended up doing to it.
It bothers me that Transformer architecture spends equal amount of compute on a yes/no answer to an arbitrarily complex riddle as it does on "The " in any other message.
It's fascinating that layers can be looped. Perhaps the next step would be to have a model dynamically select number of loops or choose to skip groups of layers MoE style?
Inference time compute (Chain-of-Thought) allows models to dynamically use more compute on harder problems. There's an active area of research around whether making them write down their "thoughts" is actually useful (i.e. potentially recompressing a rich representation down into token space), Meta looked at having the last hidden layer being fed back into to first layers of the model with Coconut, allowing the model to "think" in latent space rather than with words.
CoT is orthogonal to this. It still doesn't remove inference bottlenecks (which is memory-bound by a ridiculous factor).
Using a smaller set of layers in a loop could make them cacheable, or skipping layers conditionally could save memory bandwidth on easy tokens: https://arxiv.org/html/2507.10524v1
Meta looked at having the last hidden layer being fed back into to first layers of the model with Coconut, allowing the model to "think" in latent space rather than with words.
a fair bit optimization for inference and training relies on token CoT
loss of interpretability/potentially lower steerability
token CoT is still improving and has proven scaling dynamics, latent-space has not been benchmarked at frontier-scale
not necessarily clear how to use RL to reward better thinking traces
with tool calls/reasoning that "interacts" with external sources, the benefits of latent-space might be a lot smaller: part of the intuition why latent-space might be better is it allows "parallel" exploration of various thought paths, you don't have to collapse to the most likely token; however, you cannot continuously call a tool, and multiple parallel (but discrete) thought paths are already an active area of research (rf. Tree of Thoughts, 2023)
My guess is that going through human-language tokens isn't lossy enough to be a problem, and the models can recover the fuzzy latent space meaning with a few layers (as demonstrated by conversations in base64), so the gain from using latent space directly is incremental.
We don't have any training data in that latent space, and it would be freaky if LLMs developed a "neuralish" language we don't understand.
I'd guess you could get even better results by including this layer duplication at training time. Let the model optimize what params/circuits it puts in this explicitly repeated section. Hard to imagine that wouldn't be even more effective.
Seems to somewhat correspond to the way that thought tokens are used in modern models. I remember reading somewhere that a lot of the benefit from those might just be that they give the model more trips through the transformer to process advanced concepts before generating its output while keeping param count constant. This is a pretty compelling idea in that it can make that extra compute power more explicit.
It seems like a simple idea in retrospect, but a lot of things in AI tend to be I guess. I wouldn't be surprise if this or something similar was already in use by some of the big labs, though.
I think first training with sparse looping, then progressively looping more and more of the chunk would improve trainability.
Also initially loop it once, then add more iterations as long as it proves beneficial. All while training.
But plausibly one could go the other way, too. Try to figure out exactly what circuits were involved and add them as hardened units.
I still think that in the upcoming decades we will see more and more of this. Even hardening proven nets as semi-analog circuits. Especially since we see quantization working out decently. Once we stop streaming weights and they just sit there as unclocked transistors, the inference speeds will be incredible.
I personally don't think this anatomy is described here and previous studies that try to get to truthiness or the structure of intermediate layers has resulted in a fragile signal. There are many unfalsifiable conjectures communicating an exuberance that is hard to tell what is inferred or actually measured.
I'm not particularly familiar with how transformer models are actually structured beyond what the article described but I cant help but wonder if it would be possible, by a similar method, to identify layers that all serve the same or similar purposes and replace all but one of them with pointers. It seems that it might be possible to reduce memory requirements pretty dramatically depending on the number of "redundant" layers.
I know pruning layers is already relatively common. This feel similar, maybe without some of the potential performance sacrifices though.
thesnarky1 | 20 hours ago
This is a fascinating read about how the models are structured. Even if you are tired of all the vibecoding articles lately, this one is correctly tagged as ai, because it gets much more into how the things work and what structural changes to a model ended up doing to it.
kornel | 19 hours ago
It bothers me that Transformer architecture spends equal amount of compute on a yes/no answer to an arbitrarily complex riddle as it does on "The " in any other message.
It's fascinating that layers can be looped. Perhaps the next step would be to have a model dynamically select number of loops or choose to skip groups of layers MoE style?
vpr | 17 hours ago
Inference time compute (Chain-of-Thought) allows models to dynamically use more compute on harder problems. There's an active area of research around whether making them write down their "thoughts" is actually useful (i.e. potentially recompressing a rich representation down into token space), Meta looked at having the last hidden layer being fed back into to first layers of the model with Coconut, allowing the model to "think" in latent space rather than with words.
kornel | 4 hours ago
CoT is orthogonal to this. It still doesn't remove inference bottlenecks (which is memory-bound by a ridiculous factor).
Using a smaller set of layers in a loop could make them cacheable, or skipping layers conditionally could save memory bandwidth on easy tokens: https://arxiv.org/html/2507.10524v1
gnyeki | 15 hours ago
Why hasn’t this approach become more widespread?
vpr | 14 hours ago
I suspect a few reasons:
kornel | 4 hours ago
My guess is that going through human-language tokens isn't lossy enough to be a problem, and the models can recover the fuzzy latent space meaning with a few layers (as demonstrated by conversations in base64), so the gain from using latent space directly is incremental.
We don't have any training data in that latent space, and it would be freaky if LLMs developed a "neuralish" language we don't understand.
Ameo | 15 hours ago
I'd guess you could get even better results by including this layer duplication at training time. Let the model optimize what params/circuits it puts in this explicitly repeated section. Hard to imagine that wouldn't be even more effective.
Seems to somewhat correspond to the way that thought tokens are used in modern models. I remember reading somewhere that a lot of the benefit from those might just be that they give the model more trips through the transformer to process advanced concepts before generating its output while keeping param count constant. This is a pretty compelling idea in that it can make that extra compute power more explicit.
It seems like a simple idea in retrospect, but a lot of things in AI tend to be I guess. I wouldn't be surprise if this or something similar was already in use by some of the big labs, though.
mordae | 4 hours ago
I think first training with sparse looping, then progressively looping more and more of the chunk would improve trainability.
Also initially loop it once, then add more iterations as long as it proves beneficial. All while training.
But plausibly one could go the other way, too. Try to figure out exactly what circuits were involved and add them as hardened units.
I still think that in the upcoming decades we will see more and more of this. Even hardening proven nets as semi-analog circuits. Especially since we see quantization working out decently. Once we stop streaming weights and they just sit there as unclocked transistors, the inference speeds will be incredible.
th0ma5 | 15 hours ago
I personally don't think this anatomy is described here and previous studies that try to get to truthiness or the structure of intermediate layers has resulted in a fragile signal. There are many unfalsifiable conjectures communicating an exuberance that is hard to tell what is inferred or actually measured.
skycam | 18 hours ago
I'm not particularly familiar with how transformer models are actually structured beyond what the article described but I cant help but wonder if it would be possible, by a similar method, to identify layers that all serve the same or similar purposes and replace all but one of them with pointers. It seems that it might be possible to reduce memory requirements pretty dramatically depending on the number of "redundant" layers. I know pruning layers is already relatively common. This feel similar, maybe without some of the potential performance sacrifices though.
kornel | 3 hours ago
It sounds similar to Mixture of Experts architecture, but with experts stacked serially instead of parallel.
Corbin | 12 hours ago
Any gene that can be duplicated once, can be duplicated many times. What's the marginal improvement from one more duplication?
TyberiusPrime | 7 hours ago
Depends on the selection pressure. Elephants have ~40 copies of the guardian of the genome, p53 vs the two homo sapiens have.