hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences:
1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM!
2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.
Very cool idea. Interested to see how this progresses.
One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!
I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.
If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
Modded-nanogpt is also much more data efficient than vanilla napogpt, even if some of the individual optimizations trade off higher throughput for worse data efficiency.
yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.
There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.
I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.
diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.
This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.
Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.
Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.
Maybe some newer references are better, but my mind went to the Model Soups paper[1]:
The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."
> Second-order optimizers and natural gradient methods
Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.
Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.
I think there will be good headway in using the part-trained model to generate itself more training data in the form of making itself tasks, completing those tasks with many different approaches, evaluating which solution is best (using the same LLM as judge), and then differentially training on the best solutions vs the worst ones.
The challenge is that such an approach almost certainly requires a model with RLHF post-training, but this needs to be done in the pre training phase. But with infinity compute, this isn't an issue - you simply do the post-training many times.
Very interesting benchmark, excited to see what comes out of this. Considering humans are enourmously more sample efficient compared to today's models, it seems clear there's a lot of room to close that gap. The fact that they hit 5.5x in the first week with relatively straightforward changes suggests we're nowhere near the ceiling for data efficiency
suddenlybananas | a day ago
[OP] sdpmas | a day ago
soraki_soladead | a day ago
Mumps | 7 hours ago
> Directions we think are wide open ... Curriculum learning
BabyLM and offshoot published a pretty convincing body of work on exactly that (which suggests it's not particularly relevant to LM training).
As I read your page, I really felt like the brevity-thoroughness tradeoff went the wrong way.
archermarks | a day ago
[OP] sdpmas | a day ago
lzaborowski | a day ago
If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
navvyeanand | a day ago
kseniamorph | a day ago
timshel1 | a day ago
[OP] sdpmas | 23 hours ago
linolevan | 23 hours ago
[0] https://www.alphaxiv.org/abs/2509.14786
[OP] sdpmas | 23 hours ago
_0ffh | 21 hours ago
[OP] sdpmas | 21 hours ago
_0ffh | 21 hours ago
Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857
[OP] sdpmas | 21 hours ago
refulgentis | 22 hours ago
[OP] sdpmas | 21 hours ago
jiggawatts | 14 hours ago
yorwba | 13 hours ago
Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.
Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.
magicalhippo | 7 hours ago
The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."
[1]: https://arxiv.org/abs/2203.05482
bee_rider | 22 hours ago
> Second-order optimizers and natural gradient methods
Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
[OP] sdpmas | 21 hours ago
alyxya | 20 hours ago
vladf | 19 hours ago
https://arxiv.org/abs/2006.10732
The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified
rcarmo | 14 hours ago
londons_explore | 14 hours ago
The challenge is that such an approach almost certainly requires a model with RLHF post-training, but this needs to be done in the pre training phase. But with infinity compute, this isn't an issue - you simply do the post-training many times.
jbergqvist | 13 hours ago
[OP] sdpmas | 13 hours ago