> 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.
Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.
Anyway, as written this passage doesn't really make a whole lot of sense (the point is that it produces broken sentences?), and given that it was almost certainly written by an AI, it demonstrates that the architecture doesn't work especially well at any scale (I kid, I kid).
The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).
On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.
You can build an unlimited-order Markov chain by, instead of pre-computing a table of counts for all possible contexts, using a substring-search index on the training data to count possible continuations on the fly: https://arxiv.org/abs/2401.17377 That paper uses suffix arrays, but more compact indices are possible: https://arxiv.org/abs/2506.12229
I'm not sure what the venn diagram of knowledge to understand what that sentence is suggesting looks like, it's probably more crowded in the intersection than one might think.
Believe me, using the 1541 as co-processor and extra storage was super tempting and on my mind all the time! So what do you think? Flash attention with K on the front side and V on the backside? :)
That's a good idea because, although I love this, 1 minute per token is absolutely savage. Whereas if you can juice the performance you're into semi-credible Jar Jar Binks simulator territory.
It does also make me wonder what you could do with somewhat more powerful retro hardware. I'd love to see what a transformer running on a PSX or an N64 could do.
This would have blown me away back in the late 80s/early 90s.
(Or maybe not, if it doesn't perform better than random, I haven't actually tried it out yet. Some more examples would have been nice!)
I wonder how far you could push this while still staying period correct, e.g. by adding a REU (RAM Expansion Unit), or even a GeoRAM (basically a REU on steroids).
SuperCPU would also be an option, but for me it's always blurring the line of "what is a C64" a bit too much, and it likely just makes it faster anyway.
Jopsph Weizenbaum's ELIZA was rule-based and ran on even slower (1960s) hardware, but because it relied on simple pattern matching instead of neural nets, it would easily have been more responsive (the Emacs editor/operating system has an implementation included, start it with: M-x doctor RETURN).
ELIZA was not written in assembler, but (different versions) in COMIT, FORTRAN and LISP.
ELIZA is better, because this doesn't seem to generate anything coherent. You can try the original ELIZA with DOCTOR script here: https://anthay.github.io/eliza.html
It (v3) mostly only says hello and bye, but I guess for 25k parameters you can't complain. (I think the rather exuberant copy is probably the product of Claude et al.)
Just reminded me of the random sentence generator program on my Vic-20. I had changed most of the words to all the bad words a preteen could think up. So many laughs with the neighborhood kids.
Interesting, I’ve always thought neural network progress was primarily bottlenecked by compute.
If it turns out that LLM-like models can produce genuinely useful outputs on something as constrained as a Commodore 64—or even more convincingly, if someone manages to train a capable model within the limits of hardware from that era—it would suggest we may have left a lot of progress on the table. Not just in terms of efficiency, but in how we framed the problem space for decades.
A human would use a proper dispatch table and wouldn't make excuses for a sloppy implementation ("Python is fast enough").
Besides, the author has an art and design background, which doesn't seem to match the deep knowledge of Transformers or assembly required for such a project.
I am the subject of your investigation. So, in your world meta-programming is a bad thing? Fine. In my world it isn't. The transition layers are how I kept four implementations bit-identical through the test suite. If you prefer to hand-roll this toward the goal, that's your decision, your life.
And yeah, I use AI where it makes sense. Architecture decisions are still mine. For the record: I'm from Farbrausch, so you are technically correct! The demoscene did become a UNESCO intangible cultural heritage a few years back, I guess that makes me an artist, FINALLY! :)
Maybe impressive in one way, but I'm also pretty sure a simple n-gram Markov model (a la Niall on the Amiga) would have a lower loss on the test set.
Transformers don't scale down very well, in my experience - I used to train local models all the time as new ones were released, as I recall transformers were the first ones I couldn't get better results out of with my limited training data and GPU.
Great work! Though I see some people criticizing the usefulness of this. Are they being sarcastic are just really not understanding what is being discussed here? I can't tell. Maybe as an interesting follow up you could train the transformer on something with a more limited vocabulary. Spoken language is complex but a transformer can work on less complex domains like music or PET-BASIC code.
Thanks! The training corpus and code are in the repo if you want to try... Training takes just a couple of minutes on an RTX 3090. Don't get your hopes up too high, though. I can imagine that code would be harder, not easier. Even modest sized transformer models struggle with proper GOTO targeting. It would look like BASIC, but essentially it would be friendly gibberish too.
harel | 11 days ago
tclancy | 11 days ago
(Came here to say an update to Eliza could really mess with the last person still talking to her.)
bighead1 | 11 days ago
wk_end | 11 days ago
Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.
Anyway, as written this passage doesn't really make a whole lot of sense (the point is that it produces broken sentences?), and given that it was almost certainly written by an AI, it demonstrates that the architecture doesn't work especially well at any scale (I kid, I kid).
forinti | 11 days ago
jll29 | 11 days ago
On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.
yorwba | 10 days ago
anthk | 10 days ago
- Install Cpanminus for Perl, some C compiler and sqlite3 just to be sure.
yorwba | 10 days ago
Lerc | 11 days ago
I'm not sure what the venn diagram of knowledge to understand what that sentence is suggesting looks like, it's probably more crowded in the intersection than one might think.
dnnddidiej | 10 days ago
Lerc | 10 days ago
gizmo64k | 10 days ago
gizmo64k | 10 days ago
classichasclass | 11 days ago
bartread | 11 days ago
It does also make me wonder what you could do with somewhat more powerful retro hardware. I'd love to see what a transformer running on a PSX or an N64 could do.
ghstinda | 11 days ago
anyfoo | 11 days ago
(Or maybe not, if it doesn't perform better than random, I haven't actually tried it out yet. Some more examples would have been nice!)
I wonder how far you could push this while still staying period correct, e.g. by adding a REU (RAM Expansion Unit), or even a GeoRAM (basically a REU on steroids).
SuperCPU would also be an option, but for me it's always blurring the line of "what is a C64" a bit too much, and it likely just makes it faster anyway.
LeFantome | 10 days ago
Have not heard much about it since launch. Although, now that I look, it seems they are just shipping now.
https://www.commodore.net/product-page/commodore-64-ultimate...
steve_taylor | 10 days ago
wk_end | 10 days ago
IcePic | 10 days ago
brcmthrowaway | 11 days ago
jll29 | 11 days ago
ELIZA was not written in assembler, but (different versions) in COMIT, FORTRAN and LISP.
https://dl.acm.org/doi/pdf/10.1145/365153.365168
Geee | 11 days ago
daemonologist | 11 days ago
It (v3) mostly only says hello and bye, but I guess for 25k parameters you can't complain. (I think the rather exuberant copy is probably the product of Claude et al.)
rob74 | 10 days ago
YOU> HI! HOW ARE YOU DOING?
C64> BELON
YOU> WHAT DO YOU MEAN BY THAT?
C64> YOU DES
YOU> YOU TOO!
C64> YOU DESER! THIS THE SHIN
YOU> DESER? SHIN? THOSE AREN'T EVEN REAL WORDS, WHAT'S GOING ON HERE?
C64>
YOU> THIS THE SHIN
C64> YOU MAKE ME HAPPY TOO
YOU> YOU DESER
C64> YOU MADE MY HEART! W
YOU> DES, DESER, DESERST?
C64> YOU DES
YOU> BELON
C64> I LIKE OURON ME! WONDERFUL!YOU!
YOU> YOU BELON DES
C64> YOU MAKE ME HAPPY
mixmastamyk | 11 days ago
Vaslo | 10 days ago
Brings back memories
borsch_not_soup | 10 days ago
If it turns out that LLM-like models can produce genuinely useful outputs on something as constrained as a Commodore 64—or even more convincingly, if someone manages to train a capable model within the limits of hardware from that era—it would suggest we may have left a lot of progress on the table. Not just in terms of efficiency, but in how we framed the problem space for decades.
dpe82 | 10 days ago
Very, very cool project though!
chillingeffect | 10 days ago
YOU> HELP I'M DROWNING
C64> YOU' HERE!
YOU> OH NO I'M ON FIRE
C64> IGLAY!
YOU> IM BEING SWALLOWED BE A SNAKE
C64>
YOU> BIRDS ARE NIPPING ON ME
C64> YOU
Razengan | 10 days ago
Maybe there is deeper wisdom in there that we have yet to unearth
numpad0 | 10 days ago
djmips | 10 days ago
Hackbraten | 10 days ago
arketyp | 10 days ago
rahen | 10 days ago
NooneAtAll3 | 10 days ago
meh
rahen | 10 days ago
This is a giveaway for AI generation, from the docstring to the terrible opcode dispatch (Claude sucks at assembly or low-level optimization): https://github.com/gizmo64k/soulplayer-c64/blob/main/src/cpu...
A human would use a proper dispatch table and wouldn't make excuses for a sloppy implementation ("Python is fast enough").
Besides, the author has an art and design background, which doesn't seem to match the deep knowledge of Transformers or assembly required for such a project.
gizmo64k | 10 days ago
pjmlp | 10 days ago
Even cooler would have been to have the 6502 directly generated from the LLM.
vintermann | 10 days ago
Transformers don't scale down very well, in my experience - I used to train local models all the time as new ones were released, as I recall transformers were the first ones I couldn't get better results out of with my limited training data and GPU.
bluejay2387 | 10 days ago
gizmo64k | 10 days ago