A 10 year old Xeon is all you need

Source: point.free
528 points by cafkafk 10 hours ago on hackernews | 233 comments

17 minutes read

The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it.

I have a recycled server. To its credit, it has a whopping 128 GB RAM, but it’s DDR3… That RAM is 5-6 times slower than the current best laptop ram. It also has a single Intel Xeon E5-2620 v4 from 2016, which is about 5 times slower than my laptops CPU…

Oh, and as I did mention, we have no GPU. And no, the Xeon does not have an integrated GPU.

But, just hear me out…

If we were to just break out ollama here, well… as explained in earlier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add support for the model we need, if they ever do. Might be they never do. And even still, ollama simply doesn’t expose enough knobs for us to ever make this run well, neither does even the standard llama-cpp.

But. Why would that stop us?


I’ve recieved feedback that some of the previous posts were too high level, I’ll try to make things as clear as reasonably possible here. If you’re a tech worker, or a Linux enthusiast that has built a computer and used something like ChatGPT, most of this should be approachable.

So, just to really set the stage fully. The hardware, per lscpu:

  • CPU: Intel Xeon E5-2620 v4 @ 2.10 GHz
  • Cores: 8 physical, 16 threads
  • Instruction sets: AVX2 (no AVX-512, no AVX-VNNI, no BF16)
  • Cache: 20 MiB L3, 2 MiB L2 total
  • Memory: 128 GB DDR3
  • GPU: none

For LLM inference, memory bandwidth is the limiting resource. Every token generated requires hauling gigabytes of weights from RAM into the CPU cache.

When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watching the “decoder pass”. During this phase, the model generates the output one piece (or “token”) at a time.

In this step, the system’s raw processing power is rarely the bottleneck. Instead, the limitation is memory bandwidth. To calculate that next word, the processor has to constantly pull massive amounts of data. That data is the “weights” that contain the model’s learned knowledge. It moves this from memory into the compute cores.

The processor executes the required matrix calculations so quickly that it is left sitting idle, waiting for the hardware to physically move the next chunk of weights across the memory bus. In traditional software terms, decoding is heavily memory-bound, not compute-bound.

This is the so called “memory wall”, one of the single biggest performance hurdles now, whether you’re on a Xeon or an H100.


Naively running llama-cli on a DDR3 machine without a GPU is horrendously slow, even if it can run it, because it’s optimized for a generic GPU usecase, and often leaves a lot of improvements on the table. Further, it simply doesn’t have most of the actual optimizations that the state of the art currently uses to run these at scale.

The remedy is to pull every optimization lever ik_llama.cpp exposes. Most of them are slightly obscure.

Here is the magic spell that makes this actually run.

llama-cli \
  --model gemma-4-26B-A4B-it-Q8_0.gguf \
  --model-draft gemma-4-26B-A4B-it-assistant-GGUF/\
wikitext-2-raw_ik-llama-mtp_drafter-conservative/\
gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
  --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune \
  -cnv --color --jinja --special \
  -sm graph -smgs -sas -mea 256 --split-mode-f32 \
  --temp 0.7 -t 8 --parallel 8 \
  --cpu-moe --merge-up-gate-experts \
  --flash-attn on --mla-use 3 \
  --mlock --run-time-repack --no-kv-offload

Under a blackbox tool like ollama you never see this line. On aging hardware you have to understand what each flag does, because half of them won’t take, and the engine will tell you so in passing.


Speculative decoding.

--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune

This pairs the 26B verifier with the small drafter from the previous post. Up to three tokens per draft (--draft-max 3), all probabilities accepted (--draft-p-min 0.0), --spec-autotune adjusting the chain length per workload.

This ties directly back to our previous discussion about the memory-bound decoder pass.

When a model uses a long reasoning chain, it is generating those “thinking” tokens one by one. Even if the internal reasoning is hidden from the user and all you see is a short final answer, the hardware still has to perform a full decoder pass for every single token in that hidden chain.

In fact, speculative decoding is currently one of the most brilliant software workarounds the AI industry has invented to bypass the “memory wall,” and spec autotune is how you squeeze the maximum speed out of it.

The argument for speculative decoding is stronger on CPU than on GPU. CPU compute is cheap relative to the cost of streaming the verifier’s weights through cache, so spending extra cycles on a tiny drafter whose active layers easily fit in L3 buys tokens at very little marginal cost. The drafter’s working set fits in L3. The verifier however spills out of everything.


CPU and MoE routing.

--cpu-moe --merge-up-gate-experts -t 8 --parallel 8

Gemma 4 26B-A4B has 128 experts with 8 active per token, giving about 3.8B active parameters out of ~25.2B total. --cpu-moe tunes the routing for CPU cache hierarchies.

CPUs handle memory very differently than GPUs. While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.

In an MoE model, constantly jumping around between 128 different experts can cause “cache thrashing”, where the CPU constantly has to dump its cache and fetch new weights from the much slower main system RAM (normally DDR4/DDR5, we’re on DDR3!).

This flag tells the router to be smarter about how it picks experts, optimizing the sequence so the weights stay neatly inside the CPU’s local cache for as long as possible.

--merge-up-gate-experts fuses two per-expert projections into a single matmul, which the logs confirm:

fused_up_gate = 1

This is a software trick to bypass the memory bandwidth bottleneck we discussed earlier.

Inside the experts, the math operations require data to be passed through different layers. Normally, the processor would calculate an “up projection”, write the result to memory, then load the weights for a “gate projection”, calculate that, and combine them. That requires moving data across the memory bus multiple times.

Instead of doing two separate trips over the memory bus, it combines the operations into a single step.

-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other.


Memory pinning, repacking, KV cache.

--mlock --run-time-repack --no-kv-offload

--run-time-repack reorganizes weight matrices in memory immediately before inference to match the CPU’s cache layout. The logs confirm:

============ Repacked 265 tensors

Processors have their own ultra-fast, built-in memory called caches (L1, L2, and L3). However, these caches expect data to be fed to them in very specific shapes and sizes.

If the AI’s weight matrices are sitting in system RAM in a generic layout, the CPU has to awkwardly pull the data in pieces, resulting in “cache misses” where the CPU stalls. --run-time-repack tells the engine to spend a few seconds during startup to physically reorganize the massive tables of numbers in the RAM so they perfectly align with how the CPU wants to ingest them. It pays a small time penalty upfront to guarantee maximum memory bandwidth during the actual text generation.

--mlock is meant to pin the model in RAM so the OS cannot swap any of it to disk.

mlock stands for “memory lock”, suprising, I know! In standard operating systems, if the system starts running out of RAM, it will quietly take data that hasn’t been used in a few seconds and “swap” (or page) it to the physical hard drive.

If an OS tries to swap out 27GB of AI weights to a disk, the generation speed will instantly drop to zero while the system chokes trying to read it back. --mlock tells the Linux kernel: “Pin this 27GB strictly in physical RAM. Do not ever move it to the disk.”

Notice that if you’re not careful, you’ll see this:

warning: failed to mlock 27628376064-byte buffer
  (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).

The flag is fine; the kernel-side memlock limit isn’t set high enough to pin a 27 GB buffer. This is not an LLM-shaped problem at all — it’s a ulimit default — and it’s the kind of footgun the blackbox tools paper over by simply not asking for the optimization in the first place.

Consider that for a moment, that many tools by default will just have no problem putting your model into swap if it decided that’s the best option. You can imagine how much this can hurt performance…

--no-kv-offload tells the engine not to look for a GPU for the KV cache. There isn’t one to find, but the flag short-circuits the check.

The KV (Key-Value) cache is the AI’s short-term memory — it stores the context of the current conversation so the model doesn’t have to re-read the entire prompt for every new token.

Because the KV cache is constantly being read from and written to, AI engines usually try to “offload” it to a GPU, which has much faster memory than we do.

Since this specific setup is highly optimized to run purely on a CPU, letting the engine search the hardware buses for a GPU that doesn’t exist is a waste of time and could throw an error. This flag explicitly short-circuits that check, telling the engine to just keep the short-term memory in the system RAM alongside the weights.


Graph layout.

I’ve tried my best to keep this easy to understand, but this part is just plain hard to make explain in a single blog post.

Now onto dark arts. A common frustration in bleeding-edge AI software is that the engine is being developed so fast that the developers don’t have time to write official documentation. If you want to know how to optimize the engine, you have to dig through the raw code or read the Github Pull Request (PR) comments between the developers.

-sm graph -smgs -sas -mea 256 --split-mode-f32

These flags govern how the computational graph is allocated across memory regions. The full documentation ultimatley lives in the code, even if it has some documentation.

The flag -sm graph tells the engine to use Split Mode in the Graph mode (often known in the industry as Tensor Parallelism). This is entirely about how you divide the massive math workload across multiple processors or memory regions (like multiple CPU sockets or GPUs).

  • Layer Split (The Default/Fallback): The engine slices the model horizontally. Processor A calculates Layers 1–10, then sends the data over the system bus to Processor B, which calculates Layers 11–20. While Processor A is working, Processor B is sitting idle.

  • Graph Split (The Goal): The engine slices the computational graph vertically. Processor A and Processor B calculate different halves of Layer 1 at the exact same time, combine their answers, and move to Layer 2 together. This keeps all hardware running at 100% simultaneously, drastically improving generation speed.

On this run, the engine declines:

=======================================================
Split mode 'graph' is not supported for Gemma4 external MTP
  => changing split mode to 'layer'
=======================================================

Because MTP creates a much more complicated web of math at the very end of the network, this inference engine simply hasn’t gotten support yet to safely “graph split” (vertically slice) an MTP architecture yet. When the engine boots up, it detects the MTP layers, realizes -sm graph will break the math, and safely downgrades to the slower, sequential layer split so the model can still run.

I’ve included it because it will likely be very helpful in the future, so you should try your luck if you’re working on a newer version.

While -sm graph was disabled, these other flags still apply to how the engine manages memory:

  • -sas (Split Across Sockets): Explicitly tells the engine how to divide the workload across different physical CPU sockets (NUMA nodes) on a server motherboard. You may note we only have one CPU, but we could get more later, it’s a nice optimization, just bench it to be safe if you do this, since older boards may break current day assumptions.

  • --split-mode-f32: When data is split across processors, it has to be stitched back together. This flag forces those intermediate connection points to use 32-bit floating-point precision (higher quality math). It prevents the AI from losing intelligence or hallucinating due to rounding errors during the split.

And don’t worry if you see this:

Oops: tensor with strange name rope_freqs.weight

It has a strange name. Strange names will not stop us here. :D


Attention.

Look. ikawrakow, creator of ik_llama.cpp is beyond the word “craked”.

Kawrakow wrote custom CPU kernels to handle Flash Attention, bypassing the need for a GPU during heavy context processing.

This let’s us do something that normally you only do on a GPU.

--flash-attn on --mla-use 3

Flash Attention fuses the attention softmax with its matmuls to avoid materializing the full attention matrix. Duh, anyone knows this, but I’ll try to explain it.

To generate text, an AI has to calculate how every single word in your prompt relates to every other word. Mathematically, this creates a grid of size N×N (where N is the number of tokens).

If you give the AI a short sentence, that grid is small. But if you feed it a 100,000-word document, that matrix explodes into 10 billion cells. Normally, the processor calculates this massive matrix and “materializes” it — meaning it physically writes the entire giant grid out to the main system RAM, only to immediately read it back for the next step.

Flash Attention applies the Kernel Fusion trick, but to the attention mechanism. It calculates the attention scores in small chunks and fuses the math (the softmax) so that the giant N×N matrix is never actually written to RAM. It is calculated and consumed entirely inside the processor’s ultra-fast local cache.

Flash Attention was originally invented strictly for GPUs because it relies on how GPU hardware handles memory blocks. Successfully porting this highly complex, hardware-specific optimization to work on standard CPUs is a massive software engineering achievement. Well done ikawrakow.

--mla-use 3 enables Multi-Head Latent Attention. Earlier, we discussed the KV Cache (the AI’s short-term memory of the conversation that prevents it from having to re-read the whole prompt for every word).

In standard architectures, storing the raw Key and Value data for every single token eats up RAM incredibly fast. Multi-Head Latent Attention (MLA) is a breakthrough architecture that heavily compresses this short-term memory. Instead of saving raw data for every token, it compresses the Keys and Values into a much smaller, dense mathematical representation (a “latent” space).

This drastically reduces the memory footprint of the KV cache, allowing the model to remember massive conversations without running out of system RAM. The flag --mla-use 3 simply tells the engine to activate a specific tier or kernel implementation of this compression.

But all of this is just experimental stuff right, like the split mode graph? Nah. The logs confirm both took:

flash_attn    = 1
fused_moe     = 1
fused_up_gate = 1

The memory accounting from the logs:

------------------- Layer sizes:
Layer  0:    825.98,   2048.00,   2873.98   77.00 MiB
...
Layer 29:    840.59,   1024.00,   1864.59   77.00 MiB
Layer 30:    748.00,    435.00,   1183.00   MiB (output layer)
--------------------------------------------------------------------------
Total   :  24852.46,  56755.00,  81607.46 MiB
Memory required for model tensors + cache: 82355 MiB

An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

That a working configuration requires 25 flags, half of which are undocumented and a quarter of which fail silently, is a reasonable working definition of the usability moat described in the first post.

The engine loads a 25B-parameter MoE, runs speculative decoding against an MTP drafter, and generates text at reading speed on hardware that was old when the architecture in question hadn’t been invented yet.


When we started this series a week ago, the state of local open-weights AI looked grim. We began by pulling back the curtain on the industry’s favorite marketing spin: the idea that dropping a massive, uncalibrated weights file onto a repository constitutes “open source.” We looked at the massive usability moat built out of missing documentation, silent defaults, and black-box wrappers that hide performance-killing decisions under the guise of user-friendliness.

In the second post, we rolled up our sleeves and waded into the muck. We hunted down obscure, unmerged pull requests, compiled specialized forks (ik_llama.cpp), flipped the standard logic of quantization on its head to build highly precise speculative decoding drafters, and wrote custom scripts to scrub infrastructure data leaks out of our GGUF metadata.

Finally, in this post, we put our money where our mouth is. We dragged a 2016 enterprise relic out of the closet — NAY, out of the grave, a single Intel Xeon running on agonizingly slow DDR3 RAM with absolutely no GPU to speak of — and forced it to run a cutting-edge, 26-billion-parameter Mixture-of-Experts architecture at reading speed. We did without throwing exotic hardware at the problem. Instead we treated the deployment pipeline as a serious thing, and mapped the architecture directly to physical hardware, tuning memory allocation, and unlocking the absolute limits of CPU cache optimization.

The lesson here is simple: The bottleneck to running state-of-the-art AI locally isn’t just in the silicon. It’s the need to understand how the inferrence engine actually works. Deeply.

While a cluster of data-center graphics cards, a corporate API token, or a massive budget are all extremly useful for specific workloads, for the ones that the open models cover, you just need refurbed hardware and to refuse to let black-box tools hold the steering wheel. Armed with the right fork, calibrated quants, and an understanding of the memory architecture under your hood, the usability moat vanishes.

The bleeding edge of Open Weight AI isn’t locked behind a paywall or a model proivider. If you’re already running a homelab, It’s sitting right there on the command line of a ten-year-old server.

Welcome to the other side of the moat. Now go download the quants and get your hands dirty.

Thanks for reading :D