Running local models is good now

40 points by Yogthos 11 days ago on lobsters | 38 comments

On the one hand, I don't particularly have much desire to run these tools for my current workflows, but on the other hand the largest of the issues I have with them stem from the centralization aspects (which bleed into other areas too: environmental, privacy, power distribution, etc). So I am glad to see that things really do seem to be getting better on the locally hostable models front.

[OP] Yogthos | 10 days ago

I strongly suspect that's where things will go in the future. Nobody really wants to send all their data to some service provider, and on top of that, you're completely at their whim in terms of price hikes and model availability. As we just saw with the whole Fable fiasco from Anthropic, there's a real danger in allowing yourself to become a digital serf.

As local models and coding harnesses keep improving there's going to be fewer reasons to rent models from a provider, and I'd argue that's true even if local models are less capable overall. For example, a lot of people use DeepSeek instead of Claude simple because it's good enough while being far cheaper. Same argument applies to local, at some point it doesn't really matter if you can rent a better model when local one does the job.

There could also be a lot of possibilities with customizing and tuning these tools as well. For example, I haven't really seen anybody make a LoRA for a specific language, and that might make the model far more effective in a restricted domain. At that point the model could even perform better than a huge general purpose version.

zladuric | 10 days ago

I'm not sure why people keep saying that, Nobody really wants to send all their data to some service provider, and on top of that, you're completely at their whim in terms of price hikes and model availability..

Isn't everyone doing just that? From billions of people having their everything digital held at FAANG, to companies trusting everything to Microsoft and a few others... It's just us techies and a minor percentage of companies who don't give all our digital stuff away.

radex | 10 days ago

When talking about individuals - maybe. But organizations tend to want assurances. That does not necessarily mean they won't outsource, but at the very least an attempt will be made to have some contractual assurances and a plan to change vendors if needed.

(Not to say that organizations act rationally - just that an average organization does more due diligence than an average individual)

: zladuric | 9 days ago
I do agree but even then, assurance just means there's this plan. Everyone is "cloud agnostic", but everyone is at aws, right? That's what I mean. It's not even that nobody cares about this stuff. It's just that it's a hassle and you just wanna pay someoy to do some work for you so you can go on with your business.

[OP] Yogthos | 10 days ago

The context here is software devs, so it is precisely the demographic that cares about this stuff. Also, I don't think regular people really want to send the data to these companies. It's just that it happens to be convenient and accepted, and the barrier to not doing it makes it not worth it. Like say you're a regular person who has a phone and takes pictures with it. You don't even get a say in the matter. Apple and Google will just upload whatever you snap to their server. To actually have a bit of privacy you'd have to install GrapheneOS which is far beyond what most people can do.

joelgrus | 10 days ago

"Regular people" simply don't care. I promise you that if I went to the gym (or wherever) and asked people how concerned they were that their photos were uploaded to Apple photos, the average person would say "not at all."

For the most part even I don't care; my primary concern with giving my data to Google is "what if they suddenly unperson me and cut off access to it." But if I didn't store it in the cloud then I would just have to worry about losing access to it in different ways. (cf the hard drive full of MP3s that sits in a computer that hasn't been booted in like 10 years)

[OP] Yogthos | 10 days ago

Sure, most people don't understand the implications of not having any data sovereignty. That's really a problem with lack of education.

joelgrus | 10 days ago

People having different preferences from you is not "a problem with lack of education".

[OP] Yogthos | 10 days ago

People not understanding the consequences of handing over all their data to megacorps is absolutely a problem with lack of education. This isn't a matter of having a different informed perspective from me, it's a matter of people not knowing how this technology works and that the implications are. There is a reason why vast majority of people who work in tech are uncomfortable with relying on cloud services. And we've had countless examples of how this ends up being disastrous on top of that, so it's rather disingenuous to frame this as a matter of opinion when the harm is clear and documented.

zladuric | 9 days ago

In principle I agree with you. Most people don't even know what is going to happen when Megacorps get their monopolies and one of the causes is the lack of education. What I disagree on is your statement that the vast majority of devs care. I would say that even if the percentage of devs aware of the various issues, the majority still doesn't care that much.

I mean, we shovel money to Microsoft's and oracle's solutions even though there are free and better ones. We give all our data to Google and apple. We host all our stuff at Aws.

I know in a lot of cases it's not our choice but I still think it is not even a majority, let alone a vast one.

: [OP] Yogthos | 9 days ago
Sure that's fair, but the way I look at it is that if all things were equal people would choose to keep their data local. The problem is that the bar to use these services is lower and most people don't really want to bother investing the effort into doing something else. And a lot of time it's just a matter of pragmatism, you only have so much time during the day, and you have to decide on how you want to spend it. Sometimes you end up holding your nose and doing something that you'd do differently if you could devote more effort to it.

Incidentally, LLMs are actually helping with this exact problem. I have a friend who's a designer, and he was very frustrated with Win 11. And I kept suggesting he try Linux which was just too much effort for him to bother previously. He finally took the plunge with Zorin, and he's been using DeepSeek whenever he'd get stuck. That was enough for him to get his bearings and make the switch. He's been happily using Linux for around a year now and has no plans of going back. DeepSeek was what made it possible because every time he had some issue, it made it much easier to troubleshoot and find a solution.

Same pattern applies to a lot of other things where you need to figure out configuration, or how to put tooling together. You obviously still have to spend the time to understand what you're doing, but being able to get a lead on a specific problem you're stuck on means you're always able to make progress without being too frustrated.

jfb | 10 days ago

I feel like commoditization is coming for the frontier labs; once the quality of locally-runnable models begins to approach, say, a three month old frontier model, the entire economics of the industry will shift, and fast. Labs will have to have other axes of discrimination than "our model does best on benchmark X" when people fail to see the difference between first and third on the bench.

[OP] Yogthos | 10 days ago

For sure, and there is a point where models are just good enough for what you're doing so it doesn't really matter that there's a more powerful model you can rent. For example, a lot of people are already switching from Claude to DeepSeek even though Claude is better in absolute terms. We're just getting to the point now where local models are starting to become a viable option, and I expect that within a year or so we'll have local models that are good enough for most tasks.

That might be what pops the whole AI bubble. The models themselves are basically general purpose commodity, traditionally the margins on making these kinds of general tools are low. The value lives in customization and finding a niche domain. What's likely to happen is that we'll see companies popping up specializing in tuning on prem models for businesses to fit their specific needs. And the whole market for renting out general purpose models will collapse.

jfb | 10 days ago

I think about Cohere or Mistral here — there might be a niche for “less capable model but with a more amenable governance structure” even setting aside local models, right?

[OP] Yogthos | 10 days ago

For sure, I'd really love to see models developed in a community governed fashion as actual open source projects rather than just being graciously handed down to us by corps.

jfb | 10 days ago

Absolutely. The issue is of course how expensive these models are to build.

: [OP] Yogthos | 10 days ago
I'm hoping we'll see more projects like this going forward to address that. https://github.com/bigscience-workshop/petals

jmillikin | 10 days ago

once the quality of locally-runnable models begins to approach, say, a three month old frontier model, the entire economics of the industry will shift, and fast.

This seems unlikely as stated; so long as throwing more CPU/RAM at an LLM makes them more effective there will be a gap between what can run on a $500 laptop and a $500,000 rack-mounted server stuffed full of custom silicon.

What does seem possible is that in 5 years a local model could be 3 years behind the frontier, which would mean it's 2 years ahead of where the frontier is right now.

[OP] Yogthos | 10 days ago

I can tell you haven't actually run local models like Qwen 3.6 yourself. Also, ATLAS shows just how close even small models you can run locally can be to the frontier when handled properly.

jmillikin | 9 days ago

I'm an active and enthusiastic user of local models, and use both Gemma 4 and Qwen 3.5 pretty heavily. I've also invested a fair amount of effort into writing custom tooling, such as a local HTTP interceptor that can rewrite completion requests on the fly, which lets me (1) run the claude harness with a local model, and (2) fixes issues in other clients such as Zed's inflexible system prompt (https://github.com/zed-industries/zed/issues/51583). I've written comments on this site before about effective use of local models.

None of that changes the fact that a ~31B local model today is about where the proprietary hosted models were a year or two ago, well before the notable inflection point in capability that happened around Q4 2025. It is obvious during use that the current frontier models are very far ahead of local models, and if a benchmark doesn't show that gap then it's not a good benchmark.

And that gap can't necessarily be closed any more than you can close the gap between a laptop and a server for compiling or 3D scene rendering -- I can't buy a laptop with 256 cores and 2 TB of RAM, but there are multiple cloud providers that will rent me such hardware by the hour.

At some point in the future it'll be possible to run a Fable-equivalent model on my local workstation, either from models becoming more optimized or local hardware becoming faster (or redesigned entirely, e.g. Taalas), but it's a simple matter of physics that if Fable runs on consumer hardware then whatever's running on datacenter-class hardware will be even more capable.

[OP] Yogthos | 9 days ago

Yes, giant frontier models can do more than ones you can run locally. That's not what I'm talking about though. What I'm saying is that local models, especially Qwen 3.6 27b, are getting good enough. To put it another way, if I can get around town on a bicycle, I don't need a monster truck.

However, anybody who's watched technology evolve knows how silly it is to project linear evolution here. There are papers coming out literally on weekly basis right now with people discovering new tricks for optimizing models, reducing memory usage, and improving capabilities. And it's entirely possible that similarly to how we see current crop of local LLMs beat frontier models from a year ago, the exact same thing will happen next year too.

And it's a simple matter of common sense that if I can run Fable level model locally, it's going to be good enough for vast majority of things I do on day to day basis. We're already seeing this happen with people switching to DeepSeek instead of paying for Claude even though the latter is demonstrably more capable. If DeepSeek is doing good enough job, then it simply doesn't matter how much better Claude is. Similarly, if a model I can run and own does what I need it to, then I don't really care that the frontier model running in a data centre can do things that it can't.

: jmillikin | 9 days ago
What development do you anticipate in LLM implementation that would cause a local model to be as capable as a 3-month-old hosted frontier model, which is the claim I quoted?

jfb | 10 days ago

I mean this is a good point! But ISTM that there is a threshold and not thinking about that is a mistake when we look at what’s been good enough in other technical domains.

emk | 10 days ago

Local models are different in interesting ways, some of which may be an advantage:

Inference power consumption is roughly a high-end gaming GPU, and even then, only when generating tokens. This can usually be limited to about 300W. If you read the code, you're probably looking at spending 25% of a working day generating tokens, so 75W sustained.
Training power consumption for a few local-sized models a year could basically be absorbed into the background noise of industrial civilization.
You get to keep all your data local and you don't need to encourage the grifters quite so much.
Local models are dumber, which actually keeps me closer to the work. With Fable, I can give instructions like, "Fill in this street with houses," and get a bunch of crappy McMansions. With Qwen3.6 27B, I can say, "Paint these four rooms." The natural "chunk size" of the work and the small models' preference for concrete instructions force the user to understand the code in far more detail. That isn't to say that local models can't summarize project structure or find bugs. Just that they reward a far more hands-on working style.

Fable is the model that really convinced me we're screwed. It really can just crap out entire projects. The "McMansions" even look nice. But the roofs leak, the foundations are shaky, and the craft is just good enough, just long enough, to sell. This will, of course, likely be wildly successful in the market. I mean, even Fable's worst day is still better than plenty of enterprise SaaS (except for, you know, compliance and security).

So while I find that local models are interesting tools, I am really not looking forward to the messes created by the next generation of frontier models.

: k749gtnc9l3w | 10 days ago
Also, as long as the pre-answer-generation blabbering does not switch to neuralese, local models will unavoidably let it be inspected in real time, while I have heard that frontier models now hide it as an anti-distillation measure. This sometimes reveals interesting «confidence/doubt» information, and sometimes let me cancel the request and write a better version based on how the model rewords my question.

vrthra | 10 days ago

If there are academics here, what do you use local models for? I have found qwen3-coder:30b reasonable for latex edit, and for querying OCRed papers about their results. Any other usage?

nicoco | 10 days ago

Academic here. I don't use "agentic coding"; I don't use LLMs for writing at all (it's even forbidden by most editors isn't it?). I have been extremely underwhelmed every time I tried, not to mention the hassle and fragility of setting up a local inference pipeline (it requires using our shared computing cluster, my laptop's GPU is very tiny). I do use ollama/qwen3-coder or duck.ai occasionnally when I don't have the right keywords to search how to do something in a language or with a lib I am not very used to using, or for very specific stuff I am not an expert at all in (regex, SQL, ...).

: vrthra | 9 days ago
Which field? I think in CS the ACM policy is to be explicit about the use of LLMs.

k749gtnc9l3w | 10 days ago

First drafts of translations. Proofreading those translations helped fix quite a few mistakes in the teaching materials we could have noticed without translating but never did… (This is mostly relevant for teaching in an environment that is not single-language-only)

One-shotting first drafts of general quality-of-life small personalised scripts/mini-tools. Including a harness for the translation to exclude e.g. TikZ from translation requests. Needs debugging afterwards, debugging much more interesting than writing the slog part that slop does get right. Validation strategy obviously matters even more than for handwritten things, ideally it is «any remaining bugs will be pretty obvious when running the tool»…

Honestly, Qwen3.6 surprised me by being not that bad in drafting example solutions to rather standard-ish proof-writing exercises. Although editing to match the desired style might make this somewhat axe-porridge-ish/stone-soup-ish, but some formulas probably stay through the process… depends on tediousness, I guess.

vrthra | 9 days ago

Very interesting. Translations are of course one of the original uses in which AI became reasonably proficient. How is Tikz involved (i.e. that it needs to be excluded) in translations?

k749gtnc9l3w | 9 days ago

The thing to translate is a large chunk of LaTeX, among other things it contains some TikZ.

TikZ diagrams:

need very little translation if any
are a metric ton of tokens
are pretty annoying to check for minor hallucinated changes during copying

So it's easier to cut them out.

I am also now convinced that to translate well I want to feed the text in small pieces, not all at once. In addition to all the nice checkpointing and inspection properties, it makes it natural to inject the false history where «assistant replies» are actually written by me using the style and terminology I want to set.

Why I care about the metric ton of tokens: if copying TikZ from input to output takes more time than all the other translation, I start asking the natural question «do I have better tools for literal copying of a clearly delimited text fragment»! Either it is a big difference of how many pages I can get translated overnight (dense 20B–35B class models), or how many pages I can get translated while answering an unrelated email (MoE A3B–A4B class models with the same total size range).

I have heard horror stories about chatbot-LLM translations with frontier models, like losing a couple of pages in a short book; some of these seem to be unavoidable with frontier single-stream-LLMs if you feed the original text all at once, and impossible even with local-sized LLMs if you feed the original text paragraph-by-paragraph. So I need the slicing harness anyway, and probably it is a good idea to cut at TeX environment edges, and then special treatment for TikZ becomes a cheap add-on.

I guess to get this detailed control with hosted LLMs I would need to pay per-token as subscriptions are tied to specific tooling. Well, between English and French local models are pretty good when used with care. And I got a detailed explanation in a different discussion here on Lobste.rs — in what sense for English→Polish neither DeepL nor Claude Opus is good enough (and I think neither Qwen nor Gemma are good enough, even if a bit better (!) under careful use than Claude, but maybe not better than DeepL unless there is formatting)…

So, the formatted-text translation capability gap might be de-facto negative by now.

vrthra | 9 days ago

That is very interesting! Thank you for the insight. I also find that passing through OCR is a good idea before sending them to LLMs, which naturally removes the figures. The effect is the same I guess.

Rahul

: k749gtnc9l3w | 9 days ago
Well, there I have the LaTeX source and want to translate LaTeX source. But yeah, mixing the tasks makes everything harder; even asking a model to OCR (one page at a time), then asking the same model to translate, should be easier than doing both at the same time.

zipy124 | 10 days ago

Proof reading that goes beyond spell-check/grammar-check basically. Or writing quick scripts for data analysis, but only pilot experiment type stuff, not final analysis, so exploration.

vpr | 10 days ago

Reformed academic, run in fairly academic circles.

Don't see a lot of local LLM usage, outside of ML people, or schools who for whatever reason provide an endpoint on local clusters.

Frontier model usage is incredibly high. Mathematicians are using them extensively, from small lemmas to typesetting/picture generation/one-off code projects. What was once a undergraduate summer project is now a prompt away, everyone seems to recognize it's bad for training but the temptation for instant gratification is just too strong.

An important point to note is the only real angle for local LLMs is privacy/ethics, since academics are universally putting these tools on grants (though unclear how tedious this is to drag through European bureaucracy). My academic friends in math/physics are largely unconcerned about digital privacy & ethics, despite being some of the earliest computer adopters (digital "natives" in the 80s-90s). Exception there might be French friends who have an independent culture of open source/digital sovereignty.

felixs | 10 days ago

I've lately tried gemma4:12b on my Framework 13 with 32 GB memory for coding and while that was incredibly slow (~30 min for a simple task), it worked to my surprise. (tech stack: ollama, opencode running in a VM without network access - all in the repositories and setting it up quite straightforward)

So I can agree that local models are coming in range to be useful on "normal" machines, and it's somewhat exciting.

: k749gtnc9l3w | 9 days ago
Is it unified memory and iGPU? With Q4_K_M the larger-but-MoE (26B for Gemma4, 35B for Qwen3.6) version should still fit, but as it is mixture of experts with fewer parameters active for each specific token, it might be faster.

I think on my non-last-gen Ryzen's iGPU number the speed is indeed about the number of active parameters (I have 64 GiB so I use Q6 quantisations, but maybe it is not better than Q4)