Is it worthwhile to run local LLMs for coding today?

28 points by Akir a day ago on tildes | 23 comments

whs | a day ago

I went on this route, and even buying a RTX 5090, and I'd say I'm nowhere near usable if you don't have the budget for two of them and probably 128GB memory, all that in in one machine. That being said, people reported some success on Mac Studio's unified memory but due to the slower memory bandwidth it will be slower than proper NVIDIA setup.

The reason this doesn't work is:

  1. You can test a lot of open weight models on OpenRouter. I'd say Qwen3.5 9B is quite good for getting non-coding tasks done (like querying), for coding Qwen Coder is probably the best model you can run with a single RTX 3090 but it is nowhere near GLM-4.7 (you could point Claude Code to Qwen Coder but it's not as agentic as it should be).
  2. You need a LOT of VRAM to get the model to load. The top open weight models today, GLM-4.7, Kimi K2.5, MiniMax M2.5, etc. are not possible to be loaded on a single gaming GPU at all.
  • I think there are tricks about MoE models that you'd only load specific experts to the GPU, but you still need enough RAM for it (or you can tolerate if it loads from your SSD - I tried HDD and it takes over 5 min to load Gemma3), and it requires some tuning with your LLM runtime.
  1. There are quantized models that may fit in to consumer GPU, but it will be significantly watered down. Q4, the common variant to run most models on consumer GPUs is about 10% loss.
  2. Not only you need the VRAM to load model, but you also need RAM to support large context window. I'm not sure if you need VRAM or RAM, because at this point I already have exhausted both of them on my gaming PC. I was planning to get 128GB RAM, but with XMP and doubling memory cost I don't think I'd be doing that any time soon. Currently I think my max context window with Gemma3 27B is about 50-60k compared to Claude and Qwen3.5's max context window of ~200k.

That being said, that one time I used Qwen Coder locally it was the 2nd fastest coding model I've ever seen, only beaten by Copilot's GPT4o (which I suspect that Microsoft probably bought provisioned throughput).

kacey | a day ago

Btw, may I ask if you've given Qwen3.5 27B a shot? Some -- admittedly kinda bad -- benchmarks figure that it's about as good as Anthropic's cheap model (Haiku 3.5).

teaearlgraycold | a day ago

people reported some success on Mac Studio's unified memory but due to the slower memory bandwidth it will be slower than proper NVIDIA setup

The M3 Ultra has pretty good bandwidth (820 GB/s) but limited compute compared to high end GPUs.

cutmetal | a day ago

So interesting timing on this question. Replying to you because you seem to know what you're talking about and I'm curious what you think.

I'm waiting on delivery of a pair of Nvidia Tesla P40 data center GPUs. You can get them used on ebay right now for a little over $200/ea, shipped. They each have 24gb vram, and I'm planning to put them into a machine with 64gb ram. (You do have to come up with a cooling solution as they're made to have server-grade blower fans wind-tunneling them, I found some cheap 3d printed shrouds that funnel 120mm fans through them. They also have non-standard power inputs so you need adapters for that too.)

My understanding is I should be able to run 70b parameter models with a decent context window and speeds - does that sound realistic to you? In any case I'll find out next week!

tauon | 16 hours ago

70B parameters will probably be a rather tight fit for 48 GB of VRAM at say an 8-bit quantization, based on the model size numbers for the newly released Qwen 3.5 family here.

Edit: But that doesn’t mean the smaller models aren’t capable! I mean, I’ve tested both qwen3.5:2b and qwen3.5:9b on nearly four year old fanless laptop hardware and was surprised at the quality of outputs.
Innovation in the space is going pretty crazy currently, IMO. Even if most any locally-ran models won’t compare to the big guns you can get from cloud hardware… especially once you start to factor in the price of things. Subsidized models like GPT 5.3 (Codex) or Claude as Opus 4.6 are hard to beat in that regard, at least from what I can tell so far.

kacey | a day ago

I wouldn't, tbh? Thoughts, as a case-wise analysis:

  • Either the AI bubble pops, and all the hardware in those data centres (plus all the purchase orders and contracts) go up in flames, which will probably have ... an effect on computer hardware prices,
  • or the AI bubble succeeds, and we have fully autonomous AGI-class intelligences running amock in the cloud. The team which was responsible for getting the current state-of-the-art local model running (Qwen3.5-397B-A17B) was fired from Alibaba in the last week, so it's not terribly certain that local models will continue becoming better as fast as they have been over the last year. Which means that you'll probably want to use cloud models anyways.

(edit) Ah, two addendums:

  1. If you want to have better hardware over the next, say, two years for some other reason (e.g. gaming), now's a decent time to spec up. IMO.
  2. In case it helps, as a point of data, I've been running a 4-bit quant of Qwen3.5 35B A3B on a PC (9950x w/64 GB RAM and an RTX 2060), which has been inferring at ~20-30 tokens/second (depending on batch size; it trends closer to 30 than not). It still requires handholding, but it's mostly capable of handling a decently technical workload (atm. it's implementing an ML project I've been mulling for a while, and it's doing OK enough). It'd be fine for simple web apps, or quick one off scripts.

babypuncher | 23 hours ago

now's a decent time to spec up. IMO.

Now is a horrible time because hardware prices are grotesquely inflated by the billionaire class's insatiable appetite for shoving slop down our throats.

teaearlgraycold | 23 hours ago

Not Apple’s. I’d recommend people buy Apple hardware now before their fixed pre-inflation contracts run out. For general consumers it’s hard to justify alternatives in the <$1,000 range.

kacey | 23 hours ago

Sorry, I didn't mean to offend. Apple products were -- IIRC -- one of the few computing products that haven't seen a price jump because of the recent spike in the cost of everything. For people who really need the extra capacity, selecting the beefiest spec you can makes the most sense of any time between now and ~2029, since prices are only going to go up.

Genuinely, if Akir has the spare cash and has a reason to spec up for the next two years, now will probably be the most cost effective time to do so. Who knows what happens next. If you disagree, please feel free to make an argument ...? I can quote random blogspam if it'd help make mine, but I'm hopeful that we can discuss this point instead of screaming at each other.

It's not going to get better for years to come, most likely.
Memory chips are already fully pre-ordered for three years.

0xSim | 20 hours ago

Unless you have a very powerful GPU and/or tons of RAM, you local model will just give you a slower but slightly better autocomplete. You'll pay thousands to run an LLM that is way worse than the current best models (Claude Opus 4.5/4.6) available on relatively cheap subscriptions.

It could be worth it if you use that computing power for something else, but investing that much money to run a local LLM is an awful idea.

shrike | 17 hours ago

If you want to learn how to use AI Agents, spend the 20€ or something for Claude Pro for a month and start using it. Both the app and Claude Code on the CLI. Try creating a tool with it that fixes an issue in your day to day life or work, see how it goes. Automate something you need to do manually 50 times a week. Or make a silly game.

Now you have a baseline of what the state of the art can do

Then you can start experimenting with local models. Grab LM Studio, Ollama and ComfyUI and see what kind of free/open models there are. Some are good for coding, others can describe and even generate images.

Find the limits of the mainstream models, what will they do and what they won't. Try writing an AI assisted short story about a murderer and see how the model starts moralising on the character's actions or refuses to write about some things. Then grab some uncensored ones and get REALLY FUCKING WORRIED because they will generate detailed stories of the most heinous shit with no limits at all.

Try to recreate the same app you did with Claude with local models, it's slower of course, but how is the quality compared to it? Good enough? Try using a local model in an IDE for autocomplete or agentic workflows, how does it feel?

ComfyUI is fun too, you can easily create "pipelines" for image generation, fully locally. Also see the bits about limited and unlimited models from the paragraph about text-only models. Oof.

[OP] Akir | 3 hours ago

I took your advice and it was quite eye-opening. One of the things Anthropic said it could do was onboarding, and it just so happened I have a project that has been on hiatus for about a year and could use some brushing up on it. The response that it came up with was significantly better than I thought it would be, though probably not actually good enough for what I wanted it to be still. But then it continued by telling me there were some minor bugs that should be fixed and it's pushed me through a bunch of code review which happens to be a great way to familiarize myself with my old code.

But god damn, even if I did spend the extra $400 for 32GB of RAM, a local model wouldn't be able to get anywhere near what this is doing. I doubt I'd be getting anything this good even if I spent $5000 on the highest end MacBook Pro.

This has really given me a new perspective on the storagepocalypse and why these companies are buying up these resources like there's no tomorrow. It's also really got me wondering if the current AI boom really is a bubble.

kacey | 3 hours ago

It's also really got me wondering if the current AI boom really is a bubble.

Not the OP, but IMO -- probably still a bubble, if only because there's still a gap between revenue and investment. Competitors appear capable of keeping pace with Anthropic/OpenAI at a steady ~6-12 months gap in capability, and they're doing so for pennies on the dollar. If the large, American AI firms can't demonstrate a way to keep their advantages proprietary, then a lot of the R&D investment which is going into making these systems will end up being written off: why would consumers/companies pay 10x for Anthropic/OpenAI when another service is available for mere fractions of the price?

But yeah, agreed that my expectations were blown out of the water while working with some frontier models. Even if they stay just as they are now, this will be massively disruptive to nearly all work done in front of a computer.

babypuncher | 23 hours ago

Oh god please don't become part of the problem. This insatiable appetite for slop is making computers too expensive for the rest of us

kacey | 23 hours ago

Uh ... individuals buying computers aren't driving up the price of components, it's OpenAI buying 40% of the world's RAM manufacturing capacity and the like. I'm sure Akir has reasons for wanting to code locally, and anyways, isn't their desire to use their computer just as valuable as "the rest of us"?

LukeZaz | 19 hours ago

but I also know that I'm kind of falling behind by not embracing AI coding agents.

Doubtful. The way I've been watching people use these things has been nothing short of reckless; this technology has failed to prove itself countless times, and the cases in which it has worked have been few and far between. To say I am hesitant to believe that it is actually useful for coding and not simply fooling people into thinking it's useful for coding is an understatement. A good software professional is supposed to be wary of software.

But that's just one angle. And not one I prefer, frankly, since I still believe the tech could be good in a hypothetical future where a lot of things were different.

The advice I'd offer here is to consider more than just the cost or a hypothetical future wherein AI magically becomes everything it claims to be. People here have already answered that for you. I suggest instead to consider the following two factors:

  1. Generated code is significantly more prone to errors, due to hallucinations, the fact that you didn't write it and thus understand it less (if at all, depending on how much you read), and the fact that LLMs do not have brains and cannot think.
  2. From at least my personal viewpoint, generated code is a moral failing. LLMs cause numerous problems from the environmental to the sociological, both for yourself and for everyone — knowing this and using them anyway is to declare to the world that one considers convenience more important than both. This is not me pointing a finger at you – knowing the harms of AI is not knowledge anyone's born with – this is me saying this is something you should be thinking about.

So when you ask yourself if a local LLM is worthwhile, please don't just stop at price. Ask yourself if the code will really be as good as some people claim it is, and more importantly, ask yourself if the risks and problems that LLMs (even local models!) create are worth the alleged utility.

shrike | 17 hours ago

No 1 depends a lot when you end the generation loop. If you don't give the AI Agent (you are using an agent and not copypasting from ChatGPT web, right?) tools to validate its work by building, testing and linting the code - of course it's going to be shit.

I've personally done full-ass PRs and bug fixes with nothing but prompts to Claude + Opus 4.6 and the code quality is on par or a bit over what I'd write. Mostly because the model knows a few language/library tricks better than I do. Zero errors, zero hallucinations. And me writing it doesn't matter, I READ it before I submit to any human eyes - as you should. My job is to deliver code I have proven to work

AI isn't going anywhere, someone opened the Pandora's box or large language models and assossiated tech. The only thing that varies is that whether we get local models that are good enough for daily use or do we need to rely on online models.

teaearlgraycold | a day ago

I hardly use local LLMs for coding, but I am pretty sure you'll want a 128GB MacBook Pro if you're looking to run anything remotely comparable to hosted models. Even then, a 256GB or 512GB Mac Studio is more of the right choice to run the best open weight models.

But as you don't seem to be a professional software engineer I don't think I can anticipate your needs. If you just need an LLM that can help write some small scripts and navigate the command line then I can see something useful fitting into 32GB. I've gotten some use out of GPT-OSS-20B on my 24GB MacBook Air at times when I didn't have internet access. But it was really just a fancy natural language CSS documentation lookup tool at that time. Not anything remotely comparable to modern "agentic" coding tools. The context window is much too small for that.

If you don't need the AI to be local then the free tiers for cloud hosted models will be your best option.

[OP] Akir | a day ago

Yeah, honestly the more I think about it the more dumb the idea becomes. I guess I just got bit by the FOMO bug.

I wouldn't expect a local model to run at the level of the hosted ones, so that isn't really a concern. My expectation was more along the lines of a debugging helper. I think it probably makes a lot more sense to just use their stuff as a pay-as-you-go thing if I ever feel the need to do some vibe coding or something like that. And for debugging I honestly find it somewhat rare to have LLMs be able to tell me something I couldn't find out by talking to the proverbial rubber duck.

clayh | a day ago

I can run 4B Qwen 3.5 on a 24Gb M3 MacBook Air without a problem. However, I think you’ll need much larger models for good coding assistance. I agree that you’d probably want 128 or 256 Gb of RAM if you’re doing this for anything other than a hobby.

pete_the_paper_boat | 17 hours ago

Local inference is probably worth it for stuff like autocomplete, etc. That's a small and complete context that can be performed by tiny models.

Idk if full development is very feasible for a competitive price, unless speed isn't an issue, then maybe it's possible to set up a task, walk away, and come back to a finished problem.

entitled-entilde | 17 hours ago

Running local models is fun for hobby purposes like building an agent, fine tuning, or digging into text analysis with embeddings. For coding, not so much. Remember if your model makes a mistake in a tool call, or runs out of context, the whole thing grinds to a halt. For me, AI coding should be fun and ergonomic, and this spoils it. If you want to try AI coding just go commercial. Claud code is 20$ a month which is not so bad. It’s possible that 2 years from now a local model is developed that could do great on any task with 32GB, so I get the temptation. But if you run a local model and hate the whole experience, that’s worse. At least have a backup plan for that memory if you do buy it