Yes, I agree that the value of Apple's chips is hard to beat, but there's still a massive bottleneck with hardware accessibility. The 128 GB MacBook described costs over $5,000 on their website, and in the consumer space, even the most often recommended GPU which is a 3090 with 24 GB VRAM, you can find used going for $700 at minimum. This effectively prices out all non-professional users who don't have money to spend to simply have parity with the $20 subscription on their phones. (and even for those who do, they have to cope with the fact that the model will always be dumber than ChatGPT, and that their hardware will grow outdated very quickly)
Also a noticeable disconnect between the hardware we have and the primary focus of open-source labs (scale up and cater to their enterprise customers, just look at GLM-5's increase to 744B parameters, double from GLM-4.5's 355B). We really just need some kind of Cambrian explosion in cheap hardware for local models to be feasible.
My personal issue with the frontier AI vendors boils down to two points:
They're bound to increase prices, and if one of them ends up finding itself in a monopoly position, they'll probably increase it to the point where they essentially reap 98% of the benefits
They're above the law, so they can't be trusted for any purpose. They might essentially be doing anything with the data you're sending them and they would probably get away with it even if they were caught in the act
Other than these two points, I don't care whether the LLM is running on my machine or someone else's, so I'd be perfectly happy with making LLM requests against my friendly neighborhood AI shop that invested in some top-end hardware to run the latest open source models, contractually bound itself not to store my request data in any way and is not powerful enough to skew the judicial system when they're caught violating that contract.
if one of them ends up finding itself in a monopoly position
I really don't see how any of the players are going to build a monopoly for themselves. There's really no "special sauce" for frontier models other than capital.
Aside from the distillation problem, there's also a game theory element in play: if one of the well capitalized tech giants is behind on getting closed model, remote AI customers, it's probably worth it for them to open source good models as a defensive tactic. Ie, it's worth it for Facebook to commoditize AI if it denies OpenAI or Anthropic capital to build a competitive ad product
It doesn't have to be exactly a monopoly. Look at operating systems, phone operating systems, search, social media. Yes, they don't have a monopoly, but they can do whatever the hell they want and they don't care about the law.
I think that was the point of the commenter before you.
have money to spend to simply have parity with the $20 subscription
I’m not convinced we’ve seen the real cost yet. That $20 is heavily subsidized - if anything, the risk is prices go up, not down. The whole ecosystem is betting hard on cost curves improving.
they have to cope with the fact that the model will always be dumber than ChatGPT
A lot of use cases just don’t need parity. “Good enough” already clears a surprisingly wide surface area - translation, transcription, summarization, games, glue code, specialized coding agents, etc.
Overall I agree, I just wonder if there are some edges there.
I also feel like it should be possible to do RLHF locally in a way that you can't in the cloud, so that my local coding agent knows the kind of stuff I in a PR.
Have you ever attempted this? I'd love to hear about it!
I did attempt this with llama "back in the day", but hit the bitter lesson much quicker than I expected (and to be honest, it was hard and I just got too busy). Maybe Qwen is worth a shot too.
Also it depends on use case. "8 t/s makes for a good conversation experience" sure but only with no thinking. If you want agentic coding with a usable window you need more like 60 t/s which is possible with local models on these macs or Ryzen AI Max but not at the 120b+ sizes
You don't have to buy an Apple, or a 3090, you can also go for a Strix Halo platform. I got a Framework Desktop with 128 GB for around 3000 euros, but you can go much cheaper if you go for the Chinese brands. A quick search brings me to a Bosgame M5 for 1699 dollar.
Still not something my mother is going to spend on her new laptop, but things are going fast. I now have a device in my house that is superior in any way over something like the original ChatGPT 3.5. The latest Qwen models are on another level. I run the Q6 variant of Qwen3.5-122B-A10B-GGUF, about 122GB, and I'd say it is about 60-70% of the way of the leading proprietary models. This was considered impossible a couple of years ago.
You could buy a 128GB Strix Halo box for $1700 a few months ago, right now you can't get it for less than $2,400 (the bare Framework motherboard), complete systems are now near or above $3,000.
So Claude Code $200/month subscription costs roughly $5000/month in actual compute. Once the price correction happens, that self-hosted beef-up workstation stands in a pretty good position.
I'm still unconvinced by this, I'm not saying it's not being subsidized, but those $5k a month are at the price point they are selling the API calls not their actual cost and for some extreme cases. Of course we don't have numbers for their margins on API calls, but I'm guessing they are quite high.
Yes, but I doubt it. Of course I don't have data to back it up, but comparing it with other strong models served with openrouter from providers that don't have the very deep pockets of anthropic I doubt they are selling it at a loss. Particularly with how expensive Opus is
What about shared servers for that? If you have a small team (5 people), a machine could be shared and bring the price down quite a lot. You can imagine a whole range between on-machine and the huge current providers.
Shared hosting is a possibility for local AI, but so much of the underlying stuff is under such active development, that nothing gets fully tested upon release, and basic stuff is broken constantly. New models beat the previous ones are released all the time, so there's lots of upgrade churn. Also, since some models work better than others at certain tasks, you might want the ability to switch between or load multiple models at once, and they don't all work best with the same backends, libraries, inference engines, etc.
What I'm getting at here is that self-hosting LLMs can be a fun (and moderately productive, if expensive) hobby, but it's a full-time job for at least one person if you need to support a team that relies on them to do their jobs.
I think we're still one or two major breakthrough away from local AIs being fully competitive. There has been tons of progress towards getting tons of intelligence within increasingly smaller models but I think there's still a long way to go.
Unless you're running specialized workloads all the time or you truly need the privacy guarantees, it probably makes more economic sense to rent a GPU at the moment. I'm hopeful that eventually hardware prices will come down and we'll be able to run more advanced models locally, but I think most current hardware will end up outdated in a few years so it's better to wait.
If one properly tunes a lot of the models, I've had a great deal of success with $70, 80, 120b models with an mxfp4 for format and reasonable KV and CV values. As long as it's properly scoped non-dense model my framework desktop works quite nicely.. sometimes at 50 to 60 tokens per second on 30 or 40 billion models and 30 to 40 on the 120. It all depends on how you quantitize and how current you stay. The optimizations are quite amazing if you're willing to spend the time. And I do think local is the future. They're going to rise the prices to the point where no one will be able to afford them and local would become King.
What 120 are you running at 40t/s on framework? I'm tried setting up qwen3.5 122 MoE on framework and after the kernel settings to get the vram to make it fit it runs at 22 t/s
Capital have more motive to push AI for building valuable product than for improve personal life for fun. I think local exist all the time as side projects or personal interests, but it’s hard to beat with commercial data centers because of the lower cost
Open source models harming the future revenue prospects of proprietary ones by piggybacking on them, while the latter scrambles to find legal grounds to stop it, would be the ultimate form of poetic justice.
Have any open source models done better than almost match the performance of existing closed source frontier models? Because, in the described world, frontier models are still the source of all distillations.
And what does "dominate" mean? We have software on workstations and software in data centres right now; it's usually the requirements of the software that drives where it's ran. Is the idea that servers won't need AI?
Is Apple betting on local? I think they're continuing to bet on consumer products. To a consumer, AI is just a feature. What Apple is most likely to use AI for is to enrich their same stuff. That's worth adding to their costs a bit, not multiplying.
Siri's not working out—OK, let's buy Gemini. And if that's too much cost, roll it into another subscription service and take some off the top.
Apple holds the keys to a significant user base who stick with them despite their software quality suffering. Their bottom line hasn't suffered; that's proof enough that no one is beating them on user experience. And they just shipped a wildly popular computer by deciding this is a great time for, inflation adjusted, an all-time low price.
As for building their own models and selling inference, Apple leadership doesn't like to enter markets unless they think they can own a significant piece of it, they're not overly hooked on quarterly results to the exclusion of the long term, and they of all companies know the game of hype.
When the world is going AI-mad, Apple seem to have few options. They are mainly a hardware company with zero server market share. They are best at vertical scaling, their competition are decades ahead in software and horizontal scaling.
As I see it this and simply doing nothing are their only viable options at this point.
While not an Apple user in any capacity, I'm very grateful for the demand for local AI this creates.
This conflates "open" with "local". It's true that the best open-weights models are only six months behind the frontier (if you trust benchmarks). But those models have hundreds of billions of parameters. You're not going to be running an unquantized 397B parameter model locally any time soon, at least not without dropping five figures on hardware.
Apple has been criticized for being "behind" on AI, but their bet appears to be: have competitors burn cash to train models, let advances propagate into open source models, and make devices good enough to run them.
No, they're just going to licenses frontier models from one of the big three. They've announced that Gemini will be powering Siri, for example.
However, the most recent Macbook 4 pro Max looks to have made a leap in the size of model that's viable locally (data):
Presumably this was supposed to be 5, not 4? But the link to the data for the table which claims an M5 can reasonably use a 134.9B model points to this tweet, which only makes claims about running Llama 70B.
That's what I'm expecting. The progress on making the models smaller and faster has been very rapid, and I fully expect that we'll get to a point where you'd be able to run the equivalent of current frontier models on a local machine within a few years. On top of that, we see things like ASIC chips being developed that implement the model in hardware. These could become similar to GPU chips you just plug in your computer.
The tech industry has gone through many cycles of going from mainframe to personal computer over the years. As new tech appears, it requires a huge amount of computing power to run initially. But over time people figure out how to optimize it, hardware matures, and it becomes possible to run this stuff locally. I don't see why this tech should be any different.
Rick Beato just compared big data centers to recording studios. Was an interesting comparison. Though I think "big iron" will stick around for big needs. He allows for that as recording studios continue to exist, though in much smaller numbers:
I personally think that the most likely scenario is unfortunately not local AI but specialized AI chips that turn those city sized data centers into room sized data centers. https://taalas.com/products/
It's like everyone is building factories with steam engines even though the electric motor has already been invented.
About the price ain't: I'm not sure that even higher prices, at lower output is going to drive people away from the established players and towards the adoption of smaller open models.
You can see time and time again that consumers are "happy" to have bad service and pay premium prices, as long as the market is cornered by a few strong actors. Even when they know that different and better ways to deal with it all exist.
Look at car markets, at os markets, at healthcare, at many other things. The fact that better ways to do stuff exist doesn't matter.
I'm not saying those smaller specialised models will not be there, just that they're gonna have a hard time taking the big players off the throne, when most people don't even know that smaller and better models exist.
mudkip | 21 hours ago
Yes, I agree that the value of Apple's chips is hard to beat, but there's still a massive bottleneck with hardware accessibility. The 128 GB MacBook described costs over $5,000 on their website, and in the consumer space, even the most often recommended GPU which is a 3090 with 24 GB VRAM, you can find used going for $700 at minimum. This effectively prices out all non-professional users who don't have money to spend to simply have parity with the $20 subscription on their phones. (and even for those who do, they have to cope with the fact that the model will always be dumber than ChatGPT, and that their hardware will grow outdated very quickly)
Also a noticeable disconnect between the hardware we have and the primary focus of open-source labs (scale up and cater to their enterprise customers, just look at GLM-5's increase to 744B parameters, double from GLM-4.5's 355B). We really just need some kind of Cambrian explosion in cheap hardware for local models to be feasible.
enobayram | 12 hours ago
My personal issue with the frontier AI vendors boils down to two points:
Other than these two points, I don't care whether the LLM is running on my machine or someone else's, so I'd be perfectly happy with making LLM requests against my friendly neighborhood AI shop that invested in some top-end hardware to run the latest open source models, contractually bound itself not to store my request data in any way and is not powerful enough to skew the judicial system when they're caught violating that contract.
[OP] wils124 | 3 hours ago
I really don't see how any of the players are going to build a monopoly for themselves. There's really no "special sauce" for frontier models other than capital.
Aside from the distillation problem, there's also a game theory element in play: if one of the well capitalized tech giants is behind on getting closed model, remote AI customers, it's probably worth it for them to open source good models as a defensive tactic. Ie, it's worth it for Facebook to commoditize AI if it denies OpenAI or Anthropic capital to build a competitive ad product
zladuric | 7 minutes ago
It doesn't have to be exactly a monopoly. Look at operating systems, phone operating systems, search, social media. Yes, they don't have a monopoly, but they can do whatever the hell they want and they don't care about the law.
I think that was the point of the commenter before you.
jonathannen | 18 hours ago
I’m not convinced we’ve seen the real cost yet. That $20 is heavily subsidized - if anything, the risk is prices go up, not down. The whole ecosystem is betting hard on cost curves improving.
A lot of use cases just don’t need parity. “Good enough” already clears a surprisingly wide surface area - translation, transcription, summarization, games, glue code, specialized coding agents, etc.
Overall I agree, I just wonder if there are some edges there.
carlana | 5 hours ago
I also feel like it should be possible to do RLHF locally in a way that you can't in the cloud, so that my local coding agent knows the kind of stuff I in a PR.
jonathannen | 4 hours ago
Have you ever attempted this? I'd love to hear about it!
I did attempt this with llama "back in the day", but hit the bitter lesson much quicker than I expected (and to be honest, it was hard and I just got too busy). Maybe Qwen is worth a shot too.
I'm currently doing an end-run around this with tool augmented prompting (e.g. this eslint formatter library).
Kye | 2 hours ago
Unless you have something more current none of the attempts to figure out how much it really costs have panned out. For example: https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/
It might be heavily subsidized, but I haven't seen a claim built on solid ground yet.
singpolyma | 20 hours ago
Also it depends on use case. "8 t/s makes for a good conversation experience" sure but only with no thinking. If you want agentic coding with a usable window you need more like 60 t/s which is possible with local models on these macs or Ryzen AI Max but not at the 120b+ sizes
ewintr | 14 hours ago
You don't have to buy an Apple, or a 3090, you can also go for a Strix Halo platform. I got a Framework Desktop with 128 GB for around 3000 euros, but you can go much cheaper if you go for the Chinese brands. A quick search brings me to a Bosgame M5 for 1699 dollar.
Still not something my mother is going to spend on her new laptop, but things are going fast. I now have a device in my house that is superior in any way over something like the original ChatGPT 3.5. The latest Qwen models are on another level. I run the Q6 variant of Qwen3.5-122B-A10B-GGUF, about 122GB, and I'd say it is about 60-70% of the way of the leading proprietary models. This was considered impossible a couple of years ago.
bityard | 4 hours ago
You could buy a 128GB Strix Halo box for $1700 a few months ago, right now you can't get it for less than $2,400 (the bare Framework motherboard), complete systems are now near or above $3,000.
mordae | 7 hours ago
So Claude Code $200/month subscription costs roughly $5000/month in actual compute. Once the price correction happens, that self-hosted beef-up workstation stands in a pretty good position.
marcecoll | 6 hours ago
I'm still unconvinced by this, I'm not saying it's not being subsidized, but those $5k a month are at the price point they are selling the API calls not their actual cost and for some extreme cases. Of course we don't have numbers for their margins on API calls, but I'm guessing they are quite high.
mysteriouspants | 3 hours ago
Have we considered that they may be selling their API calls at a loss as well?
marcecoll | 2 hours ago
Yes, but I doubt it. Of course I don't have data to back it up, but comparing it with other strong models served with openrouter from providers that don't have the very deep pockets of anthropic I doubt they are selling it at a loss. Particularly with how expensive Opus is
adrien | 8 hours ago
What about shared servers for that? If you have a small team (5 people), a machine could be shared and bring the price down quite a lot. You can imagine a whole range between on-machine and the huge current providers.
bityard | 4 hours ago
Shared hosting is a possibility for local AI, but so much of the underlying stuff is under such active development, that nothing gets fully tested upon release, and basic stuff is broken constantly. New models beat the previous ones are released all the time, so there's lots of upgrade churn. Also, since some models work better than others at certain tasks, you might want the ability to switch between or load multiple models at once, and they don't all work best with the same backends, libraries, inference engines, etc.
What I'm getting at here is that self-hosting LLMs can be a fun (and moderately productive, if expensive) hobby, but it's a full-time job for at least one person if you need to support a team that relies on them to do their jobs.
cesarandreu | 20 hours ago
I think we're still one or two major breakthrough away from local AIs being fully competitive. There has been tons of progress towards getting tons of intelligence within increasingly smaller models but I think there's still a long way to go.
Unless you're running specialized workloads all the time or you truly need the privacy guarantees, it probably makes more economic sense to rent a GPU at the moment. I'm hopeful that eventually hardware prices will come down and we'll be able to run more advanced models locally, but I think most current hardware will end up outdated in a few years so it's better to wait.
singpolyma | 18 hours ago
We don't need smaller models we need cheaper VRAM 😉
symgryph | 19 hours ago
If one properly tunes a lot of the models, I've had a great deal of success with $70, 80, 120b models with an mxfp4 for format and reasonable KV and CV values. As long as it's properly scoped non-dense model my framework desktop works quite nicely.. sometimes at 50 to 60 tokens per second on 30 or 40 billion models and 30 to 40 on the 120. It all depends on how you quantitize and how current you stay. The optimizations are quite amazing if you're willing to spend the time. And I do think local is the future. They're going to rise the prices to the point where no one will be able to afford them and local would become King.
singpolyma | 18 hours ago
What 120 are you running at 40t/s on framework? I'm tried setting up qwen3.5 122 MoE on framework and after the kernel settings to get the vram to make it fit it runs at 22 t/s
informal | 17 hours ago
Capital have more motive to push AI for building valuable product than for improve personal life for fun. I think local exist all the time as side projects or personal interests, but it’s hard to beat with commercial data centers because of the lower cost
jrgtt | 5 hours ago
Open source models harming the future revenue prospects of proprietary ones by piggybacking on them, while the latter scrambles to find legal grounds to stop it, would be the ultimate form of poetic justice.
quad | 17 hours ago
Have any open source models done better than almost match the performance of existing closed source frontier models? Because, in the described world, frontier models are still the source of all distillations.
And what does "dominate" mean? We have software on workstations and software in data centres right now; it's usually the requirements of the software that drives where it's ran. Is the idea that servers won't need AI?
kevinc | 6 hours ago
Is Apple betting on local? I think they're continuing to bet on consumer products. To a consumer, AI is just a feature. What Apple is most likely to use AI for is to enrich their same stuff. That's worth adding to their costs a bit, not multiplying.
Siri's not working out—OK, let's buy Gemini. And if that's too much cost, roll it into another subscription service and take some off the top.
Apple holds the keys to a significant user base who stick with them despite their software quality suffering. Their bottom line hasn't suffered; that's proof enough that no one is beating them on user experience. And they just shipped a wildly popular computer by deciding this is a great time for, inflation adjusted, an all-time low price.
As for building their own models and selling inference, Apple leadership doesn't like to enter markets unless they think they can own a significant piece of it, they're not overly hooked on quarterly results to the exclusion of the long term, and they of all companies know the game of hype.
cajually | an hour ago
When the world is going AI-mad, Apple seem to have few options. They are mainly a hardware company with zero server market share. They are best at vertical scaling, their competition are decades ahead in software and horizontal scaling.
As I see it this and simply doing nothing are their only viable options at this point.
While not an Apple user in any capacity, I'm very grateful for the demand for local AI this creates.
bakkot | 5 hours ago
This conflates "open" with "local". It's true that the best open-weights models are only six months behind the frontier (if you trust benchmarks). But those models have hundreds of billions of parameters. You're not going to be running an unquantized 397B parameter model locally any time soon, at least not without dropping five figures on hardware.
No, they're just going to licenses frontier models from one of the big three. They've announced that Gemini will be powering Siri, for example.
Presumably this was supposed to be 5, not 4? But the link to the data for the table which claims an M5 can reasonably use a 134.9B model points to this tweet, which only makes claims about running Llama 70B.
Yogthos | 5 hours ago
That's what I'm expecting. The progress on making the models smaller and faster has been very rapid, and I fully expect that we'll get to a point where you'd be able to run the equivalent of current frontier models on a local machine within a few years. On top of that, we see things like ASIC chips being developed that implement the model in hardware. These could become similar to GPU chips you just plug in your computer.
The tech industry has gone through many cycles of going from mainframe to personal computer over the years. As new tech appears, it requires a huge amount of computing power to run initially. But over time people figure out how to optimize it, hardware matures, and it becomes possible to run this stuff locally. I don't see why this tech should be any different.
white-star | 25 minutes ago
Rick Beato just compared big data centers to recording studios. Was an interesting comparison. Though I think "big iron" will stick around for big needs. He allows for that as recording studios continue to exist, though in much smaller numbers:
https://www.youtube.com/watch?v=YTLnnoZPALI
timthelion | 9 hours ago
I personally think that the most likely scenario is unfortunately not local AI but specialized AI chips that turn those city sized data centers into room sized data centers. https://taalas.com/products/
It's like everyone is building factories with steam engines even though the electric motor has already been invented.
zladuric | 7 hours ago
About the price ain't: I'm not sure that even higher prices, at lower output is going to drive people away from the established players and towards the adoption of smaller open models.
You can see time and time again that consumers are "happy" to have bad service and pay premium prices, as long as the market is cornered by a few strong actors. Even when they know that different and better ways to deal with it all exist.
Look at car markets, at os markets, at healthcare, at many other things. The fact that better ways to do stuff exist doesn't matter.
I'm not saying those smaller specialised models will not be there, just that they're gonna have a hard time taking the big players off the throne, when most people don't even know that smaller and better models exist.