> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time.
I suspect they weren't as efficient as they could be with token use either. Sounds like they were trying to encourage non-developers to vibe code stuff
I'd argue you have a lot more to worry about with developers as far as token usage goes because they're the ones who know how to rig up these wild workflows where tens of agents simulate an entire software development team. The non-developers are probably going to be sticking more in the realm of iterating via chat.
2nd link doesn't work.
That would be a neat tool, to find the original article and see how many levels of AI summary it has gone through, a game of AI telephone!
I had thought about creating something like that for finding comments for articles. For a given article, display links to comments for HN, lobsters, reddit, etc. However, I feel I already waste too much time reading comments. I shouldn't make it easier and more tempting.
Man, maybe it's time for me to give the verge a subscription. There the only ones actually doing any journalism here and a bunch of AI blogs skimming off the top.
Surely a company as large as Microsoft is actively attempting to build their own models. They couldn't possibly have expected to stake the future of their software development on the conditions of a third party company?
At one point there were rumours that they'd do that. They also have the rigts to oAI models for a few more years still, so they could always use that but apparently they're also compute starved (like anyone else).
MSFT does have a frontier AI Lab. My friend works there. I don’t know what they’re doing. But MSFT is one of like 5 entities that actually have the talent and physical infrastructure to compete in model-building.
Okay, but what if you're not Microsofts size and don't have and R&D budget large enough to fund development of your own models and tools?
This is a warning to any company, not building their own AI, that AI assisted development could become really expensive really fast and most likely won't pay off. What Microsoft is suggesting is that the current price is to high, but it's still not high enough for e.g. Anthropic to be profitable, or AI coding tools are only as good as the developers using them. So you can't meaningfully do layoffs by replacing the developers with AIs, because the cost is to high.
How does Microsoft plan to fix CoPilot, so that the cost will be so much lower than Claude, that budget overruns won't be a problem for their own customer?
I expect in the next year or so, we'll stop seeing headlines like "Anthropic buys $15b of compute from SpaceX" and we'll start seeing headlines like "Uber's AI department licenses GPT 6.2 as the foundation for their internal model," or something like that.
Smaller companies will have departments that distill larger models into something more specifically manageable and useful for them. At least, that's my personal prediction :)
How would that help with pricing? The cost of hardware is already subsidized to hell and back by investors and that's not dropping costs enough. I'm not concerned about Uber, they are way to big. I'm thinking sub 1000 employees in total and maybe 50 - 100 people in the IT department. Are they just going to be cut off from AI tools, because the cost of running them would ruin the company?
I do think your prediction makes sense, because the AI really isn't the product, it needs to be baked into something and licensing the models saves you the R&D and cost of implementing your own.
> Smaller companies will have departments that distill larger models into something more specifically manageable and useful for them.
In order to do that they'd have to make a concrete business case to justify the headcount and compute costs. They'd be facing the same fundamental economic problems Anthropic, OpenAI, MSFT, etc are facing just at a department level instead of a megacorp level. I hope they try it, sunlight is the best disinfectant.
However, when the pressure is turned up and people have to actually show results--and, like, be accountable--instead of just buying a subscription and externalizing the accountability, I don't think we'll see so much enthusiasm about AI coding. Whether or not an engineer is actually more or less productive with AI (not merely whether they feel more productive) will begin to matter a lot more. I don't see how people continue using AI in this hypothetical small company under those adverse conditions.
The frontier model space costs 1000x as much to develop as the small language models, and is only 1.5 years ahead.
Factually, the frontier models have not paid for themselves. So, if you're MSFT and Apple, you don't need to run in a race where even the winner loses massively.
You can try to train models 1.5 years behind that are highly likely to be profitable, given your market position.
The average person is lagging behind what AI is capable of by 3+ years anyway...
So you can save 1000x on training and 10x on inference and just use SOTA small models.
Why spend $5B training a model that's for sure not going to make $5B (after inference costs) when you can spend $5M building one that WILL make far more than that after inference costs?
There's definitely a way to use Claude code that is token conscious.
I've tried throwing unsupervised agentic software factory workflows against the wall, and they burned through my tokens like nobody's business but didn't produce much.
Supervised, human-in-the-loop process on the other hand is much more productive but doesn't consume nearly as much. Maybe that's why everyone's pushing agentic approaches so much.
At the enterprise level though, its going to be hard to want to use a service in which costs are not predictable, and keeping those costs under control requires employee training.
Yes, because in video games there is always a chance to win so you can optimize your strategy around that chance. If you have a 1% chance to drop a legendary weapon, the question becomes how do I manufacture 100 chances for a weapon drop in the shortest possible time. With agentic coding there is no such guaranteed chance - in a way it's worse than a slot machine that is guaranteed to pay out eventually. You could spend hundreds of millions of tokens and still not get what you asked for.
> If you have a 1% chance to drop a legendary weapon, the question becomes how do I manufacture 100 chances for a weapon drop in the shortest possible time.
Sidenote but I hope everyone realizes that 100 is kind of arbitrary here and does not mean the total chance to to get something is 100%.
You’re right, the arpg analogy isnt great, it’s too simplistic. I was trying to come up with something heavily stochastic where people are coming up with strategies to get the odds in their favor. Maybe closer to speculating on the real estate market? But even that feels too simplistic compared to LLMs. Even the definition of a win isn’t well defined.
Actually it’s really its own thing, I don’t think the slot machine analogy works too well, you also have fixed odds (and you know they aren’t in your favor), and a binary output
The analogy to slot machine is that you're spending your own resources in hope of a reward. So you're ultimately bound by your resources and your strategy doesn't count for much in the grand scheme of things.
With employees, there's a lot of punishments in place for people to not want to mess up. Loss of wages and reputation, prison time,... Startup do not fail because they have a bug-ridden product, they fail because of the market.
With AI, all bets are off. They're not aligned with your goals and it's very hard to discern when they go off unless you're an expert. And if you are one, at best it's just a slight boost in typing especially with all the works involved in software development.
There have also been winners of a slot machine gamba, so the analogy quite holds. I would even argue that there are considerably more slot machine gamba winners than the real world examples of actual LLM work.
There’s actually been a ton of research on how to optimize “slot machines,” at least in a generalized sense. For more reading, check out the literature on multi armed bandits.
Odd, I train teams (at large companies) to use harnesses effectively. So some training does exist.
I get the anti/skeptic sentiment. I've been called a lot of horrible things by a vocal contingent when they hear that I help train folks to learn software engineering best practices and then apply AI to that.
To be fair, the cost of software development has always been fairly unpredictable. What may be different is that the cost used to be roughly proportional to man-hours spent, while now the number of agents running in parallel may be less predictable.
> To be fair, the cost of software development has always been fairly unpredictable.
Yes, but in a "oops this is gonna take another two months to finish" kind of way, not the "oops this is the 12th time this month 8 developers have burned $2K in tokens in a single day and no one really knows how it happened" kind of way.
A belt loaded spinwheel machine gun, where there are some chances the next bullet is a dummy round, or goes in the wrong direction. And everytime you reload a new soldier is in charge of the gun
You don't need that analogy as the normal use of a automatic gun in war is not to kill, it is to suppress - stop the enemy from moving. If you are hit by a gun in automatic mode it is your own stupid fault. When you want to kill someone you switch to one shot or maybe 3 round bursts.
The only important thing to know is they're loud and thus impossible to conceal. Maybe someone out in the open would be hit if one starts firing unexpectedly. However, the vast majority of cases, the purpose of an automatic gun in war is just to put oatness stream of bullets so that everyone on the other side knows it's stupid to show your head and this allows your own people to move in relative safety because they know where your automatic weapon isn't going to be firing and the enemy, of course, isn't going to be able to respond.
The cost per month is 100% known and always has been. What has been variable is the rate of delivery. AI is different and can be substantial in countries with lower wages.
i've worked at so many places where the propaganda/marketing and reality on the ground is so disorienting/shocking i don't really expect this to be any different...
2 months ago: no limits. 1 month ago we had a leaderboard for whoever had the highest token spend not taking into account what was actually produced. This week: “everyone is using opus too much, just use it for planning.”
since those headlines started ive felt it just encouraged inefficiency. "say as much as you can without saying anything." if you were accomplishing your task the need for more would end, thus there is incentive to never succeed.
So your costs scale with the number of users you have.
Thats an op ex that you can explain.
For tokens for developers its maybe closer, cost/outcome wise, to hiring an external consulting company to write your code; money paid scales with work done, no promise of delivery, arbitrary unpredictable external price changes.
Its not quite the same; though, similarly lucrative for consultants.
You can put a limit on token spend and provide training (and even pre-configured workflows) on how to limit token spend.
Like the other commenter said: cloud spend can also spin out of control if you don't pay attention, yet we've found ways to keep it under control (training, guardrails, limits, transparancy).
The problem that I see is what you do if someone runs out of tokens. It doesn't very well work to say "well I guess you just get fired because you can't work at full speed for the rest of the month".
Personally, this feels like its just trying to push the work of managers in allocating resources onto developers so that they have more work to do and can be blamed if anything goes wrong.
My experience as well... I've only hit Antrhopic's 5hr threshold a few times, and two of them was within a half hour of the window. Also, all three times I'd already accomplished a LOT.
I tend to work with the agent, and observe what's going on as well as review/test and work through results/changes. I spend a lot more time planning tasks/features than the execution, even using the agent as part of planning and pre-documentation. It works really well. I don't think people burning through the 5hr allotment in under an hour are actually reviewing/QC/QA the results of what they're doing in any meaningful way, and likely producing as much garbage as good (slop).
I'm really curious as to HOW the MS employees were using the agents as much as what they were doing.
I suspect subscription limits are quite a bit higher than the equivalent tokens their dollar cost could purchase. I similarly feel like I can get a lot done with a $20/mo Claude Pro subscriptions, but also can easily spend $10-20/day at API pricing with similar usage.
Personally I prefer the API pricing because I feel like I'm not going to get rug pulled on my work. When it comes to personal stuff, I use the shit out of my sub, but it's not making me money.
I’ve made the same argument On Here. Paying the full price (should!) make you consider you usage, pick the right model, delegate to cheaper/local providers, …. It makes you use the models the way they’re going to be used after the subsidy ends.
Depends on what you're optimizing for. I'd hope that "after the subsidy ends", the "cheaper/local providers" will be at the level of at least current SOTA models. If not, then there's hardly a point using them anyway; if yes, then by sticking to subscription workflow you'll be learning the very workflow you'll be using "after subsidy ends".
Either way, I don't see much point of intentional austerity in times of extreme growth. There will be time for austerity once the growth ends.
Terms of service prohibit subscriptions for employees of companies bigger than X people. I suppose they could all sign up as individuals and try to get away with it but presumably that would look pretty obvious with a tiny bit of analytics.
Because with Max subscriptions, you have to use the Claude Agent SDK, which is basically running Claude Code underneath. You don't get to use the chat/Messages APIs with personal subscriptions, for that you need the API pricing.
The current thinking is automated agents is what turns this from an industry in the tens of billions to a multi trillion dollar one. So yes you are right on the money, agents stimulate demand for this thing they've built.
AI is expanding to meet the needs of expanding AI. Why worry about jobs? AI will provide plenty of work. If anything, I worry we'll be working more, not less. All that AI will need someone to vouch for it and to scapegoat when it makes mistakes.
98.6% cache hits doesn't distinguish an efficient workflow from an overly chatty linear agent repeatedly reusing the same context. Plus, it says nothing directly that the process has good useful progress per token.
I doubt it, the difference between someone slightly inefficient and someone extremely efficient isn't big enough to matter compared to how much they cost in salary.
You pay for cache hits on every turn and even with the newest architectures longer context is slower/more energy intensive. Constructing concise turns that reuse prefix and stop when the new context is no longer useful help, as does pushing generation down into cheaper models while using stronger models for verification.
> There's definitely a way to use Claude code that is token conscious.
Colleague used Sonnet 4.6 on some pretty normal agentic coding tasks through AWS Bedrock to keep the data in the EU, 100 EUR usage in a single day. In comparison, the Mistral subscription costs about 20 EUR per month and we tested that for similar tasks it was okay, the usage got to around 10% of that monthly limit in a single day. Or Anthropic's own Max (5x) plan where you get way, way more tokens to do with as you please.
I feel like the sweet spot is having a monthly subscription with any of the providers (you're subsidized a bunch), but if you have to pay per tokens, now I'd just look in the direction of what tasks DeepSeek would be okay for, sadly probably not in the situation above. For a startup, though...
On the other hand, this feels a bit hypocritical:
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time, and sources tell me that Claude Code has proved very popular inside Microsoft over the past six months.
They're gonna say that the future is all AI... until they get the bill.
I was trying to get a better sense of the time cost quality matrix of these, so I threw together a quick eval of Sonnet 4.6, Mistral's dev model, and Opus 4.7 (figuring it's what you'd use if you were on Max).
The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.
The one detail I did forget to mention is that if anyone goes with the Mistral subscription (instead of paying per-token), then the Mistral Vibe tool gives you their Medium 3.5 model by default, with a 200k token context. It will probably be enough for plenty of tasks, though there's also a noticeable difference between that and up to 1M.
Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.
My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.
There's where even frontier models struggle, which makes comparisons meaningful.
It’s making guesses not decisions, framing as decisions will lead you astray to wasted time and tokens.
It’s vaguely productive to tell them a ton of relevant info upfront attempting to minimise their need for load bearing guesses. I say vaguely because obedience is generally only around the level where it's good enough to lull you into a false sense of security, not to actually be obedient.
It’s a bit more productive to use the various loop mechanisms (hooks, /goal etc) to evaluate each end of turn against guard rails and reject with clear instruction on whats unacceptable. Obviously if you only do this without the front load of info then you’re likely to spend more tokens to reach a satisfactory end of iteration.
While you are correct that something like Antigravity 2 + Opus 4.6 can handle large scale software engineering tasks, I would argue that it is usually (but not always) better "coding agent hygiene" to work on smaller code modules and as the human in the loop be a partner, not someone who prompts and then disengages.
Breaking code up into composable chunks has worked well for me over 50+ years as a professional software developer, and I can't get away from the idea that it is still usually the way to go using agentic coding tools.
I was a Mistral Le Chat Pro subscriber (the €20/month plan). Yesterday I hit my monthly limit. Switching to PAYG I burned through another €40 in one evening, working on the same project, with the same tasks.
I upgraded my plan last night to Mistral Le Chat Teams. This now costs me €60 per month for two users. Limits have been reset, but I have no idea now if my per seat limit is higher than the Pro plan, or if the limit is shared between the seats, it’s really not clear. I guess I will find out next month. The limits reset on the first of the month and I really hope I don’t hit them in the next seven days.
I use Mistral Vibe CLI and I’ve written and implemented a couple of new skills[1]. Caveman, based on an idea I found online somewhere, this skill removes all extraneous response text, including articles. Makes for some fun reading, but supposedly reduces output tokens significantly. Hash-anchors, this one is based on a concept from Dirac[2], reduces search failures and also includes multi-file dispatch. It will be hard to measure, but Vibe tells me these two should result in roughly a 40% reduction in token burn.
I think it's great. People at a broad scale are getting first hand experience with resource management. It's a fairly cheap way of doing it too (in contrast to: learning this by managing humans) and we can all benefit from the skill transfer.
I find myself observing how my lead manages meetings ... "ah, this is like when I do that with Claude", "this is where he wants to understand what happened, like when I ask Claude" ... it's funny.
Yeah. Claude does good work but reviewing it all properly takes quite a bit of time. It got to the point I started having trouble maxing out my weekly allocation.
Dealt with that by going all out and making an agentic parallel code review skill. Basically an infinite TODO list generator. Now I'm definitely getting 100% of the usage I paid for. It really burns tokens like nobody's business, and catches a lot of issues while at it. I've been looping this review/fix process every week. It's dramatically reduced the amount of stuff I need to pay attention to during my human review sessions.
I’m interested in how this works in practise - I guess you’ve written a skill to do code review, then your Claude.md file tells it to use it after every change as a bg task? So does this work as a background task while Claude is working on the next ‘feature’?
There are many "critics", one for each quality I want reviewed. Correctness, consistency, maintainability, security, testing... Everything I could think of, and I keep adding more.
The scrutinize skill is the entry point. The Opus I'm talking to becomes an agent coordinator. He explores and autodiscovers the project's structure, subdivides it into logical sections.
Then he runs a truly absurd critic x section matrix against the entire project. Literally hundreds of these agents running in parallel, each focusing on one area. Ten minutes of this is enough to exhaust my Max 5x five hour window and put a serious dent in the weekly usage numbers.
It literally takes days to run a full agent sweep. I designed it around the rate limiting. The agents do file system style journaling in order to resume cleanly. They commit all of their findings as they go into an orphan branch in the repository. Further review runs can build on it and avoid searching for known issues.
The way it works in practice is I just run /scrutinize sweep and then go work on something else, or just go do my actual job, live my life, play video games, write an article for my blog or something. Come back five hours later to either resume the process or check the literally hundreds of issues that have been found by all the agents. Then Claude and myself will go in and evaluate and fix all of those issues one by one. Then review again. Then evaluate/fix again. I'm just gonna keep looping this over and over until zero issues are found. For all of my projects.
Going from solo hobbyist programmer to this was pretty insane. I can only imagine what these corporations with infinite money must be doing.
I would advise against it, depending on the project.
My lone lisp project gets the most love. I spend weeks reading, reviewing, restructuring and rewriting everything. It's the project where I'm concentrating all my efforts. Everything I push to master is absolutely my own work and I do want everyone to read it.
I had no trouble letting Claude take over maintenance of my static site generator and virtual machine orchestration scripts though. I wanted to care but... I didn't. I did glance over the finished product just to ensure it wasn't going to nuke my laptop the second it ran, but that's pretty much the extent of it.
You're welcome. Email me when you discover your own path? I could learn a thing or two as well. I'm pretty much a beginner when it comes to this stuff. Subscribed like two months ago.
Those critic skills are great. I see a real business opportunity for someone who can bundle everything you describe into a turnkey solution for a programmer like me who doesn't want to take months coming up with their own system and extensive .md files.
I'm currently, very painfully, removing a tiny bit of tech debt at a time from a massively complex project that we inherited from a 3rd-party vendor. Some of the tech debt is AI-related, some because it's a vendor who rarely has to maintain anything they create, some because when we first inherited it we had no grasp on the entire codebase and were just trying to change the plane wheels while flying (we still are).
What I'm doing now is the hardest kind of programming imo. I spend hours/week just meditating on how to chip away at this out-of-control codebase, figuring out how I can surgically remove some leaky abstraction that's spawned 5 cousins w/o disrupting the whole project. I'd be fascinated to see if the latest frontier model with a system like yours can actually help me. But I don't have the time or desire to invest the months of trial and error that I'm sure it took you to get to that point.
I used Claude Code's /insights function. It gets Claude himself to go over your sessions and usage patterns. It'll produce an HTML report that you can view.
In my case Claude saw that code review was my main activity and that I was manually and repeatedly asking claude to "review X, Y, Z..." so he suggested turning it into a skill. So I fired up the superpowers:brainstorming skill and bikeshedded it until I ended up with this heavy duty massively parallel super reviewing super claude. Refined it a bit after a couple weeks of use and the result is what you see in my repository.
I really don't like how the payment plans work with the providers right now. I feel this pressure to use all my tokens for the week, often just "wasting" them. But also, I want to take advantaged of the subsidized tokens in Claude Code and Codex for as long as I can.
There is this real danger that our thinking, and the things we make, become bloated without constraints.
IMO software has gone to shit since both mobile phones and laptops mostly have massive amounts of compute. We always seem to use it to the limit, just because it's there.
It's the gacha of software development. We've got periodically resetting timers. Prompting is like booster packs: we have a finite number of dice rolls before the timer resets. We might even get a Legendary Ultra Rare pull whenever Claude happens to be feeling extra motivated. Before you even know what's happening, it's hijacked the brain's reward circuits to the point you're waking up at 3 AM because that's when the timer resets. Gotta saturate those timers with pulls and minmax everything in sight.
At least it's doing something productive instead of just sinking money into literal gambling simulators. Mercifully, unlike video games, automation is not "cheating".
That will be true in March 2027, but isn’t necessarily true today. I work in a large organization that is grandfathered into the Enterprise Subscription plan until then. We have thousands of seats using Claude Code.
The way coding agent work is fantastically wasteful. All the megabytes of code are processed over and over and over, sometimes withing just one session.
There are papers describing KV cache precomputation for commonly used documents (e.g. KVLink), but, of course, it's not a priority for model providers: they'd rather sell you more tokens, also they would rather get to AGI/ASI first than optimize usage of existing models...
I believe OP is talking about new sessions or after compaction. He’s getting at the fact that LLMs are stateless and have to rediscover your codebase on every new session.
I meant caching on a bigger level. If you're an organization with 100 developers each doing 10 sessions a day, you're paying for 10000x tokens in frequently used document even if you had 100% KV cache hits within one session. Apparently that's too costly even for companies with trillion dollar market cap...
Normally KV cache works only if your context prefix is identical, but there are papers which demonstrate documents can be cached between different contexts.
I think so too, otherwise why wouldn't you put that (purported) increased capacity/output into improving your existing products or creating new ones, with the headcount that you already have?
My experience is, Claude Code burns way more tokens compared to other agents, probably to ensure high levels of perceived quality, which is, most of the times not worth the bloat for the user. The bloat works for Anthropic as an advertisement at the cost of your tokens.
its kind of weird tho, jensen also said we should be burning tons of tokens as well... 'perceived quality' cant be the only reason these ceos pushing token usage so hard can it?
1. right now, usage correlates with experimentation and learning, few if anyone knows how to make these things effective on their own over long sessions of activity
2. long term, you should be using more than one agent at a time, because they are running in the background based on events (new direct message / something happened in eg. github)
Lots of these places measure employee token use with managers having dashboards. It seems like performative code production rather than making anything useful.
Microsoft poorly manages token use of most expensive models in a pilot. Then they use that failure to advertise/position their own Github Copilot agents to procurement teams, over the now widely validated Claude Code-based agents.
At least Codex is trying to win validation on merit.
"incentivize to use as many tokens as possible" = "Upper management knows people dont like change so we are forcing them to come up with ways to use this thing". It does not mean that management will encourage wastefulness in the future, and it also doesnt mean that token usage from now wont be reviewed in the future. Whats to stop them from dinging your performance in november because you wasted a hundred thousand on tokens with nothing to show for it?
Makes sense why Anthropic wants to IPO as soon as possible as the growth right now comes from temporary wastefulness. Makes all the investments more risky.
I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.
Gemini got a big reduction in usage limits this week. There was backlash and they added 3x usage for Antigravity a day later but I haven't really tried it out to get a feel for it yet.
Google rug pulled Code Assist and Gemini CLI. They're moving everything to Antigravity and we would need to reinstall all our tooling, reconfigure any automations, and the mechanism to subscribe via GCP is much clunkier.
This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.
And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.
After 2 weeks of Claude getting progressively worse and worse, today was the final straw.
I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".
I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.
I've seen a lot of this sentiment over the previous six months from people on reddit. I have yet to experience this myself as a developer with over 20 years of experience.
It’s because it’s not true, there’s no evidence for it that passes the sniff test. No lab is “shipping a worse model once they’ve got you”. People have a bad few days and blame the model providers instead of stepping back to fix their workflow.
What it does seem like is that they're tuning some knobs up and down or releasing new versions of models or system prompts that result in the model getting dumber and smarter in waves.
Opus has been dumb this week.
Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.
It could also just be luck and my impressions are false... who knows.
Opus 4.7 has been a real downgrade for me. I’m back to mid 2025 when I had to catch all the completely intermediary goals/assumptions the model is creating for itself
My "mental scratchpad" needs to be as sharp as possible to maximize my intelligence. I think of the LLM as a scratchpad for my thinking, I hope the Anthropic team can see this.
it's sort of good at thinking, writing specs, etc.. Also debugging. But as a coder: I see no advantage to opus 4.6 and I preferred sonnet most times already over opus 4.6.
Oh Opus is nerfed sure, but not that hard. Early this year opus 4.6 can understand your prompt and your intention easily, it got worse around mid April. Opus 4.7 even worse than that.
However that's just it, you just need to improve and make clearer of your prompt and it will perform just as good.
This account is an LLM-hype peddler, shilling for Anthropic (check comment history). If they say that Claude is not nerfed, then most likely it is, in fact, nerfed.
I wouldn't call correcting misinformation and FUD "peddling hype" or "shilling" but I suppose we are in a post-truth world, where if you push back against the anti-AI emotions and vibes with grounded facts, you must be a shill.
Anyways, please take your discourse of calling people you disagree with "shills" back to Reddit. I'd much rather engage with someone debating the merits of an argument.
If you are an LLM-hype peddler, you really should not be offended at being called out. Also, this is the merit you are ostensibly looking for — since you are a shill, everyone should know this first before taking your words seriously.
You should also check your LLM prompt for HN comments, because the original comment you replied to was not anti-AI, and, in fact, very much pro-AI. The only criticism it had was about model being degraded, so they could not go as hard at AI-assisted development anymore as they used to before. I guess it's a bit difficult for LLMs to spot the difference and make proper conclusion for now.
Also even if taking you seriously — how does writing "no, model performance is not degraded because I say so" serve as correcting misinformation? It only does if you are shilling for Anthropic (which you do), otherwise it's just hot air.
Not offended at all, but just ranting about how someone is a shill instead of responding to the substance of their argument is simply not the kind of discussion we have on HN. Read the guidelines.
> "no, model performance is not degraded because I say so" serve as correcting misinformation?
Because zero evidence has been provided other than feelings. That is not evidence of degradation, and we know they don't serve quants.
As always, I think this happen more to vibe coder. They don't understand that bigger project means worse AI performance. On top of that Opus felt being nerfed at understanding prompt so if your spec is bad you won't get good result.
I see a lot of the "4.7 is a downgrade" sentiment. 4.7 does (mostly) what you ask it to do. 4.6 does what it thinks it should do. As someone with 20 years writing my own code I want the former, but the loud contingent online wants the latter.
When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.
I can tell you for a fact, Claude 4.7 was NOT doing what I told it to do (in fact the clear and complete opposite - repeatedly), a pretty simple architectural refactor, and that Codex did better and DeepSeek much better.
It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.
Meanwhile, Codex broke it down into 3 steps and just got it done...
No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."
Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.
Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.
Tell it to make changes, amend the commit, push --force-with-lease.
I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.
Mostly to test how much LLMs can actually scale development.
Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.
From a purely cost basis perspective, it's hard to argue they aren't killing it.
But from a multiplier perspective, it's up in the air how great they are.
It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.
So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.
#1 -> part of scaling is you can't review every single line of code.
LLMs don't really scale if you're still the bottlneck, or they only scale as much as you reviewing every line of code - that's not that much scaling...
So I try to only review certain parts, like making sure they aren't changing tests to allow architecturally broken code to slip through (because they regularly try, even when given explicit instructions not to). Or if I'm watching them make changes on my phone and see that they are clearly doing the exact opposite of what they're supposed to be doing (regularly if I'm watching).
#2 -> if commits are small, GitHub's setup is good enough that you can review code on your phone.
#3 -> if they're huge, I can just review on my laptop at lunch or something.
Theoretically, all of this can be solved easily with orchestration and require minimal oversight.
If you're using LLMs to write code and you're carefully reviewing every line with a jade-handled magnifying glass, you're not really scaling - at least to the degree I'm interested in.
> LLMs don't really scale if you're still the bottlneck
This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.
I think whats funny is that employees were most likely already covering the cost for these tools because they are useful. Companies didn't believe employees were using these tools and now have forced their usage and no longer have the costs subsidized.
Similarly companies seem to reward high token usage as a sign of someone willing to play ball with AI and again have forced higher costs on themselves for people reward hacking or using tokens out of spite.
There is no world where I can put my company’s data through an external site without their express consent and security sign off. I suspect at most companies there’s zero path for people to have been paying for it themselves.
None of the 5 places I have worked is this possible, but they are also all highly regulated industries. Firewalls block virtually everything by default.
Fair, but I assume everything on my work laptop is key logged. Surely they would notice Claude phoning home from my company laptop? I suspect a network rule to look for that traffic is trivial?
My employer doesn't specifically block this stuff, but does put up a warning when you visit it to review our AI usage policy. There isn't detection for using things in ways we shouldn't, but they have an audit trail and can review it if there is suspicion.
I switched from Anthropic to OpenAI after spending ~$40K in equivalent token costs using Claude over 3 months.
I found Opus 4.7 to be slow and wasteful with token usage. It's shocking how inefficient it is with tasks like bash tool usage and web searching, delegating them to a dozen subagents only to get stuck and never return until you esc and intervene. That, in addition to all of the broken tooling Anthropic built in to limit token usage like the broken monitoring tool made managing Claude a chore. I was happy to pay $200/month for Opus 4.5 when they had more capacity, but 4.7 felt like a huge step back and no longer worth the price and inconvenience.
I remember an OpenAI employee comment on the GPT5.5 release post about how they specifically geared it towards long-horizon tasks and its been a breathe of fresh air in that regard. I have five two-week long sessions going right now and there's been no degradation in performance or efficiency. It's much better at carrying rules/learnings forward even in long-running sessions and grounding/refreshing itself in verified facts when it loses context.
Its funny because in two weeks I've gotten way more done with GPT5.5 with way fewer tokens and way less handholding. I think this goes to show how important tooling and the harness is and how a capable model like Opus 4.7 can be severely handicapped by bad product decisions.
Being able to mange context over long running sessions is a function of the harness, not the model. Are you using Claude Code with GPT5.5? Codex? piclaw? They’ll all have different context management strategies to let you keep going when you would otherwise have filled up context and be forced to stop.
It doesn’t matter how good the harness is if the model does a bad job of planning and continuing from long context. A good harness cannot overcome a weak model.
Thus does kind of beg the question: If developers are being laid off because AI is better/faster/cheaper or makes all their people 10x or whatever fig leaf, what happens if the required tooling ends up being more expensive? From the investor’s point of view is the drag of employee costs better or worse than a ballooning expense item?
There is no profit, expense, revenue. Those don't matter. Only thing that matters is stock price goes up, and laying off makes stock price go up. When laying off make stock price go down, then laying off stop.
I suspect AI would have to get drastically more expensive before it starts looking worse than payroll. If one developer using Claude Code can effectively substitute for 2 developers, you are already coming out ahead at current API pricing assuming very heavy usage, your cost is going to be ~1.5x developer (factoring in beyond salary - benefits, PTO, the other overhead that comes with having employees).
So you're getting 2 for the price of 1.5. Scale that up to 500 devs at a big company and it's a big chunk of change saved on payroll.
Keeping your headcount or hiring humans instead, AI would have to start to cost upwards of $15k/month/developer or more before it costs more than hiring. You're looking at about 4 billion tokens per month before humans start to break even or are cheaper.
True, that was more hypothetical if it got good enough to 2x.
But even taking a more realistic 1.25x (20% time savings) gain, lets say you drop from 500 to 400 devs, you'd have to hit around $4,000/dev/month in token spend before hiring humans again would break even.
Payroll is just expensive, in most companies it's by far the biggest expense. AI still has to cost drastically more before investors would call it out as being worse than increasing headcount, from a pure dollars perspective.
I suppose if it all works out it'll end up way more expensive than the employees the models displaced ever were. These kinds of technologies usually end up as an oligopoly at best, and those players will have a wide moat by then, and the things these models build will be tweaked such that no other model or human being can realistically work on them anymore, and then they can price gouge everyone to the brink of unprofitability.
The model provider would be like a union, at least if unions had absolute control over their members, could take them all away at any time forever with no substantial negative consequences to itself, and spend billions on employer lock-in so switching to the competition is worse than paying the 12% model salary raise.
Because they are not people or alive, you can literally torture them if it gives you a mild increase in performance. For all practical purposes you can't do that to living humans. What is the price to put on being able to do that? It might weight the scales a bit for some employers.
"AI" is just a cover for laying ppl off and saving cost. But the pendulum will swing the the other way and the companies will realise that knowledgeable ppl are still required to generate and utilize the generated code. No serious company can run with vibe-coded apps generated by laymen.
> If developers are being laid off because AI is better/faster/cheaper
This is, in my opinion, tripe. SWEs are being laid off because of post-Covid over-hiring. The only evidence for labour destruction is in junior hires. But not because anyone is being fired, but because entry-level jobs are being cannibalised.
In general economy that is not the stock market is looking less and less great. Answer to this is to tighten the belt and that means losing employees. Especially as there has not been any new great revenue sources outside AI in recent years.
> Especially as there has not been any new great revenue sources outside AI in recent years.
Nobody can make a profit with AI. Any clever idea can be cloned with AI, competition makes it unprofitable. No moat, no arbitrage opportunity. "During the gold rush, the only people making money were the men selling shovels."
We can definitely do amazing things with AI, and it makes us have superpowers, but so does everyone else. My competition also uses AI. I have to keep up with an AI powered competition now.
The shovels are the datacenters. China and America are building them. Even after the valuations puff out, that infrastructure will remain as a massive competitive advantage to those economies.
I am not convinced this will be true. The big piles of GPUs make sense when the models will change multiple times before the hardware fails; but when there's no more money for rapidly training models, the best can be encoded as circuits in much more energy efficient hardware, rendering even the new power supply infrastructure for the data centres useless.
More expensive is a difficult calculation: faster can sometimes warrant the higher cost, if it means you can go faster to market. Also, LLMs work 24x7, and can be scaled up and down as needed. Faster to off board an LLM than to fire an employee (especially here in Europe). So, even if AI is more expensive than a developer, from TCO and ROI perspective it can still make business sense.
I'm surprised they even had them in a first place. Doesn't Microsoft have a deep partnership with OpenAI? Aren't all Copilot things powered by various GPT models? I would assume the two companies have barter agreements of sorts.
> I understand that Microsoft is planning to remove most of its Claude Code licenses and push many of its developers to use Copilot CLI instead. While Claude Code has been a popular addition, it has also undermined Microsoft’s new GitHub Copilot CLI coding tool — a command line version of GitHub Copilot that runs outside of development apps like Visual Studio Code.
It's a forum called Hacker News that's been hacked and covertly refactored into Marketing Wars. Being their primary goal is to foster a space to draw-in (marketing) projects/start-ups.
What's the point of eating your own dog food when the only thing you are doing is reselling other people's dog food? Microsoft don't have any competing LLM.
If you properly keep documents, architecture, and decision records, token consumption can be pretty less. Iam managing everything with two codex plus sub. Repo size is 300 k loc ( backend).
Slightly related (me not understanding) is why the Copilot in VS code is essentially just CLI interface. Why can't it use the IDE tools (search, LSP, ...). All it ever does is trying to execute grep.
Because it’s far far easier to make a text-generation machine generate text that has decades of how-to explanations on the Internet than to correctly work an internal editor API that changes often and isn’t as well-documented.
Claude Copilot does seem a bit more lost on the interface side than other models, but then again all of them are. Only the baseline tier seems to have been fine tuned to the platform.
I replaced common grep with a semantic search wrapper for some projects. It was amusing. It has a response header that lets Claude know it is not using standard grep. Works fine. Have to out smart them ;)
There is an option to turn on semantic indexing and search on copilot in vscode. Although I have no perceptual differences when I turn it on. The docs mention something about it.
Claude’s prompt heavily pushes it towards grep. We have an internal cross repo semantic search mcp and to get Claude to consistently use it a skill and prompting was not enough. A pre tool use hook is the answer. Claude will even write one for you if you describe the problem to it :)
Same, with regard to TUIs in general. The VS code copilot chat extension has really nice integration for 'human in the loop' style agentic development. I build some tooling - https://www.agentkanban.io to integrate a taskboard and git worktrees with copilot chat
I'm with you there. I can't stand the CLI that wants to take you away from the mostly bad code it writes. Give me the structure, let me finesse it - to do that I need to actually see it no matter how much Anthropic pretends that it's perfect.
I run Claude code inside an emacs vterm for moderately long lived work streams, and an ever shifting set of tmuxes for quick small features or bug fixes. The way I ensure I read the code at least a bit is the same as for wholly hand written code: I never do git add . only for one file at a time, and I got diff each file just prior to adding it (except sometimes for code genned files). I also arrange mostly to do incremental dev, sort of agile where I am the client and claude is the dev team and I check the utility of each feature one by one, so what I end up with delights me. It does tend to do more than is needed, so I will mostly delete code it has written rather than fix things. Like really not every module tunable constant needs to be over rideable from env vars. I am happy with the resulting systems, they have not collapsed into unmaintainable messes yet; the Claude in vterm in emacs is nice where I can think and run shell commands and look at code or git history while having a longer running discussion is nice UX.
I'm a little the opposite, what's the point of using an IDE with AI? I genuinely don't get it?
These days I just use Claude Code Desktop or Claude Code in powershell. Standalone, not inside and IDE. Honestly, I'm using Desktop more and more as it gets more features.
The IDE is for me. No AI in it at all. If I want to get Claude to do something specific to a file I just @ the file.
the obvious answer is because it's easier , faster, and more efficient to flip a true to false right in front of you than it is to prompt an llm.
if your response is "my prompts don't produce code that needs values flipped, ever." then I would wager you're only touching very simple things with an LLM.
for me I don't care about the token cost and prompt writing so much as the fact that it's just faster to change 0 to 1 and leaves me twiddling my thumbs for an llm output less.
The thing that drove me away from manual edits was that I found myself confusing the LLM all the time. It would read or write, some code, I'd twiddle with things, and then the LLM's future references to the same code would be a mess.
On balance, and via dictation, it feels likely to be faster overall to just enact the changes I want 'inline' of the conversation thread.
Is this stuff any better now? I think current harnesses probably do have things like file change listeners that automatically inform agents before they act on a file they've previously engaged with if it has changed in the meantime.
If you do manual edits, I find it best to start a new conversation. But if your instructions and documentation is good enough, the new conversations won't have any problems picking up where it needs to be.
Having said that, I fear what June 1st brings for copilot
It might suddenly be very useless for me.
I just use Codex/Claude Code in one window and Neovim in another and navigate around using Niri’s keyboard shortcuts. I much prefer it to VS Code on a traditional desktop in almost every respect.
But why did you flip that true to false? It sounds like a missing unit test. So at a minimum it’s do the flip, find the right place to unit test, and write a test. Or I just tell my LLM “this should be false because of X, fix and write a test”
For me I need to compare the code generated before committing. Also I need to read markdown plans generated for review before commit to execution. VSCode CC extension also generate clickable links to the file directly if the query has something to do with it.
All of them are valid usecase of VSCode CC extension for me.
Smart model can cut down time to write complex firewall yaml dramatically, relying both on the existing file and the ugly draft (eg comma delimited details of the rules I need) I put out. It makes it 5 minutes lead time and 20 presses of tab instead of writing a shell/python full of edge cases or just copying existing rules as a template and laborously editing them -- smart model knows what the specific firewall needs.
But I'm not a developer, so I use both - haiku via github for tab completion and CC for cli.
Productivity. You generate the skeleton of the code with Codex/Claude Code/et. al. and refactor it manually. It's kind of unlikely that an AI agent will be able to one-shot every bit of code in the exact way you want, even with a fat AGENTS.md file. With a smart AI-native IDE like Zed, it will quickly be able to pick up what manual change you intent to do without you fully typing out anything, especially if they're repetitive. This helps enormously when you're debugging or profiling your code.
That’s like asking why anyone would use IDE autoformatting, linting, or build tools rather than constantly swapping to a terminal to run their command line versions. As in, why use tool integration in an integrated development environment? Because that’s the entire point. Classic IDE refactoring and code generation tools are limited to explicitly programmed operations, but a well-integrated LLM can do much more and smarter manipulations without you having to context switch and explain the context of what you want done.
Claude Code will write the whole thing for you. Whereas doesn’t Copilot require input along the way of coding? ie- it doesn’t do all the programming for you
Most of us never had the option for work to pay for Claude Code -- some internal orgs did this. That being said I had a personal Claude Code subscription for a bit.
Honestly I find GitHub Copilot CLI (and now also the new GitHub Copilot app) quite decent. I mostly use it with Opus 4.7, or rarely with GPT-5.5. The VSCode extension is ok, but CLI or app are the better experience IMO.
I was recently talking to someone about that! I wasn't sure if it was my imagination, but I felt like Opus 4.6 was way more diligent about looking things up online and making sure that its response was accurate. While Opus 4.7 seems content to just throw out an answer as quickly as possible with little care for accuracy; I started to always remind it to do an online search and to double check its work, to the point where I had to add a custom memory.
I switched back to 4.6 thinking, as most did, 4.7 introduced some jankinesss to it. I switched back soon enough to 4.7. I think I might've adapted myself to what and how 4.7 does things. 4.6 felt a step backward.
4.7 is better if your spec is clearer. 4.6 is better if you give it more freedom doing it's tasks. 4.6 felt it'll steer off often if you give detailed specs than 4.7 though, so perhaps that's it
How does their versionimg work? Because I've assumed that they're constantly tweaking their system prompts, I'm hoping in a couple of months, 4.7 will be improved over my first impressions- I caught significant hallucinations, something I'd rarely experienced with 4.6, if at all, I honestly can't remember one - but what I worried me was thebout the hallucinations I didn't catch.
Same. 4.7 intelligence is significantly worse than 4.6 on ALL 3P Harnesses. So only on Claude Code and Anthropic API/Subscription you get decent performance but on every other Harness and/or Cloud Provider inference (Bedrock) it performs worse than 4.6 on almost every task. This is not just anecdotal, i've talked to many colleagues from AWS, Microsoft and so on and they all agree that something fishy is going on.
I switched back to even Sonnet 4.6 in Claude Code over Opus 4.7. Every day or two I try a new task on Opus 4.7 and regret it.
Looking now I see that "Opus 4.6 Legacy" is an option that was not there before, so maybe Anthropic noticed that others are having the same difficulty.
i use 4.6 and i've configured advisor to be on 4.7, so, when something's more complex the advisor can help. at least that's how i do with claude code, not sure of the others have implemented the concept of advisors.
4.7 turned out to be a disaster in multilingual settings, so I sticked to 4.6 so far. 4.7 seemed to be optimized for (very specific slice of) coding at the expense of everything else.
I went to 4.7, didn't have a choice, found it unsatisfactory, then Claude quietly added in the option to use 4.6, so I'm back on 4.6, and I'm not the only one in my company.
I had far more hallucinations with 4.7 than 4.6.
I'll try it again after a few more months for them to get it right, but 4.6 is what changed my mind on LLMs as a tool, and 4.7 felt like a step backwards, so for now I'm sticking with something that has delivered me value, instead of arguing with a model ostensibly better that was making shit up 1 - 2 times a day. It was really disappointing.
I can give examples if needed, I screenshotted the most aggravating ones, but what worries me is which ones I didn't recognise.
Opus 4.7 went through a major degradation a few weeks ago (way more hallucinations and rabbit holes than usual). Anthropic fixed it. Give it another shot.
Opus 4.7 seems very smart but the adaptive reasoning makes me always uncertain how hard it is actually trying. And it is far too argumentative. It seems to think it HAS to contradict you in ever response.
I have stuck with 4.6. I fully believe 4.7 can be smarter for truly complex and long running agentic use. But I prefer the more direct, literal mechanistic style and 4.6 seems to be peak Opus for that.
I still use 4.6 if I need Opus. It's mostly GPT-5.5 for me. Only if I know it cannot do some thing like push without running the tests (because AGENTS.md said so), I switch to 4.6.
Although GPT's been acting weird since Thursday...
Switched back when 4.7 had an issue last week and it was wayyy faster. I assume mostly because a lot of people have moved off but might consider using it more just for the speed boost.
Wouldn't they be forced into API pricing instead of per-seat like that though? That would potentially be a massive cost increase. But I've discovered through talking to colleagues some companies are already doing that. I can't understand why you'd ever do that when you can get VC subsidized pricing for now. At least for all initial in-plan usage. I doubt many developers go past the limit anyway and for those you switch just the extra usage to on demand anyway.
Funny I had the opposite experience. The Claude models seemed equivalent to GPT-5.4/5 in a generic harness like Copilot CLI or Opencode or Pi, but Claude Code the app/harness is so much better than all the others that I switched at work, even though I'd much prefer to use a non-proprietary harness (and eventually I do want to get Pi set up to be comparable).
Anthropic's Claude harness is much better than Copilot, i.e. the tools and instructions in each harness are different. Anthropic is just that much better (for claude models, likely an amount of co-development).
Personally, I looked into Copilot's prompt and saw things that made me put it down immediately to start working on my own. I'm now using OpenCode for reasons and I like it better than any Big Ai tool. Using OC with Qwen3.6-MoE (for context) and generally happy with the results.
I never understand why Amazon even bothers to build their own coding agent.
GitHub Copilot is in a somewhat similar place as Microsoft's toy but still different -- it was more or less the first coding agent/assistant, and GitHub/VSCode/Microsoft has enough user base and impact to influence individual users and enterprises' choices.
For Amazon's coding agent -- I just never see anyone outside Amazon even mentions Kiro or Amazon Q. Maybe a little bit when Kiro was offering tons of free credits. But I don't think it's even remotely relevant these days. I don't see news about companies adopting Kiro.
To me, it's just a matter of time before they are sunset, like Chime or a bunch of AWS products.
Is there any proprietary Amazon end-dev/ops facing service that's worth using? I've never had a good experience with any I've tried - CodeBuild, Cloud9, Q, SageMaker, WorkMail, WorkDocs, Chime, OpsWorks,...
I love AWS at the infrastructure level, but their PaaS tends to be meh, and their end-user directed stuff is usually atrocious.
When I was at AWS I had exactly one customer who used Chime and they loved it.
They were a manufacturing org and only managers had licenses to MS Office and users in Active Directory. Everybody else was registered on a separate OpenLDAP directory to avoid paying MS licenses.
Chime was cheaper per user than onboarding everybody into AD and paying Teams, and they could tack Chime usage into their AWS bill.
Microsoft have historically tended to dogfood their own products.
Obviously you want to be aware of what else is on the market, and use the right tool for the job -- but equally if you have a directly competing product, you'd prefer your org's telemetry and suggestions are directed towards improving your own software rather than your competitors'.
This was always a little weird to be because Microsoft internally is actively hostile to cross-org collaboration. If you worked in most of Azure you basically have 0 lanes of communication with someone from the Windows team and vice versa. Triply so for stuff like Kusto or Teams which you'd be dogfooding daily. I guess if there's a horrible stop the world bug it'd get surfaced through telemetry but normal user feedback is not a thing.
Compared to working at other big techs, where I was able to direct msg the engineers on the team for internal protobuf or datalake services in addition to user groups that were generally responsive it was just strange. Also Microsoft doesn't have a monorepo so you can't just commit patches to their service because you don't have access to their repos which I pretty regularly do elsewhere.
Maybe it's just Microsoft moving to more model agnostic tech within their copilot. I recently started using Microsoft 365 Copilot because corporate added Cowork which runs on Opus 4.7 which was better than the alternative we have available. Unlike the "real" Claude Code or Cowork this only has access to files in a specific onedrive folder in your personal sharepoint container, so it's much more compliant to things like NIS2.
Technically we're using Copilot and we're playing for it through Microsoft licenses, but it's using Opus 4.7. Even before this, most of our custom agents within m365 copilot were one of the GPT models.
Or maybe you're right and they want their developers to use the copilot models.
Copilot cannot be behind any models because it's a harness, not a model. You can use any of the popular models through it, including Claude models. Though people have been saying that Claude CLI is a better experience.
I switched to OpenRouter and OpenCode a while ago. It is much cheaper, much much cheaper, and A LOT more reliable. Particulary Gemini was a piece of trash when it came to uptime
I switched from Claude code to the GitHub copilot app recently. Since our repositories are hosted on GitHub I find the copilot app better integrated for the PR workflow with PR management available in the app. I don’t think I miss any of the features of Claude code I never thought I would make the switch but copilot upped the game.
Also it became very hard to convince management to keep both Claude code and GitHub Copilot enterprise licenses.
I’ve been quite content with CoPilot’s $10/mo plan. Still offers access to Claude models (limited tokens) but has no time limits like the $20 Claude plan, so no interruptions in work flow. I use one of the free models for the more pedestrian tasks then sic Claude on the particularly thorny problems. Works very well for me.
Thanks for the heads up. I had a brainstorming session with Gemini about this (I can’t believe I just typed that sentence lol) and the plan is to switch to two local LLMs; a lightweight, fast 3B model for autocomplete, and a slower 14B model for chat sessions. Then I can switch to a DeepSeek Premium API key for the really tough stuff. It recommended the Continue plugin for VSCode.
When I am away from home I’ll run autossh on my dinosaur road laptop (which probably has 8MB video RAM lol) to connect to the home PC’s LLMs. Gemini assured me that this should run well over my intermittent cellular connection.
The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
This, if you’re high performing, the company won’t question your use of tokens. If they want to limit it, they have ways to set limits on spend and usage.
Perhaps for now. But you know, after working solid with AI for two years and adopting effective methods using detailed plans, and having a lot of success with it, here is the problem:
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
> Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
This is why I have switched nearly all of my personal coding experiments over to Qwen3.6 27B. Opus make it easy to gloss over too much and to delegate too much. And so I don't build sufficient memory of the code to provide long-term oversight.
But Qwen3.6 27B sits on an really interesting balance point. It understands code well enough to get 80% of the way to a good design, and it can fully implement a well-specified feature. But if my understanding of the code starts to weaken, things start going wrong much more quickly than they do with Claude.
Opus will happily take complex code beyond the point of salvation, if you allow it. I'm currently cleaning up a successful prototype code base right now, one that was partially vibe-coded and now needs to be put into production. And Opus generated massive amounts of tech debt. So clearly people who lean into vibe coding will need to keep upgrading their models for many years to keep up with the mess created by earlier models.
Strong agree (although I'm on Qwen3.6-35B-A3B, with 6-bit quant.). If you're a programmer, it gets the job done. When I occasionally don't want to care about the code, I switch over to DeepSeek V4 Pro.
Yes I use Opus 4.7 regularly as my daily AI tool. It can do incredible things for sure, but more in the sense of pure intellect not much in “emotional” or “creative” intelligence.
For example you might have a great design/architecture session and then run out of context. The next agent tries to piece things together from fragments of conversation and such. But it often starts going off on tangents, searching overly broad to understand, misses cues and nuance, all-the-while burning tokens.
As other articles have put it: AI makes doing the easy things easier and the hard things harder. Because hard things require creativity.
To bring this back to the original post: companies need people, and they shouldn’t expect that they can fire half their workforce and replace it with AI. Quite the contrary. The faster companies move with AI the more technical debt they’ll end up with it’s a guarantee.
“If you want to travel fast, go alone. If you want to travel far, go together.”
Now as you can see from the article, it starts turning. People are getting less pricey than agents on API pricing.
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
> But objective measures of the economy like unemployment and real wages look good to excellent
Oh hell no, ever since the tail end of Biden the trend for unemployment is showing upwards when corrected for seasonal effects [1], and for real wage growth the situation has been worse for an even longer time [2] - if not for the effects of the post covid stimulus packages plus emergency wage raises following the energy cost explosion thanks to the Russian invasion of Ukraine.
The story the stonk markets tell is completely decoupled from reality, partially because the AI wash trading bubble keeps distorting the statistics, partially because no matter what the stonk markets only can grow up because pension contributions keep blowing up the market [3]. Not getting that difference was what blew up Biden's reelection and is now screwing over Trump.
What is in the gutters is memories of 2008-2010. That was the last time folks experienced a bad economy. I remember Ed Elson saying something along the lines of "who cares about employment, what matters is inflation". Sure, if you're 27, you haven't got a clue what a bad economy looks like.
Unemployment and CPI : The most false statistics on the planet. Instead, look at employment population ratio 25-54, and core inflation. That will FREE YOUR MIND.
If you’re sitting under a tree in the rain and it gets soaked through and you start getting wet, finding another tree won’t help you.
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
Code quality matters to engineers. Find a senior manager who cares. Or worse, find a customer who cares.
While they obviously want a high quality product, no outages, a responsive system etc, I don’t think they necessarily understand why you need to avoid creating god-objects, need to reason about abstractions, etc.
Code quality also exists on different axes. I've seen the case where code quality was poor in some aspects, e.g., tons of technical debt, coupling making it difficult to make changes, but overall product quality was very high. It had to be: it was a medical device.
Most environments only care about the output. In the case I'm thinking of, Software made it perfectly clear to Management, most of whom were former engineers, that the product desperately needed redesign in some ways. But as long as the cost of that redesign exceeded the cost to get the next version out, it could be postponed. This went on for years.
As one that does, it’s a difficult discussion to have with the executives. My peers look like their teams are producing more than my teams are and any argument along the lines of “but their code sucks” isn’t going to hold water. The executives care but until there’s actual impact or poor quality, it won’t matter, and it’s a lagging metric. Many still don’t care about technical debt and that’s been well understood in industry for a while.
It’ll take production incidents, impacted customers, and brand damage to make the executives start to prioritize quality over quantity again.
Thats why they said they optimize for effective output at the cost of higher token use. They didn’t say they are intending to have high token use, instead thet implied its a second order effect of seeking more effective output.
It doesn’t necessarily mean shipping faster either. Speeding up code production doesn’t mean it speeds up qa, compliance, and the litany of other things. Everyone seems to forget Amdahl’s law.
On a task by task basis the code Claude generates is pretty good these days.
The biggest issue I see is that it wants to rearchitect the code constantly and I have no faith in my tests anymore because Claude will just "fix" them
It’s too bad that, yet again, instead of the productivity gains leading to shorter work weeks, the benefits accrue to the companies. Just once I’d like to see productivity gains lead to more leisure time, not higher expectation.
American software engineers are paid commensurately more than equivalent roles in countries with strong worker rights. There is no free lunch.
Besides, it's probably counterproductive in the long run to think of strong worker rights as being opposed to the employer wanting higher productivity out of the worker.
Well, if we are talking the worldwide software development industry, FAANG-like salaries are a tiny exception. There are so many places without strong worker rights and without a high premium for workers.
The expectation of higher productivity measured by completely useless means, letting a highly qualified employee jump through hoops for the amusement and misconceptions of the C-level.
Maybe once we get universal income we can start recommending this. Until then the individual isn't to blame when the only option to keep providing is to keep grinding in a toxic environment.
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
> I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough.
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
Unfortunately, I think this is correct. Such as it ever has been with technological change. The folks at the bottom bear the brunt of the dislocation and the folks at the top pat themselves on the back for being so forward looking and get huge payouts regardless of the actual results. Further, the folks at the top are always incentivized to go along with the herd of their peers because if it works then they were on the bandwagon, and if it doesn’t work, well then, how could they have known because “Everyone was deceived.”
How much creativity do you need to fix bugs in corporate code? Almost zero. It’s maintenance, not creative work. Nothing against it, it’s needed, but let’s be real, would anybody be really sad if this work is overtaken by LLMs? I certainly won’t be, let them do it.
> How much creativity do you need to fix bugs in corporate code? Almost zero.
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
IBM system/360 OS had more than 50,000 bugs which could not be fixed because fixing any single bug would introduce two new bugs. I fear that a lot of AI software systems will reach the same crapware state as IBM system/360 very very soon!
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
the companies will have to pay, because you can't force creativity and quality
Most companies do not care about quality.
_users_ who have to interact with that software will pay the price.
Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
Anyone (including ANTHROP\C) "recommending selective use of cheaper models" is spending costly human time (which costs more over time) on correcting the machine (which costs less over time). This is a bad trade.
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
I disagree heartily with everything here, both in personal experience from the models, and in values about coding.
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
I’d point out that smaller and simpler also makes their router code easier to review and that fewer lines will have fewer bugs (on average) and those bugs will be more obvious. But then, I’m old school and won’t let an AI work on code without reviewing it, and I mostly write code by hand.
Too many people see wages as a sunk cost and a constant. One problem though is AI costs per task are unpredictable, and management tends to prefer predictable outcomes over optimal outcomes.
> The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
This, I happily used the opus 4.6 fast mode to the tune of 5k for a project. The delivery of the project justified the 5k, if I only spent 500 but delivered the project 1 month later - I would have been in the dog house.
Your project cost $5k in tokens? How does that work? over what time? My understanding is that most developers are given pro max plans at $200/m and are expected to max that out.
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
My real comment is, why were they not just using their self-hosted copies of it? Do they pay back Anthropic for use of it in Azure? Broker a deal, let Anthropic charge you drastically less to use their model AND Anthropic could have made Claude Code work directly with Azure for Microsoft employees. Pennies on the dollar, and Microsoft could do it using low use GPUs to save on cost, or stack underused GPU compute (this is how serverless was born btw - its the unused resources in a web server somewhere).
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
I think quantifying tokens used is analogous to quantifying the amount of sawdust generated on a construction site.
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
With the rise of agentic coding, this has become a sign of quality for me in my own PRs and reviews: New features implemented in less than a thousand lines of productive code.
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
"""
Steve Ballmer
In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing.
"""
So many times in my career I have seen a problem that could be handled with two lines of code and a table lookup being handled with 40 lines of code and a switch statement. So the guy writing the 40 lines of codes switch statement would get paid 20 times more money!
To understand the token count thing - spending tokens is necessary and not sufficient to demonstrate that you are adopting AI.
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
My feeling is it's not as bad of a metric as people think. Companies don't fully know the best way to use AI and things are changing rapidly, so you want people using a lot of tokens even on stuff that seems maybe kind of dumb on the surface, because if you find one useful thing and share it in the org that makes up for a lot of failures.
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
NVidia will probably sue you for doing that, though.
Remember that the entire mantra of "productivity is a measure of how many shovels you break and replace" is only ever echoed by the one selling the shovels.
> The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.
What per cent of internal Microsoft IP runs through Anthropic? Do they not care about trade secrets, or certain groups allowed or not allowed to use tools that expose IP to external vendors?
Our shop is forced to use Copilot on gov cloud, and it’s so useless I usually stick to manually coding. Its syntax is messy, it randomly combines lines together, flips order, or drops a couple tokens worth of output in the middle of a line, and for some reason it consistently drops the last line of every code block. I assume we’re getting a few versions back of GPT under the hood. But it does make me appreciate how the models of the past year or so crossed the threshold from interesting to truly productivity-enhancing.
Between Copilot, Claude, and Gemini, I still actually prefer Gemini. I do a lot of scientific writing in addition to coding and Gemini is the only model I can trust to “just be right”. This trust then transfers over to its code output.
If you are talking about the Copilot built into vs code, that's not been my recent experience at all. Very capable in agent mode since gpt 5.4 came out.
It seems that people are using LLMs to generate code but many complain of sub par code. I recall the early days of virtualization when folks will use it but complain about performance. HW capacity continued to improve until virtualization became de facto standard. I wonder if sub par code will become better as more powerful agents models or compute become available.
It's been said that technologies are not product. CC might be better, but at the end of the day M$ is going to want to cut costs and have employees use their own technology. Perhaps Copilot CLI is close enough, and the CC product doesn't justify the cost of the Claude (technology) license when M$ has their own technology to leverage.
Side note, it's so frustrating that The Verge puts a paywall at the fold. It makes me feel like the rest of the story is not worth reading. I'm not inclined to pay $2 to read a link that was posted on an aggregator.
I have noticed particularly in recent weeks and maybe couple of months that token costs are just ridiculous.
I can understand the upcoming IPOs and instinctive pressure to show profits ... but let's be honest, showcasing burning 1.3 million USD in tokens by a single developer in a month is the most ridiculous thing I have seen in my entire life.
The general principles still apply. You expect investing X and have a return on such investment.
Unfortunately that's not so easy to promise or expect.
There's no real 1 to 1 correlation between amount of code written and returns, and even less between tokens burned and returns.
I start to believe that the current token pricing approach, followed at the moment by all leading labs (especially considering OS models capabilities), is bordeline delusional ...
[OP] robertkarl | a day ago
I expect the r/LocalLLaMA guys to be going nuts about this news.
thewebguyd | 23 hours ago
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time.
I suspect they weren't as efficient as they could be with token use either. Sounds like they were trying to encourage non-developers to vibe code stuff
xienze | 22 hours ago
ndiddy | a day ago
sashank_1509 | a day ago
fishtoaster | a day ago
[OP] robertkarl | a day ago
[OP] robertkarl | a day ago
m132 | a day ago
ajd555 | a day ago
OnionBlender | a day ago
scarmig | a day ago
siva7 | a day ago
q3k | a day ago
call me a luddite, i'll be wearing it as a badge of honor
BoiledCabbage | 23 hours ago
andyfilms1 | a day ago
rglover | a day ago
andrekandre | 18 hours ago
NitpickLawyer | a day ago
At one point there were rumours that they'd do that. They also have the rigts to oAI models for a few more years still, so they could always use that but apparently they're also compute starved (like anyone else).
kridsdale1 | 22 hours ago
mrweasel | a day ago
This is a warning to any company, not building their own AI, that AI assisted development could become really expensive really fast and most likely won't pay off. What Microsoft is suggesting is that the current price is to high, but it's still not high enough for e.g. Anthropic to be profitable, or AI coding tools are only as good as the developers using them. So you can't meaningfully do layoffs by replacing the developers with AIs, because the cost is to high.
How does Microsoft plan to fix CoPilot, so that the cost will be so much lower than Claude, that budget overruns won't be a problem for their own customer?
andyfilms1 | a day ago
Smaller companies will have departments that distill larger models into something more specifically manageable and useful for them. At least, that's my personal prediction :)
mrweasel | 22 hours ago
I do think your prediction makes sense, because the AI really isn't the product, it needs to be baked into something and licensing the models saves you the R&D and cost of implementing your own.
jcgrillo | 8 hours ago
In order to do that they'd have to make a concrete business case to justify the headcount and compute costs. They'd be facing the same fundamental economic problems Anthropic, OpenAI, MSFT, etc are facing just at a department level instead of a megacorp level. I hope they try it, sunlight is the best disinfectant.
However, when the pressure is turned up and people have to actually show results--and, like, be accountable--instead of just buying a subscription and externalizing the accountability, I don't think we'll see so much enthusiasm about AI coding. Whether or not an engineer is actually more or less productive with AI (not merely whether they feel more productive) will begin to matter a lot more. I don't see how people continue using AI in this hypothetical small company under those adverse conditions.
kridsdale1 | 22 hours ago
There may be a spot of “good enough to pay for and make a profit” that exists.
onlyrealcuzzo | 22 hours ago
The frontier model space costs 1000x as much to develop as the small language models, and is only 1.5 years ahead.
Factually, the frontier models have not paid for themselves. So, if you're MSFT and Apple, you don't need to run in a race where even the winner loses massively.
You can try to train models 1.5 years behind that are highly likely to be profitable, given your market position.
The average person is lagging behind what AI is capable of by 3+ years anyway...
So you can save 1000x on training and 10x on inference and just use SOTA small models.
Why spend $5B training a model that's for sure not going to make $5B (after inference costs) when you can spend $5M building one that WILL make far more than that after inference costs?
tra3 | a day ago
I've tried throwing unsupervised agentic software factory workflows against the wall, and they burned through my tokens like nobody's business but didn't produce much.
Supervised, human-in-the-loop process on the other hand is much more productive but doesn't consume nearly as much. Maybe that's why everyone's pushing agentic approaches so much.
SubiculumCode | a day ago
salawat | a day ago
dgellow | 12 hours ago
gambiting | 11 hours ago
echoangle | 10 hours ago
Sidenote but I hope everyone realizes that 100 is kind of arbitrary here and does not mean the total chance to to get something is 100%.
avadodin | 8 hours ago
dgellow | 7 hours ago
Actually it’s really its own thing, I don’t think the slot machine analogy works too well, you also have fixed odds (and you know they aren’t in your favor), and a binary output
skydhash | 2 hours ago
With employees, there's a lot of punishments in place for people to not want to mess up. Loss of wages and reputation, prison time,... Startup do not fail because they have a bug-ridden product, they fail because of the market.
With AI, all bets are off. They're not aligned with your goals and it's very hard to discern when they go off unless you're an expert. And if you are one, at best it's just a slight boost in typing especially with all the works involved in software development.
serf | 10 hours ago
people still can't get over the unreasonable effectiveness of algorithms.
greenchair | 8 hours ago
arkadiytehgraet | 7 hours ago
LPisGood | 10 hours ago
subscribed | 7 hours ago
If this is the "analogy" you go for, you don't seem to be suited to make that comparison.
__mharrison__ | 6 hours ago
I get the anti/skeptic sentiment. I've been called a lot of horrible things by a vocal contingent when they hear that I help train folks to learn software engineering best practices and then apply AI to that.
layer8 | 23 hours ago
xienze | 23 hours ago
Yes, but in a "oops this is gonna take another two months to finish" kind of way, not the "oops this is the 12th time this month 8 developers have burned $2K in tokens in a single day and no one really knows how it happened" kind of way.
kridsdale1 | 22 hours ago
dgellow | 12 hours ago
bluGill | 8 hours ago
dgellow | 6 hours ago
bluGill | 5 hours ago
ilovecake1984 | 22 hours ago
sidewndr46 | 22 hours ago
iSnow | 22 hours ago
sidewndr46 | 21 hours ago
andrekandre | 18 hours ago
lukan | 9 hours ago
foolserrandboy | 9 hours ago
basch | 21 hours ago
mrgoldenbrown | 17 hours ago
Isn't this a (mildly exaggerated) description of AWS, which is a very successful service?
noodletheworld | 11 hours ago
So your costs scale with the number of users you have.
Thats an op ex that you can explain.
For tokens for developers its maybe closer, cost/outcome wise, to hiring an external consulting company to write your code; money paid scales with work done, no promise of delivery, arbitrary unpredictable external price changes.
Its not quite the same; though, similarly lucrative for consultants.
logicchains | 9 hours ago
Not if you're using it for running builds, running research jobs, model training, etc.
jochem9 | 13 hours ago
Like the other commenter said: cloud spend can also spin out of control if you don't pay attention, yet we've found ways to keep it under control (training, guardrails, limits, transparancy).
harimau777 | 7 hours ago
Personally, this feels like its just trying to push the work of managers in allocating resources onto developers so that they have more work to do and can be blamed if anything goes wrong.
tracker1 | a day ago
I tend to work with the agent, and observe what's going on as well as review/test and work through results/changes. I spend a lot more time planning tasks/features than the execution, even using the agent as part of planning and pre-documentation. It works really well. I don't think people burning through the 5hr allotment in under an hour are actually reviewing/QC/QA the results of what they're doing in any meaningful way, and likely producing as much garbage as good (slop).
I'm really curious as to HOW the MS employees were using the agents as much as what they were doing.
kristjansson | a day ago
skeledrew | a day ago
kridsdale1 | 22 hours ago
lawn | 23 hours ago
giancarlostoro | 23 hours ago
HDThoreaun | 23 hours ago
sulam | 22 hours ago
kristjansson | 15 hours ago
TeMPOraL | 10 hours ago
Either way, I don't see much point of intentional austerity in times of extreme growth. There will be time for austerity once the growth ends.
mrgoldenbrown | 17 hours ago
northern-lights | 10 hours ago
brookst | 23 hours ago
CoolestBeans | a day ago
dualvariable | 20 hours ago
beardyw | 12 hours ago
gmerc | 12 hours ago
hannofcart | 11 hours ago
visarga | 7 hours ago
thegreatpeter | 23 hours ago
brookst | 23 hours ago
gobdovan | 23 hours ago
kridsdale1 | 22 hours ago
recursive | 22 hours ago
nchie | 10 hours ago
hedgehog | 21 hours ago
KronisLV | 22 hours ago
Colleague used Sonnet 4.6 on some pretty normal agentic coding tasks through AWS Bedrock to keep the data in the EU, 100 EUR usage in a single day. In comparison, the Mistral subscription costs about 20 EUR per month and we tested that for similar tasks it was okay, the usage got to around 10% of that monthly limit in a single day. Or Anthropic's own Max (5x) plan where you get way, way more tokens to do with as you please.
I feel like the sweet spot is having a monthly subscription with any of the providers (you're subsidized a bunch), but if you have to pay per tokens, now I'd just look in the direction of what tasks DeepSeek would be okay for, sadly probably not in the situation above. For a startup, though...
On the other hand, this feels a bit hypocritical:
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time, and sources tell me that Claude Code has proved very popular inside Microsoft over the past six months.
They're gonna say that the future is all AI... until they get the bill.
michaelbuckbee | 22 hours ago
The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.
https://5m6qnuhyde.evvl.io/
KronisLV | 20 hours ago
kaoD | 11 hours ago
Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.
My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.
There's where even frontier models struggle, which makes comparisons meaningful.
CraigJPerry | 10 hours ago
It’s making guesses not decisions, framing as decisions will lead you astray to wasted time and tokens.
It’s vaguely productive to tell them a ton of relevant info upfront attempting to minimise their need for load bearing guesses. I say vaguely because obedience is generally only around the level where it's good enough to lull you into a false sense of security, not to actually be obedient.
It’s a bit more productive to use the various loop mechanisms (hooks, /goal etc) to evaluate each end of turn against guard rails and reject with clear instruction on whats unacceptable. Obviously if you only do this without the front load of info then you’re likely to spend more tokens to reach a satisfactory end of iteration.
kaoD | 10 hours ago
mark_l_watson | 6 hours ago
Breaking code up into composable chunks has worked well for me over 50+ years as a professional software developer, and I can't get away from the idea that it is still usually the way to go using agentic coding tools.
dgellow | 12 hours ago
I mean, the will continue to say so, they just want to be the ones being paid for the service, not anthropic :)
phillc73 | 8 hours ago
I upgraded my plan last night to Mistral Le Chat Teams. This now costs me €60 per month for two users. Limits have been reset, but I have no idea now if my per seat limit is higher than the Pro plan, or if the limit is shared between the seats, it’s really not clear. I guess I will find out next month. The limits reset on the first of the month and I really hope I don’t hit them in the next seven days.
I use Mistral Vibe CLI and I’ve written and implemented a couple of new skills[1]. Caveman, based on an idea I found online somewhere, this skill removes all extraneous response text, including articles. Makes for some fun reading, but supposedly reduces output tokens significantly. Hash-anchors, this one is based on a concept from Dirac[2], reduces search failures and also includes multi-file dispatch. It will be hard to measure, but Vibe tells me these two should result in roughly a 40% reduction in token burn.
[1] https://codeberg.org/MimosaDev/skills
[2] https://dirac.run/
nurettin | 12 hours ago
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Me: Are you sure?
Claude: Well actually there is a bug <more random stuff that looks right this time>
----- Now it is:
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Claude: Let me consult the advisor on that.
Claude: advisor came up with some advice, adjusting according to that. <more random stuff that looks right this time>
jstummbillig | 11 hours ago
visarga | 7 hours ago
matheusmoreira | 10 hours ago
Dealt with that by going all out and making an agentic parallel code review skill. Basically an infinite TODO list generator. Now I'm definitely getting 100% of the usage I paid for. It really burns tokens like nobody's business, and catches a lot of issues while at it. I've been looping this review/fix process every week. It's dramatically reduced the amount of stuff I need to pay attention to during my human review sessions.
jdsnape | 8 hours ago
matheusmoreira | 8 hours ago
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
There are many "critics", one for each quality I want reviewed. Correctness, consistency, maintainability, security, testing... Everything I could think of, and I keep adding more.
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
The scrutinize skill is the entry point. The Opus I'm talking to becomes an agent coordinator. He explores and autodiscovers the project's structure, subdivides it into logical sections.
Then he runs a truly absurd critic x section matrix against the entire project. Literally hundreds of these agents running in parallel, each focusing on one area. Ten minutes of this is enough to exhaust my Max 5x five hour window and put a serious dent in the weekly usage numbers.
It literally takes days to run a full agent sweep. I designed it around the rate limiting. The agents do file system style journaling in order to resume cleanly. They commit all of their findings as they go into an orphan branch in the repository. Further review runs can build on it and avoid searching for known issues.
The way it works in practice is I just run /scrutinize sweep and then go work on something else, or just go do my actual job, live my life, play video games, write an article for my blog or something. Come back five hours later to either resume the process or check the literally hundreds of issues that have been found by all the agents. Then Claude and myself will go in and evaluate and fix all of those issues one by one. Then review again. Then evaluate/fix again. I'm just gonna keep looping this over and over until zero issues are found. For all of my projects.
Going from solo hobbyist programmer to this was pretty insane. I can only imagine what these corporations with infinite money must be doing.
resonious | 7 hours ago
matheusmoreira | 7 hours ago
My lone lisp project gets the most love. I spend weeks reading, reviewing, restructuring and rewriting everything. It's the project where I'm concentrating all my efforts. Everything I push to master is absolutely my own work and I do want everyone to read it.
I had no trouble letting Claude take over maintenance of my static site generator and virtual machine orchestration scripts though. I wanted to care but... I didn't. I did glance over the finished product just to ensure it wasn't going to nuke my laptop the second it ran, but that's pretty much the extent of it.
sgc | 7 hours ago
matheusmoreira | 6 hours ago
suzzer99 | 3 hours ago
I'm currently, very painfully, removing a tiny bit of tech debt at a time from a massively complex project that we inherited from a 3rd-party vendor. Some of the tech debt is AI-related, some because it's a vendor who rarely has to maintain anything they create, some because when we first inherited it we had no grasp on the entire codebase and were just trying to change the plane wheels while flying (we still are).
What I'm doing now is the hardest kind of programming imo. I spend hours/week just meditating on how to chip away at this out-of-control codebase, figuring out how I can surgically remove some leaky abstraction that's spawned 5 cousins w/o disrupting the whole project. I'd be fascinated to see if the latest frontier model with a system like yours can actually help me. But I don't have the time or desire to invest the months of trial and error that I'm sure it took you to get to that point.
matheusmoreira | 2 hours ago
In my case Claude saw that code review was my main activity and that I was manually and repeatedly asking claude to "review X, Y, Z..." so he suggested turning it into a skill. So I fired up the superpowers:brainstorming skill and bikeshedded it until I ended up with this heavy duty massively parallel super reviewing super claude. Refined it a bit after a couple weeks of use and the result is what you see in my repository.
visarga | 7 hours ago
kixxauth | 7 hours ago
There is this real danger that our thinking, and the things we make, become bloated without constraints.
IMO software has gone to shit since both mobile phones and laptops mostly have massive amounts of compute. We always seem to use it to the limit, just because it's there.
matheusmoreira | 7 hours ago
At least it's doing something productive instead of just sinking money into literal gambling simulators. Mercifully, unlike video games, automation is not "cheating".
krzyk | 6 hours ago
cebert | 5 hours ago
krzyk | 5 hours ago
blitzar | 9 hours ago
By buying a subscription and dealing with the limits, using claude code and paying per token seems like the fast lane to the poor house.
killerstorm | a day ago
There are papers describing KV cache precomputation for commonly used documents (e.g. KVLink), but, of course, it's not a priority for model providers: they'd rather sell you more tokens, also they would rather get to AGI/ASI first than optimize usage of existing models...
brookst | 23 hours ago
beoberha | 22 hours ago
iainmerrick | 11 hours ago
killerstorm | 22 hours ago
Normally KV cache works only if your context prefix is identical, but there are papers which demonstrate documents can be cached between different contexts.
brookst | 19 hours ago
dgellow | 12 hours ago
brookst | 4 hours ago
dgellow | 4 hours ago
guluarte | a day ago
stock_toaster | 21 hours ago
zkmon | a day ago
andrekandre | 18 hours ago
verdverm | 3 hours ago
1. right now, usage correlates with experimentation and learning, few if anyone knows how to make these things effective on their own over long sessions of activity
2. long term, you should be using more than one agent at a time, because they are running in the background based on events (new direct message / something happened in eg. github)
tyleo | a day ago
Speed without judgement always compounds badly.
andrewl-hn | 22 hours ago
https://www.folklore.org/Negative_2000_Lines_Of_Code.html
thadk | a day ago
At least Codex is trying to win validation on merit.
uniclaude | a day ago
HDThoreaun | 23 hours ago
boelboel | 22 hours ago
proxysna | a day ago
I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.
kridsdale1 | 22 hours ago
operatingthetan | 22 hours ago
saulpw | 22 hours ago
seabrookmx | 4 hours ago
This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.
And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.
onlyrealcuzzo | 22 hours ago
After 2 weeks of Claude getting progressively worse and worse, today was the final straw.
I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".
I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.
mmusc | 20 hours ago
couchdb_ouchdb | 20 hours ago
Our_Benefactors | 15 hours ago
raincole | 12 hours ago
It's a good thing that hype-chasers are cancelling though. So we can use the services with a reasonable latency.
colechristensen | 12 hours ago
Opus has been dumb this week.
Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.
It could also just be luck and my impressions are false... who knows.
dgellow | 12 hours ago
Wowfunhappy | 11 hours ago
chantepierre | 11 hours ago
cdurth | 5 hours ago
shomp | 4 hours ago
kaeluka | 7 hours ago
solenoid0937 | 11 hours ago
People heard "Claude is nerfed" and now they see it everywhere, they notice failures a lot more than they would have otherwise.
Doesn't matter that Claude is not, in fact, nerfed. Perception is powerful and most humans are not rational.
fendy3002 | 9 hours ago
However that's just it, you just need to improve and make clearer of your prompt and it will perform just as good.
fragmede | 9 hours ago
arkadiytehgraet | 6 hours ago
solenoid0937 | 4 hours ago
Anyways, please take your discourse of calling people you disagree with "shills" back to Reddit. I'd much rather engage with someone debating the merits of an argument.
arkadiytehgraet | 24 minutes ago
You should also check your LLM prompt for HN comments, because the original comment you replied to was not anti-AI, and, in fact, very much pro-AI. The only criticism it had was about model being degraded, so they could not go as hard at AI-assisted development anymore as they used to before. I guess it's a bit difficult for LLMs to spot the difference and make proper conclusion for now.
Also even if taking you seriously — how does writing "no, model performance is not degraded because I say so" serve as correcting misinformation? It only does if you are shilling for Anthropic (which you do), otherwise it's just hot air.
solenoid0937 | 15 minutes ago
> "no, model performance is not degraded because I say so" serve as correcting misinformation?
Because zero evidence has been provided other than feelings. That is not evidence of degradation, and we know they don't serve quants.
fendy3002 | 9 hours ago
johnfink8 | 7 hours ago
When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.
onlyrealcuzzo | 6 hours ago
It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.
Meanwhile, Codex broke it down into 3 steps and just got it done...
No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."
Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.
Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.
shomp | 4 hours ago
onlyrealcuzzo | 4 hours ago
Tell it what to do.
Commit, push to origin, review on GitHub.
Tell it to make changes, amend the commit, push --force-with-lease.
I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.
Mostly to test how much LLMs can actually scale development.
Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.
From a purely cost basis perspective, it's hard to argue they aren't killing it.
But from a multiplier perspective, it's up in the air how great they are.
It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.
So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.
shomp | 4 hours ago
onlyrealcuzzo | 3 hours ago
LLMs don't really scale if you're still the bottlneck, or they only scale as much as you reviewing every line of code - that's not that much scaling...
So I try to only review certain parts, like making sure they aren't changing tests to allow architecturally broken code to slip through (because they regularly try, even when given explicit instructions not to). Or if I'm watching them make changes on my phone and see that they are clearly doing the exact opposite of what they're supposed to be doing (regularly if I'm watching).
#2 -> if commits are small, GitHub's setup is good enough that you can review code on your phone.
#3 -> if they're huge, I can just review on my laptop at lunch or something.
Theoretically, all of this can be solved easily with orchestration and require minimal oversight.
If you're using LLMs to write code and you're carefully reviewing every line with a jade-handled magnifying glass, you're not really scaling - at least to the degree I'm interested in.
zarzavat | 2 hours ago
This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.
dsagent | a day ago
Similarly companies seem to reward high token usage as a sign of someone willing to play ball with AI and again have forced higher costs on themselves for people reward hacking or using tokens out of spite.
QuiEgo | a day ago
kridsdale1 | 22 hours ago
Fun fact, up until you face a consequence for crime, all crime is free! Have fun and go win the competition game against your co-workers.
cityofdelusion | 22 hours ago
QuiEgo | 21 hours ago
RevEng | 19 hours ago
InsideOutSanta | 22 hours ago
o10449366 | a day ago
I found Opus 4.7 to be slow and wasteful with token usage. It's shocking how inefficient it is with tasks like bash tool usage and web searching, delegating them to a dozen subagents only to get stuck and never return until you esc and intervene. That, in addition to all of the broken tooling Anthropic built in to limit token usage like the broken monitoring tool made managing Claude a chore. I was happy to pay $200/month for Opus 4.5 when they had more capacity, but 4.7 felt like a huge step back and no longer worth the price and inconvenience.
I remember an OpenAI employee comment on the GPT5.5 release post about how they specifically geared it towards long-horizon tasks and its been a breathe of fresh air in that regard. I have five two-week long sessions going right now and there's been no degradation in performance or efficiency. It's much better at carrying rules/learnings forward even in long-running sessions and grounding/refreshing itself in verified facts when it loses context.
Its funny because in two weeks I've gotten way more done with GPT5.5 with way fewer tokens and way less handholding. I think this goes to show how important tooling and the harness is and how a capable model like Opus 4.7 can be severely handicapped by bad product decisions.
gnat | 23 hours ago
beering | 4 hours ago
josefritzishere | a day ago
rnxrx | a day ago
ares623 | 23 hours ago
stock_toaster | 23 hours ago
While LLM Opex is "some future quarter" and very easy to co-mingle with other expenses.
thewebguyd | 23 hours ago
So you're getting 2 for the price of 1.5. Scale that up to 500 devs at a big company and it's a big chunk of change saved on payroll.
Keeping your headcount or hiring humans instead, AI would have to start to cost upwards of $15k/month/developer or more before it costs more than hiring. You're looking at about 4 billion tokens per month before humans start to break even or are cheaper.
jayd16 | 23 hours ago
thewebguyd | 22 hours ago
But even taking a more realistic 1.25x (20% time savings) gain, lets say you drop from 500 to 400 devs, you'd have to hit around $4,000/dev/month in token spend before hiring humans again would break even.
Payroll is just expensive, in most companies it's by far the biggest expense. AI still has to cost drastically more before investors would call it out as being worse than increasing headcount, from a pure dollars perspective.
mrgoldenbrown | 17 hours ago
ilovecake1984 | 22 hours ago
__mharrison__ | 6 hours ago
dividedbyzero | 23 hours ago
kridsdale1 | 22 hours ago
thewebguyd | 22 hours ago
dividedbyzero | 22 hours ago
Applejinx | 10 hours ago
user34283 | 11 hours ago
With research and hardware near guaranteed to bring the efficiency way up, I'm not scared here of massive price hikes.
There is no moat.
andrewl-hn | 22 hours ago
This would never fly if stock market was rational. But it never is.
dawnerd | 11 hours ago
marcosdumay | 2 hours ago
I wonder if this will happen before they have some obligatory debloating of the investors exposition to the company.
ngc248 | 12 hours ago
JumpCrisscross | 11 hours ago
This is, in my opinion, tripe. SWEs are being laid off because of post-Covid over-hiring. The only evidence for labour destruction is in junior hires. But not because anyone is being fired, but because entry-level jobs are being cannibalised.
Ekaros | 9 hours ago
visarga | 7 hours ago
Nobody can make a profit with AI. Any clever idea can be cloned with AI, competition makes it unprofitable. No moat, no arbitrage opportunity. "During the gold rush, the only people making money were the men selling shovels."
We can definitely do amazing things with AI, and it makes us have superpowers, but so does everyone else. My competition also uses AI. I have to keep up with an AI powered competition now.
JumpCrisscross | 3 hours ago
ben_w | an hour ago
ako | 9 hours ago
skeledrew | a day ago
wg0 | 23 hours ago
rvz | 23 hours ago
kridsdale1 | 22 hours ago
chris_money202 | 8 hours ago
sergiomattei | 22 hours ago
Shalomboy | 22 hours ago
andrewl-hn | 22 hours ago
RevEng | 21 hours ago
relevant_stats | 22 hours ago
> I understand that Microsoft is planning to remove most of its Claude Code licenses and push many of its developers to use Copilot CLI instead. While Claude Code has been a popular addition, it has also undermined Microsoft’s new GitHub Copilot CLI coding tool — a command line version of GitHub Copilot that runs outside of development apps like Visual Studio Code.
And people here are interpreting this as related mainly to the Claude burning too much tokens too quickly and suggesting Microsoft should rather use SomeOtherLLM©?
Is this Hacker News or rather Marketing Wars?
RobRivera | 22 hours ago
Eso mensaje de hijo de Carlos
relevant_stats | 21 hours ago
johnnypangs | 12 hours ago
s_dev | 11 hours ago
ninjagoo | 8 hours ago
No public forum is naturally immune to the spread of (guerilla) marketing. [1]
[1] Internet Rule #48
righthand | 4 hours ago
wolvoleo | 21 hours ago
DeathArrow | 12 hours ago
nobodywillobsrv | 12 hours ago
gmerc | 12 hours ago
matt3210 | 11 hours ago
sreekanth850 | 11 hours ago
iamflimflam1 | 11 hours ago
What they wanted was for them to use both and feedback which was better.
The developers voted with their feet and didn’t use Copilot.
What Microsoft were hoping was that the opposite would happen...
cfunderburg | 11 hours ago
stanac | 11 hours ago
rplnt | 11 hours ago
skywhopper | 10 hours ago
Especially if you want effective results.
avadodin | 8 hours ago
bitexploder | 6 hours ago
mattmanser | 10 hours ago
At the moment it seems like the way it's been trained has been tightly coupled with grep.
It does feel bizarre though that it doesn't use the symbol servers.
a1o | 10 hours ago
https://code.visualstudio.com/docs/copilot/reference/workspa...
AlexMoffat | 7 hours ago
gbro3n | 11 hours ago
tags2k | 11 hours ago
lanstin | 4 hours ago
bakies | 4 hours ago
mattmanser | 10 hours ago
These days I just use Claude Code Desktop or Claude Code in powershell. Standalone, not inside and IDE. Honestly, I'm using Desktop more and more as it gets more features.
The IDE is for me. No AI in it at all. If I want to get Claude to do something specific to a file I just @ the file.
serf | 10 hours ago
if your response is "my prompts don't produce code that needs values flipped, ever." then I would wager you're only touching very simple things with an LLM.
for me I don't care about the token cost and prompt writing so much as the fact that it's just faster to change 0 to 1 and leaves me twiddling my thumbs for an llm output less.
NiloCK | 9 hours ago
On balance, and via dictation, it feels likely to be faster overall to just enact the changes I want 'inline' of the conversation thread.
Is this stuff any better now? I think current harnesses probably do have things like file change listeners that automatically inform agents before they act on a file they've previously engaged with if it has changed in the meantime.
ThunderSizzle | 8 hours ago
Having said that, I fear what June 1st brings for copilot It might suddenly be very useless for me.
thevillagechief | 6 hours ago
christophilus | 7 hours ago
That said, I never tried copilot.
brianwawok | 6 hours ago
krzyk | 6 hours ago
fendy3002 | 9 hours ago
All of them are valid usecase of VSCode CC extension for me.
subscribed | 8 hours ago
Tab completion.
Smart model can cut down time to write complex firewall yaml dramatically, relying both on the existing file and the ugly draft (eg comma delimited details of the rules I need) I put out. It makes it 5 minutes lead time and 20 presses of tab instead of writing a shell/python full of edge cases or just copying existing rules as a template and laborously editing them -- smart model knows what the specific firewall needs.
But I'm not a developer, so I use both - haiku via github for tab completion and CC for cli.
vovavili | 8 hours ago
metaltyphoon | 8 minutes ago
This doesn’t mean much if you are using a terminal editor.
Sharlin | 7 hours ago
harimau777 | 7 hours ago
I can also click on a file referenced by the AI and have it open immediately in the IDE so that I can inspect it.
Finally, it is a pain to write long, multi-line prompts in a CLI where you can't easily click around to edit different parts.
The primary weakness I've found in IDE based UI is that it struggles to get through the corporate security in order to run commands.
ninjagoo | 8 hours ago
MS thinks CoPilot is the Clark Griswold of LLMs when it's really Cousin Eddie...
RA_Fisher | 7 hours ago
mirekrusin | 7 hours ago
__mharrison__ | 6 hours ago
RA_Fisher | 5 hours ago
verst | 11 hours ago
Honestly I find GitHub Copilot CLI (and now also the new GitHub Copilot app) quite decent. I mostly use it with Opus 4.7, or rarely with GPT-5.5. The VSCode extension is ok, but CLI or app are the better experience IMO.
RA_Fisher | 7 hours ago
krzyk | 6 hours ago
gofreddygo | 10 hours ago
Underlying model choice still has no restrictions. Opus 4.6 is by far the most popular. there's still big $$$ bills going anthropic's way.
comboy | 10 hours ago
fendy3002 | 10 hours ago
4.7 IMO is around 10-20% worse at understanding your prompt intention. You need more effort to explain your intention clearer so it doesn't divert.
TheAceOfHearts | 9 hours ago
Keyframe | 9 hours ago
fendy3002 | 9 hours ago
meowface | 8 hours ago
It's not a bad idea to skip it and wait until the next model release, but I personally will stick with 4.7.
techpression | 8 hours ago
jondwillis | 2 hours ago
EdwardDiego | 7 hours ago
siva7 | 6 hours ago
fendy3002 | 4 hours ago
epistasis | 4 hours ago
Looking now I see that "Opus 4.6 Legacy" is an option that was not there before, so maybe Anthropic noticed that others are having the same difficulty.
SequoiaHope | 9 hours ago
zuppy | 9 hours ago
lifthrasiir | 8 hours ago
trollbridge | 5 hours ago
willtemperley | 8 hours ago
I've spent the last couple of days building Swift bindings to a monster CPP lib and I've actually had fun.
EdwardDiego | 7 hours ago
I had far more hallucinations with 4.7 than 4.6.
I'll try it again after a few more months for them to get it right, but 4.6 is what changed my mind on LLMs as a tool, and 4.7 felt like a step backwards, so for now I'm sticking with something that has delivered me value, instead of arguing with a model ostensibly better that was making shit up 1 - 2 times a day. It was really disappointing.
I can give examples if needed, I screenshotted the most aggravating ones, but what worries me is which ones I didn't recognise.
samastur | 7 hours ago
/model command returns only 4 choices for me: Opus 4.7, two Sonnet options and Haiku.
whateveracct | 6 hours ago
samastur | 4 hours ago
For anyone else who may want this, use: export ANTHROPIC_MODEL=claude-opus-4-6
krzyk | 6 hours ago
Maybe this is becaus I'm on api pricing? (All new contracts for corps are pushed to that by Anthropic).
cpeterso | 5 hours ago
putlake | 3 hours ago
/model claude-opus-4-6[1M]
whiplash451 | 4 hours ago
theptip | 2 hours ago
UltraSane | an hour ago
zmmmmm | 7 hours ago
pimeys | 2 hours ago
Although GPT's been acting weird since Thursday...
nijave | 2 hours ago
vasco | 9 hours ago
bdavbdav | 9 hours ago
fortran77 | 7 hours ago
gwerbin | 6 hours ago
fortran77 | 6 hours ago
krzyk | 6 hours ago
And you get a token based pricing since June 1.
verdverm | 4 hours ago
Personally, I looked into Copilot's prompt and saw things that made me put it down immediately to start working on my own. I'm now using OpenCode for reasons and I like it better than any Big Ai tool. Using OC with Qwen3.6-MoE (for context) and generally happy with the results.
Insanity | 7 hours ago
I think Kiro might have some “first mover” advantage internally, but CC feels better to use.
fg137 | 6 hours ago
GitHub Copilot is in a somewhat similar place as Microsoft's toy but still different -- it was more or less the first coding agent/assistant, and GitHub/VSCode/Microsoft has enough user base and impact to influence individual users and enterprises' choices.
For Amazon's coding agent -- I just never see anyone outside Amazon even mentions Kiro or Amazon Q. Maybe a little bit when Kiro was offering tons of free credits. But I don't think it's even remotely relevant these days. I don't see news about companies adopting Kiro.
To me, it's just a matter of time before they are sunset, like Chime or a bunch of AWS products.
Insanity | 6 hours ago
For Kiro, I agree with you, it seems like wasted effort and Anthropic / OpenAI are miles ahead in their tooling.
cameronh90 | 6 hours ago
I love AWS at the infrastructure level, but their PaaS tends to be meh, and their end-user directed stuff is usually atrocious.
rescbr | 4 hours ago
They were a manufacturing org and only managers had licenses to MS Office and users in Active Directory. Everybody else was registered on a separate OpenLDAP directory to avoid paying MS licenses.
Chime was cheaper per user than onboarding everybody into AD and paying Teams, and they could tack Chime usage into their AWS bill.
__mharrison__ | 6 hours ago
There's a large (and growing!) contingent of people who don't write code these days. (Many don't even use the keyboard.)
cameronh90 | 6 hours ago
Obviously you want to be aware of what else is on the market, and use the right tool for the job -- but equally if you have a directly competing product, you'd prefer your org's telemetry and suggestions are directed towards improving your own software rather than your competitors'.
Anon1096 | 6 hours ago
Compared to working at other big techs, where I was able to direct msg the engineers on the team for internal protobuf or datalake services in addition to user groups that were generally responsive it was just strange. Also Microsoft doesn't have a monorepo so you can't just commit patches to their service because you don't have access to their repos which I pretty regularly do elsewhere.
Quothling | 3 hours ago
Technically we're using Copilot and we're playing for it through Microsoft licenses, but it's using Opus 4.7. Even before this, most of our custom agents within m365 copilot were one of the GPT models.
Or maybe you're right and they want their developers to use the copilot models.
keyle | 11 hours ago
Arguably, Copilot is GPT 5? Not sure what the CLI offers behind the covers.
patentlyze | 11 hours ago
It. is. so. bad.
It feels like it's at least 1-2 years behind the current top models.
gbro3n | 11 hours ago
tored | 9 hours ago
keyle | 7 hours ago
alternatex | 6 hours ago
meowkit | 11 hours ago
The CLI can swap to whatever model (/models) based on your subscriptions.
The copilots on desktop or Office Apps are likely just GPT5 nano or other tiny models with cheap inference
golf1052 | 10 hours ago
usernametaken29 | 10 hours ago
zabil | 10 hours ago
Also it became very hard to convince management to keep both Claude code and GitHub Copilot enterprise licenses.
cbdevidal | 9 hours ago
cbdevidal | 9 hours ago
mellosouls | 9 hours ago
Github Copilot offered probably the best value and was IMO underappreciated for a long time; I've been an annual subscriber since day 1.
The changes announced a few days ago completely revoke that value proposition, I doubt I'll continue with it.
cbdevidal | 9 hours ago
mellosouls | 9 hours ago
Changes to GitHub Copilot individual plans
https://news.ycombinator.com/item?id=47838508
GitHub Copilot is moving to usage-based billing
https://news.ycombinator.com/item?id=47923357
Multipliers for annual subscribers:
https://docs.github.com/en/copilot/reference/copilot-billing...
cbdevidal | 8 hours ago
When I am away from home I’ll run autossh on my dinosaur road laptop (which probably has 8MB video RAM lol) to connect to the home PC’s LLMs. Gemini assured me that this should run well over my intermittent cellular connection.
You just saved me some headache and money :-D
ramigb | 5 hours ago
__mharrison__ | 6 hours ago
New pricing model changes that. I will still keep it around for autocompletion (for the rare times when I open up an editor).
dminik | 9 hours ago
maxignol | 7 hours ago
harimau777 | 7 hours ago
Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
cowsandmilk | 6 hours ago
krzyk | 6 hours ago
mschuster91 | 6 hours ago
ggititel | 5 hours ago
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
ekidd | 5 hours ago
This is why I have switched nearly all of my personal coding experiments over to Qwen3.6 27B. Opus make it easy to gloss over too much and to delegate too much. And so I don't build sufficient memory of the code to provide long-term oversight.
But Qwen3.6 27B sits on an really interesting balance point. It understands code well enough to get 80% of the way to a good design, and it can fully implement a well-specified feature. But if my understanding of the code starts to weaken, things start going wrong much more quickly than they do with Claude.
Opus will happily take complex code beyond the point of salvation, if you allow it. I'm currently cleaning up a successful prototype code base right now, one that was partially vibe-coded and now needs to be put into production. And Opus generated massive amounts of tech debt. So clearly people who lean into vibe coding will need to keep upgrading their models for many years to keep up with the mess created by earlier models.
trollbridge | 5 hours ago
Opus is relegated to the planning / design phase.
maplethorpe | 5 hours ago
Have you tried Claude Opus 4.7?
ggititel | 5 hours ago
For example you might have a great design/architecture session and then run out of context. The next agent tries to piece things together from fragments of conversation and such. But it often starts going off on tangents, searching overly broad to understand, misses cues and nuance, all-the-while burning tokens.
As other articles have put it: AI makes doing the easy things easier and the hard things harder. Because hard things require creativity.
To bring this back to the original post: companies need people, and they shouldn’t expect that they can fire half their workforce and replace it with AI. Quite the contrary. The faster companies move with AI the more technical debt they’ll end up with it’s a guarantee.
“If you want to travel fast, go alone. If you want to travel far, go together.”
krzyk | 5 hours ago
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
trollbridge | 5 hours ago
I have DS-V4-Pro agents pretty much running 24/7. The cost is inconsequential. The same cannot be said for anything from Anthropic.
ponector | 5 hours ago
ac29 | 4 hours ago
Consumer sentiment is in the gutters certainly. But objective measures of the economy like unemployment and real wages look good to excellent
https://fred.stlouisfed.org/series/UNRATE
https://fred.stlouisfed.org/series/LES1252881600Q
throwaway85825 | 4 hours ago
mschuster91 | 3 hours ago
Oh hell no, ever since the tail end of Biden the trend for unemployment is showing upwards when corrected for seasonal effects [1], and for real wage growth the situation has been worse for an even longer time [2] - if not for the effects of the post covid stimulus packages plus emergency wage raises following the energy cost explosion thanks to the Russian invasion of Ukraine.
The story the stonk markets tell is completely decoupled from reality, partially because the AI wash trading bubble keeps distorting the statistics, partially because no matter what the stonk markets only can grow up because pension contributions keep blowing up the market [3]. Not getting that difference was what blew up Biden's reelection and is now screwing over Trump.
[1] https://www.bls.gov/charts/employment-situation/civilian-une...
[2] https://www.atlantafed.org/research-and-data/data/wage-growt...
[3] https://news.ycombinator.com/item?id=48233492
bachmeier | 3 hours ago
williamDafoe | an hour ago
ninkendo | 5 hours ago
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
some-guy | 5 hours ago
Insanity | 4 hours ago
While they obviously want a high quality product, no outages, a responsive system etc, I don’t think they necessarily understand why you need to avoid creating god-objects, need to reason about abstractions, etc.
vips7L | 4 hours ago
drob518 | 3 hours ago
HeyLaughingBoy | 2 hours ago
thayne | 2 hours ago
HeyLaughingBoy | 2 hours ago
Most environments only care about the output. In the case I'm thinking of, Software made it perfectly clear to Management, most of whom were former engineers, that the product desperately needed redesign in some ways. But as long as the cost of that redesign exceeded the cost to get the next version out, it could be postponed. This went on for years.
xena | 4 hours ago
snoman | 3 hours ago
It’ll take production incidents, impacted customers, and brand damage to make the executives start to prioritize quality over quantity again.
joshribakoff | 4 hours ago
vips7L | 4 hours ago
chocrates | 4 hours ago
fauigerzigerk | 3 hours ago
willj | 4 hours ago
generic92034 | 2 hours ago
willj | an hour ago
generic92034 | 2 hours ago
gottorf | an hour ago
Besides, it's probably counterproductive in the long run to think of strong worker rights as being opposed to the employer wanting higher productivity out of the worker.
generic92034 | 40 minutes ago
The expectation of higher productivity measured by completely useless means, letting a highly qualified employee jump through hoops for the amusement and misconceptions of the C-level.
gchamonlive | 4 hours ago
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
gchamonlive | 4 hours ago
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
drob518 | 3 hours ago
joquarky | 2 hours ago
They call themselves "risk takers" to justify their high pay.
confidantlake | 2 hours ago
rebolek | 3 hours ago
gchamonlive | 2 hours ago
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
williamDafoe | an hour ago
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
rixed | an hour ago
Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
Terretta | 4 hours ago
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
epistasis | 4 hours ago
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
drob518 | 3 hours ago
opsnooperfax | 2 hours ago
majormajor | 2 hours ago
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
lumost | 4 hours ago
apsurd | 2 hours ago
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
giancarlostoro | 4 hours ago
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
bob1029 | 4 hours ago
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
cwsx | 3 hours ago
arw0n | 3 hours ago
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
rcleveng | an hour ago
""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """
From https://www.pbs.org/nerds/part2.html
williamDafoe | an hour ago
gausswho | 3 hours ago
We may be on the cusp of the AI age's new era of 'measure twice, cut once'.
theptip | 2 hours ago
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
idopmstuff | an hour ago
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
HappMacDonald | 7 minutes ago
Remember that the entire mantra of "productivity is a measure of how many shovels you break and replace" is only ever echoed by the one selling the shovels.
locknitpicker | an hour ago
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.
jgalt212 | 7 hours ago
heisenbit | 5 hours ago
lou1306 | 5 hours ago
If anything, it's forced dogfooding, i.e., forcing their own workforce to beta-test their product.
plaidfuji | 5 hours ago
Between Copilot, Claude, and Gemini, I still actually prefer Gemini. I do a lot of scientific writing in addition to coding and Gemini is the only model I can trust to “just be right”. This trust then transfers over to its code output.
totalhack | 5 hours ago
siva7 | 4 hours ago
la64710 | 5 hours ago
goldylochness | 5 hours ago
loloquwowndueo | 4 hours ago
bel8 | 49 minutes ago
2) Opus is not even unambiguously best at coding anymore. GPT 5.5 splits that title for some time now.
3) I would have probably done the same in his position. Dogfood the product.
jadar | 4 hours ago
Side note, it's so frustrating that The Verge puts a paywall at the fold. It makes me feel like the rest of the story is not worth reading. I'm not inclined to pay $2 to read a link that was posted on an aggregator.
thisislife2 | 3 hours ago
Kapura | 3 hours ago
geoffbp | 3 hours ago
gradientsrneat | an hour ago
https://github.blog/news-insights/company-news/github-copilo...
Claude tokens are priced by GitHub at a disproportionately premium price compared to Gemini and OpenAI. I wonder why?
https://docs.github.com/en/copilot/reference/copilot-billing...
fredcallagan | 21 minutes ago