I feel like Claude Code is starting to fall over from being entirely written by LLMs. How do you even begin to fix precise bugs in a 1M+ LOC codebase all written by AI? It seems like LLMs are great for quickly adding large new features but not great for finding and fixing edge-cases.
This is real. I’ve seen some baffling bugs in prompt based stop hook behavior.
When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway. Then I went poking on GitHub and found a vibed fix diff that changed the behavior in a totally new direction (it did not update the documentation).
Seems like everyone over there is vibing and no one is rationalizing the whole.
>Seems like everyone over there is vibing and no one is rationalizing the whole.
Claude Code creator literally brags about running 10 agents in parallel 24/7.
It doesn't just seems like it, they confirmed like it is the most positive thing ever.
It's software engineering crack. Starting a project feels amazing, features are shipping, a complex feature in the afternoon - ezpz. But AI lacks permanence, for every feature you start over from scratch, except there is more of codebase now, but the context window is still the same. So there is drift, codebase randomizes, edge cases proliferate, and the implementation velocity slows down.
Full disclosure - I am a heavy codex user and I review and understand every line of code. I manually fight spurious tests it tries to add by pointing a similar one already exists and we can get coverage with +1 LOC vs +50. It's exhausting, but personal productivity is still way up.
I think the future is bright because training / fine-tuning taste, dialing down agentic frameworks, introducing adversarial agents, and increasing model context windows all seem attainable and stackable.
I think that the current test suite is far too small. For the Claude Code codebase, a sensible next step would be to generate thousands of tests. Without that kind of coverage, regressions are likely, and the existing checks and review process do not appear sufficient to reliably prevent them.
My request is that an entirely LLM-written feature should only be eligible for merge once all of those generated tests pass, so we have objective evidence that the change preserves existing behavior.
I usually have multiple agents up working on a codebase. But it's typically 1 agent building out features and 1 or 2 agents code reviewing, finding code smells, bad architecture, duplicated code, stale/dead code, etc.
I'm definitely faster, but there's a lot of LLM overhead to get things done right. I think if you're just using a single agent/session you're missing out on some of the speed gains.
I think a lot of the gains I get using an LLM is because I can have the multiple different agent sessions work on different projects at the same time.
I know at least one of the companies behind a coding agent we all have heard of has called in human experts to clean up their vibe coded IAC mess created in the last year.
I’m happy to throw an LLM at our projects but we also spend time refactoring and reviewing each other’s code. When I look at the AI-generated code I can visualize the direction it’s headed in—lots of copy-pasted code with tedious manual checks for specific error conditions and little thought about how somebody reading it could be confident that the code is correct.
I can’t understand how people would run agents 24/7. The agent is producing mediocre code and is bottlenecked on my review & fixes. I think I’m only marginally faster than I was without LLMs.
> with tedious manual checks for specific error conditions
And specifically: Lots of checks for impossible error conditions - often then supplying an incorrect "default value" in the case of those error conditions which would result in completely wrong behavior that would be really hard to debug if a future change ever makes those branches actually reachable.
I always thought that the vast majority of your codebase, the right thing to do with an error is to propagate it. Either blindly, or by wrapping it with a bit of context info.
I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case. It’s worth knowing about the error cases, but it requires a lot more knowledge and reasoning about the current state of the program to think about how they should be handled. Not something you can figure out just by looking at a snippet.
The answer (as usual) is reinforcement learning. They gave ten idiots some code snippets, and all of them went for the "belt and braces" approach. So now thats all we get, ever. It's like the previous versions that spammed emojis everywhere despite that not being a thing whatsoever in their training data. I don't think they ever fixed that, just put a "spare us the emojis" instruction in the system prompt bandaid.
Training data from junior programmers or introductory programming teaching material. No matter how carefully one labels data, the combination of programming’s subjectivity (damaging human labeling and reinforcement’s effectiveness at filtering around this) and the sheer volume of low-experience code in the input corpus makes this condition basically inevitable.
Garbage in garbage out as they say. I will be the first to admit that Claude enables me to do certain things that I simply could not do before without investing a significant amount of time and energy.
At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.
Building a production grade system using Claude has been a fools errand for me. Whatever time/energy i save by not writing code - I end up paying back when I read code that I did not write and fixing anti-patterns left and right.
I rationalized by a bit - deflecting by saying this is AI's code not mine. But no - this is my code and it's bad.
> At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.
This is starting to drive me insane. I was working on a Rust cli that depends on docker and Opus decided to just… keep the cli going with a warning “Docker is not installed” before jumping into a pile of garbage code that looks like it was written by a lobotomized kangaroo because it tries to use an Option<Docker> everywhere instead of making sure its installed and quitting with an error if it isn’t.
What do I even write in a CLAUDE.md file? The behavior is so stupid I don’t even know how to prompt against it.
> I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case.
Think about it, they have to work in a very limited context window. Like, just the immediate file where the change is taking place, essentially. Having broader knowledge of how the application deals with particular errors (catch them here and wrap? Let them bubble up? Catch and log but don't bubble up?) is outside its purview.
I can hear it now, "well just codify those rules in CLAUDE.md." Yeah but there's always edge cases to the edge cases and you're using English, with all the drawbacks that entails.
I have encoded rules against this in CLAUDE.md. Claude routinely ignores those rules until I ask "how can this branch be reached?" and it responds "it can't. So according to <rule> I should crash instead" and goes and does that.
This is my biggest frustration with the code they generate (but it does make it easy to check if my students have even looked at the generated code). I dont want to fail silently or hard code an error message, it creates a pile of lies to work through for future debugging
Writing bad tests and error handling have been the worst performance part of Claude for me.
In particular writing tests that do nothing, writing tests and then skipping them to resolve test failures, and everybody's favorite: writing a test that greps the source code for a string (which is just insane, how did it get this idea?)
Seriously. Maybe 60% of the time I use claude for tests, the "fix" for the failing tests is also to change the application code so the test passes (in some cases it will want to make massive architecture changes to accomodate the test, even if there's an easy way to adapt the test to better fit the arch). Maybe half the time that's the right thing to do, but the other half the time it is most definitely not. It's a high enough error rate that it borderlines on useful.
Usually you want to fix the code that's failing a test.
The assumption is that your test is right. That's TDD. Then you write your code to conform to the tests. Otherwise what's the point of the tests if you're just trying to rewrite them until they pass?
An expectation of professionalism, training and written material on software design, providing incentives (like promotions) to not produce crap, etc.
It's not a world where everything produced is immediately verified.
If a human consistently only produced the quality of work Claude Opus 4.5 is capable of I would expect them to be fired from just about any job in short order. Yes, they'd get some stuff done, but they'd do too much damage to be worth it. Of course humans are much more expensive than LLMs to manage so this doesn't mean it can't be a useful tool... just it's not that useful a tool yet.
Humans may be prone to err, but they don't confabulate like LLMs do. Also, the unit tests are done by people who know intimately the expected behavior of the code, which surprisingly, it's frequently the same programmer.
This can be abused because the programmer is both judge and jury, but people tend to handle this paradox much better than LLMs.
1. Competent humans architecting and leading the system who understand the specs, business needs, have critical thinking skills and are good at their job
2. Automated tests
3. Competent human reviewers
4. QA
5. Angry users
Cutting out 1 and 3 in favor of more tests isn't gunna work
Ugh, I just think everyone in these threads are talking past each other.
I'm personally not advocating for not having humans in the loop. I don't know of anybody using llm tools or advocating for them that are saying there shouldn't be humans in the loop. "vibe coding" seems to mean different things to different people.
Not my experience at all when I occasionally try making something purely coded by AI for fun. It starts off fine but the pile of sub-optimal patterns slowly builds towards an unmaintainable mess with tons of duplication of code, and state that somehow needs to be kept in sync. Tests and linters can't test that the code is actually reasonable code...
Doesn't mean it's not a useful tool - if you read and think about the output you can keep it in check. But the "100% of my contributions to Claude Code were written by Claude Code" claim by the creator makes me doubt this is being done.
PS. In the 5 minutes between starting and finishing writing the parent comment https://claude.ai/settings/usage just stopped displaying my quota usage... fun.
Everyone has been stressing over losing their job because of AI. I'm genuinely starting to think this will end in 5x more work needing to clean up the mess caused. Who's going to maintain all this generated code?
Most of us in the financial side of this space think so as well. This is why AI Ludditism doesn't make sense - CAT Hydraulic Excavators didn't end manual shovelers, it forced them to upskill.
Similarly, Human-in-the-loop utilization of AI/ML tooling in software development is expected and in fact encouraged.
Any IP that is monetizable and requires significant transformation will continue to see humans-in-the-loop.
Weak hiring in the tech industry is for other reasons (macro changes, crappy/overpriced "talent", foreign subsidies, demanding remote work).
As in the ranking/mental model increasingly being used by management in upper market organizations.
A Coding copilot subscription paired with a competent developer dramatically speeds up product and feature delivery, and also significantly upskills less competent developers.
That said, truly competent developers are few and far between, and the fact that developers in (eg.) Durham or remote are demanding a SF circa 2023 base makes the math to offshore more cost effective - even if the delivered quality is subpar (which isn't neccesarily true), it's good enough to release, and can be refactored at a later date.
What differentiates a "competent" developer from an "average" developer is the learning mindset. Plenty of people on HN kvetch about being forced to learn K8s, Golang, Cloud Primitives, Prompt Engineering, etc or not working in a hub, and then bemoan the job market.
If we are paying you IB Associate level salaries with a fraction of the pedigree and vetting needed to get those roles, upskilling is the least you can do.
We aren't paying mid 6 figure TC for a code monkey - at that point we may as well entirely use AI and an associate at Infosys - we are paying for critical and abstract thinking.
As such, AI in the hands of a truly competent engineer is legitimately transformative.
> Who's going to maintain all this generated code?
Other AI agents, I guess. Call Claude in to clean up code written by Gemini, then ChatGPT to clean up the bugs introduced by Claude, then start the cycle over again.
This is probably tongue in cheek, but I literally do this and it works.
I've had one llm one-shot a codebase. Then I use another one to review (with a pretty explicit prompt). I take that review and feed it to another agent to refactor. Repeat that a bunch of times.
That would be possible if you had just the spec, but after sometime most of the code will not have been generated through the original spec, but through lots of back and forth for adding features and big fixing. No way to run all that again.
Not that old big non-AI software doesn't have similar maintainability issues (I keep posting this example, but I don't actually want to callthat company out specifically, the problem is widespread: https://news.ycombinator.com/item?id=18442941).
That's why I'm reluctant to complain about the AI code issues too much. The problem of how software is written, on the higher level, the teams, the decisions, the rotating programmers, may be bigger than that of any particular technology or person actually writing the code.
I remember a company where I looked at a contractor job, they wanted me to fix a lot of code they had received from their Eastern European programmers. They complained about them a lot in our meeting. However, after hearing them out I was convinced the problem was not the people generating the code, but the ones above them who failed to provide them with accurate specs and clear guidance, and got surprised at the very end that it did not work as expected.
Similar with AI. It may be hard to disentangle what is project management, what is actually the fault of the AI. I found that you can live with pockets of suboptimal but mostly working code well enough, even adding features and fixing bugs easily, if the overall architecture is solid, and components are well isolated.
That is why I don't worry too much about the complaints here about bad error checks and other small stuff. Even if it is bad, you will have lots of such issues in typical large corporate projects, even with competent people. That's because programmers keep changing, management focuses on features over anything else (usually customers, internal or external, don't pay for code reorg, only for new features). The layers above the low level code are more important in deciding if the project is and remains viable.
From what the commenters say, it seems to me the problem starts much higher than the Claude code, so it is hard to say how much at fault AI generated code actually is IMHO. Whether you have inexperienced juniors or an AI producing code, you need solid project lead and architecture layers above the lines of code first of all.
That's why all the code in my project is generated from the "prompts" (actually just regular block comments + references) and so all of that is checked in.
In the cloud with a micro-service architecture this just makes sense. Expose an API and call it a day, who cares what's behind the API as long as it follows the spec.
Using AI doesn’t really change the fact that keeping ones and zeroes in check is like trying to keep quicksand in your hands and shape it.
Shaping of a codebase is the name of the game - this has always been, and still, is difficult. Build something, add to it, refactor, abstraction doesn’t sit right, refactor, semantics change, refactor, etc, etc.
I’m surprised at how so few seem to get this. Working enterprise code, many codebases 10-20 years old could just as well have been produced by LLMs.
We’ve never been good at paying debt and you kind of need a bit of OCD to keep a code base in check. LLM exacerbates a lack of continuous moulding as iterations can be massive and quick.
I was part of a big software development team once and that necessity I felt there, namely, being able to let go of the small details and focusing on the big picture is even more important when using llms.
I've been trying opencode a bit with gemini pro (and claude via those) with a rust project, and I have a push pre-hook to cargo check the code.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
I added a bunch of lines telling it to never do that in CLAUDE.md and it worked flawlessly.
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
I was doing a somewhat elaborate form/graph in Google Worksheets, had to translate a bunch of cells from English to Spanish, and said "Why not use Gemini for this easy, grunt work? They tend to output good translations".
I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.
I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.
I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.
This is one of those places I feel like they're trying to do too much with the LLMs and I think this is one of those places where there's "a bubble". I feel like the LLMs are text tools, so trying to take them out of their domain and force them somewhere else you're going to have problems.
Anyways, I replied because I had something else I wanted to say.
I was using Gemini in a google worksheet a while back. I had to cross reference a website and input stuff into a cell. I got Gemini to do it, had it do the first row, then the second, then I told it to do a batch of 10, then 20. It had a hiccup at 20, would take too long I guess. So I had it go back to 10. But then Gemini tells me it can't read my worksheet. I convince it that it can, but then it tells me it can't edit my worksheet. I argue with it, "you've been changing the worksheet wtf?" I convinced it that it could and it started again, but then after doing a couple it told me it couldn't again. We went back and forth a bit, I'd get it working, it would break, repeat. I think it was after the third time I just couldn't get it to do it again.
I looked up the docs, searched online, and I was concerned that I found Google didn't allow Gemini to do a lot of stuff to worksheets/docs/other google workspace stuff. They said they didn't allow it to do a ton of stuff that I definitely had Gemini doing.
Then a week or two went by and google announced they're allowing gemini to directly edit worksheets.
So wtf how did I get it to do it before it could do it???
I run multiple agents in separate sessions. It starts with one agent, building out features or working on a task/bug fix. Once it gets some progress, I spin up another session and have it just review the code. I explicitly tell it things to look out for. I tell it to let me know about things I'm not thinking of and to make me aware of any blind spots. Whatever it reviews I send back to the agent building out features (I used to also review what the review agent told me about, but now I probably only review it like 20% of the time). I'll also have an agent session started just for writing tests, I tell it to look at the code and see if it's testable, find duplicate code, stale/dead code. And so on and so forth.
Between all of that + deterministic testing it's hard for shit to end up in the code base.
You don't. These seems to be this idea that LLMs can do it all, but the reality is that it itself has limited amounts of memory, and thus context.
And this is not tied to the LLMs. It's that to EVERYTHING we do. There are limits everywhere.
And for humans the context window might be smaller, but at least we have developed methods of abstracting different context windows, by making libraries.
Now, as a trade-off of trying to go super-fast, changes need to be made in response to your current prompts, and there is no time validate behavior in cases you haven't considered.
And regardless of whether you have abstractions in libraries, or whether you have inlined code everywhere, you're gonna have issues.
With libraries changes in behavior are going to impact code in places you don't want, but also, you don't necessarily know, as you haven't tested all paths.
With inlined code everywhere you're probably going to miss instances, or code goes on to live its own life and you lose track of it.
They built a skyscraper while shifting out foundational pieces. And now a part of the skyscraper is on the foundation of your backyard shed.
What differences do you see between AI written codebases and a codebase written by engineers? Both parties create buggy code, but I can imagine the types of bugs are different. Is it just that bug fixing doesn't really scale because we don't have the ability to chomp down 1M+ LOC codebases into LLM context?
> I feel like Claude Code is starting to fall over from being entirely written by LLMs.
The degradation is palpable.
I have been using vscode github copilot chat with mostly the claude opus 4.5 model. The underlying code for vscode github copilot chat has turned to shit. It will continuously make mistakes no matter what for 20 minutes. This morning I was researching Claude Code and pricing thinking about switching however this post sounds like it has turned to shit also. I don't mind spending $300-$500 a month for a tool that was a month ago accomplishing in a day what would take me 3-4 days to code. However, the days since the last update have been shit.
Clearly the AI companies can't afford to run these models at profit. Do I buy puts?
Just like a leveraged ETF, the returns are twice as good when things are on the up and up, but when you dig a hole it takes three times the effort to dig yourself out because now going down twice as fast, and you're also paying interest (ie, you have no clue where the bodies are burried as you bury them twice as fast).
You do it the same way you fix every other disaster of a code-base. You add a ton of tests and start breaking it up into modules. You then rewrite each module/component/service/etc. one at a time using good practices. That's how every project gets out of the muck.
That's a big, slow, and expensive process though.
Will Anthropic actually do that or will they keep throwing AI at it and hope the AI figures this approach out? We shall see...
With the competition biting at their heels I don't think they have time to do that. They're stuck with what they have. At least until innovation settles a little
This explanation fits my intuition, but from an outsider's perspective, I can't say the user experience with claude code is noticeably more bug-ridden than what is typical for a rapidly scaling startup rushing crap out the door. It's vibes all the way down.
Despite having written a few books on LLM applications I use them sparingly for coding: to design and get started, and occasionally for debugging. I have no interest criticizing other people’s practices but I enjoy mostly writing code myself.
CRITICAL: MAKE NO MISTAKES!
CRITICAL: NEVER APOLOGIZE! MAKE IT RIGHT THE FIRST TIME INSTEAD!
CRITICAL: DO NOT HALLUCINATE OR CONFABULATE EVER!
CRITICAL: DON'T DELETE THE DATABASE WITHOUT ASKING FIRST!
CRITICAL: NEVER USE VERBATIM CODE BLOCKS FROM GPL LICENSED PROJECTS!
CRITICAL: CODE AS IF ELON MUSK WAS LOOKING OVER YOUR SHOULDER ALL THE TIME!
CRITICAL: IF YOU MAKE MISTAKES AGAIN I WILL GET PTSD AND DIE AND IT WILL BE YOUR FAULT!
...
Slightly off topic, but does anyone feel that they nerfed Claude Opus?
It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.
All the engineers I asked today have said the same thing. Something is not right.
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
> What we need is an open and independent way of testing LLMs
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc
To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).
I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.
Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.
I've observed the same random foreign-language characters (I believe chinese or japanese?) interspersed without rhyme or reason that I've come to expect from low-quality, low-parameter-count models, even while using "opus 4.5".
An upcoming IPO increases pressure to make financials look prettier.
In fact as my prompts and documents get better it seems it does increasingly better.
Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.
Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.
They're A/B testing on the latest opus model, sometimes it's good sometimes it's worse than sonnet annoying as hell. I think they trigger it when you have excessive usage or high context use.
Or maybe when usage is high they tweak a setting that use cache when it shouldn't.
For all we know they do whatever experiment the want, to demonstrate theoretical better margin, to analyse user patterns when a performance drop occur.
Given what is done in other industries which don't face an existential issue, it wouldn't surprise me some whistle blowers in a few years tell us what's been going on.
This has been said about every LLM product from every provider since ChatGPT4.
I'm sure nerfing happens, but I think the more likely explanation is that humans have a tendency to find patterns in random noise.
They are constantly trying to reduce costs which means they're constantly trying to distill & quantize the models to reduce the energy cost per request. The models are constantly being "nerfed", the reduction in quality is a direct result of seeking profitability. If they can charge you $200 but use only half the energy then they pocket the difference as their profit. Otherwise they are paying more to run their workloads than you are paying them which means every request loses them money. Nerfing is inevitable, the only question is how much it reduces response quality & what their customers are willing to put up with.
I don’t care what anyone says about the cycle or that implying that it’s all in our heads. It’s bad bad.
I’m a Max x20 model who had to stop using it this week. Opus was regularly failing on the most basic things.
I regularly use the front end skill to pass mockups and Opus was always pixel perfect. This last week it seemed like the skill had no effect.
I don’t think they are purposely nerfing it but they are definitely using us as guinea pigs. Quantized model? The next Sonnet? The next Haiku? New tokenizing strategies?
I noticed that this week. I have a very straightforward claude command that lists exact steps to follow to fetch PR comments to bring them into the context window. Stuff like step one call gh pr view my/repo and it would call it with anthropiclabs/repo instead, it wouldn’t follow all the instructions, it wouldn’t pass the exact command I had written. I pointed out the mistake and it goes oh you are right! Then proceeded to make the same mistake again.
I used this command with sonnet 4.5 too and have never had a problem until this week. Something changed either in the harness or model. This is not just vibes. Workflows I have run hundreds of times have stopped working with Opus 4.5
I tried using Claude Code this week because I have a free account from my work.
However when I try to log in via CLI it takes me to a webpage with an “Authorize” button. Clicking the button does nothing. An error is logged to the console but nothing displays in the UI.
Sadly their whole frontend seems to be built without QC and mostly blindly assuming a happy path.
For the claude.ai UI, I've never had a single deep research properly transition (and I've done probably 50 or so) to its finished state. I just know to refresh the page after ~10mins to make the report show up.
Do you have API access (platform.claude.com) rather than Claude code (claude.ai)? I had similar issues trying to get Claude CLI working via the second method, not knowing there’s a difference
Oh is this whats been happening? I've been trying to ask question on a fairly long context window and history -- but it fails. No response it kind of acknowledges it received the input but then reprints the last output and then that whole dialogue is essentially dead ... same issue? Happened multiple times - quite frustrating.
Just a pro sub - not max.
Most of the time it gives me a heads up that I'm at 90% but a lot of the times it just failed, no warning, and I assumed it was I hit max.
I’ve also been encountering this behavior, coupled with rapidly declining length of use for a pro account now below an hour, and weekly limits getting hit by Wednesday despite achieving very little other than fixing its own mistakes after compressions.
One day everyone worshipping Claude when it works but once it goes on holiday, vibe-coders don't believe in using their own brains to solve the problem themselves.
Sometimes, poor old Claude wants to go on holiday and that is a problem?!?
> Just my own observation that the same pattern has occurred at least 3 times now:
> release a model; overhype it; provide max compute; sell it as the new baseline
> this attracts a new wave of users to show exponential growth & bring in the next round of VC funding (they only care about MAU going up, couldn’t care less about existing paying users)
> slowly degrade the model and reduce inference
> when users start complaining, initially ignore them entirely
then start gaslighting and make official statements denying any degradation
> then frame it as a tiny minority of users experiencing issues
then, when pressure grows, blame it on an “accidentally” misconfigured servers that “unintentionally” reduced quality (which coincidentally happened to save the company tonnes of $).
As many point out, it’s one thing to have a buggy product and another to ignore users.
Businesses like google were already a step in the wrong direction in terms of customer service, but the new wave of AI companies seem to have decided their only relation to clients is collecting their money.
Unclear costs, no support, gaslighting customers when a model is degraded, incoming rug pulls..
The VS Code plug-in is broken on Windows. The command-line interface is broken on Windows.
I just signed up as a paying customer, only to find that Claude is totally unusable for my purposes at the moment. There's also no support (shocker), despite their claims that you'll be E-mailed by the support team if you file a report.
And if you try to use the "regular" VS Code plug-in mode, it fails with error 3221225477. A search will turn up reports on that too.
If you query Claude itself on this, it will acknowledge that both of these are known problems: "Oh, you've hit this known issue." But no support for paying customers, who are promised a follow-up E-mail after reporting a problem. BULLSHIT. A week later, I haven't heard a peep, and what I paid for is thus far useless.
There's also this issue[1] with about 300 participants about limits being reached much more quickly since they stopped the 2x limit for the holidays. A few people from Anthropic joined the conversation but didn't say much. Some users say they solved the issue by creating a new account or changing their plan.
Something to check is if you’re opted into the test for the 1M context window. A co-worker told me this happened to them. They were burning a lot more tokens in the beta. Seems like creating a new account could track with this (but is obviously the Nuclear option).
I recently put a little money on the API for my personal account. I seem to burn more tokens on my personal account than my day job, in spite of using AI for 4x as long at work, and I’m trying to figure out why.
I hope that at some point companies start competing on quality instead of speed. LLMs will never be able to understand a codebase, and the more capable they get the more dangerous it is to just hand them the permission to blindly implement functionality and fix bugs. Bugs should be going down but they seem more prevalent than ever.
They already are competiting on quality. Why do you think Claude made Opus slower than Sonnet, yet with better benchmark scores.
LLMs do understand codebases and I've been able to get them to make reactors and clean up code without them breaking anything due to them understanding what they are doing.
Bugs are being solved faster than before. Crashes from production can directly be collected and fixed by a LLM with no engineering time needed other than a review.
They are trained on other code, ignore how your codebase is structured, and lack knowledge of it. To do so, you would need to feed the whole codebase every time you ask it for something, with extensive comments about the style, architecture, and so on. No amount of md files will help with that.
In large codebases, they struggle with code reuse, unless you point the agent to look for specific code.
Finding bugs has nothing to do with understanding the codebase. They find local bugs. If they could understand the whole codebase, we would be finding RCEs for popular OSS projects so easily, including browsers.
My guess is SRE culture is a tough sell at Anthropic. When you’re a frontier lab, almost everything else looks more prestigious and more immediately “impactful”.
Claude Code gets functionally worse every update. They need to get their shit together, hilarious to see Amodei at Davos talking big game about AGI and the latest update for a TUI application fucking changes observable behavior (like history scrolling with arrow keys), rendering random characters on the "newest" native version rendered in iTerm2, broken status line ... the list goes on and on.
This is the new status quo for software ... changing and breaking beneath your feet like sand.
I love CC, but there's so many bugs. Even the intended behavior is a mess - CC's VS Code UI bash tool stoped using my .zshrc so now it runs the wrong version of everything.
Codex is a bit better bug-wise but less enjoyable to use than CC. The larger context window and superiority of GPT 5.2 to Opus makes it mostly worth it to switch.
Claude Code's only saving grace is that it's pretty good from a fresh session - it can largely find and re-load into context what it needs to load. If I see my context ticking down, I ask it to give me a summary and TODO list, and either copy it, or have it put that into a docstring of what it's working on. Then just start a fresh session on that file. Shouldn't need to do this, for sure, but it gets it done in a pinch.
My largest gripe with Claude Code, and with encouraging my team to use it, is that checkpoints/rollbacks are still not implemented in the VS Code GUI, leading to a wildly inconsistent experience between terminal and GUI users: https://github.com/anthropics/claude-code/issues/10352
Anecdotal evidence of course but I have one long-running session in a terminal for over a month now. I work with it daily, compacts several times a day, I rollback conversation sometimes.
All with no issues.
Unsure what your use case is, but compaction makes it lose anormous amount of context. Claude code is better used on a task-by-task basis; things get bad. The whole purpose of init and CLAUDE.md is to prevent long chats from losing context and approach more surgically.
For the last month I've been working on a relatively big feature in a larger project.
I often compact the session when starting a new feature, often have to remind claude to read the claude.md etc. I still use it as if it was a new session regularly, it frequently doesn't remember what it did an hour ago, etc.
But the compact seems to work which is a very different experience than the one of the GP, who kills the session when it reaches the context limit and writes explicit summary files.
> checkpoints/rollbacks are still not implemented in the VS Code GUI
Rollbacks have been broken for me in the terminal for over a month. It just didn’t roll back the code most of the time. I’ve totally stopped using the feature and instead just rely on git. Is this this case for others?
I've been using /rewind in claude code (the terminal, not using vscode at all) quite a bit recently without issue - if that's the feature you're asking about.
Not discounting at all that you might "hold it" differently and have a different experience. E.g. I basically avoid claude code having any interaction with the VCS at all - and I could easily VCS interaction being a source of bugs with this sort of feature.
I mean double tapping escape, going back up the history, and choosing the “restore conversation and code” option. Sometimes bits of code are restored, but rarely all changes.
It worked when first released but hasn’t for ages now.
I've been using beads for longer term stuff (todo kinda stuff), have you given that a try?
I literally just posted in another comment that people shouldn't be worried about killing their current session/context window. I used to get worried about compaction and losing context, but now when I feel like things are slipping I kill it quick and start a new session.
This seems like the kind of problem someone with a Max subscription would run into. On my plus subscription, i'm too paranoid of my usage to allow the context window to get that large.
Claude writes all of their code. It's honestly a damning indictment of "AI is gonna replace engineers" when all the code the AI guys are giving us is dog.
People aren't going to forums and social media to hype up their own good code to nearly the same degree as otherwise. It's orders of magnitude more negative. There are ways of using AI well and using it poorly. There's no reason to correct your copmetition's unforced errors, or giving away an advantage in using these tools, for so long as there is a moat of effort and esoteric knowledge.
Just because 99% of the things you read are critical and negatively biased doesn't mean the subsequent determination or the consensus among participants in the public conversation have anything to do with reality.
Amodei is on the record about completely automating AI research in 6-12 months. He thinks it's an "exponential" loop & Anthropic is going to be the first to get there. That's not esoteric knowledge, that's the CEO saying so in public at the same time that their consumer facing tool is failing & their automated abuse detection is banning users for legitimate use cases.
I don't consider Anthropic to be one of the teams using AI particularly well. They're building the tools, they're not using the tools in the best, most skillful way possible.
Tangentially related: I would like to report a low-severity security vulnerability in Claude (web version), but I can't be bothered to go through the Hackerone formalities, since I don't care about a bounty.
Right now I'm defaulting to "do nothing" because I'm lazy, but if any Anthropic staff are reading this I'm happy to explain the details informally somewhere.
This is an N of 1, of course, but I can relate to the other folks who've been expressing their frustration with the state of Claude over the last couple weeks. Maybe it's just that I have higher expectations, but... I dunno, it really seems like Claude Code is just a lot WORSE right now than it was a couple weeks ago. It has constant bugs in the app itself, I have to babysit it a lot tighter, and it just seems ... dumber somehow. For instance, at the moment, it's literally trying to tell me, "No, it's fine that we've got 500 failing tests on our feature branch, because those same tests are passing in development."
I like a light editor with syntax highlighting and basic linting. Last time I was coding regularly I used VS code, but had only the default plugins. I only used it for basic text input. I always ran git and my code from the terminal. Does that help?
Try claude. It's basically the cli agent everyone is catching up to.
Start prompting it for annoying shit "Set up a project layout for X", then write things yourself inside that - the fun stuff or stuff you care about.
Then use it for refactors or extrapolation "I wrote this thing that works, but this old file is still in old format, do what I did there"
It's very good for helping with design of just above layperson knowledge. "I have this problem organizing xyz, what's a good pattern for this?"
or just "I want to do a project that does xyz, but dont know where to start, let's chat about it"
Some of these 'chatty' queries can be done in web, but having it on CLI is great b/c it'll just say "Can I do this for you" and you can easily delegate parts of the plan.
Give it a shot. That's pretty low level agentic use, and yes, it will demolish procrastination and startup inertia.
Somehow this post was pessimized by hn and you’ll probably see some bullshit report from anthropic first but not the actual evidence of anthropic being utterly abysmal and silent about their mistakes.
What really bothers me is that I’ve paid them for that subscription and their support team became rock solid and didn’t utter a word about paying me back for days when I couldn’t use my active chats.
I don’t even vibe code that much in Claude and it still manages not only to fail in existing chats where compaction should work, but also eat out my weekly limits astonishingly fast.
Hmm, so suddenly my stuck chats are compacting and moving on vs starting and immediately stopping again ad nauseam, seems like they fixed something finally?
measurablefunc | a day ago
OGEnthusiast | a day ago
throwjjj | a day ago
rigel8 | a day ago
eunoia | a day ago
When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway. Then I went poking on GitHub and found a vibed fix diff that changed the behavior in a totally new direction (it did not update the documentation).
Seems like everyone over there is vibing and no one is rationalizing the whole.
heliumtera | a day ago
Claude Code creator literally brags about running 10 agents in parallel 24/7. It doesn't just seems like it, they confirmed like it is the most positive thing ever.
TrainedMonkey | a day ago
Full disclosure - I am a heavy codex user and I review and understand every line of code. I manually fight spurious tests it tries to add by pointing a similar one already exists and we can get coverage with +1 LOC vs +50. It's exhausting, but personal productivity is still way up.
I think the future is bright because training / fine-tuning taste, dialing down agentic frameworks, introducing adversarial agents, and increasing model context windows all seem attainable and stackable.
tuhgdetzhh | a day ago
kaydub | a day ago
I'm definitely faster, but there's a lot of LLM overhead to get things done right. I think if you're just using a single agent/session you're missing out on some of the speed gains.
I think a lot of the gains I get using an LLM is because I can have the multiple different agent sessions work on different projects at the same time.
MrDarcy | a day ago
klodolph | a day ago
I can’t understand how people would run agents 24/7. The agent is producing mediocre code and is bottlenecked on my review & fixes. I think I’m only marginally faster than I was without LLMs.
gpm | a day ago
And specifically: Lots of checks for impossible error conditions - often then supplying an incorrect "default value" in the case of those error conditions which would result in completely wrong behavior that would be really hard to debug if a future change ever makes those branches actually reachable.
klodolph | a day ago
I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case. It’s worth knowing about the error cases, but it requires a lot more knowledge and reasoning about the current state of the program to think about how they should be handled. Not something you can figure out just by looking at a snippet.
stefan_ | a day ago
zbentley | a day ago
PrimalPower | a day ago
At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.
Building a production grade system using Claude has been a fools errand for me. Whatever time/energy i save by not writing code - I end up paying back when I read code that I did not write and fixing anti-patterns left and right.
I rationalized by a bit - deflecting by saying this is AI's code not mine. But no - this is my code and it's bad.
throwup238 | a day ago
This is starting to drive me insane. I was working on a Rust cli that depends on docker and Opus decided to just… keep the cli going with a warning “Docker is not installed” before jumping into a pile of garbage code that looks like it was written by a lobotomized kangaroo because it tries to use an Option<Docker> everywhere instead of making sure its installed and quitting with an error if it isn’t.
What do I even write in a CLAUDE.md file? The behavior is so stupid I don’t even know how to prompt against it.
xienze | a day ago
Think about it, they have to work in a very limited context window. Like, just the immediate file where the change is taking place, essentially. Having broader knowledge of how the application deals with particular errors (catch them here and wrap? Let them bubble up? Catch and log but don't bubble up?) is outside its purview.
I can hear it now, "well just codify those rules in CLAUDE.md." Yeah but there's always edge cases to the edge cases and you're using English, with all the drawbacks that entails.
gpm | a day ago
human_person | a day ago
colechristensen | a day ago
In particular writing tests that do nothing, writing tests and then skipping them to resolve test failures, and everybody's favorite: writing a test that greps the source code for a string (which is just insane, how did it get this idea?)
withinboredom | a day ago
freedomben | a day ago
kaydub | a day ago
The assumption is that your test is right. That's TDD. Then you write your code to conform to the tests. Otherwise what's the point of the tests if you're just trying to rewrite them until they pass?
skerit | a day ago
einpoklum | a day ago
That is not an uncommon occurrence in human-written code as well :-\
tobyjsullivan | a day ago
> Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.
https://news.ycombinator.com/item?id=13775966
Edit: found the original comment from NikolaeVarius
data_ders | a day ago
nrds | a day ago
agumonkey | a day ago
egeozcan | a day ago
Then again, the google home page was broken on FF on Android for how long?
cyanydeez | a day ago
AstroBen | a day ago
You can assert that something you want to happen is actually happening
How do you assert all the things it shouldn't be doing? They're endless. And AI WILL mess up
It's enough if you're actively reviewing the code in depth.. but if you're vibe coding? Good luck
brookst | a day ago
gpm | a day ago
It's not a world where everything produced is immediately verified.
If a human consistently only produced the quality of work Claude Opus 4.5 is capable of I would expect them to be fired from just about any job in short order. Yes, they'd get some stuff done, but they'd do too much damage to be worth it. Of course humans are much more expensive than LLMs to manage so this doesn't mean it can't be a useful tool... just it's not that useful a tool yet.
ASalazarMX | a day ago
This can be abused because the programmer is both judge and jury, but people tend to handle this paradox much better than LLMs.
AstroBen | a day ago
1. Competent humans architecting and leading the system who understand the specs, business needs, have critical thinking skills and are good at their job
2. Automated tests
3. Competent human reviewers
4. QA
5. Angry users
Cutting out 1 and 3 in favor of more tests isn't gunna work
kaydub | a day ago
I'm personally not advocating for not having humans in the loop. I don't know of anybody using llm tools or advocating for them that are saying there shouldn't be humans in the loop. "vibe coding" seems to mean different things to different people.
gpm | a day ago
Doesn't mean it's not a useful tool - if you read and think about the output you can keep it in check. But the "100% of my contributions to Claude Code were written by Claude Code" claim by the creator makes me doubt this is being done.
gpm | a day ago
Edit: And 3 minutes later it is back...
AstroBen | a day ago
dawnerd | a day ago
alephnerd | a day ago
Similarly, Human-in-the-loop utilization of AI/ML tooling in software development is expected and in fact encouraged.
Any IP that is monetizable and requires significant transformation will continue to see humans-in-the-loop.
Weak hiring in the tech industry is for other reasons (macro changes, crappy/overpriced "talent", foreign subsidies, demanding remote work).
AI+Competent Developer paid $300k TC > Competent Developer paid $400k TC >>> AI+Average Developer paid $30k TC >> Average Developer paid $40k TC >>>>> Average Developer paid $200k TC
fuzzzerd | a day ago
Huh?
alephnerd | a day ago
A Coding copilot subscription paired with a competent developer dramatically speeds up product and feature delivery, and also significantly upskills less competent developers.
That said, truly competent developers are few and far between, and the fact that developers in (eg.) Durham or remote are demanding a SF circa 2023 base makes the math to offshore more cost effective - even if the delivered quality is subpar (which isn't neccesarily true), it's good enough to release, and can be refactored at a later date.
What differentiates a "competent" developer from an "average" developer is the learning mindset. Plenty of people on HN kvetch about being forced to learn K8s, Golang, Cloud Primitives, Prompt Engineering, etc or not working in a hub, and then bemoan the job market.
If we are paying you IB Associate level salaries with a fraction of the pedigree and vetting needed to get those roles, upskilling is the least you can do.
We aren't paying mid 6 figure TC for a code monkey - at that point we may as well entirely use AI and an associate at Infosys - we are paying for critical and abstract thinking.
As such, AI in the hands of a truly competent engineer is legitimately transformative.
Tl;dr - Mo' money, Mo' expectations
kaydub | a day ago
FeteCommuniste | a day ago
Other AI agents, I guess. Call Claude in to clean up code written by Gemini, then ChatGPT to clean up the bugs introduced by Claude, then start the cycle over again.
kaydub | a day ago
I've had one llm one-shot a codebase. Then I use another one to review (with a pretty explicit prompt). I take that review and feed it to another agent to refactor. Repeat that a bunch of times.
inimino | a day ago
nosianu | a day ago
Not that old big non-AI software doesn't have similar maintainability issues (I keep posting this example, but I don't actually want to callthat company out specifically, the problem is widespread: https://news.ycombinator.com/item?id=18442941).
That's why I'm reluctant to complain about the AI code issues too much. The problem of how software is written, on the higher level, the teams, the decisions, the rotating programmers, may be bigger than that of any particular technology or person actually writing the code.
I remember a company where I looked at a contractor job, they wanted me to fix a lot of code they had received from their Eastern European programmers. They complained about them a lot in our meeting. However, after hearing them out I was convinced the problem was not the people generating the code, but the ones above them who failed to provide them with accurate specs and clear guidance, and got surprised at the very end that it did not work as expected.
Similar with AI. It may be hard to disentangle what is project management, what is actually the fault of the AI. I found that you can live with pockets of suboptimal but mostly working code well enough, even adding features and fixing bugs easily, if the overall architecture is solid, and components are well isolated.
That is why I don't worry too much about the complaints here about bad error checks and other small stuff. Even if it is bad, you will have lots of such issues in typical large corporate projects, even with competent people. That's because programmers keep changing, management focuses on features over anything else (usually customers, internal or external, don't pay for code reorg, only for new features). The layers above the low level code are more important in deciding if the project is and remains viable.
From what the commenters say, it seems to me the problem starts much higher than the Claude code, so it is hard to say how much at fault AI generated code actually is IMHO. Whether you have inexperienced juniors or an AI producing code, you need solid project lead and architecture layers above the lines of code first of all.
inimino | 2 hours ago
AstroBen | a day ago
I'd much rather make plans based on reality
inimino | 2 hours ago
direwolf20 | a day ago
ssl-3 | a day ago
If the code is cheap (and it certainly is), then tossing it out and replacing it can also be cheap.
kaydub | a day ago
jordanbeiber | a day ago
Shaping of a codebase is the name of the game - this has always been, and still, is difficult. Build something, add to it, refactor, abstraction doesn’t sit right, refactor, semantics change, refactor, etc, etc.
I’m surprised at how so few seem to get this. Working enterprise code, many codebases 10-20 years old could just as well have been produced by LLMs.
We’ve never been good at paying debt and you kind of need a bit of OCD to keep a code base in check. LLM exacerbates a lack of continuous moulding as iterations can be massive and quick.
egeozcan | 13 hours ago
tyfon | a day ago
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
egeozcan | a day ago
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
ASalazarMX | a day ago
egeozcan | 17 hours ago
At least with AGENTS/CLAUDE.md file, you know the agent will re-read those rules on every new session.
ASalazarMX | a day ago
I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.
I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.
I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.
kaydub | a day ago
Anyways, I replied because I had something else I wanted to say.
I was using Gemini in a google worksheet a while back. I had to cross reference a website and input stuff into a cell. I got Gemini to do it, had it do the first row, then the second, then I told it to do a batch of 10, then 20. It had a hiccup at 20, would take too long I guess. So I had it go back to 10. But then Gemini tells me it can't read my worksheet. I convince it that it can, but then it tells me it can't edit my worksheet. I argue with it, "you've been changing the worksheet wtf?" I convinced it that it could and it started again, but then after doing a couple it told me it couldn't again. We went back and forth a bit, I'd get it working, it would break, repeat. I think it was after the third time I just couldn't get it to do it again.
I looked up the docs, searched online, and I was concerned that I found Google didn't allow Gemini to do a lot of stuff to worksheets/docs/other google workspace stuff. They said they didn't allow it to do a ton of stuff that I definitely had Gemini doing.
Then a week or two went by and google announced they're allowing gemini to directly edit worksheets.
So wtf how did I get it to do it before it could do it???
kaydub | a day ago
Manage that yourself! If you have hooks throwing errors then feed the error back into the llm.
swalsh | a day ago
egeozcan | 20 hours ago
kaydub | a day ago
I run multiple agents in separate sessions. It starts with one agent, building out features or working on a task/bug fix. Once it gets some progress, I spin up another session and have it just review the code. I explicitly tell it things to look out for. I tell it to let me know about things I'm not thinking of and to make me aware of any blind spots. Whatever it reviews I send back to the agent building out features (I used to also review what the review agent told me about, but now I probably only review it like 20% of the time). I'll also have an agent session started just for writing tests, I tell it to look at the code and see if it's testable, find duplicate code, stale/dead code. And so on and so forth.
Between all of that + deterministic testing it's hard for shit to end up in the code base.
OptionOfT | a day ago
And this is not tied to the LLMs. It's that to EVERYTHING we do. There are limits everywhere.
And for humans the context window might be smaller, but at least we have developed methods of abstracting different context windows, by making libraries.
Now, as a trade-off of trying to go super-fast, changes need to be made in response to your current prompts, and there is no time validate behavior in cases you haven't considered.
And regardless of whether you have abstractions in libraries, or whether you have inlined code everywhere, you're gonna have issues.
With libraries changes in behavior are going to impact code in places you don't want, but also, you don't necessarily know, as you haven't tested all paths.
With inlined code everywhere you're probably going to miss instances, or code goes on to live its own life and you lose track of it.
They built a skyscraper while shifting out foundational pieces. And now a part of the skyscraper is on the foundation of your backyard shed.
quietsegfault | a day ago
dataviz1000 | a day ago
The degradation is palpable.
I have been using vscode github copilot chat with mostly the claude opus 4.5 model. The underlying code for vscode github copilot chat has turned to shit. It will continuously make mistakes no matter what for 20 minutes. This morning I was researching Claude Code and pricing thinking about switching however this post sounds like it has turned to shit also. I don't mind spending $300-$500 a month for a tool that was a month ago accomplishing in a day what would take me 3-4 days to code. However, the days since the last update have been shit.
Clearly the AI companies can't afford to run these models at profit. Do I buy puts?
swalsh | a day ago
bmurphy1976 | a day ago
That's a big, slow, and expensive process though.
Will Anthropic actually do that or will they keep throwing AI at it and hope the AI figures this approach out? We shall see...
AstroBen | a day ago
borg16 | a day ago
folks have created software by "vibe coding". It is now time to "face the music" when doing so for production grade software at scale.
root_axis | a day ago
mark_l_watson | a day ago
ankit219 | a day ago
charcircuit | a day ago
heliumtera | a day ago
cube00 | a day ago
ASalazarMX | a day ago
esafak | a day ago
AznHisoka | a day ago
jampa | a day ago
It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.
All the engineers I asked today have said the same thing. Something is not right.
eterm | a day ago
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
quentindanjou | a day ago
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
Analemma_ | a day ago
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
Maxious | 21 hours ago
https://www.anthropic.com/engineering/a-postmortem-of-three-...
judahmeek | 3 hours ago
We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc
To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.
What am I missing here?
landl0rd | a day ago
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
dudeinhawaii | 20 hours ago
jampa | a day ago
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
eterm | a day ago
It has a habit of trusting documentation over the actual code itself, causing no end of trouble.
Check your claude.md files (both local and ~user ) too, there could be something lurking there.
Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.
F7F7F7 | 22 hours ago
spike021 | a day ago
mrguyorama | a day ago
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
olao99 | 14 hours ago
ojr | 14 hours ago
landl0rd | a day ago
An upcoming IPO increases pressure to make financials look prettier.
epolanski | a day ago
In fact as my prompts and documents get better it seems it does increasingly better.
Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.
Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.
kachapopopow | a day ago
hirako2000 | a day ago
Or maybe when usage is high they tweak a setting that use cache when it shouldn't.
For all we know they do whatever experiment the want, to demonstrate theoretical better margin, to analyse user patterns when a performance drop occur.
Given what is done in other industries which don't face an existential issue, it wouldn't surprise me some whistle blowers in a few years tell us what's been going on.
root_axis | a day ago
measurablefunc | 23 hours ago
cap11235 | a day ago
F7F7F7 | 22 hours ago
I’m a Max x20 model who had to stop using it this week. Opus was regularly failing on the most basic things.
I regularly use the front end skill to pass mockups and Opus was always pixel perfect. This last week it seemed like the skill had no effect.
I don’t think they are purposely nerfing it but they are definitely using us as guinea pigs. Quantized model? The next Sonnet? The next Haiku? New tokenizing strategies?
ryanar | 12 hours ago
I used this command with sonnet 4.5 too and have never had a problem until this week. Something changed either in the harness or model. This is not just vibes. Workflows I have run hundreds of times have stopped working with Opus 4.5
paulhebert | a day ago
However when I try to log in via CLI it takes me to a webpage with an “Authorize” button. Clicking the button does nothing. An error is logged to the console but nothing displays in the UI.
We reached out to support who have not helped.
Not a great first impression
hobofan | a day ago
For the claude.ai UI, I've never had a single deep research properly transition (and I've done probably 50 or so) to its finished state. I just know to refresh the page after ~10mins to make the report show up.
roywiggins | a day ago
https://github.com/anthropics/claude-code/issues/14222
F7F7F7 | 22 hours ago
attheicearcade | a day ago
paulhebert | 19 hours ago
But it’s confusing.
The docs say run “Claude” and then pick your option.
I tried multiple options and neither worked
boringg | a day ago
Just a pro sub - not max.
Most of the time it gives me a heads up that I'm at 90% but a lot of the times it just failed, no warning, and I assumed it was I hit max.
kingkawn | a day ago
kilroy123 | a day ago
rvz | a day ago
Sometimes, poor old Claude wants to go on holiday and that is a problem?!?
daredoes | a day ago
---
> Just my own observation that the same pattern has occurred at least 3 times now:
> release a model; overhype it; provide max compute; sell it as the new baseline
> this attracts a new wave of users to show exponential growth & bring in the next round of VC funding (they only care about MAU going up, couldn’t care less about existing paying users)
> slowly degrade the model and reduce inference
> when users start complaining, initially ignore them entirely then start gaslighting and make official statements denying any degradation
> then frame it as a tiny minority of users experiencing issues then, when pressure grows, blame it on an “accidentally” misconfigured servers that “unintentionally” reduced quality (which coincidentally happened to save the company tonnes of $).
BoredPositron | a day ago
kace91 | a day ago
Businesses like google were already a step in the wrong direction in terms of customer service, but the new wave of AI companies seem to have decided their only relation to clients is collecting their money.
Unclear costs, no support, gaslighting customers when a model is degraded, incoming rug pulls..
cs02rm0 | a day ago
I cancelled my subscription.
VerifiedReports | a day ago
I just signed up as a paying customer, only to find that Claude is totally unusable for my purposes at the moment. There's also no support (shocker), despite their claims that you'll be E-mailed by the support team if you file a report.
brookst | a day ago
What symptoms do you see? There are some command line parameters for reinstall / update that might be worth trying.
EMM_386 | a day ago
https://github.com/anthropics/claude-code/issues/769
VerifiedReports | 19 hours ago
And if you try to use the "regular" VS Code plug-in mode, it fails with error 3221225477. A search will turn up reports on that too.
If you query Claude itself on this, it will acknowledge that both of these are known problems: "Oh, you've hit this known issue." But no support for paying customers, who are promised a follow-up E-mail after reporting a problem. BULLSHIT. A week later, I haven't heard a peep, and what I paid for is thus far useless.
copirate | a day ago
[1] https://github.com/anthropics/claude-code/issues/16157
codazoda | a day ago
I recently put a little money on the API for my personal account. I seem to burn more tokens on my personal account than my day job, in spite of using AI for 4x as long at work, and I’m trying to figure out why.
MicKillah | a day ago
cheschire | a day ago
deadbabe | a day ago
elemdos | a day ago
charcircuit | a day ago
LLMs do understand codebases and I've been able to get them to make reactors and clean up code without them breaking anything due to them understanding what they are doing.
Bugs are being solved faster than before. Crashes from production can directly be collected and fixed by a LLM with no engineering time needed other than a review.
f311a | 16 hours ago
They are trained on other code, ignore how your codebase is structured, and lack knowledge of it. To do so, you would need to feed the whole codebase every time you ask it for something, with extensive comments about the style, architecture, and so on. No amount of md files will help with that.
In large codebases, they struggle with code reuse, unless you point the agent to look for specific code.
Finding bugs has nothing to do with understanding the codebase. They find local bugs. If they could understand the whole codebase, we would be finding RCEs for popular OSS projects so easily, including browsers.
whoevercares | a day ago
MadsRC | a day ago
mccoyb | a day ago
This is the new status quo for software ... changing and breaking beneath your feet like sand.
direwolf20 | a day ago
cheriot | a day ago
kilroy123 | a day ago
Cursor, Claude code, Claude in the browser, and don't even get me started on Gemini.
mbm | a day ago
btown | a day ago
My largest gripe with Claude Code, and with encouraging my team to use it, is that checkpoints/rollbacks are still not implemented in the VS Code GUI, leading to a wildly inconsistent experience between terminal and GUI users: https://github.com/anthropics/claude-code/issues/10352
kuboble | a day ago
system2 | a day ago
kuboble | a day ago
For the last month I've been working on a relatively big feature in a larger project.
I often compact the session when starting a new feature, often have to remind claude to read the claude.md etc. I still use it as if it was a new session regularly, it frequently doesn't remember what it did an hour ago, etc.
But the compact seems to work which is a very different experience than the one of the GP, who kills the session when it reaches the context limit and writes explicit summary files.
hknceykbx | a day ago
nojs | a day ago
Rollbacks have been broken for me in the terminal for over a month. It just didn’t roll back the code most of the time. I’ve totally stopped using the feature and instead just rely on git. Is this this case for others?
gpm | a day ago
Not discounting at all that you might "hold it" differently and have a different experience. E.g. I basically avoid claude code having any interaction with the VCS at all - and I could easily VCS interaction being a source of bugs with this sort of feature.
nojs | a day ago
It worked when first released but hasn’t for ages now.
ninninninnin | a day ago
F7F7F7 | 22 hours ago
I’d much rather have the terminal version working again though.
kaydub | a day ago
I literally just posted in another comment that people shouldn't be worried about killing their current session/context window. I used to get worried about compaction and losing context, but now when I feel like things are slipping I kill it quick and start a new session.
swalsh | a day ago
lifetimerubyist | a day ago
observationist | a day ago
Just because 99% of the things you read are critical and negatively biased doesn't mean the subsequent determination or the consensus among participants in the public conversation have anything to do with reality.
measurablefunc | a day ago
observationist | a day ago
Dario is delusional, for this and other reasons.
lifetimerubyist | 10 hours ago
Yeah but Anthropic are and you'd think that if they were doing that the code would be..you know...good?
Retr0id | a day ago
Right now I'm defaulting to "do nothing" because I'm lazy, but if any Anthropic staff are reading this I'm happy to explain the details informally somewhere.
xyzsparetimexyz | 12 hours ago
delduca | a day ago
smithkl42 | a day ago
jimnotgym | a day ago
jvanderbot | a day ago
I like cli tools, and claude is generally considered a very good option for that.
I have a coworker who likes codex better.
jimnotgym | a day ago
jvanderbot | a day ago
Start prompting it for annoying shit "Set up a project layout for X", then write things yourself inside that - the fun stuff or stuff you care about.
Then use it for refactors or extrapolation "I wrote this thing that works, but this old file is still in old format, do what I did there"
It's very good for helping with design of just above layperson knowledge. "I have this problem organizing xyz, what's a good pattern for this?"
or just "I want to do a project that does xyz, but dont know where to start, let's chat about it"
Some of these 'chatty' queries can be done in web, but having it on CLI is great b/c it'll just say "Can I do this for you" and you can easily delegate parts of the plan.
Give it a shot. That's pretty low level agentic use, and yes, it will demolish procrastination and startup inertia.
jimnotgym | 15 hours ago
[OP] nurimamedov | a day ago
blks | a day ago
eboye | a day ago
bastard_op | 23 hours ago
ec109685 | 22 hours ago
Apple and Google do same thing with their silly forums.
trenchgun | 18 hours ago