The Claude Coding Vibes Are Getting Worse

41 points by ciferkey a day ago on lobsters | 12 comments

quasi_qua_quasi | 21 hours ago

I think part of the frustrating loop with these is that

  • the core metric that you evaluate these on is very difficult to quantify ("how good is it at my tasks") and has lots of confounds (your tasks change over time, your expectations change, etc)
  • even if you can quantify it, they can make changes in a way that doesn't let you use the old thing
  • there is also no way to know whether or not the service runners are fiddling with stuff and not telling you

viraptor | 5 hours ago

Does anyone have a link to that continuously running benchmark checking for daily quality regressions? It's annoyingly hard to find now.

Edit: had to use Claude to find it https://marginlab.ai/trackers/claude-code/

simonw | 22 hours ago

This morning I had the brand new Claude Opus 4.7 and Qwen3.6-35B-A3B - a 21GB model file running directly on my laptop - draw me pelicans and I liked the Qwen one better.

jcelerier | 17 hours ago

Tried the qwen 3.6 q4_m but it was completely unable to write me a simple shader after 4 tries, something that old Sonnet versions did without making a mistake.

elobdog | 5 hours ago

Hahahaha... I learnt something new today about pelikans riding a bicycle! Thank you.

cpurdy | 9 hours ago

QOTD from Hyperpape (elsewhere on the interwebs)

The metaphor I subscribe to with LLMs is that they’re like a talisman that accentuates your inbuilt tendencies.

To the extent that you might tend to jump to conclusions or be sloppy, they exacerbate that. To the extent that you are careful, they can be a tool for doing more.

Most of us are neither intellectual saints nor pure slop producers, so we have to be very careful.

matheusmoreira | 21 hours ago

Adaptive thinking is the only thinking-on mode, and in our internal evaluations it reliably outperforms extended thinking.

Boris literally went to HN and advised people to turn off adaptive thinking because it was buggy to the point of allocating zero thinking tokens to important things.

jrgtt | 12 hours ago

Is it like expected of one to know what a techfluencer has said in an HN thread in order to run reliable software nowadays?

matheusmoreira | 12 hours ago

He's not an influencer, he's the guy responsible for Claude Code. I see what you mean though. HN being the tech world's unofficial support page is a somewhat perverse outcome but at least it allows people to engage with insiders. Complaining loudly enough until it reaches the HN frontpage and some employee sees it seems to be extremely effective. For some companies like Google that seems to be the only way to get support.

About the issue I was referring to: there were issues with adaptive thinking that would cause it to allocate literally zero thinking tokens sometimes, leading to stupidity.

The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.

And now, less than two weeks after that incident, they release a model where it's not possible to disable adaptive thinking!

"The proprietary tool that is trained using a poorly understood random process with unknown data and evaluated on luftgeschäft benchmarks is not behaving as I've come to intuitively expect."

What a shock it must be!

alexandria | 44 minutes ago

I really like the way you put this. Exactly!

bbrown | 7 hours ago

I think GIGO is alive and well as is Conway's Law: a program reflects the engineering culture that developed it.

https://techtrenches.dev/p/the-snake-that-ate-itself-what-claude