Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
December 2025 was the breakthrough for me.
January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
The openclaw ban pushed me over to 5.5 for some daily usage. I feel like Opus and 5.5 are good at very different things. 5.5 can be too literal, and it does not have as much of a ‘creative’ bent whether that’s toward design, UI/UX, interpreting vague instructions, etc. So, in that way, Opus had sort of spoiled me.
On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.
Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time.
I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
While I certainly like parentheses highlighting and rainbow parentheses, I've programmed Clojure without syntax highlighting and while it’s not as nice as it would be with, it’s fine.
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.
I agree with this. I feel like there’s a false dichotomy right now in a lot of these discussions where one can only vibe code or only code by hand. It is possible to do both…
But we’ve had tools to automate out the boilerplate for years. We don’t need ai for that. It’s seriously like we all forgot we could run one command and scaffold a project. AI isn’t even that great at it. Last I tried a month ago it used a really out of date version of nextjs and picked all sorts of random deps that weren’t in the plan.
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
Or copy paste another file and edit the 10 lines that are actually different. The nice thing is that when you have an epiphany that you’ve already done this twice and that it’s for the same purpose, so you abstract the code and remove 100 lines from the project.
You have no idea how many times I’ve asked “why are we not using the project generator” or “why did you write 200 lines to parse a csv? Here’s a library and five lines to get it done” in the last year. Its easily up 20x compared to pre ai, and getting worse.
I agree, but the reality is that most people work to make a living, not to have fun. If you enjoy your job because you mostly get to write code in a tight feedback loop instead of doing the "hard" work of planning, writing and reviewing specs, balancing customer requirements, and the lot, you have a very privileged life. And those jobs are probably going to get fewer now.
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
I totally agree. I loved coding because of its closed feedback loop. Since last November, I also delegated it mostly to agents. Now I concentrate more on the design part, which is not the same. However, you move with the times and hope something else will become exciting. I do not know a more worthwhile and satisfying way than computing to spend my work hours.
I watched the last one S5:E17 What jobs are AI jobs and I think it gives the right framing to think about this. It is not prescriptive, it does not give a list which is smart. The job title might be the same but the actual role might have different context so the best is to have the right frame to explore your particular situation.
How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.
How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?
Supply and demand. Not many people are good at programming and it's highly in demand.
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.
are you a programmer? it 100% requires skill. AI or not.
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
Yes I am. The vibe coding I've tried didn't work very well so I agree that still required my skill. But I also don't have access to the latest models and supposedly they're a lot better (see this article for example!).
So is it possible for non-programmers to vibe code if they have the latest models? If not now, what about in a few years?
AI is clearly a different class of tool to something like a welder.
I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out”
your hypothetical boss has other things to do than kick a LLM around at that price
Never to feed the trolls ... but, how does my carpenter deserve $100 an hour when he is using an electric drill and power saw I can get at Home Deepo for $100 bucks?
Most good developers are not employed because just because they can code well.
What is over is: fizzbuzz and trivial CS algorithm regurgitation as a gate.
In most cases you could work around that. For instance write the code yourself and make the AI write the tests. Or keep it busy writing superfluous documentation. Very few people are micromanaged to the extent that they can’t subvert the system.
A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way
Exact same experience here. Prior to Opus 4.5 I'd sometimes use AI for some frontend webdev stuff (I am a C/C++/Python programmer; my HTML/CSS/JS knowledge is probably on par with a first-year uni student) and I'd have to manually edit things and retry, tell it not to attempt a paradigm that had failed before or cycle between models in Cursor just to try and get one that could make a simple widget that worked properly.
Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.
I didn't use it often, but when it was needed it was needed.
Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.
It's vibing in the sense that I'm not really writing code, and I'm leaving a lot of decision to the models. I let them drive a lot of the design document details, I just made sure it contained the salient points. Implementation plans I just skimmed. Didn't write any code, just did some checks here and there.
But yes, I did think that it sorta felt like being a team lead for some eager programmers.
Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.
> planning every last detail before writing code is boring
Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.
There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.
> Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.
Part of this is true, part of it the agents catch at least a non-trivial portion of. If you prompt it to do a review, especially with a specific angle like ensuring sustained write performance, or how it will work when the future extensions are implemented, they do often catch a lot of issues.
I agree you lose a fair bit of the sense of "it feels like I'm doing something wrong", or "this doesn't seem optimal" etc. I think the skill in using these tools is to determine when you need that control and where it doesn't really matter.
For me, the fun in programming is sometimes to actually write code, solving a problem in a specific way or try some new approach. Other times the fun is to create something that works, and the code is more a means to an end.
The first case I'll probably still do by hand, like handmade vases despite factory made are cheap and readily available.
For the second case I think these newfangled tools have made it even more fun, since writing lots of boiler plate, repetitive event handles and whatnot is not my idea of fun.
> I think these newfangled tools have made it even more fun, since writing lots of boiler plate, repetitive event handles and whatnot is not my idea of fun
That’s what code generators, snippets plugins, macros, and the old copy-paste are here for. I wonder if you were using notepad to code. Because even nano had macros.
Those tools only get you so far, especially if you write something novel to you. Using a new framework or programming language say.
Sometimes using a new framework or programming language is the fun part.
But sometimes it's just the best way of solving a problem incidental to the fun part.
One of the two projects I vibed included a web frontend. I didn't touch a single line of HTML, CSS or JavaScript of the frontend. And I didn't touch the API on the backend. I'm not a web dev, so this isn't something I've got snippets for or whatever, and in this case wasn't the interesting part.
The interesting part for me in that case was making a tool that could help us, not the details how exactly how that was done.
> The interesting part for me in that case was making a tool that could help us, not the details how exactly how that was done.
And I wouldn’t argue about the economics of getting a MVP out. But with software, you often got one happy path and myriads way of getting into an incoherent state (and crashing early would be a boon in this case) and/or returning the wrong response. When you care about failure, you also care that your code is semantically right. The devil is very much in the details, especially if you have N>1 users.
Getting thing dones for me include a high confidence that the code will do the right thing. And that’s means reviewing each line and checking the semantics (only when it’s a few line of code) or building a test harness and making sure I handle contracts and invariants.
Snippets, Code Generators, and Copy-Paste gives me sample that I can trust, although I may need to edit. But LLM doesn’t. And I’m doubly doubtful when it’s something I’m not familiar with.
It's software development, but with much less actual programming (in my case none).
When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.
Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.
Currently just manual. I'm not pushing the frontier here, just getting my feet wet.
While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
You're completely twisting what I said. I've never talked about people claiming it's not making developers obsolete. We are obviously extremely far from that. I'm talking about people who say it doesn't work to build basic features in their projects correctly.
Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235
If this is everything needed for an LLM to generate acceptable code, what is even the point of them?
Maybe we come from different cultures and context is harder to grasp just in text so maybe for those reasons your response feels ruder than I hope it was intended to be.
I am sorry for not being clear in my response but I didn't intend to twist your words. I am not sure where I did so. My response was intended to be a more general remark on the kind of discourse on this topic I see and that I think both sides are right from the context they are looking in with and also why I think both sides come out of this discussion exhausted of the other. Not discounting presence of bad actors but generally I think there are most engaging in good faith like you are probably.
Coming specifically to respond your last response, I don't think one needs all of these prerequisites to get value out of LLMs. In fact LLMs have helped me untangle some very messy ball of muds on projects where we previously deemed it not worth the effort and basically carried some codebases as legacy. Now we can write enough tests to feel confidence and do a port against those tests all in a span of few days, which we found impressive.
Now having said all this, I think I understand your perspective a bit better on your original comment.
While it's a very versatile hammer, if it doesn't work for your use case that's all great. I just think that a bit more patience though with honing it maybe could help you find areas where it could work for you. If not, cheers!
Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.
What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.
If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.
Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.
There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.
But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.
I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.
The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.
For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.
The pipeline controls the quality far more than the model, empirically.
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Neither of those are from 'under' they both look either front or top?
Imagine yourself under the ducks feet, looking up at an oblique angle - wings as I suggested. The AI won't do that, it has no reference for dimensionality.
you can replace the pelican and the bicycle with your preferred animal and a means of locomotion. I bet you can come up with a pair that definitely wasnt in the training data
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
You might point them at the cloudflare blog about deploying mythos - I found it interesting. Upshot — as your folks discovered, deployment, harness, and utilization method matters for mythos and is a bit different than how you’d deploy a coding agent for writing code, but if you do that, you get something capable of end to end chaining and reasoning about a much broader class of vulnerabilities.
No personal experience with it. But the security team writeups I’ve read are significantly more positive about it than you describe, so it might be worth a second look.
Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
There's an interesting economic contest here as well - is it more sustainable for a malware group to spend $500 in tokens looking for an issue in my app? or for me to spend $500 scanning for issues on every deployment?
Systemically this usually favours the offence, as they could scan my app once every 6 months whereas I'd need to do it on weekly releases.
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third.
Scared for the future
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
General population, you mean non SWEs? Because there are many SWEs around the world who earn median wage and who stand to lose it all as the avalanche of firings is ramping up.
Non SWEs (salespeople, clerks, secretaries, assistants, taxi drivers, writers, 3D modelers, artists, designers) are of course going the same way. Unless they are protected (unionized or such), why would they have sympathy for SWEs? People of our ilk are the ones causing this (to them and to ourselves). What I will tell them is to not repeat our mistake, organize and protest.
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
Yeah, I don't think the role of QA is to write automated tests - developers should be doing most of that work.
The best QA people I've worked with didn't write much code at all. You'd give them a new system and they'd find all of the bugs, testing obscure edge-cases that you'd never thought of.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
This is the magic question that I'm very eager to hear the answer to.
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
There is an entire category of software engineers who exist entirely to knock out features on microservices or do easily automatible QA work whose jobs will disappear.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
I was doing something like this, and then realized at least with claude that it’s so much better at HTML that it’s better to get an HTML-first deck together, which could then be turned into a PPT template and/or PDF directly, depending on needs.
It saved me a fair number of design-tweak steps in the md -> pandoc part of the workflow. Realistically, hand editing claude’s HTML is also easy in most cases, so I didn’t feel like I lost much (for the generative cases). Similarly if it’s mostly what I’ve written directly that’s the source it’ll be in markdown, and I’ve found it’s a faster path to have md -> (LLM-translated HTML deck) -> pdf.
With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
Sure it depends how it is done but for most uses I'd say they are not appropriate - building tools with them is ok if you double check (though how many people will when the answers seem good enough at first?).
I'd find it really troubling if financial analysts are using them without knowing the deep limitations of the tooling (which the companies selling them will not highlight for you). They don't actually count or reason so they are liable to just make up figures based on their training dataset, not the data you give them.
Using them for actual financial analysis and generating reports based on data will lead to hallucinated figures which conform to what was asked for, not what the data says and silently fills in gaps in the data. It's extremely dangerous and not something they are good at at all.
Don’t get me wrong, I very much agree with the danger. As I highlighted - I saw it this morning when someone used Claude to draw the wrong conclusions.
I’m saying there is a way in which they can be used where there isn’t scope for numerical hallucinations at all. They can write sql queries, for example, without ever being allowed to even see numbers.
What invariably does and will happen though is they’ll inner join instead of left join and some data will get missed. Or there will be some missing context (users in this set already have a certain class of property by virtue of some selection bias and that will be mistreated as some signal etc).
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
Yes, but no room is made for people who see no use for it. There is a forced-consensus that this technology is useful, which I have to combat against at work.
We teach in a very different environment, but your use sounds typical of my colleagues. "I ask it for suggestions and pick one", but nobody seems to wonder about what is lost when we shrink the horizon of what we will teach to the most likely outputs from a chatbot, one of which we will use.
Maybe this makes more sense in other fields. I have to prepare people to work in the shipping industry, in extremely dangerous roles where they will be operating heavy machinery, steering ships, driving cranes etc. The fact is that AI knows next to nothing about this field because an AI cannot experience handling a ship in rough weather, has never secured a boat to a ship's side with the rain and wind in its face.
Yet, when people are brought in to instruct our trainees, they are told to "tell AI what you want and pick one of the suggestions", in the best case, or just give over everything to the AI in the worst case. And nobody seems to be able to explain why this is a better way of working than sitting with a pen and paper, brainstorming some ideas for a lesson based on your real experiences, and then delivering it. The only justification I'm ever given is your one, "I pick from a list so I am really still in control", "it's quicker and I don't have to think as hard or as long", "it's better at making slides or writing good-sounding (to management and auditors) lesson plans". No-one ever seems to justify it by saying it is genuinely a better experience for the trainees.
> Yes, but no room is made for people who see no use for it. There is a forced-consensus that this technology is useful, which I have to combat against at work.
This is the crux of the issue -- The technology is useful. Using it appropriately is probably the thing that people are ignoring, but you're conflating one and the other in your comment.
It is not useful to you in this case, and complain that it is an overall detriment in your industry. Those are fine and reasonable statements and conditions, and I see no reason to disagree with them... But your first statement, people who see no use for it? That is, to me, as off-putting an opinion as the consequence-unaware hypebeasts who are running OpenClaw with access to their trading accounts and can't see why others aren't.
I sympathise with the idea that everyone wants to use the new hammer and so is treating every problem like a nail, but hammers are still pretty good tools. (And you can ignore the ex-NFT-fans hammering on their dicks in the corner.)
I mean only that I see no use for it myself, in my own work. I'm sure there are people working in roles around me who believe they get some use out of AI doing their work for them, and they will have to answer to auditors when they find problems with their work, or when someone is killed.
To me, as a non-techie person, it feels as if people who work in software believe that because their work can be done by AI, everyone else's can, too. Or that this would be better, simply because it proposes a technological solution to human work — it is taken as read that a solution which uses cool sounding computers and data farms is better than one done by humans with a pen and a pad and life experience. They don't have to justify this belief, because the money is on their side.
I don't mean to tar you with a too-wide brush, and I feel like you have a good handle on your personal acceptance for LLM assistance. No complaint there.
I do think, maybe alternative to your view, that LLMs can provide useful feedback to graduate-level employees in most fields.
It is not that the work can be done by LLMs -- we're not there, yet, in software or otherwise -- but that LLMs as useful tutors specifically in regard to denouncing known bad ideas is largely applicable all over.
What I mean by the above is that I have yet to find a truly interesting idea spun from whole cloth by an LLM. They're mediocre at it. They're trained from the aggregate thoughts of those in every industry, and you and I both know that the aggregate of the industry is, generally, mediocre.
Conversely, though, is the hit: They won't be worse than mediocre. An indefatigable tutor who gives no great advice but will counsel you against blowing yourself up (or cutting a limb off with a rope, or falling overboard) is, to me, worth an amount.
The failure modes will get better, the advice will get better. Are we there, now? Unsure. You can tell us all better.
What does that really mean though — ten more years of data centers exploiting local communities for their resources will mean that a computer might be able to teach people to tie knots, and reliably check their work... No government would allow that to certify someone, and no company would risk the lawsuit when someone dies doing what the AI tells them, so it's a non-starter. Even if it were possible, and governments got on board with certifying training like that, would anyone think this was better than what we have now?
What are the likely use cases in my industry then? That AI is used to bodge the important paperwork that protects lives; is used to draft legislation; is used by both employees and management to do things like personal development reports.
Is anyone meant to be impressed? Is this worth communities having their water stolen from them?
I appreciate I am skeptical, but it is hard not to be when the world spends all day telling you a piece of technology is going to fundamentally change the world, and in real life you only see people use it to blag CVs, personal reports, and lesson planning.
> "What does that really mean though — ten more years of data centers exploiting local communities for their resources"
That is purest hyperbole. Data centers use a lot of electricity, but they are hardly looting local communities. The water issue is wildly exaggerated, unless a data center is located in a desert, because most water is recirculated.
And why do you think no one will allow an AI to certify someone on certain topics. Their knowledge at the moment is roughly the average of people in the field. Is an average person in your field not able to certify others? In any case, AIs are improving very rapidly, so what is not possible today will be possible tomorrow.
As an example, let me point out the Tesla FSD. On a per-mile basis, self-driving Teslas have a massively lower accident rate (less than 20%) than human-driven vehicles. That is a very physical activity being handled by an AI.
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
At work the tools handed to most are still essentially chatbots. Getting access to coding tools is an uphill battle because there isn’t really a good way to manage risk yet. Hard enough to keep a coding agent in check locally and ensure it does rm -rf anything. Scale that to thousands of people with limited skill and it doesn’t really work. So currently they just don’t.
That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible
They lag behind because we build for ourselves first. We are rolling out Claude to the biz team this week and they will get access to Cowork, which is still preview aiui.
Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!
I've always been a "power user", making little python programs and figuring out new ways to do things with seemingly unrelated systems. My knowledge is shallow, but very broad.
A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.
Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.
So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise content systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.
So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?
In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.
I swear to god that DeepSeek V4-Flash is the most useful model available right now. It's SO FAST and is good enough for so many tasks that I run it most of the time for almost everything. Even when it messes up, it's so cheap to iterate that I can fix most problems without changing the model to a more "capable" one.
It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
>Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
I made an account on OpenRouter.ai , created an API key, plugged the API key into the Zed editor, and started asking free models questions about my codebase.
Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.
I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.
I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
Out of curiosity - what harness did you use, and what model? And how are you prompting? In my mind prompting like:
“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”
Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.
Sure give it a go, perhaps it will work better now with frontier models, I haven't tried it in a while (this was a year ago, things have improved since then). I'm not sure what tests for having amazing graphics, gameplay, input, UI, sounds, etc would look like, but it would be interesting to see the results!
okay hold my beer. both claude and codex running now.
EDIT: both agents took about 20 minutes. I used that exact prompt in a clean directory for each, and then said "deploy to netlify" - so a total of two prompts.
Netlify is having trouble claiming the Claude project, so if you need a password it's "My-Drop-Site"
FYI, Claude rated itself 7.7/10 for fun, and Codex 98/100 during the fun test loop. As you'll see if you poke at them, Claude needs a physics bug fix round. But I think these both did about what I would have expected.
Claude one doesn't really work (collision detection was the problem I had before too), but fairly close.
Yes when I tried previously I had a few gameplay issues in frogger and I couldn't manage to one-shot this sort of thing at the time (a year ago), so last year definitely saw some good progress at this sort of thing. The asteroids game I was very happy with though, had a very cool retro feel and was wireframe only. Wasn't so keen on the code produced as it had a patchwork feel to it.
To your point, I didn't even look at the code.. :) Okay, I looked at the codex code. it's super reasonable -- separation of concerns, operating on a state model, it's not over designed. I did not hate it. I also noted that codex put in a CRT simulator loop which is a nice touch.
I think a year ago this would have taken a lot of back and forth and arguing; to me that's kind of the point of Simon's article -- a lot more just 'works' now.
Sorry I meant the code a year ago - it took a bit more hand-holding at that point and it was a mishmash of different things, but I feel it’s just slightly easier now - still similar. Haven’t looked into this one just had a quick play. Thanks for trying it out!
I think his article is for the last 6 months - my feeling is progress with LLMs has stalled recently and generated code still has problems with accuracy and coherence and subtle bugs, but everyone has a different experience.
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
I didn't even submit this one. I didn't actually think this was a good fit for hacker news, the pelican bicycle thing is pretty much played out here already!
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.
My generation will be the new fox news boomers, but instead of Fox news it will be ChatGPT and Claude telling them that Israel is the greatest country in the world and if you disagree you must be an antisemite.
At that point their ground truth is completely skewed (already for some folk), everything is relative. Some of them will probably die off in self-induced Darwin award winning ways, but sadly certain skewed world views may persist.
That's not how the human mind works. People still get skewed views on body standards even when they know that what they are looking at is biased and/or photoshopped, for example. When an AI fake stirs emotions just right, half the people will not even care about the truth.
No, it will just become like in the Soviet system - people will not believe anything anymore, and become disillusioned and just not care about anything other than their immediate surroundings. That's what being innundated with fake/exaggerated information results in. People don't know what to believe because they're seeing/hearing everything, and bits will be retained, but in general there will be a "this is too much, I can't keep up, who even knows, shrug".
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
What's the problem? If I enjoy some show, material or text, if it brings me value or a brief moment of happiness, I could care less if it was made by an AI or a human.
This racism against AI-generated stuff has to stop. If not, we'll have a butlerian jihad on our hands that will set back prosperity, development and science for decades, perhaps centuries.
People mention the artists... ohh, boohoo... either do it on your free time, improve your performance and selling skills or move to another job.
It's not my job to slave away only so that artists can day dream and produce stuff that no one cares about.
I think we need to start separating such concepts like entertainment from the ones of enjoyment, fascination, function, interest, satisfaction, beauty and the sublime a bit more. Art theory literally has books on these things, as they all fall under the topic of aesthetics. Do you really enjoy a frozen pizza from the oven at home in the same way as a freshly made pizza from an authentic pizza oven?
I always care about the processes involved, especially if any human work is involved, from all its accuracies to its errors. For me, interesting things happen while we balance our understandings with a certain amount of holism and a certain amount of reductionism. Putting it on either side of the scale, like your holistic statements, is just pure ideology, and that doesn't hold any merit in reality and is honestly just bland, repetitive and boring.
> Do you really enjoy a frozen pizza from the oven at home in the same way as a freshly made pizza from an authentic pizza oven?
Yes absolutely. I even measure them on the same scale and sometimes the frozen pizza wins.
I’ve literally got an authentic wood brining pizza oven at home and it can cook some great pizza, but that doesn’t mean its output is somehow in an untouchable category it’s just food. Further, with access to the real thing novelty goes away and it needs to sand on its own.
I think the elderly are particularly vulnerable. I also have at least one family member whose social media feed is 100% slop, they are blissfully unaware, and if you told them, they wouldn’t believe you.
Does "creative" mean that you are creative at coming up with ideas or does it mean that you are artistic and can create stuff?
I suppose it is more the latter, and it's the artistic people who create stuff who will suffer. The ones coming up with ideas, but previously couldn't create becasuse they lacked skill might win thanks to AI.
Coming up with ideas is easy, creating and putting in the effort is hard (until we had AI).
Probably the value of created stuff will go down rapidly because there will be so much of it.
When advertising agencies for example see that their copywriter can go from idea to concept with a video generator instead of engaging an animator, they’ll simply cut the middleman who used to create that animation for them and use the tool instead, even if the content isn’t as good (though the quality of this one is really pretty good, there are obvious problems). They’ll happily accept mediocrity to save money.
People will still create adverts but quality and creativity will go down and a lot of jobs are going to be suddenly displaced.
Yes sure if you look closely it’s slop, but a huge number of companies and advertisers just don’t care (and they feel the same about their social media content, blogs and yes code) - they will attempt to cut corners where they can to the detriment of true artists.
But yes, for anyone who does this for a living there will be obvious deficiencies, esp when you try to do something truly novel, intentional and interesting and don’t quite want what it produces.
But in this area they have made quite a lot of progress.
In a serious creative tool you would also want a lot more creative input. At a minimum the ability to steer the animation with skeletons that feed into a control net, or something like that. And the ability to control the look and feel and create much more consistent characters. Both things that exist in good tooling, but both things that create work that will keep animators employed. But it will dramatically reduce the number of animators needed to reach a given level of "good enough".
And looking at the trajectory of the animation industry, I don't think increases in productivity will be used to raise the quality of the animation if the alternative is to just pay fewer animators
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
Thank you for the reply. Would something like a Squirrel flying a hangglider as an SVG be a good new test? Or would that be indirectly in the training data too?
Graphically perfect, but content-wise nonsense. The pelican's center of gravity is clearly behind the wheel. It needs to be above or very slightly ahead of the wheel.
Still impressed. And, to be honest, I don't think that this problem matter much. Physical accuracy is very nice, but for example is not the most important aspect when I watch a fantasy movie. Or even a scifi one.
The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.
> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".
Google/Gemini has pretty impressive audio visual capabilities. I tried to have Claude add mulch to a landscape picture and it looked like someone hit it with the orange spray paint tool in MS Paint. Nano Banana actually produced something fairly realistic
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.
So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)
Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
> I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.
> What percentage of human engineers are creating novel solutions for hard problems, you think?
IMO Every engineer should try spending his time in a company that tries to solve new problems.
Otherwise we will be stuck, as we are now, with big tech paying you mountains of money for doing nothing, incentivizing you to embark on useless activities for letting other managers have a career, fear layoffs and when that happen complaining about it because "it's a year i'm looking for a new job" pretending same compensation and environment. Web development jobs are particularly affected by that.
In the game industry, for example, if you don't do something interesting your game won't sell a copy.
Let me stress this out again, if LLMs get you 97% there, maybe you should try another idea.
> IMO Every engineer should try spending his time in a company that tries to solve new problems.
Yet typically 95% of software developers mainly work on CRUD-type apps. Coding agents are not perfect there either but they’re really a lot more reliable than they were a few months ago.
Please you don’t need to stress anything. I think you are conflating ideas.
Unique game loops ideas make a good game, it has very little to do with the engineering. This is true for most software engineering products. Most engineering work is just reinventing or reimplementing existing ideas, what you describe rarely exists. It may exist in that the people learning the new ideas think it’s novel but very little is truly unique.
Claude wrote me a little python script to help me sort and rank all the AI videos I've generated. It also extracted the metadata and organized it into a CSV. I sent it some hex dumps of the header and it got it first try. The header structure of webms generated by comfy are pretty novel.
As a random example of a "hard" problem solved by AI that I couldn't have realistically done myself, despite having decades of wide industry experience:
Reverse engineering a proprietary protocol from a binary executable.
I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.
My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)
Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.
There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
> There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
I think you could have. However I don't think you would have - there is a big difference. It is a lot of work to to that, and people who try normally give up. However if your boss told you could have. Note that I suspect from your story this is more like give this to a dozen people and in 2 years you get results - at a cost of several million dollars.
> For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.
As I said, this is an example of using AI successfully to produce a high quality product (one that I use every day).
But to your point: I am solving hard problems that people really have. You just don't see those because I haven't mentioned them publicly yet. And they won't be released or talked about until they're ready.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
You claim "very high quality" but can't even get the basic UI working properly. You wrap tmux and a container in 2k lines of code and claim quality, I think the comment above was aimed at this claim.
I did not much more than a cursory glance too, but found "./sandbox/create.go", a ~1300 lines long file with so much duplication even within just itself that I stopped counting.
Now it was a long time ago I did Go professionally, but I'm also in the camp of "That doesn't really count as high-quality", although I know for a fact you can get quality code out of LLMs, but I don't think that's a good showcase of that.
> I did not much more than a cursory glance too, but found "./sandbox/create.go", a ~1300 lines long file with so much duplication even within just itself that I stopped counting.
Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet.
Admittedly, the parsing & escaping code and some utility functions could be moved outside to shrink the file, but otherwise I'm having trouble finding issues with the code.
The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
Look for slight variations of the same thing but with different paths, variables, or modes and I think you'd be able to spot the rest as well.
I've noticed that the bar for "quality" when people judge AI is often significantly higher than what they'd hold a human to. I'm not saying GP et al are doing this (I haven't looked myself), but it is a widespread pattern I've noticed both professionally and personally. I don't know why it is.
I have seen it too. The answer is easy - they don’t like AI. I've seen similar things with some people that don’t like women in tech or certain minorities - they suddenly critique at an extremely high level. I also haven’t looked at this particular case, but it wouldn’t surprise me to be the same thing here.
Dude, are you for real? We've had the supposed inevitability of AI rammed down our throats since the minute LaMDA convinced Blake Lemoine it was sentient, we've watched CEOs hype up AI as if it were production-ready while it was still barely beta quality, LLM-driven chatbots have been stapled to the side of every product no matter how little sense it makes since OpenAI published an API, and we've been told to prepare for the inevitable "agentic future" even as Claude 3.5 had to have its hand held more than a wet-behind-the-ears freshman summer intern. We're told that this technology is going to eat the entire world economy and render human labor obsolete, starting with our jobs, but if it's genuinely supposed to do that, I think it's more than reasonable to expect it to write superhumanly perfect code, not just code that's incrementally better than the last model release but still bad; extraordinary claims require extraordinary evidence, after all. To liken AI skepticism to the obstacles faced by women and minorities in tech is a category error that trivializes actual human struggles against human prejudices.
> I also haven’t looked at this particular case, but it wouldn’t surprise me to be the same thing here.
Be surprised then, because me, who left the critique, probably exclusively programmed with agents for the last year or so, so unlikely I think the code is bad because I "don't like AI". I don't love it either, but wouldn't call myself a AI-hater by any measurements, would be weird to write articles like this if so: https://emsh.cat/en/one-human-one-agent-one-browser/
Again, I wasn't reacting specifically to you (as noted, I wouldn't be surprised if so, but I also wouldn't be surprised if not). I was making a more general statement.
The bar isn't any higher. There's just no grace given. No one is judging a hobby project made by a human on quality, and the person who the hobby project belongs to will rarely say that their code is high quality. And in a professional setting, I think people are fine with "good enough" but they're not going to claim anything is high-quality.
But people are so quick to label their vibe-coded codebase as high quality and no grace is going to be given to a machine.
What comments are you seeing that are calling code from humans high-quality?
Grace shouldn't be given though. The code from vibe coding should pass the review bar as-is. If you need to iterate, you've defeated the purpose.
Because the end result is people committing bad code. For some random hobby project, sure who cares. But people are using this at work. The codebase is rotting in a new innovative way.
Either the bar has to be set at "actually good code comes out of vibe coding" or you have to accept that codebases are going to steadily become less usable by human coders who use their fingers to type in emacs.
Suddenly every dev needs an agent to even work with the slop. Seems like an outcome Anthropic would love though....
People who use AI set the bar themselves when they claim they generate "very high quality work using Claude". Humans more rarely make such claims about the code they write themselves, but when they do, I expect they face similar scrutiny.
AI code is competent, but it's not great or high quality unless you have a good enough eye for quality to steer it with an iron hand. But if you do, you know the quality comes from proper guidance, so you still wouldn't say AI code is great. If you do say exactly that, it comes across as having low standards (which is fine if you own it) and people are going to jump on that just to bring you down a peg.
> "I've noticed that the bar for 'quality' when people judge AI is often significantly higher than what they'd hold a human to."
Because that is literally the hype being fed to us by the marketers at the AI companies and HN users promoting AI.
- AI promoters: "AI is doing Ph.D level work! LLMs are not just a token predictor, it is actually thinking and reasoning! It will replace all developers, including _you_, so get on board the AI hype train now!"
- AI promoters when confronted with blatant mistakes and reasoning errors from cutting edge models: "Why are you holding LLMs up to higher standards than humans? That's not fair or reasonable."
https://github.com/kstenerud/yoloai/blob/main/runtime/regist... <- `Register` embeds a copy of the code from `IsAvailable` because of the locking; that could be replaced with a private `isAvailable` that has no locking that both use (after doing their own locking)
Just out of curiosity, I enabled some other linters and it looks bad. Excluding test files, there are 110 functions with a cyclomatic complexity over 10 and 7 that are _over 50_. The worst is at 86, which is mind-boggling.
Could probably find more, but you get the drift. I'm sure it runs, but stylistically this is more along the lines of what I would expect an intern to do.
This is also sort of nit-picky, but like half the stuff in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe... isn't idiosyncratic, it's just the way those things work and a lot of them aren't even tricky. The one linked is particularly blatant; that's not limited to os.Stat that's literally just how permissions work. Denying permission on inodes is a property of the folder, not the file.
Claude Code will automatically "dumb" the TUI down a bit when it can't properly detect certain terminal capabilities, to avoid potential font rendering issues.
Likely there are some terminal caps that aren't being properly preserved inside of the sandbox. It's never bothered me since the agent itself works fine.
Did you not expect someone to actually look and critique it?
Whether the visual bugs are a deal breaker or not isn’t the point.
The point is that’s not high quality code, it may work. But it’s not code I would ship at my job and therefore it’s not high enough quality for anyone serious
Hey that's fine. You're free to make whatever judgment you wish.
But I still stand by the quality of my code, including here. You and I don't need to agree.
What decades of managing codebases (public and private, huge and small) has taught me is that there will always be an endless list of bugs and feature ideas and nice-to-haves and technical debt pressures in any given project. You'll never get to them all, so you prioritize (as I have done here). Functional bugs usually trump visual ones unless they're actually interfering with work.
Will I fix this bug? Probably, now that I'm aware of it. But there are more important matters to attend to first.
Edit: Turns out the bug comes from a mismatch with the terminal I'm using. With other terminals it looks fine. Term caps are surprisingly complicated, especially when you have multiple layers!
> You aren’t having a disagreement with a person. You’re having a disagreement with reality.
How so? Are you going to instruct us all on how a termcaps mismatch bug is an indicator of poor code quality, rather than an unfortunate bug emerging from within the chaos of the many layers of disparate technologies that must somehow be stitched together (along with their idiosyncrasies) in order to make a project like this work?
Because you won’t listen to a word anyone says lol.
You had a visual bug right at the top of the repos README. Then insisted you hadn’t noticed it before.
Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.
How can we believe that this code is high quality if we see a glaring issue 5 seconds into opening the github?
We didn’t seek out your repo and start lobbing critiques at it. YOU POSTED IT as an example of high quality generated code. I’m telling you I am unimpressed
Really? So the discussion leading to the theory that there's likely a problem with termcaps disparity between layers didn't happen?
> Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.
Really? So you can tell from a single cosmetic bug which doesn't affect its ability to perform its task, that the rest of the codebase is deficient? That's a pretty damn impressive skill!
Hater's gonna hate, I guess ¯\_(ツ)_/¯
The otherwise timid pack always circles after they sense a single drop of blood, no matter how small and insignificant.
Dude, you have a glaring visual bug that is immediately obvious, as the first thing shown in the repo, and also would be seen every time you tested the tool, but you didn't notice it at all. That does not bode well for you noticing other aspects of quality in the tool. Maybe that's the only quality issue, but we all very seriously doubt it.
I think you can fix that by setting an environment variable (regarding the terminal?) but it was a while since I checked. (I was running Claude as a subprocess and had similar issues.)
Also this reminds me of a principle I learned from a mentor. "People are visual buyers. If it looks good, people will think the code is good."
Unfortunately it doesn't matter whose fault the janky TUI is, people will see that and associate it with your software.
It's more along the lines of: Anyone with an axe to grind will find something to grind it on.
Early stage products will have some rough edges. We've seen that in Docker, Kubernetes, AWS, Azure, LXC, KVM, etc. And people griped and raged about the sheer incompetence of the maintainers and utter lack of quality, but they still used those tools even before the rough edges were polished away and folks finally settled down.
The less one pays for something, the more entitled one feels to whinge and heap on abuse.
I've been down this road so much now that it's no biggie if a few Karens want to blow off steam at my expense. I'm not above exposing their silliness though ;-)
It tackles similar kinds of problems, dealing with idiosyncrasies in Linux distros (and Mac), docker, containers, kata, firecracker, seatbelt, tart, tmux, Claude, the various terminal emulators out there, and trying to herd those cats such that it doesn't blow up in your face.
"This is, unfortunately, how narcissists behave. It's simply impossible for a narcissist to be wrong. They truly believe themselves to be right, all the time, and will even distort reality around them to "make" it true. And they do it all unconsciously." - kstenerud
Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.
The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.
It can look like that in certain conditions. The question is why are you so eager to give critique on unrelated work, appearing in a demo screencap, to someone who didn't produce it?
To be honest I assumed it was the screencap software running a basic terminal env without bells and whistles that CC needs, which I've seen before. If the actual tool functions like that too, that's not great. That said, it works for them, it works for them.
> Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.
That's like blaming the company making hammers because you're unable to build a lasting house with the hammer, it really isn't up to Anthropic, but all about how you use the tool you're holding.
I don't see how it cannot be true. Are you claiming that every developer who uses the same LLM harness + model would produce equal code, regardless of the prompt? That's clearly not true in my experience, and I cannot understand how it could be either.
And if that's not true, then it's quite literally about how you're holding this hammer.
There's a cowboy artist that paints with his penis and does amazing work. If I tried that it'd turn out incredibly poorly, I prefer to paint with paintbrushes.
Just because the naked cowboy can paint well with just his penis, doesn't mean a penis is the right tool for painting. It doesn't matter how you hold your penis, it's not the right tool.
> There's a cowboy artist that paints with his penis and does amazing work. If I tried that it'd turn out incredibly poorly, I prefer to paint with paintbrushes.
I can't decide which joke to make, either (little dick joke) "well yeah you'd have to be able to see your paintbrush in order to use it" or (big dick joke) "well yeah, if you can't even hold it in two hands, how are you supposed to paint with it?" so I'll just make both :-D
Hmm, ok, I think the penis in case is a bit distracting, can you de-analogize this to their real terms and tell me what this is supposed to mean and be related to developing with LLMs?
Just because you _can_ do something with a tool, doesn't mean it's the right tool for the job. Just because someone has contorted their entire process to adapt to a misshapen tool, and gotten good results, doesn't mean that's the right thing to do.
It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.
> Just because you _can_ do something with a tool, doesn't mean it's the right tool for the job. Just because someone has contorted their entire process to adapt to a misshapen tool, and gotten good results, doesn't mean that's the right thing to do.
Ok, I agree with this, don't use the wrong tool for the wrong job.
> It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.
Yes, I agree with this too.
I'm still not sure how this relates to LLMs and particular this specific context. I claimed that the output of your agents depend on the developer driving it. You're saying "not every tool is right for every job", I agree with this too, but is that against/for what I said?
Could you just clearly write out exactly what you're arguing for here, no analogies or metaphors, just plain and simple, because I still feel like we're having two different conversations.
Yeah, they have bad engineers, product people and testers.
Microsoft is pretty shit at launching products, does that mean "products" as a concept is wrong? No, it just means Microsoft is bad at products, not more than that. Not sure why you have to extrapolate over an entire ecosystem just because one actor is bad at something.
1) This tool breaks the Claude TUI. Exactly as described by the comment.
2) The Claude TUI itself is broken. The comment is wrong, but assuming the "billion dollar TUI product" is capable of basic rendering and it's the wrapper that broke it, that is an entirely reasonable assumption
The fun here is that both of these softwares were made extensively using AI. No matter which of our options is the case here, the point stands. An AI-built product was shown, it looks obviously ass.
The issue is likely that the tmux session being generated is for some reason not propagating all term caps. Most likely it's an interop issue between tmux and docker and the image running under docker - possibly even something with the terminal client that the pipeline doesn't like somewhere.
Claude Code correctly reduces its display to 7-bit ASCII in response (still functional, although less pretty). Once I get around to fixing this, it will probably result in another section in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
Edit: Looks like it's the terminal. That's a rabbit hole for another day.
Running through VS Code's terminal via VSCode tunnel, it looks like it normally does.
What's really interesting in this comment chain is an observation I've expressed a lot more lately. When someone knows an LLM was involved they raise their expectations. I do it too in my own work and I have to remind myself things like "this bug would've also likely occurred with a human working at this level of complexity." The real question is did the operator arbitrarily and knowingly increase the level of complexity or is it appropriate for the task.
And there’s good reason for that. Anthropic, OpenAI, Salesforce, and so on have aggressively marketed LLMs as better than humans at working. It’s no surprise when we find out something is build using an LLM, we expect it to match the marketing.
But what constitutes "better than humans at working"?
Zero defects? Because you can always find at least one defect. But people don't naturally think statistically, so they grasp the thing that confirms their bias and then hang on tenaciously.
You'll notice the incredible amount of vitriol resulting from a purely cosmetic bug (which, it turns out, results from a missing TERM env in the base image - Claude is very conservative when it can't determine utf-8 support with 100% certainty).
> The real question is did the operator arbitrarily and knowingly increase the level of complexity or is it appropriate for the task.
There's one major reason to have higher expectations for autonomous systems (of all kinds, not just LLM-powered) than for humans, at least those intended to be deployed at scale, and that's the scale. If a human makes a mistake, has biases, or even intentionally breaks the rules the impact of their actions is limited by the nature of them being a human, where something like an autonomous driving system, a coding agent, etc. is intended to be deployed by the thousands, millions, or more and any problematic behaviors happen at that scale.
There are obviously millions of bad drivers out there, but every one of the human ones is bad in different ways. If Waymo pushes a bad update there could be tens of thousands of "drivers" that suddenly become bad in identical ways.
Humans also have the ability to learn from our mistakes. The ones you'd want to have working for you usually don't make the same one twice. LLMs are pretty good at making the same mistake repeatedly, even the simplest things like basic math or counting letters.
They’re talking past each other. For some, “high quality” is a comment about implementation elegance. For others, “high quality” is about duct-taping crude implementations together to fashion a kickass user experience. To most, quality probably involves some convex combination of these.
That is the same fight the 2D animators were having with 3D aninmation 30 years ago. The resolution is likely to be the same: the tool wins but the fundamentals stay, and the line between competent and incompetent practitioners moves but does not disappear.
I think at this point there is no convincing people. Clearly there is value in these tools and it generates code when steered properly. Perhaps your struggles are down to a skill issue.
> The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.
It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.
I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.
Absolutely! I find its test generation, properly steered, to be top notch. In many ways it's like having a second head, because it'll spontaneously come up with test paths that I'd normally only get to after a month or so in one of my "aha! What about XYZ?" shower thoughts.
You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"
> The code I get from LLM's is usually much better than what I get from my peers
Then you should seriously question for who you're working for imo.
> It also isn't lazy.
It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests
> The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people.
I think it’s really down to this. Nobody can agree on what counts as production-quality code. I remember joining a company with what I think (hope) most of us would call horrible quality code. It was an absolute mess, barely compiled with hundreds of warnings, and had uncountable number of bugs. They didn’t even have a bug tracker so nobody even knew how many they had.
But the people working there already were so proud of it! None of them had ever worked for another company so they had no idea how bad their code was in comparison with the rest of the software industry (which itself is a very low bar). I told the founder we had a huge code quality problem and he looked at me like I had horns growing out of my head.
When someone says their LLM is producing “production-quality” code, actually look at it and see. Arguing about it on HN is pointless because everyone’s quality bar is different.
While reading this thread, I literally just caught an agent putting in the following CSS selector in a rule:
> .row > div > div, .alert
This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.
I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.
That's interesting. As I said I haven't tried using LLMs at this level, although I'm about to embark on some this week.
What I've found helps (at least at the other layers) is to have principles documents and standards documents for the AI to reference when it's modifying code. Principles documents describe the why, and standards documents describe the how.
So for example a few parts from my initial CSS-standards.md (still needs a lot of revision):
## Utility-first discipline
**Raw utilities everywhere by default. Never `@apply` for "components."** `@apply` exists only for
true low-level primitives that can't live in a template (e.g., `prose` overrides, embedded
third-party widget shells).
Wathan's stated position: extract only on "worrisome duplication." The Tailwind team explicitly
describes `@apply` as a tool you reach for after first reaching for templates. **Premature CSS
abstraction is the failure mode.**
## Spacing
Use only the default scale (`0, 0.5, 1, 1.5, 2, 3, 4, 6, 8, 12, 16, 24…`). **Never `p-[13px]`.** If
you need a value, change the scale in `@theme`:
```css
@theme {
--spacing: 0.25rem;
}
```
v4 uses a single `--spacing` multiplier; everything derives from it.
## Anti-patterns (banned)
- **`!` important prefix** (`!bg-red-500`). Fix specificity properly.
- **Arbitrary values for colour** (`bg-[#1da1f2]`). Define in `@theme`.
- **Arbitrary pixel offsets** as default (`top-[3px]`). Use the spacing scale. Tolerated only as
rare one-offs.
- **Nested custom CSS more than one level deep.**
- **`@apply` for any class that wraps fewer than ~5 utilities** or appears in fewer than ~3
templates.
- **Dynamic class string interpolation** (`text-${level}-500`) — purger can't see these.
- **Custom breakpoints in v1.**
- **Inline `<style>` blocks.** All CSS goes through `assets/css/app.css`.
Yeah, I have those, but it's still pretty hit and miss, and obviously, it ends up being a game of whack-a-mole for everything I find.
I don't mean to over-state the importance of these little errors, just to say that agents do plenty of dumb stuff, even today, and the people who say otherwise are selling something or (hot take incoming) some combination of stupid, lazy and/or delusional.
Just IME, the quality of the prompt often significantly affects whether it does bad stuff like your example. It's not easy by any stretch and I'm still getting there, but I'm up to a couple dozen or so "Agent Instructions" in my CLAUDE.md files for various projects that have to say things like: "when doing TDD, don't write tests to verify bug fixes in tests" because the agent is really good at following things literally. I am sure it will continue to improve, but until then every project needs some bandaid things like that.
Specifically for CSS, these bots really want to just barf out tailwind-style crap. If you deviate even slightly from the standards and practices of the modal front-end developer, you quickly see how these things are brittle, and no amount of prompting and cajoling will truly affect their behavior. In this case, you're kind of seeing the downstream affects of saying "no, do NOT do tailwind, make actual CSS with actual semantic class names please and thank you."
Perhaps ironically, this results in the quality of output I might expect if I had prompted a right-out-of-bootcamp coder to do the same. (But at least it doesn't whine about it!)
> these bots really want to just barf out tailwind-style crap.
I get it. The LLMs struggle most with state. They don’t have a real fix for that yet. People generally compensate by shoving everything into context, and making the context window as large as possible, which half-works.
Tailwind happens to be “stateless” CSS framework. Nothing uses anything else, nothing is shared, nothing is reused, nothing stacks. It’s super easy to write, since you don’t have to worry about anything else, and the styles are all duplicated dynamically and ‘compiled’ — to the point you can copy-and-paste a HTML block with tailwindcss classes from anywhere into your site, and it mostly ‘works’).
—-
Tailwind is uniquely suited for LLM use, because the problem Tailwind solves is the problem juniors (and now, LLMs) struggle with most. An LLM can happily write up a bunch of styles, without knowing any of the rest of the project state, and if it’s tailwind, it will mostly sort-of work.
It just also happens to be bad practice, this style of development is the exact thing we told everyone not to do for two decades. (“Inline styles are bad! Duplicate styles everywhere is bad! It’s bloated, it’s inefficient. It’s the mark of inexperienced front end. Don’t inline styles. Unless it’s a tailwindcss class, you can inline those styles, they get a pass I guess”).
We used to measure our JS and CSS in kilobytes, by 2011 standards this would be “far too bloated for production use”. For the old-timers, it can be hard to grapple with the idea that we’re just purposefully doing ‘worse’ front-end intentionally now. The calculation changes when half your content/styles/front-end is LLM-generated, and therefore completely disposable. Very “they don’t make them like they used to” vibes.
Maybe you're right, but honestly, I think they're just barfing out the median content related to UI development that you find on reddit and stackoverflow and github.
For better or worse, web UI development has descended down a dark rabbit hole of bad code over the last decade, and so that is what LLMs were trained on. GIGO.
Yeah, it quickly becomes a "which came first, the chicken or the egg?" thing between junior devs and LLMs.
I will say, if you use a Mistral model, and if you insist your CSS framework is Bulma (tell it, 'no tailwind', 'no preprocessor'), it does okay at staying away from Tailwind. (Not perfect, not great, but okay).
No LLM I've used can handle raw CSS well (yet). If you are carefully curating your own classes and styles, you might just be on your own for a bit.
Tailwind produces CSS bundles that are smaller than without in pretty much every possible use case. This is obvious if you think about how the generated stylesheet looks and how it would compress. HTML gets slightly bigger but not nearly enough for the combined bundle to not be smaller.
> I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention.
Yeah, absolutely. People think you're picking on, like, code formatting and no, dawg, your code doesn't do what you think it does, or it only handles the happiest of happy paths.
I do find it funny when people get mad about you critiquing their AI project. You didn't even write it, dude.
A standard Docker container, with the container UID/GID mirrored to the host user, holding the host user's API keys, with the host user's project directory bind-mounted. The tooling doesn't even use gVisor / Kata by default which could implement the claim made, but in reality this entire project appears to be security theater.
I’d like people to notice that those who claim this amazing AI productivity boost are always: pushing out software they don’t know how to judge the quality of and pushing projects that are 70% done. Every. Single. Time.
I use Claude all the time, it is immensely helpful. It is also very nuanced and requires a high level of expertise in a specific domain to produce quality work. Even then, that take time and effort. Anyone saying otherwise, quite frankly, doesn’t know what they’re doing.
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Please do not cite Dunning–Kruger effect at random.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.
If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??
Maybe in some parts of the world (including mine). But we haven’t have a lot of computers either. But 25 years ago, there was a lot of textbooks and computers editors like O’Reilly already active. I had the C programming language book (not 25 years ago, but the book is older than that) and you could learn a lot with that one book and codeblocks. Same thing with “The Go Programming Language”, “Learning Perl”, and “Programming Clojure”. You only need one book to get very decent.
All I know is that we have a gigantic amount of tech debt we accumulated on the web chasing the next web framework built on top of tons of abstractions with very disappointing native web apis that shouldn't be taken seriously nor the w3c who specified them.
And when an Agent it's capable of gluing together a web app with some crud backend with a very rounded corners UI, that solves nothing for end users, we call them capable. These are not hard problems
You insist that AI needs to be able to tackle hard problems, but can't say what qualifies as a hard problem. Can you see the problem with that? If you don't know what a hard problem looks like, how do you know the models can't tackle them?
It’s that it’s to able to tackle hard problems really. It’s because you have to give it the solution, and the patterns to follow, and then monitor it because it will go down weird paths.
If you’ve ever work directly with a user, you know how vague change requests can be. Try writing some vague prompts like that to the agent and see if it can solve them.
For some, writing down a (good?) specs and handing it to an agent is not very productive. Because by then, they already have an idea of the solution and can use the editor to have it done.
And Claude can actually tackle it the same way as humans do - here in the real world, where we don't have time to let some nonsense like "mathematically proven to be unsolvable" to stand between us and our goals: it can eyeball the code and give a good enough guess.
I don't really see your point. Most problems that people have aren't really super-novel, but just extremely bespoke.
To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.
First of all it just underlines how shitty the web has become, second If that's your work I'd chase a career path where Claude can't one-shot this kind of dumb stuff
Oddly enough switched from software to selling retro games online.
Made ridiculous bank during 2019-2023, lost money 2024-2025 (I wasn't doing proper accounting at that stage, so it took a while to really internalise that the market wasn't insane anymore), looks like we'll make a decent-ish profit in 2025-2026 after pivoting the business model. Some regrets but it's possible staying in software could have been just as turbulent.
Funnily enough we're finally at the stage where I can launching my SaaS side-hustle which I've been sitting on for the past year and a half, so that could end up back in software again soon.
I would never say never, since I don't know what Claude would look like in 5 years' time, but there's plenty it can't do at the moment.
To give a concrete example, I don't let it make sweeping changes to the main "business logic" of my SaaS. Not because it's necessarily wrong but because I can't easily verify it. But I'll let it rip on peripheral stuff, or co-work with it.
If you break down a complicated coding problem in smaller parts, it could be any problem.
You will see its basically a very reusable part thats already done uncountable times else where.
People who think they do something so special and novel that it just can't be done by non-human, struggle with breaking down a problem in smaller parts.
Even if you do have such novel problems, its not like every single day, every single bit of work you do is like that.
Factor 135066410865995223349603216278805969938881475605667027524485143851526510604859533833940287150571909441798207282164471551373680419703964191743046496589274256239341020864383202110372958725762358509643110564073501508187510676594629205563685529475213500852879416377328533906109750544334999811150056977236890927563 in less than 24 hours.
Come up with a way to sample from LLMs such that they can tell funny jokes. The jokes should not be recited jokes from elsewhere.
Implement a CUDA kernel that achieves optimal efficiency for PyTorch-like conv2d for "reasonable" shapes/strides/dilations/groups. (This task is the closest to being solved by LLMs, but they usually get stuck somewhere doing stupid things instead of considering more advanced optimization methods and still need a human to push them along).
I'm beginning to get the sense that Sturgeon's Law is at play here and the non-crap 10% of us are arguing with the 90% for whom LLM's shitty output is actually better than what they could do on their own.
I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.
and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
The answer is "for lots of people, but not you".
You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".
I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.
My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:
- Go code that implements the transpiler (parsing Wasm, building an AST)
- Go code that gets generated by serializing the AST to a .go file
- Go code that manipulates the AST (to optimize it), and its effect on the generated code
- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST
- C code that gets compiled to Wasm, then translated to Go, then called by Go
- Go code that gets called by this C code to implement a C stdlib
- WAT and WAST files that are used to implement the Wasm spec tests
I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.
And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).
Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.
Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.
Also... I think our era has an intrinsic bias that change=progress, productivity, etc.
Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.
But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.
A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.
Maybe administration was never really a bottleneck.
Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.
Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.
I haven't heard of many belting out features, and increasing prices or sales.
Most bottlenecks are upstream of another bottleneck. Few are a "dam."
I believe by now we know exactly what it's good at and what it's terrible at.
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
I agree, but you contradicted yourself just one line above.
> For generating production code even with a lot of steering and baby sitting? Absolutely not
Moreover this is further in contradiction with several facts:
1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz
2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties
3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?
> I agree, but you contradicted yourself just one line above.
>> > For generating production code even with a lot of steering and baby sitting? Absolutely not
with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not
I've quoted you two tools (Ghostty and Redis) whose development now regularly uses AI assistance to deliver production code. I quoted those because their authors shared their experiences, the strengths and the limits of the tooling.
There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.
In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.
The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.
Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.
You cited mostly web tech, which proves my point ;) Is antirez uses extensively agents to contribute to redis doesn't mean it's a becoming industry trend. I'd say quite the contrary, it isn't in the gaming industry for example, where novel ideas matter. And btw Antirez and Linus for example, put a lot of effort into steering agent into doing the right thing for them which is totally different than "these tools become just good"
> Half the projects I listed are system's programming related.
No they're not and those who are, are in overwhelming control by the engineers that steer continuously the agents in the right direction. First of all this isn't something you can do for novel ideas, especially in gaming, second it is indeed very bad the code they produce otherwise it won't require that much effort from high end professionals to bend the LLMs to their will.
Denial of nothing, it's pretty clear from my original comment above that gen ai is indeed deployed with varying degree of success in various stuff. My point is there wasn't any "inflection point" just a better integration between agents and os tools all inside a loop.
I successfully use AI in my day to day job, just not that much for coding, if I have a sense a task can be one-shotted by Claude I do, if not I don't. Simple as that
What novel ideas are you thinking of? In my experience there are very few games with novel software engineering. New gameplay mechanics or story or art or design, sure, but they're generally built with very old and standardized patterns.
My explanation for the lack of shared experience is very language dependent quality. I work in Go and it's gotten really really good. I have to pick the right abstraction and it can be overly verbose at times but it can make in 5 minutes what would have taken me an hour.
I had a really fun day yesterday because anthropics limits on their normal 20$ subscription allowed me to play around for the whole day without hitting a limit.
Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.
The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.
This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.
Small websites, fun projects, helper tools etc.
But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.
We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.
"We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."
This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.
Your Product Manager is not a coding job. Your Product Owner is not a coding job.
vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.
But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.
Firstly i wrote examples but also etc. so its more than just that. It is also refactoring, cicd pipelines and co.
2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.
Now its one shoting a lot. Including websides, refactorings, etc.
The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?
Mythos is 10 Trillion, that might be already pushing it.
95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"
> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet
>> embedding-shape 1 hour ago | root | parent | next [–]
>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?
Probably not really. For gaming, I think probably just need to have a better way to explain visual and what the problem is (collision not done correctly, ways to feedback to LLM's experimentation loop how that should be checked and why etc).
Models usually is broken if there is no feedback loop. Well, websites might be exception since they can one-shot pretty well. But there are plenty of things they can do well without one-shot that just requires a good feedback loop to be built.
Steve Yegge wrote about this in his book Vibe Coding. He says it takes about a year of experience before you're consistently getting good results. He writes about lots of different techniques for doing that, but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire.
It's been 4 years of using them for me, before writing a book I'd wait to have a decade of experience to share with others, otherwise it would have the same value as a book on a react tutorial
I'd say closer to 6 months for me but probably still some room to improve.
I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.
After having Claude Code "remember" my preferences and tools, it's more efficient.
It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.
> but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire
That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.
+1 to all of this. The challenge can be staying focused and thinking when the AI assistant is (1) moving very fast and (2) often times doing multiple things at the same time.
I know I have struggled to keep up, and fall into the trap of approving things (either commands or recommendations) without taking the time to really process and think about them.
It's a bit like the age old problem of "it's super easy to ask questions, and can be super hard to answer many of them". So the economy of the conversation gets out of whack fast.
Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.
I think it more reliably does IaC with established patterns especially when it can do a dry run.
Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho
Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.
I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.
I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.
I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.
I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?
As I said we have a plenty of different envs, codebases, requirements. Things are complex.
You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.
Let me stress this out again:
> That's why the debate is so polizered imo, there isn't a shared experience
In my experience most people with the type of critique I'm seeing from you have only tried it one time or have not taken the time to invest in an environment/process that will work for agentic coding.
My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.
Having said all that, you're right there isn't a shared experience.
I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:
assert_eq x, true if x == true
Both Claude and Codex, both with the latest versions and the original versions that had been working.
Now I just use deepseek. It isn't any dumber, and it costs way less.
Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.
I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.
This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.
@hollowturtle I'm surprised - do you really find that sota models aren't good enough to generate production code with steering and babysitting? My experience (Claude Code, mostly Opus 4.6) is that it's fantastic at this. At least in JS + TS + Elixir + Ruby. It does indeed need babysitting, my mental model is that it's an exoskeleton not a junior dev, but IME it's a friggin badass exoskeleton, easily 10x-ing my speed on most work. Notably I do NOT --dangerously-skip-permissions nor use claude code's auto mode, I micromanage and lightly review every line it's writing as it writes it, so I rarely have more than 2 sessions generating simultaneously. I suspect that a lot of the disappointment comes in when people try to delegate to it and trust it to not go off the rails. It hasn't earned that trust from me yet (and hasn't needed to yet).
Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.
It really depends on the task, but, in my experience, small to medium and bigger codebases, the amount of steering to get quality code is not worth it.
I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.
Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.
If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps
I think this depends a lot on the task, the existing codebase, and the taste of the operator.
In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.
On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.
Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.
Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.
I'm convinced that the polarization is that one's impression of AI has a direct 1:1 mapping with one's previous level of skill and sensitivity to quality. Most people are by definition average and they are impressed.
Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.
Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.
I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
"there’s zero chance any AI lab would train a model for such a ridiculous task"
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
It definitely seems like the point of no return has been passed.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
If you only read bad news (i.e. mass news these days since that sells better) this will be the picture. But I have personally seen some insane stuff happen in biotech. Like, I can't believe we're lucky enough to possibly live our life in this kind of future. We already have actual therepeutics developed using Alphafold being tested right now in real clinical trials, but the next generation of stuff that will go into trials in the next 3-5 years will be insane. We will look back at current medicine like we look back at medieval times today.
My mother is going on 5 years with multiple myeloma, a cancer that would have offed her in 5 months if it weren’t for advances in maintenance chemotherapy.
The tooling has become so good though - the eco-system around the LLM. The models have become really good, yes - but it's definitely slowed in my opinion. The tooling is what really has become great - "harness" is probably the best word. When folk like Elon/Schmidt/Theil/etc. talk about singularities and industrial revolutions - it sounds extremely out of touch - or actually protective of the massive capex they've potentially sunk.
EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.
We all have had the client from hell: they don't know what they want, they change their requirements all the time. Whenever they have a new half-baked idea, I need to scramble and re-design the architecture. They have no clue that a small change request has a big impact on the code.
Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.
top model changes every other month between Claude, GPT and gemini. but its dominated by GPT overall. Claude has taken lead in coding task but GPT 5.5 has come stronger. gemini was good in between. but its dominated by GPT 5.5 and claude overall. Coding is the area where disruption is hardest. Opencalw early this year was a major breakthrough in agentic AI and it is still making noise and becoming more mature and going toward enterprise. Agentic coding is still in adoption phase where teams are trying it , trying to make sense out of it, running it and not beleving it and eventually it is discussion point over tea. it is still in adoption phase but needle has moved from being alient to being something real which team started discussing and using it like a champ.
I'm tired of the pelican bench, it made sense in the beginning, but at this point it got too popular and old to consider the assumptions from one year ago (absence in the sample/training/reinforcement) to still hold.
They’re definitely RL training the models on the pelican test. They patch any kind of test that shows them performing poorly by hardcoding some answers into the model.
How much of what is being generated by LLMs is actually value add? My perception is there are lots of great experiments, but little real value.
+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?
+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.
It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.
AI is like Sauron's Ring: it only amplifies the user's innate abilities.
It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.
> there’s zero chance any AI lab would train a model for such a ridiculous task.
A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.
For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].
> One of my projects was a vibe-coded implementation of JavaScript in Python—a loose port of MicroQuickJS—which I called micro-javascript. You can try it out in your browser in this playground.
I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.
The honest summary that doesn't show up in the six-month roundup: the unevenness. Boilerplate, tests, scaffolding, glue code: dramatically faster, sometimes 5-10x. Architecture, data modeling, careful security work, judgment calls about what to build: same as before, sometimes slower because tab-completion sneaks in plausible-but-wrong defaults you then have to undo.
The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.
I'm always surprised to see HN people saying models aren't good.
What are these guys building? The best engineers I know, from startup to big tech admit these models are incredible.
Including people I don't know personally, foundational engineers from every area. The average HN person though, is doing some quantum-alien computation that not even the best developers in the world can grasp.
There's also an inflection point in Feb-April: Claude got considerably worse, and arguably has not really recovered since then. They claim it's fixed, but my experience it is not as great as it once was. 4.7 is still useless.
Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.
iekekke | 18 hours ago
As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
bb88 | 18 hours ago
zarzavat | 18 hours ago
minimaxir | 18 hours ago
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
jofzar | 17 hours ago
Mashimo | 16 hours ago
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
Antibabelic | 14 hours ago
captn3m0 | 7 hours ago
Antibabelic | 7 hours ago
energy123 | 16 hours ago
shepherdjerred | 18 hours ago
I'm not sure that's true anymore considering how popular Simon's blog is
nickvec | 17 hours ago
aaronbrethorst | 15 hours ago
_puk | 17 hours ago
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
kzrdude | 17 hours ago
sunaookami | 11 hours ago
simonw | 17 hours ago
muzani | 13 hours ago
throwaway2027 | 18 hours ago
_puk | 17 hours ago
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
Arn_Thor | 14 hours ago
ant6n | 14 hours ago
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
_puk | 13 hours ago
Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).
vessenes | 12 hours ago
On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.
Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.
dmpk2k | 16 hours ago
wilg | 16 hours ago
sph | 15 hours ago
OvervCW | 13 hours ago
aizk | 18 hours ago
Insanity | 18 hours ago
They definitely get something barebones up and running, but it's far from a fully fledged application.
adgjlsfhk1 | 17 hours ago
DeathArrow | 17 hours ago
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
danielbln | 16 hours ago
minimaxir | 17 hours ago
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
baq | 16 hours ago
xbmcuser | 17 hours ago
Scoundreller | 16 hours ago
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
LAC-Tech | 13 hours ago
Gemini Pro on the other hand can be quite a pleasant experience.
bluegatty | 17 hours ago
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
smackeyacky | 14 hours ago
asdff | 14 hours ago
halflife | 17 hours ago
kvakkefly | 17 hours ago
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
viccis | 17 hours ago
aspenmartin | 16 hours ago
bsder | 16 hours ago
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
FeepingCreature | 16 hours ago
hansmayer | 15 hours ago
sampullman | 15 hours ago
dkersten | 14 hours ago
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
musebox35 | 16 hours ago
Timwi | 15 hours ago
But it's by far the most fun part and the only reason to take such a job...
OakNinja | 15 hours ago
stepbeek | 15 hours ago
dawnerd | 14 hours ago
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
skydhash | 9 hours ago
mekael | 4 hours ago
peepee1982 | 14 hours ago
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
musebox35 | 13 hours ago
BOOSTERHIDROGEN | 15 hours ago
musebox35 | 12 hours ago
rafaelmn | 16 hours ago
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
manmal | 14 hours ago
1. Spec -> plan -> code (all agent driven, maybe with grill-me or ultraplan)
2. Handwritten spec -> agent driven plan -> agent driven code
3. Agent driven spec -> vibed code -> Fix by handholding until ok-ish
4. Vibed throwaway prototypes -> extract useful patterns -> rewrite with handholding
5. Generate file structure with handholding -> manual TODO comments -> Fill in blanks with handholding
rafaelmn | 14 hours ago
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
manmal | 8 hours ago
yieldcrv | 16 hours ago
Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
altmanaltman | 16 hours ago
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
It is extremely ignorant.
wilg | 16 hours ago
MikeNotThePope | 16 hours ago
pastel8739 | 15 hours ago
IshKebab | 15 hours ago
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
apsurd | 15 hours ago
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
IshKebab | 14 hours ago
apsurd | 14 hours ago
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
IshKebab | 10 hours ago
So is it possible for non-programmers to vibe code if they have the latest models? If not now, what about in a few years?
AI is clearly a different class of tool to something like a welder.
komali2 | 15 hours ago
piva00 | 15 hours ago
peepee1982 | 14 hours ago
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
skor | 13 hours ago
mianos | 12 hours ago
Most good developers are not employed because just because they can code well.
What is over is: fizzbuzz and trivial CS algorithm regurgitation as a gate.
krzyk | 16 hours ago
satvikpendem | 15 hours ago
LtWorf | 15 hours ago
tonyedgecombe | 13 hours ago
junga | 15 hours ago
chasd00 | 9 hours ago
I actually use claudecode a lot, where it works it works very well for me.
bloppe | 14 hours ago
yen223 | 14 hours ago
AussieWog93 | 12 hours ago
Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.
troupo | 10 hours ago
I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.
Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.
DeathArrow | 17 hours ago
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
fluder_tw | 14 hours ago
ssdspoimdsjvv | 12 hours ago
magicalhippo | 16 hours ago
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
nothinkjustai | 15 hours ago
magicalhippo | 13 hours ago
I didn't use it often, but when it was needed it was needed.
manmal | 15 hours ago
whatshisface | 14 hours ago
magicalhippo | 13 hours ago
But yes, I did think that it sorta felt like being a team lead for some eager programmers.
WesolyKubeczek | 14 hours ago
> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.
> I do check the documents, and what they're doing. I also check the tests, some more thorough.
Sounds like programming, but with extra steps.
dawnerd | 14 hours ago
mrcsharp | 13 hours ago
Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.
There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.
magicalhippo | 43 minutes ago
Part of this is true, part of it the agents catch at least a non-trivial portion of. If you prompt it to do a review, especially with a specific angle like ensuring sustained write performance, or how it will work when the future extensions are implemented, they do often catch a lot of issues.
I agree you lose a fair bit of the sense of "it feels like I'm doing something wrong", or "this doesn't seem optimal" etc. I think the skill in using these tools is to determine when you need that control and where it doesn't really matter.
magicalhippo | 13 hours ago
The first case I'll probably still do by hand, like handmade vases despite factory made are cheap and readily available.
For the second case I think these newfangled tools have made it even more fun, since writing lots of boiler plate, repetitive event handles and whatnot is not my idea of fun.
skydhash | 9 hours ago
That’s what code generators, snippets plugins, macros, and the old copy-paste are here for. I wonder if you were using notepad to code. Because even nano had macros.
magicalhippo | 5 hours ago
Sometimes using a new framework or programming language is the fun part.
But sometimes it's just the best way of solving a problem incidental to the fun part.
One of the two projects I vibed included a web frontend. I didn't touch a single line of HTML, CSS or JavaScript of the frontend. And I didn't touch the API on the backend. I'm not a web dev, so this isn't something I've got snippets for or whatever, and in this case wasn't the interesting part.
The interesting part for me in that case was making a tool that could help us, not the details how exactly how that was done.
skydhash | 3 hours ago
And I wouldn’t argue about the economics of getting a MVP out. But with software, you often got one happy path and myriads way of getting into an incoherent state (and crashing early would be a boon in this case) and/or returning the wrong response. When you care about failure, you also care that your code is semantically right. The devil is very much in the details, especially if you have N>1 users.
Getting thing dones for me include a high confidence that the code will do the right thing. And that’s means reviewing each line and checking the semantics (only when it’s a few line of code) or building a test harness and making sure I handle contracts and invariants.
Snippets, Code Generators, and Copy-Paste gives me sample that I can trust, although I may need to edit. But LLM doesn’t. And I’m doubly doubtful when it’s something I’m not familiar with.
magicalhippo | 13 hours ago
When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.
Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.
nopurpose | 14 hours ago
magicalhippo | 13 hours ago
While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.
orrito | 15 hours ago
ryanjshaw | 14 hours ago
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
ben_w | 14 hours ago
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
iLoveOncall | 14 hours ago
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
harshitaneja | 13 hours ago
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
iLoveOncall | 12 hours ago
Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235
If this is everything needed for an LLM to generate acceptable code, what is even the point of them?
harshitaneja | 11 hours ago
I am sorry for not being clear in my response but I didn't intend to twist your words. I am not sure where I did so. My response was intended to be a more general remark on the kind of discourse on this topic I see and that I think both sides are right from the context they are looking in with and also why I think both sides come out of this discussion exhausted of the other. Not discounting presence of bad actors but generally I think there are most engaging in good faith like you are probably.
Coming specifically to respond your last response, I don't think one needs all of these prerequisites to get value out of LLMs. In fact LLMs have helped me untangle some very messy ball of muds on projects where we previously deemed it not worth the effort and basically carried some codebases as legacy. Now we can write enough tests to feel confidence and do a port against those tests all in a span of few days, which we found impressive.
Now having said all this, I think I understand your perspective a bit better on your original comment.
While it's a very versatile hammer, if it doesn't work for your use case that's all great. I just think that a bit more patience though with honing it maybe could help you find areas where it could work for you. If not, cheers!
ReptileMan | 11 hours ago
What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.
If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.
sofixa | 11 hours ago
There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.
But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.
righthand | 7 hours ago
rTX5CMRXIfFG | 17 hours ago
Sparkyte | 17 hours ago
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
minimaxir | 17 hours ago
bluegatty | 17 hours ago
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
dnnddidiej | 16 hours ago
nl | 17 hours ago
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
raincole | 17 hours ago
Hfuffzehn | 15 hours ago
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
mrothroc | 7 hours ago
The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.
For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.
The pipeline controls the quality far more than the model, empirically.
bluegatty | 17 hours ago
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
nl | 17 hours ago
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
bluegatty | 17 hours ago
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
IanCal | 16 hours ago
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
nl | 16 hours ago
nl | 16 hours ago
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
squeaky-clean | 14 hours ago
nl | 11 hours ago
bluegatty | 10 hours ago
Neither of those are from 'under' they both look either front or top?
Imagine yourself under the ducks feet, looking up at an oblique angle - wings as I suggested. The AI won't do that, it has no reference for dimensionality.
koonsolo | 15 hours ago
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
bluegatty | 10 hours ago
The model does not 'understand' comprehensively the relationship between anatomy, dimensionality, etc..
iammjm | 10 hours ago
viking123 | 13 hours ago
DeathArrow | 17 hours ago
simonw | 17 hours ago
rahimnathwani | 16 hours ago
It would support your point about the performance of 20GB local models.
ex-aws-dude | 17 hours ago
Does that suggest the uplift was only for things that are easily verifiable like code?
4b11b4 | 17 hours ago
rdedev | 16 hours ago
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
vanuatu | 16 hours ago
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
tayo42 | 17 hours ago
yieldcrv | 16 hours ago
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
viking123 | 13 hours ago
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
DeathArrow | 17 hours ago
tptacek | 17 hours ago
thierrydamiba | 16 hours ago
simonw | 16 hours ago
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
krzyk | 16 hours ago
halflife | 15 hours ago
You used to have a couple of days to close a breach, now it 2 hours.
vessenes | 12 hours ago
No personal experience with it. But the security team writeups I’ve read are significantly more positive about it than you describe, so it might be worth a second look.
Gigachad | 15 hours ago
baq | 16 hours ago
tetha | 16 hours ago
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
tptacek | 16 hours ago
jxmesth | 16 hours ago
nickvec | 16 hours ago
gnyman | 15 hours ago
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
muvlon | 14 hours ago
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
spacebanana7 | 12 hours ago
Systemically this usually favours the offence, as they could scan my app once every 6 months whereas I'd need to do it on weekly releases.
grey-area | 17 hours ago
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
vishal_new | 16 hours ago
ShinyLeftPad | 16 hours ago
munksbeer | 10 hours ago
What are you going to tell them? Suddenly you're earning what they're earning for sitting at a desk every day?
ShinyLeftPad | 8 hours ago
Non SWEs (salespeople, clerks, secretaries, assistants, taxi drivers, writers, 3D modelers, artists, designers) are of course going the same way. Unless they are protected (unionized or such), why would they have sympathy for SWEs? People of our ilk are the ones causing this (to them and to ourselves). What I will tell them is to not repeat our mistake, organize and protest.
simonw | 16 hours ago
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
kenloef | 15 hours ago
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
koonsolo | 15 hours ago
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
Mashimo | 15 hours ago
In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.
At my current job I don't want to miss them.
koonsolo | 8 hours ago
simonw | 9 hours ago
The best QA people I've worked with didn't write much code at all. You'd give them a new system and they'd find all of the bugs, testing obscure edge-cases that you'd never thought of.
vanuatu | 16 hours ago
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
trojans1290 | 16 hours ago
koonsolo | 15 hours ago
stuxnet79 | 15 hours ago
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
asdff | 14 hours ago
empath75 | 7 hours ago
dnnddidiej | 16 hours ago
LZ_Khan | 16 hours ago
conception | 16 hours ago
RobinL | 15 hours ago
angled | 15 hours ago
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
asdff | 15 hours ago
vessenes | 12 hours ago
It saved me a fair number of design-tweak steps in the md -> pandoc part of the workflow. Realistically, hand editing claude’s HTML is also easy in most cases, so I didn’t feel like I lost much (for the generative cases). Similarly if it’s mostly what I’ve written directly that’s the source it’ll be in markdown, and I’ve found it’s a faster path to have md -> (LLM-translated HTML deck) -> pdf.
jillesvangurp | 15 hours ago
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
ta8903 | 14 hours ago
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
Gigachad | 15 hours ago
grey-area | 14 hours ago
aidos | 14 hours ago
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
grey-area | 12 hours ago
I'd find it really troubling if financial analysts are using them without knowing the deep limitations of the tooling (which the companies selling them will not highlight for you). They don't actually count or reason so they are liable to just make up figures based on their training dataset, not the data you give them.
Using them for actual financial analysis and generating reports based on data will lead to hallucinated figures which conform to what was asked for, not what the data says and silently fills in gaps in the data. It's extremely dangerous and not something they are good at at all.
aidos | 9 hours ago
I’m saying there is a way in which they can be used where there isn’t scope for numerical hallucinations at all. They can write sql queries, for example, without ever being allowed to even see numbers.
What invariably does and will happen though is they’ll inner join instead of left join and some data will get missed. Or there will be some missing context (users in this set already have a certain class of property by virtue of some selection bias and that will be mistreated as some signal etc).
angled | 16 hours ago
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
beng-nl | 16 hours ago
Thanks!
angled | 15 hours ago
beng-nl | 14 hours ago
Antibabelic | 16 hours ago
vanuatu | 16 hours ago
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
dawnerd | 14 hours ago
alexwwang | 15 hours ago
BOOSTERHIDROGEN | 15 hours ago
schnitzelstoat | 15 hours ago
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
opto | 15 hours ago
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
It makes no sense to me.
tkgally | 15 hours ago
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
bradley13 | 14 hours ago
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
AI is a tool. Use it appropriately.
opto | 12 hours ago
Yes, but no room is made for people who see no use for it. There is a forced-consensus that this technology is useful, which I have to combat against at work.
We teach in a very different environment, but your use sounds typical of my colleagues. "I ask it for suggestions and pick one", but nobody seems to wonder about what is lost when we shrink the horizon of what we will teach to the most likely outputs from a chatbot, one of which we will use.
Maybe this makes more sense in other fields. I have to prepare people to work in the shipping industry, in extremely dangerous roles where they will be operating heavy machinery, steering ships, driving cranes etc. The fact is that AI knows next to nothing about this field because an AI cannot experience handling a ship in rough weather, has never secured a boat to a ship's side with the rain and wind in its face.
Yet, when people are brought in to instruct our trainees, they are told to "tell AI what you want and pick one of the suggestions", in the best case, or just give over everything to the AI in the worst case. And nobody seems to be able to explain why this is a better way of working than sitting with a pen and paper, brainstorming some ideas for a lesson based on your real experiences, and then delivering it. The only justification I'm ever given is your one, "I pick from a list so I am really still in control", "it's quicker and I don't have to think as hard or as long", "it's better at making slides or writing good-sounding (to management and auditors) lesson plans". No-one ever seems to justify it by saying it is genuinely a better experience for the trainees.
hug | 11 hours ago
This is the crux of the issue -- The technology is useful. Using it appropriately is probably the thing that people are ignoring, but you're conflating one and the other in your comment.
It is not useful to you in this case, and complain that it is an overall detriment in your industry. Those are fine and reasonable statements and conditions, and I see no reason to disagree with them... But your first statement, people who see no use for it? That is, to me, as off-putting an opinion as the consequence-unaware hypebeasts who are running OpenClaw with access to their trading accounts and can't see why others aren't.
I sympathise with the idea that everyone wants to use the new hammer and so is treating every problem like a nail, but hammers are still pretty good tools. (And you can ignore the ex-NFT-fans hammering on their dicks in the corner.)
opto | 10 hours ago
To me, as a non-techie person, it feels as if people who work in software believe that because their work can be done by AI, everyone else's can, too. Or that this would be better, simply because it proposes a technological solution to human work — it is taken as read that a solution which uses cool sounding computers and data farms is better than one done by humans with a pen and a pad and life experience. They don't have to justify this belief, because the money is on their side.
hug | 9 hours ago
I do think, maybe alternative to your view, that LLMs can provide useful feedback to graduate-level employees in most fields.
It is not that the work can be done by LLMs -- we're not there, yet, in software or otherwise -- but that LLMs as useful tutors specifically in regard to denouncing known bad ideas is largely applicable all over.
What I mean by the above is that I have yet to find a truly interesting idea spun from whole cloth by an LLM. They're mediocre at it. They're trained from the aggregate thoughts of those in every industry, and you and I both know that the aggregate of the industry is, generally, mediocre.
Conversely, though, is the hit: They won't be worse than mediocre. An indefatigable tutor who gives no great advice but will counsel you against blowing yourself up (or cutting a limb off with a rope, or falling overboard) is, to me, worth an amount.
The failure modes will get better, the advice will get better. Are we there, now? Unsure. You can tell us all better.
On the ten year horizon, I'd place a bet, though.
opto | 5 hours ago
What are the likely use cases in my industry then? That AI is used to bodge the important paperwork that protects lives; is used to draft legislation; is used by both employees and management to do things like personal development reports.
Is anyone meant to be impressed? Is this worth communities having their water stolen from them?
I appreciate I am skeptical, but it is hard not to be when the world spends all day telling you a piece of technology is going to fundamentally change the world, and in real life you only see people use it to blag CVs, personal reports, and lesson planning.
bradley13 | 4 hours ago
That is purest hyperbole. Data centers use a lot of electricity, but they are hardly looting local communities. The water issue is wildly exaggerated, unless a data center is located in a desert, because most water is recirculated.
And why do you think no one will allow an AI to certify someone on certain topics. Their knowledge at the moment is roughly the average of people in the field. Is an average person in your field not able to certify others? In any case, AIs are improving very rapidly, so what is not possible today will be possible tomorrow.
As an example, let me point out the Tesla FSD. On a per-mile basis, self-driving Teslas have a massively lower accident rate (less than 20%) than human-driven vehicles. That is a very physical activity being handled by an AI.
Quothling | 15 hours ago
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
piokoch | 15 hours ago
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
TrackerFF | 13 hours ago
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
generationP | 13 hours ago
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
Havoc | 12 hours ago
That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible
cold_harbor | 10 hours ago
verdverm | 6 hours ago
Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!
lopsotronic | an hour ago
A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.
Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.
So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise content systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.
So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?
In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.
pr337h4m | 16 hours ago
hootz | 10 hours ago
gib444 | 16 hours ago
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
RobinL | 15 hours ago
gib444 | 15 hours ago
azuanrb | 13 hours ago
x86cherry | 15 hours ago
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
DeathArrow | 11 hours ago
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25
NortySpock | 5 hours ago
Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.
I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.
I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.
bunzee | 16 hours ago
2001zhaozhao | 16 hours ago
Hmmm......
tardedmeme | 14 hours ago
kkarpkkarp | 15 hours ago
jrowen | 16 hours ago
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.
LarsDu88 | 15 hours ago
Opus 4.5 hit that point in November.
grey-area | 15 hours ago
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
vessenes | 12 hours ago
“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”
Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.
grey-area | 12 hours ago
vessenes | 11 hours ago
EDIT: both agents took about 20 minutes. I used that exact prompt in a clean directory for each, and then said "deploy to netlify" - so a total of two prompts.
Codex: https://astounding-bavarois-27b5a2.netlify.app
Claude: http://strong-hotteok-91dfb0.netlify.app
Netlify is having trouble claiming the Claude project, so if you need a password it's "My-Drop-Site"
FYI, Claude rated itself 7.7/10 for fun, and Codex 98/100 during the fun test loop. As you'll see if you poke at them, Claude needs a physics bug fix round. But I think these both did about what I would have expected.
grey-area | 10 hours ago
Claude one doesn't really work (collision detection was the problem I had before too), but fairly close.
Yes when I tried previously I had a few gameplay issues in frogger and I couldn't manage to one-shot this sort of thing at the time (a year ago), so last year definitely saw some good progress at this sort of thing. The asteroids game I was very happy with though, had a very cool retro feel and was wireframe only. Wasn't so keen on the code produced as it had a patchwork feel to it.
vessenes | 10 hours ago
I think a year ago this would have taken a lot of back and forth and arguing; to me that's kind of the point of Simon's article -- a lot more just 'works' now.
grey-area | 9 hours ago
I think his article is for the last 6 months - my feeling is progress with LLMs has stalled recently and generated code still has problems with accuracy and coherence and subtle bugs, but everyone has a different experience.
LarsDu88 | 2 hours ago
The game I was thinking of is relatively obscure -> Panel de Pon
pineapple_opus | 15 hours ago
emil-lp | 15 hours ago
Well, a combination of that and believing that replication of test data is a good measure of progress.
vessenes | 12 hours ago
JohnKemeny | 8 hours ago
At the same time failure proves little because most humans also could not manually create a correct SVG of a pelican riding a bicycle.
What is it exactly that such a test is testing?
In which situation would you measure the "competence" of a human being by asking them to write an SVG of a pelican riding a bicycle?
ClikeX | 14 hours ago
ActionHank | 2 hours ago
wewewedxfgdf | 15 hours ago
nickvec | 14 hours ago
koolala | 14 hours ago
koolala | 14 hours ago
specproc | 14 hours ago
dcminter | 13 hours ago
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
schnitzelstoat | 14 hours ago
victorbjorklund | 14 hours ago
xnorswap | 14 hours ago
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
troupo | 10 hours ago
simonw | 9 hours ago
nothinkjustai | 15 hours ago
tomhow | 12 hours ago
We detached this comment from https://news.ycombinator.com/item?id=48189072 and marked it off topic.
ramon156 | 15 hours ago
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
pferdone | 14 hours ago
0xCMP | 14 hours ago
jimbobthemighty | 14 hours ago
https://gemini.google.com/share/55e250c99693
grey-area | 14 hours ago
notachatbot123 | 14 hours ago
sevenzero | 14 hours ago
dzhiurgis | 13 hours ago
layer8 | 13 hours ago
tonyedgecombe | 13 hours ago
dzhiurgis | 11 hours ago
viking123 | 11 hours ago
KoolKat23 | 13 hours ago
eloisant | 12 hours ago
"No that's not me, that's AI"
alpaca128 | 12 hours ago
sofixa | 11 hours ago
drdaeman | 14 hours ago
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
colinb | 14 hours ago
Retric | 14 hours ago
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
floren | 12 hours ago
God, I'm sorry
grey-area | 12 hours ago
Please try to get her to stop.
abc123abc123 | 11 hours ago
This racism against AI-generated stuff has to stop. If not, we'll have a butlerian jihad on our hands that will set back prosperity, development and science for decades, perhaps centuries.
People mention the artists... ohh, boohoo... either do it on your free time, improve your performance and selling skills or move to another job.
It's not my job to slave away only so that artists can day dream and produce stuff that no one cares about.
simonw | 9 hours ago
aceazzameen | 9 hours ago
whilenot-dev | 9 hours ago
I always care about the processes involved, especially if any human work is involved, from all its accuracies to its errors. For me, interesting things happen while we balance our understandings with a certain amount of holism and a certain amount of reductionism. Putting it on either side of the scale, like your holistic statements, is just pure ideology, and that doesn't hold any merit in reality and is honestly just bland, repetitive and boring.
Retric | 3 hours ago
Yes absolutely. I even measure them on the same scale and sometimes the frozen pizza wins.
I’ve literally got an authentic wood brining pizza oven at home and it can cook some great pizza, but that doesn’t mean its output is somehow in an untouchable category it’s just food. Further, with access to the real thing novelty goes away and it needs to sand on its own.
ryandrake | 11 hours ago
mcfedr | 11 hours ago
flakeoil | 12 hours ago
I suppose it is more the latter, and it's the artistic people who create stuff who will suffer. The ones coming up with ideas, but previously couldn't create becasuse they lacked skill might win thanks to AI.
Coming up with ideas is easy, creating and putting in the effort is hard (until we had AI).
Probably the value of created stuff will go down rapidly because there will be so much of it.
grey-area | 12 hours ago
When advertising agencies for example see that their copywriter can go from idea to concept with a video generator instead of engaging an animator, they’ll simply cut the middleman who used to create that animation for them and use the tool instead, even if the content isn’t as good (though the quality of this one is really pretty good, there are obvious problems). They’ll happily accept mediocrity to save money.
People will still create adverts but quality and creativity will go down and a lot of jobs are going to be suddenly displaced.
AussieWog93 | 12 hours ago
grey-area | 12 hours ago
But yes, for anyone who does this for a living there will be obvious deficiencies, esp when you try to do something truly novel, intentional and interesting and don’t quite want what it produces.
But in this area they have made quite a lot of progress.
wongarsu | 12 hours ago
And looking at the trajectory of the animation industry, I don't think increases in productivity will be used to raise the quality of the animation if the alternative is to just pay fewer animators
hackable_sand | 3 hours ago
sfdlkj3jk342a | 14 hours ago
https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...
Interesting that it does better at making the pelican peddle in the video generation than in image generation.
ionwake | 13 hours ago
falcor84 | 13 hours ago
ionwake | 6 hours ago
simonw | 9 hours ago
IdiotSavage | 11 hours ago
mycall | 11 hours ago
ciberado | 11 hours ago
horsawlarway | 9 hours ago
The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.
navane | 9 hours ago
djeastm | 6 hours ago
swed420 | 9 hours ago
> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".
nijave | 8 hours ago
I've heard the same has happened with common benchmarks (they've ingested solutions into training data)
nijave | 8 hours ago
bradley13 | 14 hours ago
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
hollowturtle | 14 hours ago
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
Razengan | 13 hours ago
You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.
So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)
Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol
hollowturtle | 13 hours ago
jaccola | 13 hours ago
Then I have a script that summarises that I usually run before pushing or at end of day.
Works quite well for both improving my code and the code ai wrote.
maccard | 10 hours ago
I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.
Razengan | 9 hours ago
kstenerud | 13 hours ago
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
hollowturtle | 13 hours ago
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
paulluuk | 13 hours ago
hollowturtle | 13 hours ago
IMO Every engineer should try spending his time in a company that tries to solve new problems.
Otherwise we will be stuck, as we are now, with big tech paying you mountains of money for doing nothing, incentivizing you to embark on useless activities for letting other managers have a career, fear layoffs and when that happen complaining about it because "it's a year i'm looking for a new job" pretending same compensation and environment. Web development jobs are particularly affected by that.
In the game industry, for example, if you don't do something interesting your game won't sell a copy.
Let me stress this out again, if LLMs get you 97% there, maybe you should try another idea.
pell | 12 hours ago
Yet typically 95% of software developers mainly work on CRUD-type apps. Coding agents are not perfect there either but they’re really a lot more reliable than they were a few months ago.
infecto | 8 hours ago
Unique game loops ideas make a good game, it has very little to do with the engineering. This is true for most software engineering products. Most engineering work is just reinventing or reimplementing existing ideas, what you describe rarely exists. It may exist in that the people learning the new ideas think it’s novel but very little is truly unique.
ThrowawayTestr | 13 hours ago
jiggawatts | 13 hours ago
Reverse engineering a proprietary protocol from a binary executable.
I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.
My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)
Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.
There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
bluGill | 7 hours ago
I think you could have. However I don't think you would have - there is a big difference. It is a lot of work to to that, and people who try normally give up. However if your boss told you could have. Note that I suspect from your story this is more like give this to a dozen people and in 2 years you get results - at a cost of several million dollars.
kstenerud | 13 hours ago
> For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.
As I said, this is an example of using AI successfully to produce a high quality product (one that I use every day).
But to your point: I am solving hard problems that people really have. You just don't see those because I haven't mentioned them publicly yet. And they won't be released or talked about until they're ready.
Philip-J-Fry | 13 hours ago
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
kstenerud | 12 hours ago
That's not what this product is; merely a tool it uses.
pprotas | 12 hours ago
kstenerud | 12 hours ago
I also strongly suspect that you'd only taken a cursory glance at the top of the readme prior to passing judgment.
embedding-shape | 11 hours ago
Now it was a long time ago I did Go professionally, but I'm also in the camp of "That doesn't really count as high-quality", although I know for a fact you can get quality code out of LLMs, but I don't think that's a good showcase of that.
kstenerud | 10 hours ago
Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet.
Admittedly, the parsing & escaping code and some utility functions could be moved outside to shrink the file, but otherwise I'm having trouble finding issues with the code.
embedding-shape | 9 hours ago
Look for slight variations of the same thing but with different paths, variables, or modes and I think you'd be able to spot the rest as well.
kstenerud | 9 hours ago
freedomben | 8 hours ago
kenjackson | 7 hours ago
TurkTurkleton | 7 hours ago
embedding-shape | 6 hours ago
Be surprised then, because me, who left the critique, probably exclusively programmed with agents for the last year or so, so unlikely I think the code is bad because I "don't like AI". I don't love it either, but wouldn't call myself a AI-hater by any measurements, would be weird to write articles like this if so: https://emsh.cat/en/one-human-one-agent-one-browser/
kenjackson | 3 hours ago
16bitvoid | 7 hours ago
But people are so quick to label their vibe-coded codebase as high quality and no grace is going to be given to a machine.
What comments are you seeing that are calling code from humans high-quality?
whateveracct | 34 minutes ago
Because the end result is people committing bad code. For some random hobby project, sure who cares. But people are using this at work. The codebase is rotting in a new innovative way.
Either the bar has to be set at "actually good code comes out of vibe coding" or you have to accept that codebases are going to steadily become less usable by human coders who use their fingers to type in emacs.
Suddenly every dev needs an agent to even work with the slop. Seems like an outcome Anthropic would love though....
breuleux | 5 hours ago
AI code is competent, but it's not great or high quality unless you have a good enough eye for quality to steer it with an iron hand. But if you do, you know the quality comes from proper guidance, so you still wouldn't say AI code is great. If you do say exactly that, it comes across as having low standards (which is fine if you own it) and people are going to jump on that just to bring you down a peg.
ThrowawayR2 | 5 hours ago
Because that is literally the hype being fed to us by the marketers at the AI companies and HN users promoting AI.
- AI promoters: "AI is doing Ph.D level work! LLMs are not just a token predictor, it is actually thinking and reasoning! It will replace all developers, including _you_, so get on board the AI hype train now!"
- AI promoters when confronted with blatant mistakes and reasoning errors from cutting edge models: "Why are you holding LLMs up to higher standards than humans? That's not fair or reasonable."
everforward | 2 hours ago
E.g.
https://github.com/kstenerud/yoloai/blob/main/internal/fileu... <- that recursively creates directories, but will only change permissions on the innermost dir (user may be unable to cd into intermediary directories)
https://github.com/kstenerud/yoloai/blob/main/internal/mcpsr... <- all the json.Marshal calls in this file just suppress errors, so if anything un-marshallable ends up in there the app will return empty strings with no errors logged
https://github.com/kstenerud/yoloai/blob/main/runtime/regist... <- `Register` embeds a copy of the code from `IsAvailable` because of the locking; that could be replaced with a private `isAvailable` that has no locking that both use (after doing their own locking)
https://github.com/kstenerud/yoloai/blob/main/runtime/exec.g... <- these functions are identical except for the strings.Trim, one should just call the other and then trim the output
Just out of curiosity, I enabled some other linters and it looks bad. Excluding test files, there are 110 functions with a cyclomatic complexity over 10 and 7 that are _over 50_. The worst is at 86, which is mind-boggling.
Could probably find more, but you get the drift. I'm sure it runs, but stylistically this is more along the lines of what I would expect an intern to do.
This is also sort of nit-picky, but like half the stuff in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe... isn't idiosyncratic, it's just the way those things work and a lot of them aren't even tricky. The one linked is particularly blatant; that's not limited to os.Stat that's literally just how permissions work. Denying permission on inodes is a property of the folder, not the file.
Philip-J-Fry | 12 hours ago
Can't you see in the gif? It's completely broken. My Claude doesn't look like that. Neither does anyone else's.
kstenerud | 12 hours ago
Likely there are some terminal caps that aren't being properly preserved inside of the sandbox. It's never bothered me since the agent itself works fine.
Philip-J-Fry | 12 hours ago
"It's never bothered me". Cool. But your tool is bugged.
kstenerud | 11 hours ago
Or feel free to avoid the tool entirely if this UI issue shakes your faith in its overall quality down to its very foundations.
This is hardly a hill to die on.
sjagauanbdvva | 11 hours ago
You claimed high quality and provided a repo.
Did you not expect someone to actually look and critique it?
Whether the visual bugs are a deal breaker or not isn’t the point.
The point is that’s not high quality code, it may work. But it’s not code I would ship at my job and therefore it’s not high enough quality for anyone serious
kstenerud | 11 hours ago
But I still stand by the quality of my code, including here. You and I don't need to agree.
What decades of managing codebases (public and private, huge and small) has taught me is that there will always be an endless list of bugs and feature ideas and nice-to-haves and technical debt pressures in any given project. You'll never get to them all, so you prioritize (as I have done here). Functional bugs usually trump visual ones unless they're actually interfering with work.
Will I fix this bug? Probably, now that I'm aware of it. But there are more important matters to attend to first.
Edit: Turns out the bug comes from a mismatch with the terminal I'm using. With other terminals it looks fine. Term caps are surprisingly complicated, especially when you have multiple layers!
hiw2d | 9 hours ago
kstenerud | 9 hours ago
Thanks for explaining it for me.
gilrain | 9 hours ago
gilrain | 9 hours ago
You aren’t having a disagreement with a person. You’re having a disagreement with reality.
kstenerud | 9 hours ago
How so? Are you going to instruct us all on how a termcaps mismatch bug is an indicator of poor code quality, rather than an unfortunate bug emerging from within the chaos of the many layers of disparate technologies that must somehow be stitched together (along with their idiosyncrasies) in order to make a project like this work?
sjagauanbdvva | 8 hours ago
You had a visual bug right at the top of the repos README. Then insisted you hadn’t noticed it before.
Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.
How can we believe that this code is high quality if we see a glaring issue 5 seconds into opening the github?
We didn’t seek out your repo and start lobbing critiques at it. YOU POSTED IT as an example of high quality generated code. I’m telling you I am unimpressed
kstenerud | 7 hours ago
Really? So the discussion leading to the theory that there's likely a problem with termcaps disparity between layers didn't happen?
> Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.
Really? So you can tell from a single cosmetic bug which doesn't affect its ability to perform its task, that the rest of the codebase is deficient? That's a pretty damn impressive skill!
Hater's gonna hate, I guess ¯\_(ツ)_/¯
The otherwise timid pack always circles after they sense a single drop of blood, no matter how small and insignificant.
eudamoniac | 3 hours ago
andai | 11 hours ago
Also this reminds me of a principle I learned from a mentor. "People are visual buyers. If it looks good, people will think the code is good."
Unfortunately it doesn't matter whose fault the janky TUI is, people will see that and associate it with your software.
kstenerud | 10 hours ago
Early stage products will have some rough edges. We've seen that in Docker, Kubernetes, AWS, Azure, LXC, KVM, etc. And people griped and raged about the sheer incompetence of the maintainers and utter lack of quality, but they still used those tools even before the rough edges were polished away and folks finally settled down.
The less one pays for something, the more entitled one feels to whinge and heap on abuse.
I've been down this road so much now that it's no biggie if a few Karens want to blow off steam at my expense. I'm not above exposing their silliness though ;-)
wasabi991011 | 6 hours ago
Is your product really the same complexity as these?
kstenerud | 4 hours ago
Is it doing it to the same scale? No - it's a single user app. But have a look at https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe... and you'll see the kind of shit a project like this has to handle. It's not trivial.
pprotas | 6 hours ago
blanched | 3 hours ago
walthamstow | 12 hours ago
The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.
Philip-J-Fry | 12 hours ago
I don't know what the project is. All I see is a TUI that looks completely broken.
Go and use Claude Code right now. Does it look like that? Random underscores all over the page. No it doesn't.
walthamstow | 12 hours ago
Philip-J-Fry | 12 hours ago
His tool wraps Claude and breaks the TUI. What's so hard to understand?
That's valid critique. What world have I woke up in today?
walthamstow | 11 hours ago
gilrain | 9 hours ago
> The question is why are you so eager to give critique on unrelated work, appearing in a demo screencap, to someone who didn't produce it?
I guess the question was actually, why were you so eager to critique a critique based on a false assumption?
I wish people would be careful what they support with their rhetoric.
albedoa | 6 hours ago
That is not the question. The topic of discussion had been defined multiple times before you commented!
embedding-shape | 11 hours ago
That's like blaming the company making hammers because you're unable to build a lasting house with the hammer, it really isn't up to Anthropic, but all about how you use the tool you're holding.
malfist | 10 hours ago
embedding-shape | 9 hours ago
And if that's not true, then it's quite literally about how you're holding this hammer.
malfist | 9 hours ago
Just because the naked cowboy can paint well with just his penis, doesn't mean a penis is the right tool for painting. It doesn't matter how you hold your penis, it's not the right tool.
freedomben | 9 hours ago
I can't decide which joke to make, either (little dick joke) "well yeah you'd have to be able to see your paintbrush in order to use it" or (big dick joke) "well yeah, if you can't even hold it in two hands, how are you supposed to paint with it?" so I'll just make both :-D
embedding-shape | 7 hours ago
malfist | 6 hours ago
It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.
embedding-shape | 6 hours ago
Ok, I agree with this, don't use the wrong tool for the wrong job.
> It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.
Yes, I agree with this too.
I'm still not sure how this relates to LLMs and particular this specific context. I claimed that the output of your agents depend on the developer driving it. You're saying "not every tool is right for every job", I agree with this too, but is that against/for what I said?
Could you just clearly write out exactly what you're arguing for here, no analogies or metaphors, just plain and simple, because I still feel like we're having two different conversations.
knollimar | 9 hours ago
embedding-shape | 7 hours ago
knollimar | 7 hours ago
embedding-shape | 6 hours ago
Microsoft is pretty shit at launching products, does that mean "products" as a concept is wrong? No, it just means Microsoft is bad at products, not more than that. Not sure why you have to extrapolate over an entire ecosystem just because one actor is bad at something.
knollimar | 3 hours ago
I wouldn't trust a toolmaker who doesn't know how to use the tools decently.
arcanemachiner | 6 hours ago
SlinkyOnStairs | 10 hours ago
1) This tool breaks the Claude TUI. Exactly as described by the comment.
2) The Claude TUI itself is broken. The comment is wrong, but assuming the "billion dollar TUI product" is capable of basic rendering and it's the wrapper that broke it, that is an entirely reasonable assumption
The fun here is that both of these softwares were made extensively using AI. No matter which of our options is the case here, the point stands. An AI-built product was shown, it looks obviously ass.
kstenerud | 10 hours ago
Claude Code correctly reduces its display to 7-bit ASCII in response (still functional, although less pretty). Once I get around to fixing this, it will probably result in another section in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
Edit: Looks like it's the terminal. That's a rabbit hole for another day.
Running through VS Code's terminal via VSCode tunnel, it looks like it normally does.
https://freeimage.host/i/BySkkDN
oooyay | 7 hours ago
draftsman | 5 hours ago
kstenerud | 5 hours ago
Zero defects? Because you can always find at least one defect. But people don't naturally think statistically, so they grasp the thing that confirms their bias and then hang on tenaciously.
You'll notice the incredible amount of vitriol resulting from a purely cosmetic bug (which, it turns out, results from a missing TERM env in the base image - Claude is very conservative when it can't determine utf-8 support with 100% certainty).
wolrah | 46 minutes ago
There's one major reason to have higher expectations for autonomous systems (of all kinds, not just LLM-powered) than for humans, at least those intended to be deployed at scale, and that's the scale. If a human makes a mistake, has biases, or even intentionally breaks the rules the impact of their actions is limited by the nature of them being a human, where something like an autonomous driving system, a coding agent, etc. is intended to be deployed by the thousands, millions, or more and any problematic behaviors happen at that scale.
There are obviously millions of bad drivers out there, but every one of the human ones is bad in different ways. If Waymo pushes a bad update there could be tens of thousands of "drivers" that suddenly become bad in identical ways.
Humans also have the ability to learn from our mistakes. The ones you'd want to have working for you usually don't make the same one twice. LLMs are pretty good at making the same mistake repeatedly, even the simplest things like basic math or counting letters.
gcr | 10 hours ago
my-next-account | 4 hours ago
vdelpuerto | 8 hours ago
wanderlust123 | 8 hours ago
wickedsight | 12 hours ago
Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.
It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.
I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.
kstenerud | 12 hours ago
You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"
hollowturtle | 12 hours ago
Then you should seriously question for who you're working for imo.
> It also isn't lazy.
It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests
ryandrake | 11 hours ago
I think it’s really down to this. Nobody can agree on what counts as production-quality code. I remember joining a company with what I think (hope) most of us would call horrible quality code. It was an absolute mess, barely compiled with hundreds of warnings, and had uncountable number of bugs. They didn’t even have a bug tracker so nobody even knew how many they had.
But the people working there already were so proud of it! None of them had ever worked for another company so they had no idea how bad their code was in comparison with the rest of the software industry (which itself is a very low bar). I told the founder we had a huge code quality problem and he looked at me like I had horns growing out of my head.
When someone says their LLM is producing “production-quality” code, actually look at it and see. Arguing about it on HN is pointless because everyone’s quality bar is different.
timr | 11 hours ago
> .row > div > div, .alert
This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.
I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.
kstenerud | 11 hours ago
LLMs have traditionally had problems with visual rendering (the good ol' pelican on the bicycle test). I wonder if this is more of the same?
timr | 11 hours ago
Like I said, this is just an example that happens to be CSS. I see this stuff daily, if not hourly.
kstenerud | 11 hours ago
What I've found helps (at least at the other layers) is to have principles documents and standards documents for the AI to reference when it's modifying code. Principles documents describe the why, and standards documents describe the how.
So for example a few parts from my initial CSS-standards.md (still needs a lot of revision):
timr | 10 hours ago
I don't mean to over-state the importance of these little errors, just to say that agents do plenty of dumb stuff, even today, and the people who say otherwise are selling something or (hot take incoming) some combination of stupid, lazy and/or delusional.
freedomben | 8 hours ago
Just IME, the quality of the prompt often significantly affects whether it does bad stuff like your example. It's not easy by any stretch and I'm still getting there, but I'm up to a couple dozen or so "Agent Instructions" in my CLAUDE.md files for various projects that have to say things like: "when doing TDD, don't write tests to verify bug fixes in tests" because the agent is really good at following things literally. I am sure it will continue to improve, but until then every project needs some bandaid things like that.
sjagauanbdvva | 11 hours ago
Amazing how the LLM is godly with things I don’t understand, and falls over completely when it works in my domain… I wonder why that is /s
timr | 10 hours ago
Specifically for CSS, these bots really want to just barf out tailwind-style crap. If you deviate even slightly from the standards and practices of the modal front-end developer, you quickly see how these things are brittle, and no amount of prompting and cajoling will truly affect their behavior. In this case, you're kind of seeing the downstream affects of saying "no, do NOT do tailwind, make actual CSS with actual semantic class names please and thank you."
Perhaps ironically, this results in the quality of output I might expect if I had prompted a right-out-of-bootcamp coder to do the same. (But at least it doesn't whine about it!)
maxsilver | 9 hours ago
I get it. The LLMs struggle most with state. They don’t have a real fix for that yet. People generally compensate by shoving everything into context, and making the context window as large as possible, which half-works.
Tailwind happens to be “stateless” CSS framework. Nothing uses anything else, nothing is shared, nothing is reused, nothing stacks. It’s super easy to write, since you don’t have to worry about anything else, and the styles are all duplicated dynamically and ‘compiled’ — to the point you can copy-and-paste a HTML block with tailwindcss classes from anywhere into your site, and it mostly ‘works’).
—-
Tailwind is uniquely suited for LLM use, because the problem Tailwind solves is the problem juniors (and now, LLMs) struggle with most. An LLM can happily write up a bunch of styles, without knowing any of the rest of the project state, and if it’s tailwind, it will mostly sort-of work.
It just also happens to be bad practice, this style of development is the exact thing we told everyone not to do for two decades. (“Inline styles are bad! Duplicate styles everywhere is bad! It’s bloated, it’s inefficient. It’s the mark of inexperienced front end. Don’t inline styles. Unless it’s a tailwindcss class, you can inline those styles, they get a pass I guess”).
We used to measure our JS and CSS in kilobytes, by 2011 standards this would be “far too bloated for production use”. For the old-timers, it can be hard to grapple with the idea that we’re just purposefully doing ‘worse’ front-end intentionally now. The calculation changes when half your content/styles/front-end is LLM-generated, and therefore completely disposable. Very “they don’t make them like they used to” vibes.
timr | 9 hours ago
For better or worse, web UI development has descended down a dark rabbit hole of bad code over the last decade, and so that is what LLMs were trained on. GIGO.
maxsilver | 9 hours ago
I will say, if you use a Mistral model, and if you insist your CSS framework is Bulma (tell it, 'no tailwind', 'no preprocessor'), it does okay at staying away from Tailwind. (Not perfect, not great, but okay).
No LLM I've used can handle raw CSS well (yet). If you are carefully curating your own classes and styles, you might just be on your own for a bit.
eudamoniac | 3 hours ago
habinero | 2 hours ago
Yeah, absolutely. People think you're picking on, like, code formatting and no, dawg, your code doesn't do what you think it does, or it only handles the happiest of happy paths.
I do find it funny when people get mad about you critiquing their AI project. You didn't even write it, dude.
dominotw | 9 hours ago
windexh8er | 9 hours ago
timacles | 7 hours ago
I use Claude all the time, it is immensely helpful. It is also very nuanced and requires a high level of expertise in a specific domain to produce quality work. Even then, that take time and effort. Anyone saying otherwise, quite frankly, doesn’t know what they’re doing.
treme | 13 hours ago
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
hollowturtle | 13 hours ago
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
LLMs can effectively validate your business idea
jaccola | 13 hours ago
If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??
layer8 | 12 hours ago
Just a nitpick regarding “never”: Learning resources weren’t abundant and free 25 years ago, that’s a more recent development.
skydhash | 10 hours ago
layer8 | 3 hours ago
HPsquared | 12 hours ago
viking123 | 11 hours ago
It will be just garbage on top of garbage.
jmcodes | 13 hours ago
hollowturtle | 12 hours ago
All I know is that we have a gigantic amount of tech debt we accumulated on the web chasing the next web framework built on top of tons of abstractions with very disappointing native web apis that shouldn't be taken seriously nor the w3c who specified them.
And when an Agent it's capable of gluing together a web app with some crud backend with a very rounded corners UI, that solves nothing for end users, we call them capable. These are not hard problems
squidbeak | 12 hours ago
skydhash | 10 hours ago
If you’ve ever work directly with a user, you know how vague change requests can be. Try writing some vague prompts like that to the agent and see if it can solve them.
For some, writing down a (good?) specs and handing it to an agent is not very productive. Because by then, they already have an idea of the solution and can use the editor to have it done.
hexasquid | 12 hours ago
HPsquared | 12 hours ago
TeMPOraL | 9 hours ago
AussieWog93 | 12 hours ago
To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.
These days you'd just one-shot it in Claude.
hollowturtle | 12 hours ago
AussieWog93 | 12 hours ago
But the client's problem was solved, and they're happy.
This is a genuinely useful thing. You don't need to shit all over it.
emilsoman | 12 hours ago
AussieWog93 | 11 hours ago
Made ridiculous bank during 2019-2023, lost money 2024-2025 (I wasn't doing proper accounting at that stage, so it took a while to really internalise that the market wasn't insane anymore), looks like we'll make a decent-ish profit in 2025-2026 after pivoting the business model. Some regrets but it's possible staying in software could have been just as turbulent.
Funnily enough we're finally at the stage where I can launching my SaaS side-hustle which I've been sitting on for the past year and a half, so that could end up back in software again soon.
I would never say never, since I don't know what Claude would look like in 5 years' time, but there's plenty it can't do at the moment.
To give a concrete example, I don't let it make sweeping changes to the main "business logic" of my SaaS. Not because it's necessarily wrong but because I can't easily verify it. But I'll let it rip on peripheral stuff, or co-work with it.
Glohrischi | 9 hours ago
CRUD applications and converting business requirements into code is the thing software developers do to 99% day in day out.
kamaal | 7 hours ago
You will see its basically a very reusable part thats already done uncountable times else where.
People who think they do something so special and novel that it just can't be done by non-human, struggle with breaking down a problem in smaller parts.
Even if you do have such novel problems, its not like every single day, every single bit of work you do is like that.
simonw | 10 hours ago
johndough | 2 hours ago
Factor 135066410865995223349603216278805969938881475605667027524485143851526510604859533833940287150571909441798207282164471551373680419703964191743046496589274256239341020864383202110372958725762358509643110564073501508187510676594629205563685529475213500852879416377328533906109750544334999811150056977236890927563 in less than 24 hours.
Come up with a way to sample from LLMs such that they can tell funny jokes. The jokes should not be recited jokes from elsewhere.
Implement a CUDA kernel that achieves optimal efficiency for PyTorch-like conv2d for "reasonable" shapes/strides/dilations/groups. (This task is the closest to being solved by LLMs, but they usually get stuck somewhere doing stupid things instead of considering more advanced optimization methods and still need a human to push them along).
simonw | an hour ago
y0eswddl | 9 hours ago
I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.
and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.
mwigdahl | 4 hours ago
“No, it’s the children who are wrong.”
nayroclade | 12 hours ago
The answer is "for lots of people, but not you".
You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".
hollowturtle | 9 hours ago
> Absolutely not, not quite there not even close in my experience.
I obviously mean in my experience, not the real truth.
> That everyone else is being led by a "marketing hype
That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies
ncruces | 12 hours ago
My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:
- Go code that implements the transpiler (parsing Wasm, building an AST)
- Go code that gets generated by serializing the AST to a .go file
- Go code that manipulates the AST (to optimize it), and its effect on the generated code
- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST
- C code that gets compiled to Wasm, then translated to Go, then called by Go
- Go code that gets called by this C code to implement a C stdlib
- WAT and WAST files that are used to implement the Wasm spec tests
I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.
And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).
Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.
https://github.com/ncruces/wasm2go
netcan | 12 hours ago
Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.
Also... I think our era has an intrinsic bias that change=progress, productivity, etc.
Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.
But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.
A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.
Maybe administration was never really a bottleneck.
Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.
Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.
I haven't heard of many belting out features, and increasing prices or sales.
Most bottlenecks are upstream of another bottleneck. Few are a "dam."
benterix | 10 hours ago
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
epolanski | 10 hours ago
I agree, but you contradicted yourself just one line above.
> For generating production code even with a lot of steering and baby sitting? Absolutely not
Moreover this is further in contradiction with several facts:
1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz
2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties
3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?
hollowturtle | 9 hours ago
> I agree, but you contradicted yourself just one line above.
>> > For generating production code even with a lot of steering and baby sitting? Absolutely not
with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not
epolanski | 9 hours ago
There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.
In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.
The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.
Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.
hollowturtle | 8 hours ago
chasd00 | 8 hours ago
hollowturtle | 7 hours ago
hackable_sand | 3 hours ago
epolanski | 7 hours ago
In general you do seem to be unaware of the trend.
And I want to stress it out: I'm not stoked for the trend or changes, but I'm not blind either.
hollowturtle | 7 hours ago
No they're not and those who are, are in overwhelming control by the engineers that steer continuously the agents in the right direction. First of all this isn't something you can do for novel ideas, especially in gaming, second it is indeed very bad the code they produce otherwise it won't require that much effort from high end professionals to bend the LLMs to their will.
Denial of nothing, it's pretty clear from my original comment above that gen ai is indeed deployed with varying degree of success in various stuff. My point is there wasn't any "inflection point" just a better integration between agents and os tools all inside a loop.
I successfully use AI in my day to day job, just not that much for coding, if I have a sense a task can be one-shotted by Claude I do, if not I don't. Simple as that
Miraste | 6 hours ago
hollowturtle | 3 hours ago
Veelox | 10 hours ago
Glohrischi | 10 hours ago
Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.
The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.
This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.
Small websites, fun projects, helper tools etc.
But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.
We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.
hiw2d | 9 hours ago
This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.
Glohrischi | 9 hours ago
Your Product Manager is not a coding job. Your Product Owner is not a coding job.
vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.
But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.
hiw2d | 9 hours ago
Im hyper efficient. You clearly are not and are full of it.
If youre only doing 5%, you should only get paid for that. lol. Are you happy to take a salary drop?
Glohrischi | 9 hours ago
I'm a Cloud ARchitect with experience in coding (15 years) and infrastructure (10 years) and startup founder...
If you don't comprehend what i write, feel free to ask but don't be dick?
chasd00 | 8 hours ago
keybored | 8 hours ago
keybored | 9 hours ago
Glohrischi | 9 hours ago
2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.
Now its one shoting a lot. Including websides, refactorings, etc.
The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?
Mythos is 10 Trillion, that might be already pushing it.
95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"
keybored | 2 hours ago
forlorn_mammoth | 8 hours ago
And no spelling errors either!
Also,
> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet
>> embedding-shape 1 hour ago | root | parent | next [–]
>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?
Falimonda | 9 hours ago
hollowturtle | 8 hours ago
randusername | 6 hours ago
F1 mechanic pops the hood of a mass-market Toyota Corolla and doesn't understand why everyone says it's really good.
A lot of us are out here building websites or phone apps.
Not to say that these things can't also be taken very seriously from first-principles, but I think that's rare.
liuliu | 4 hours ago
Models usually is broken if there is no feedback loop. Well, websites might be exception since they can one-shot pretty well. But there are plenty of things they can do well without one-shot that just requires a good feedback loop to be built.
hollowturtle | 3 hours ago
DennisP | 8 hours ago
hollowturtle | 8 hours ago
datadrivenangel | 8 hours ago
nijave | 8 hours ago
I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.
After having Claude Code "remember" my preferences and tools, it's more efficient.
It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.
noisy_boy | 8 hours ago
That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.
voncheese | 7 hours ago
I know I have struggled to keep up, and fall into the trap of approving things (either commands or recommendations) without taking the time to really process and think about them.
It's a bit like the age old problem of "it's super easy to ask questions, and can be super hard to answer many of them". So the economy of the conversation gets out of whack fast.
nijave | 8 hours ago
I think it more reliably does IaC with established patterns especially when it can do a dry run.
Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho
Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.
I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.
rconti | 8 hours ago
prettyblocks | 8 hours ago
hollowturtle | 7 hours ago
As I said we have a plenty of different envs, codebases, requirements. Things are complex.
You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.
Let me stress this out again:
> That's why the debate is so polizered imo, there isn't a shared experience
prettyblocks | 6 hours ago
My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.
Having said all that, you're right there isn't a shared experience.
newaccount670 | 8 hours ago
An idiosyncrasy of humanity is that the dumbest individuals tend to also be the loudest.
psadauskas | 7 hours ago
Now I just use deepseek. It isn't any dumber, and it costs way less.
sroussey | 7 hours ago
JeremyNT | 6 hours ago
I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.
I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.
This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.
topherhunt | 4 hours ago
Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.
hollowturtle | 3 hours ago
I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.
Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.
If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps
dasil003 | 2 hours ago
In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.
On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.
Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.
travisgriggs | an hour ago
Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.
eudamoniac | 3 hours ago
Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.
Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.
I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.
ionwake | 13 hours ago
inglor_cz | 13 hours ago
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
bob1029 | 13 hours ago
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
zkmon | 13 hours ago
eloisant | 12 hours ago
> there’s zero chance any AI lab would train a model for such a ridiculous task
Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...
shantnutiwari | 9 hours ago
hansmayer | 12 hours ago
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
GistNoesis | 12 hours ago
- Memory market cornering which mitigated the adoption of local AI despite great open model being released.
- Fast penetration of IP exfiltrating tools in companies world-wide.
- Developers producing more code that they can read.
- Autonomous agents killing Open Source by siphoning the attention economy
- Autonomous agents destroyed online communities (including HN)
- Autonomous agents being used in warfare (targeting, propaganda...)
- Widespread vulnerabilities discovered, Widespread supply chain attacks.
- Increasing inequality, fracture in perception, Green indicators, Grim realities.
sigmoid10 | 12 hours ago
viking123 | 11 hours ago
See you in 10-30 years when people are still dying of the same shit as today like oesophageal cancer and glioblastoma.
Maybe in the next century but by that time you and me both will be under the ground, and no, Amodei's doubling of human lifespan simply won't happen.
biophysboy | 8 hours ago
rm_-rf_slash | 7 hours ago
Medicine has done amazing things in my lifetime.
okamiueru | an hour ago
nektro | an hour ago
Asraelite | 12 hours ago
This is a good thing
felooboolooomba | 11 hours ago
evdubs | 4 hours ago
This is a bad thing.
willis936 | 9 hours ago
TeMPOraL | 8 hours ago
Wait, what? What is that?
> - Fast penetration of IP exfiltrating tools in companies world-wide.
That goes on the benefit side, I believe.
> - Autonomous agents killing Open Source by siphoning the attention economy
Anything attention economy disappearing is a "good riddance" to me.
john_strinlai | 7 hours ago
i believe they are just saying that RAM prices went crazy
nijave | 8 hours ago
Ideally we'll come out of the AI hype cycle having learned better practices.
ben8bit | 12 hours ago
EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.
sharperguy | 11 hours ago
_josh_meyer_ | 11 hours ago
alain94040 | 11 hours ago
Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.
kramit1288 | 11 hours ago
epolanski | 10 hours ago
MagicMoonlight | 10 hours ago
chrisss395 | 9 hours ago
+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?
+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.
It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.
Razengan | 8 hours ago
It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.
abstractbill | 8 hours ago
Should be the pelican bounced off.
romaniv | 7 hours ago
A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.
100% marketing, 0% science.
[1] https://arxiv.org/pdf/2303.12712
godelski | 6 hours ago
[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...
[1] I'm sure there is because of Simon's fame
Shocka1 | 6 hours ago
I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.
qiine | 6 hours ago
humm
subarctic | 6 hours ago
max_unbearable | 4 hours ago
The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.
3l3ktr4 | 4 hours ago
exabrial | 22 minutes ago
Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.