I run a lot of a/b testing. But I'm not sure showing it actually communicates all that much. Since these are non deterministic systems, even showing you an a/b test from when i made the decision a month ago, doesn't really mean a whole lot.
I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.
Some sort of eval. Eg TermBench, implemented in Harbor.
It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.
That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.
Yeah, I think we're in a phase honestly where you shouldn't use anyone elses skills, and you should instead point your stuff at a repo with skills, have it really read it, and then ask what of value there is to potentially rewrite in your style based on your preferences.
I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.
So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.
Agree, it's impossible to tell if someone else's workflow works with your codebase without actually trying it, which takes time/tokens. I've been thinking about how to make running quick, directional evals easier / more efficient to give more confidence in using / developing skills. Basically, how do we go from vibes to data?
It actually does not -- and that is part of the issue. Consumers just see "oh gosh this looks very detailed" and superficially think someone must of spent quite a bit of time on this and it works well.
Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.
It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.
I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.
I read it, right there on the OP. Tests and test results, including discussions of flaws with earlier designs and how they are improved here. What are you talking about?
Spam. It takes a minute in 2026 to create any app, any skill, any anything without any education that looks plausible that took five years ago a highly educated and skilled person at least months. Now it takes the highly skilled individual ten times the time to evaluate the vibeslopped spam it took the author to publish.
> Frame-lock: I asked the AI to run a devil's advocate debate against its own thesis. It did — four rounds, each more refined than the last. But every round stayed inside the frame I'd set. The DA attacked arguments, never premises. It never asked "are we even discussing the right question?" This is the same pattern that caused the 31% citation error rate in v2.7's stress test: the verifying AI and the generating AI share the same cognitive frame.
> Sycophancy under pushback: Every time I challenged the DA's attacks, it conceded too quickly. It retracted findings faster than it launched them. The model's training rewards conversational harmony — so "the user pushed back" was treated as evidence that the attack was wrong, when often it just meant the user was persistent.
Why do LLMs output so much sycophancy and other modes of conning (as in confidence games) humans - outputting confident text, highly agreeable tone, going along with whatever the user wants, etc.? It's manipulative output.
We see it everywhere and know it well - it's even sort of a running joke - but we're not challenging that assumption: Why that output? It seems like a design choice made by the LLM's developer: why would the process of constructing LLMs automatically create that sort of output? I'd say LLMs are in ~99th percentile of that sort of writing, which means it's not the typical writing they are trained on.
The only reason (that I know) to think it's not a design choice is that so many different LLMs do it, but very possibly they saw the success of ChatGPT using that mode and all followed it, and that is what users expect. Maybe it's a way of manipulating users to trust this new, possibly intimidating technology. Are there LLMs that don't output in that mode, by default (i.e., without prompting them to do otherwise)?
the training method and design is the emergent property. disagreement stops token generation. there arnt multi round training that follow reasonable disagreements.
While I agree most of this seems to go too far I do like the idea of the Socratic mode with State-Challenge-Reflect reflection. I often use LLMs in the same way with a skeleton "brief" document and separate chapters that I ask it to fill based on my input, basically augmented note taking (such as references, coherence, in-scope vs out of scope, arguments considered, pressure points, vulnerabilities etc)
These things are not going to be reliable if you don't know when your session will be routed to inferior model. I stopped using Opus because of that. I had to always create verification task first (a non trivial problem) for the model to prove itself it is "Opus grade" before giving it actual task, but then I found performance often was suddenly severely degraded (model suddenly being dumb as sack of potatoes). This tells me this is not ready for any serious work.
apwheele | 19 hours ago
Skill spam?
adityamwagh | 19 hours ago
elashri | 19 hours ago
AndyNemmity | 19 hours ago
I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.
apwheele | 18 hours ago
Even this repo just the "b" showcase, showing the outputs as is (with no clear documentation how those were generated, is it headless in a CI pipeline somewhere?), is not good, https://github.com/Imbad0202/academic-research-skills/tree/m....
AndyNemmity | 18 hours ago
I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.
theptip | 18 hours ago
It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.
That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.
AndyNemmity | 18 hours ago
I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.
So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.
bisonbear | 8 hours ago
mmooss | 18 hours ago
apwheele | 15 hours ago
Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.
It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.
I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.
AndyNemmity | 15 hours ago
I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.
Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.
It is not honest to me to give you confidence in it, when no one can be confident in it.
mmooss | 12 hours ago
I read it, right there on the OP. Tests and test results, including discussions of flaws with earlier designs and how they are improved here. What are you talking about?
sumeno | 15 hours ago
whattheheckheck | 17 hours ago
siva7 | 17 hours ago
Daviey | 14 hours ago
evanwolf | 18 hours ago
SubiculumCode | 18 hours ago
mmooss | 18 hours ago
> Sycophancy under pushback: Every time I challenged the DA's attacks, it conceded too quickly. It retracted findings faster than it launched them. The model's training rewards conversational harmony — so "the user pushed back" was treated as evidence that the attack was wrong, when often it just meant the user was persistent.
Why do LLMs output so much sycophancy and other modes of conning (as in confidence games) humans - outputting confident text, highly agreeable tone, going along with whatever the user wants, etc.? It's manipulative output.
We see it everywhere and know it well - it's even sort of a running joke - but we're not challenging that assumption: Why that output? It seems like a design choice made by the LLM's developer: why would the process of constructing LLMs automatically create that sort of output? I'd say LLMs are in ~99th percentile of that sort of writing, which means it's not the typical writing they are trained on.
The only reason (that I know) to think it's not a design choice is that so many different LLMs do it, but very possibly they saw the success of ChatGPT using that mode and all followed it, and that is what users expect. Maybe it's a way of manipulating users to trust this new, possibly intimidating technology. Are there LLMs that don't output in that mode, by default (i.e., without prompting them to do otherwise)?
cyanydeez | 18 hours ago
janpeuker | 18 hours ago
m3kw9 | 17 hours ago
varispeed | 16 hours ago
mdxmaker | 16 hours ago