Skills in CC have been a bit frustrating for me. They don't trigger reliably and the emphasis on "it's just markdown" makes it harder to have them reliably call certain tools with the correct arguments.
The idea that agent harnesses should primarily have their functionality dictated by plaintext commands feels like a copout around programming in some actually useful, semi-opinionated functionality (not to mention that it makes capability-discoverability basically impossible). For example, Claude Code has three modes: plan, ask about edits, and auto-accept edits. I always start with a plan and then I end up with multiple tasks. I'd like to auto-accept edits for a step at a time and the only way to do that reliably is to ask CC to do that, but it's not reliable—sometimes it just continues to go into the next step. If this were programmed explicitly into CC rather than relying on agent obedience, we could ditch the nondeterminism and just have a hook on task completion that toggles auto-complete back to "off."
I think unless you're doing simple tasks, skills are unreliable. For better reliability, I have the agent trigger APIs that handles the complex logic (and its own LLM calls) internally. Has anyone found a solid strategy for making complex 'skills' more dependable?
My only strategy is what used to be called slash-commands but are also skills now, I.e I call them explicitly. I think that actually works quite well and you can allow specific tools and tell it to use specific hooks for security of validation in the frontmatter properties.
I haven't done a lot with skills yet, but maybe try and leverage hooks to enforce skill usage, and move most of the skill's logic and complexity into a script so the agent only needs to reason about how to call the script.
I think I'll wait until they are more reliable. For now, I use skills, but they just specify which endpoint to call. It should be also safer, different vps, no access to credentials but the bearer token.
In my experience, all text “instruction” to the agent should be taken on a prayer. If you write compact agent guidance that is not contradictory and is local and useful to your project, the agent will follow it most of the time. There is nothing that you can write that will force the agent to follow it all of the time.
If one can accept failure to follow instructions, then the world is open. That condition does not really comport with how we think about machines. Nevertheless, it is the case.
Right now, a productive split is to place things that you need to happen into tooling and harnessing, and place things that would be nice for the agent to conceptualize into skills.
> sometimes it just continues to go into the next step
Use a structured workflow that loops on every task and includes a pause for user confirmation at the end. Enforce it with a hook. I'm not sure if you can toggle auto-accept this way, but I think the end result is what you're asking for.
I use this with great success, sometimes toggling auto-accept on when confidence is high that Claude can complete a step without guidance, and toggling off when confidence is low and you want to slow down and steer, with Claude stopping between the steps. Now that prompt suggestions are a thing, you can just hit enter to continue on the suggested prompt to continue.
> idea that agent harnesses should primarily have their functionality dictated by plaintext commands feels like a copout
I think it's more along the lines of acknowledging the fast-paced changes in the field, and refusing to cast into code something that's likely to rapidly evolve in the near future.
Once things settle down into tested practices, we'll see more "permanent" instrumentation arise.
"Code is cheap" has two interpretations here: one, that's its no longer seen as the artisanally-crafted fine product, now it's "manufactured". Two, though, is that it's cheaper in ops -- once the criteria are fully discovered, once no more new paths for the agents to roam, things that have been cast into code consume minimal resources (in AI scale of things), they're doggedly deterministic, and are free of heavy dependencies.
So yeah, I believe "it's a phase" but in a sense that it's a development phase, just like planning or prototyping.
Are you using either CLAUDE.md or .claude/INSTRUCTIONS.md to direct Claude about the different agents?
Also, be aware that when you add new instructions if you don't tell claude to reread these files, it will NOT have it in its context window until you tell it to read them OR you make a new CC session. This was a bit frustrating for me because it was not immediately obvious.
The saving grace of Claude Code skills is that when writing them yourself, you can give them frontmatter like "use when mentioning X" that makes them become relevant for very specific "shibboleths" - which you can then use when prompting.
Are we at an ideal balance where Claude Code is pulling things in proactively enough... without bringing in irrelevant skills just because the "vibes" might match in frontmatter? Arguably not. But it's still a powerful system.
For manual prompting, I use a "macro"-like system where I can just add `[@mymacro]` in the prompt itself and Claude will know to `./lookup.sh mymacro` to load its definition. Can easily chain multiple together. `[@code-review:3][@pycode]` -> 3x parallel code review, initialize subagents with python-code-guide.md or something. ...Also wrote a parser so it gets reminded by additionalContext in hooks.
Interestingly, I've seen Claude do `./lookup.sh relevant-macro` without any prompting by me. Probably due it being mentioned in the compaction summary.
I view them as more idiosyncratic docs, but focused on how to write code (there is so much huggingface code floating around the internet, the models do quite well with it already).
I have not had much success with skills that have tree based logic (if a do x, else do y), they just tend to do everything in the skill (so will do both x and y).
But just as "hey follow this outline of steps a,b,c" it works quite well in my experience.
So far my experience with skills is that they slow down or confuse agents unless you as the user understand what the skill actually contains and how it works. In general I would rather install a CLI tool and explain to the agent how I want it used vs. trying to get the agent to use a folder of instructions that I don't really understand what's inside.
Most LLM "harnessing" seems very lazy and bolted on. You can build much more robustly by leveraging a more complex application layer where you can manage state, but I guess people struggle building that
Common failure mode I've observed is people building a stateful harness for the LLM and then forgetting to tell the LLM about it. Leads to funny/disturbing results whenever the two "desync" in some way.
Example: a plan/act division, with the harness keeping state of which mode is active, and while in "plan mode", removing/disabling tools that can write data. Cue a mishandled timeout or an UI bug that prevents switching to "act mode", and suddenly the agent is spinning for 10 minutes questioning the nature of their reality, as the basic tools it needs to write code inexplicably ceased to exist, then opting for empirical experimentation and eventually figuring out a way to reimplement "search/replace" using shell calls or Python or whatever alternative wasn't properly sandboxed by the harness writers...
Part of this is just bugs in code, but what irks me is watching the LLM getting gaslighted or plain confused by rules of reality changing underneath it, all because the harness state wasn't made observable to the agent, or someone couldn't be arsed to have their error messages and security policies provide feedback to the LLM and not just the user.
> So far my experience with skills is that they slow down or confuse agents unless you as the user understand what the skill actually contains and how it works. In general I would rather install a CLI tool and explain to the agent how I want it used vs. trying to get the agent to use a folder of instructions that I don't really understand what's inside.
For Claude Code I add the tooling into either CLAUDE.md or .claude/INSTRUCTIONS.md which Claude reads when you start a new instance. If you update it, you MUST ask Claude to reread the file so it knows the full instructions.
Skills feel analogous to behavioral programs. If you give an agent access to a programmable substrate (e.g. bash + CLI tools), you write these Markdown programs which are triggered and read when the agent thinks certain behaviors will be beneficial.
It's a great idea: really neat take on programmability, and can be reloaded while the agent is running without tweaking the harness, etc -- lots of benefits.
`pi` has a great skills implementation too.
I think skills might really shine if you take a minimal approach to the system prompt (like `pi`) -- a lot of the times, if I want to orchestrate the agent in some complex behavior, I want to start fresh, and having it walk through a bunch of skills ... possibly the smaller the system prompt, the more likely the agent is to follow the skills without issue.
Yes -- skills live in a special gap between "should have been a deterministic program" and "model already had the ability to figure this out". My personal experience leaves me in agreement that minimal system prompts are definitely the way to go.
I’ve had a great experience with CLI-related skills at work. We have written CLIs for systems like Jira, along with skills that document the CLIs and describe the organisation of Jira at our company. Claude Code loads these reliably whenever you mention Jira or an issue number.
Alternatively, I’ve had less luck with purely documentation skills. They seem to be loaded less reliably when they’re not linked to actions the agent wants to take, and it is frustrating to watch the agent try to figure something out when the docs are one skill load away.
Documentation-based skills don’t really work in practice. They tend to waste tokens instead of adding value.
CLI skills are also redundant when the CLI already provides clear built-in help messages. Those help messages are usually up to date, unlike separate skills that need to be maintained independently.
If the CLI itself is confusing (and would likely be confusing for humans as well) then targeted skills can serve as a temporary workaround, a kind of band-aid.
Where skills truly shine is when agents need to understand non-generic terms and concepts: unique product names, brand-specific terminology, custom function names, and other domain-specific language.
I strongly disagree about CLI help being a good enough solution. Skills with CLIs backing them is the gold standard right now for a reason.
1. Skills let the agent know the CLI is available because they get an entry in the context window.
2. They let you provide a ton of organisational knowledge and processes that the agent would have a hard time figuring out from the CLI alone.
3. It is just more efficient to provide quick information in a skill than it is to require an agent to figure out every detail from CLI help messages alone every single time.
Skills are only loaded when you need them, so you’ll probably use fewer tokens overall compared to MCP servers or including them manually in your main AGENTS.md/CLAUDE.md file, which are always loaded in the system prompt.
The tension between discoverability and flexibility is real. I wonder if there's room for a hybrid approach - structured skill metadata (think OpenAPI-style specs for inputs/outputs) that can be compiled down to markdown context when needed. This would let agents validate tool calls before making them, while still keeping the LLM-friendly text format for reasoning about when to use them.
At what point does it become computationally cheaper to just generate random elf binaries, test them against constraints, and iterate until they work as specified?
See 'genetic programming' for techniques that are sort of based on this idea. Typical approach is to have a problem representation (gene analogues) that can be used to create a population of different individual solutions. Test them all against a fitness function and retain those that are 'best' according to some metric. Then create (breed) some new individuals who have some of the characteristics of the winners, perhaps mutated somewhat, insert these into the population. Repeat until you have solved the problem or have a good enough solution.
Challenges (apart from the time taken) are coming up with a good enough gene representation that captures the essence of the problem, building an efficient fitness function, and avoiding local maxima - i.e. a solution that is almost but not quite good enough, but from where you can't breed a better solution.,
I'm actually on the fence with skills. Vercel shared a study where they claimed skills performed actually worse [0] - than just injecting into the context directly via agents.md. Similarly, there was a paper recently that suggested the same [1]
Of course, the classic response to these - even WITH the evidence is often "yOu'Re dOiNg iT wRonG". Does anyone actually have proof - where using skill.md is arguably better than not?
Edit: Fixed company name, added link to Vercel's claim
I think the paper is saying specifically that it's redundant to include information about your coding repository when that information is otherwise available to the agent in higher fidelity forms (e.g. package.json). This makes sense - but not sure it's about Skills directly.
For the former I'd be interested in learning more about that. From a harness perspective the difference would be the inclusion of the description in the system prompt, and an additional tool call to return the skill. While that's certainly less efficient than adding the context directly I'd be surprised if it degraded task performance significantly.
I tend to be quite focussed with my Skill/Tool usage in general though, inviting them in to context when needed rather than increasing the potential for model confusion.
Sorry, I miquoted the company, it was Vercel, not Cursor.
"A compressed 8KB docs index embedded directly in AGENTS.md achieved a 100% pass rate, while skills maxed out at 79% even with explicit instructions telling the agent to use them. Without those instructions, skills performed no better than having no documentation at all."
Gotcha - yeah, it removes the tool calling step so their content is always in context (noting they took action to try and reduce the size of that). The framing seems a little simplistic -- thanks for the link.
daturkel | a day ago
The idea that agent harnesses should primarily have their functionality dictated by plaintext commands feels like a copout around programming in some actually useful, semi-opinionated functionality (not to mention that it makes capability-discoverability basically impossible). For example, Claude Code has three modes: plan, ask about edits, and auto-accept edits. I always start with a plan and then I end up with multiple tasks. I'd like to auto-accept edits for a step at a time and the only way to do that reliably is to ask CC to do that, but it's not reliable—sometimes it just continues to go into the next step. If this were programmed explicitly into CC rather than relying on agent obedience, we could ditch the nondeterminism and just have a hook on task completion that toggles auto-complete back to "off."
PantaloonFlames | a day ago
Frannky | a day ago
plufz | a day ago
chickensong | a day ago
Frannky | a day ago
chickensong | a day ago
Frannky | 21 hours ago
selridge | a day ago
If one can accept failure to follow instructions, then the world is open. That condition does not really comport with how we think about machines. Nevertheless, it is the case.
Right now, a productive split is to place things that you need to happen into tooling and harnessing, and place things that would be nice for the agent to conceptualize into skills.
Frannky | 22 hours ago
Rebelgecko | 20 hours ago
triage8004 | 18 hours ago
DarmokJalad1701 | a day ago
chickensong | a day ago
Use a structured workflow that loops on every task and includes a pause for user confirmation at the end. Enforce it with a hook. I'm not sure if you can toggle auto-accept this way, but I think the end result is what you're asking for.
I use this with great success, sometimes toggling auto-accept on when confidence is high that Claude can complete a step without guidance, and toggling off when confidence is low and you want to slow down and steer, with Claude stopping between the steps. Now that prompt suggestions are a thing, you can just hit enter to continue on the suggested prompt to continue.
btbuildem | a day ago
I think it's more along the lines of acknowledging the fast-paced changes in the field, and refusing to cast into code something that's likely to rapidly evolve in the near future.
Once things settle down into tested practices, we'll see more "permanent" instrumentation arise.
daturkel | a day ago
btbuildem | 8 hours ago
So yeah, I believe "it's a phase" but in a sense that it's a development phase, just like planning or prototyping.
siquick | a day ago
Referencing them in AGENTS/CLAUDE.md has increased their usage for me.
giancarlostoro | a day ago
Also, be aware that when you add new instructions if you don't tell claude to reread these files, it will NOT have it in its context window until you tell it to read them OR you make a new CC session. This was a bit frustrating for me because it was not immediately obvious.
btown | a day ago
Are we at an ideal balance where Claude Code is pulling things in proactively enough... without bringing in irrelevant skills just because the "vibes" might match in frontmatter? Arguably not. But it's still a powerful system.
winwang | 16 hours ago
Interestingly, I've seen Claude do `./lookup.sh relevant-macro` without any prompting by me. Probably due it being mentioned in the compaction summary.
conception | a day ago
ctoth | 21 hours ago
apwheele | 8 hours ago
I have not had much success with skills that have tree based logic (if a do x, else do y), they just tend to do everything in the skill (so will do both x and y).
But just as "hey follow this outline of steps a,b,c" it works quite well in my experience.
RyanShook | a day ago
airstrike | a day ago
TeMPOraL | 21 hours ago
Example: a plan/act division, with the harness keeping state of which mode is active, and while in "plan mode", removing/disabling tools that can write data. Cue a mishandled timeout or an UI bug that prevents switching to "act mode", and suddenly the agent is spinning for 10 minutes questioning the nature of their reality, as the basic tools it needs to write code inexplicably ceased to exist, then opting for empirical experimentation and eventually figuring out a way to reimplement "search/replace" using shell calls or Python or whatever alternative wasn't properly sandboxed by the harness writers...
Part of this is just bugs in code, but what irks me is watching the LLM getting gaslighted or plain confused by rules of reality changing underneath it, all because the harness state wasn't made observable to the agent, or someone couldn't be arsed to have their error messages and security policies provide feedback to the LLM and not just the user.
giancarlostoro | a day ago
For Claude Code I add the tooling into either CLAUDE.md or .claude/INSTRUCTIONS.md which Claude reads when you start a new instance. If you update it, you MUST ask Claude to reread the file so it knows the full instructions.
selridge | a day ago
Putting that in a `.md` file just means you don’t need to do it twice.
mccoyb | a day ago
It's a great idea: really neat take on programmability, and can be reloaded while the agent is running without tweaking the harness, etc -- lots of benefits.
`pi` has a great skills implementation too.
I think skills might really shine if you take a minimal approach to the system prompt (like `pi`) -- a lot of the times, if I want to orchestrate the agent in some complex behavior, I want to start fresh, and having it walk through a bunch of skills ... possibly the smaller the system prompt, the more likely the agent is to follow the skills without issue.
evalstate | a day ago
sothatsit | a day ago
Alternatively, I’ve had less luck with purely documentation skills. They seem to be loaded less reliably when they’re not linked to actions the agent wants to take, and it is frustrating to watch the agent try to figure something out when the docs are one skill load away.
jedisct1 | 13 hours ago
Documentation-based skills don’t really work in practice. They tend to waste tokens instead of adding value.
CLI skills are also redundant when the CLI already provides clear built-in help messages. Those help messages are usually up to date, unlike separate skills that need to be maintained independently.
If the CLI itself is confusing (and would likely be confusing for humans as well) then targeted skills can serve as a temporary workaround, a kind of band-aid.
Where skills truly shine is when agents need to understand non-generic terms and concepts: unique product names, brand-specific terminology, custom function names, and other domain-specific language.
sothatsit | 12 hours ago
1. Skills let the agent know the CLI is available because they get an entry in the context window.
2. They let you provide a ton of organisational knowledge and processes that the agent would have a hard time figuring out from the CLI alone.
3. It is just more efficient to provide quick information in a skill than it is to require an agent to figure out every detail from CLI help messages alone every single time.
firemelt | a day ago
neurostimulant | 22 hours ago
rukuu001 | 23 hours ago
Ross00781 | 15 hours ago
bandrami | 12 hours ago
KineticLensman | 10 hours ago
Challenges (apart from the time taken) are coming up with a good enough gene representation that captures the essence of the problem, building an efficient fitness function, and avoiding local maxima - i.e. a solution that is almost but not quite good enough, but from where you can't breed a better solution.,
neya | 11 hours ago
Of course, the classic response to these - even WITH the evidence is often "yOu'Re dOiNg iT wRonG". Does anyone actually have proof - where using skill.md is arguably better than not?
Edit: Fixed company name, added link to Vercel's claim
[0] https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
[1] https://arxiv.org/abs/2602.11988
evalstate | 10 hours ago
For the former I'd be interested in learning more about that. From a harness perspective the difference would be the inclusion of the description in the system prompt, and an additional tool call to return the skill. While that's certainly less efficient than adding the context directly I'd be surprised if it degraded task performance significantly.
I tend to be quite focussed with my Skill/Tool usage in general though, inviting them in to context when needed rather than increasing the potential for model confusion.
neya | 9 hours ago
Sorry, I miquoted the company, it was Vercel, not Cursor.
"A compressed 8KB docs index embedded directly in AGENTS.md achieved a 100% pass rate, while skills maxed out at 79% even with explicit instructions telling the agent to use them. Without those instructions, skills performed no better than having no documentation at all."
https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
evalstate | 9 hours ago