What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?
It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [json blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.
Big frontier models need this less than small models.
In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.
I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).
The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.
Maybe similar to Instructor [1] which was a cool tool for json and structured output enforcement combining pydandic with ai retry loops very handy for when models don't have that covered
Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS
I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?
Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.
Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.
Concrete example:
Task: get, analyze and report on Q3 sales data.
Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.
We send this to the model:
tool_result: [PrereqError] analyze_sales requires fetch_sales_data
to be called first. Available next steps: fetch_sales_data
Model emits a corrected fetch_sales_data(...) on the next turn.
Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.
We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.
And lastly bare text response nudges. Small models love to chat, we need them to call tools!
Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?
Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).
At least, if I understand your economic benefit angle correctly.
For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".
Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).
Neat! Historically I've been most active on LinkedIn but the AI community seems very X-leaning so I'll make sure to pay closer attention there. Good luck with the harness, happy to connect!
Thank you! It was a really fun rabbit hole to fall into and I found a bunch of counterintuitive stuff.
I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.
I guess it's that, and yes, much as they learned speech patterns from us, now we start to learn from them.
I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...
AI usage is great. The problem is the asymmetry in effort between generating text automatically, and then further amplifying this via posting it, while then expecting human eyeballs to spend the time reading it. It is antisocial.
If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576
Hi! Yes, I definitely think so. I've seen variance across all model families I looked at. The magnitude changes, but the presence of variance is a constant.
Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !
I only just skimmed it, but will try to dive deeper in a bit.
I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.
I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.
Once I dig into it more I'll try to highlight other deltas.
Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.
One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.
[OP] zambelli | 2 hours ago
fabian_shipamax | 2 hours ago
[OP] zambelli | 2 hours ago
schaefer | an hour ago
schaefer | 51 minutes ago
I run small models at home, so I'm very curious.
[OP] zambelli | 21 minutes ago
Out of curiosity, what models are you running?
tommica | 2 hours ago
[OP] zambelli | 2 hours ago
Big frontier models need this less than small models.
k__ | 2 hours ago
[OP] zambelli | 2 hours ago
k__ | 2 hours ago
[OP] zambelli | an hour ago
The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.
jimmySixDOF | 41 minutes ago
[1] https://github.com/567-labs/instructor
[OP] zambelli | 19 minutes ago
jf | an hour ago
[OP] zambelli | an hour ago
xiaod | an hour ago
[OP] zambelli | an hour ago
Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.
Concrete example: Task: get, analyze and report on Q3 sales data.
Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.
We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data
Model emits a corrected fetch_sales_data(...) on the next turn.
Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.
We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.
And lastly bare text response nudges. Small models love to chat, we need them to call tools!
dpweb | an hour ago
[OP] zambelli | an hour ago
At least, if I understand your economic benefit angle correctly.
For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".
mholubowski | an hour ago
[OP] zambelli | an hour ago
Aleesha_hacker | an hour ago
[OP] zambelli | an hour ago
I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.
snovv_crash | an hour ago
[OP] zambelli | an hour ago
Maybe I've been spending too much time reading the evals and I now sound like an LLM...
Either way, here I am - happy to answer any questions!
snovv_crash | 31 minutes ago
I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...
Best of luck with Forge!
throwaway20222 | 56 minutes ago
snovv_crash | 37 minutes ago
If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576
rebekkamikkoa | 35 minutes ago
Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?
[OP] zambelli | 20 minutes ago
6r17 | 24 minutes ago
lucrbvi | 19 minutes ago
[0]: https://github.com/dottxt-ai/outlines
[OP] zambelli | 10 minutes ago
I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.
I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.
Once I dig into it more I'll try to highlight other deltas.
nzeid | 11 minutes ago
> python -m forge.proxy --backend-url http://localhost:8080 --port 8081
This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.
In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)
azurewraith | 9 minutes ago
One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.
Brief: https://statewright.ai/research