Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

110 points by zambelli 9 hours ago on hackernews | 36 comments

[OP] zambelli | 2 hours ago

Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.

fabian_shipamax | 2 hours ago

dashboard link is dead

[OP] zambelli | 2 hours ago

schaefer | an hour ago

yes, that link works for me.

schaefer | 51 minutes ago

super interesting work. It will take me a few days to dig in and really understand it. But I'm looking forward to it.

I run small models at home, so I'm very curious.

[OP] zambelli | 21 minutes ago

That's awesome! Let me know if quick start is causing issues or anything else you'd like to dig into.

Out of curiosity, what models are you running?

tommica | 2 hours ago

What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?

[OP] zambelli | 2 hours ago

It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [json blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.

Big frontier models need this less than small models.

So, this basically ensures that models call the right tools with the correct format?

[OP] zambelli | 2 hours ago

In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.
Ah, so it a "smart" retry mechanism?

[OP] zambelli | an hour ago

I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).

The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.

jimmySixDOF | 41 minutes ago

Maybe similar to Instructor [1] which was a cool tool for json and structured output enforcement combining pydandic with ai retry loops very handy for when models don't have that covered

[1] https://github.com/567-labs/instructor

[OP] zambelli | 19 minutes ago

Interesting! I'll look into that. Would mean another dep/integration but might be more robust.
Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS

[OP] zambelli | an hour ago

Very tangential! I'll try but it might take me a while.

xiaod | an hour ago

I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?

[OP] zambelli | an hour ago

Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

dpweb | an hour ago

Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?

[OP] zambelli | an hour ago

Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).

At least, if I understand your economic benefit angle correctly.

For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".

mholubowski | an hour ago

Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).

[OP] zambelli | an hour ago

Neat! Historically I've been most active on LinkedIn but the AI community seems very X-leaning so I'll make sure to pay closer attention there. Good luck with the harness, happy to connect!

Aleesha_hacker | an hour ago

Impressive work, love seeing tools that boost local LLM reliability without touching the model itself

[OP] zambelli | an hour ago

Thank you! It was a really fun rabbit hole to fall into and I found a bunch of counterintuitive stuff.

I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.

snovv_crash | an hour ago

I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?

[OP] zambelli | an hour ago

I definitely use LLMs to help write things - but this is my draft!

Maybe I've been spending too much time reading the evals and I now sound like an LLM...

Either way, here I am - happy to answer any questions!

snovv_crash | 31 minutes ago

I guess it's that, and yes, much as they learned speech patterns from us, now we start to learn from them.

I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...

Best of luck with Forge!

throwaway20222 | 56 minutes ago

If you are so outright against using AI, why would he care if you read his article about AI?

snovv_crash | 37 minutes ago

AI usage is great. The problem is the asymmetry in effort between generating text automatically, and then further amplifying this via posting it, while then expecting human eyeballs to spend the time reading it. It is antisocial.

If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576

rebekkamikkoa | 35 minutes ago

Hi Antoine!

Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?

[OP] zambelli | 20 minutes ago

Hi! Yes, I definitely think so. I've seen variance across all model families I looked at. The magnitude changes, but the presence of variance is a constant.
Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !

lucrbvi | 19 minutes ago

How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?

[0]: https://github.com/dottxt-ai/outlines

[OP] zambelli | 10 minutes ago

I only just skimmed it, but will try to dive deeper in a bit.

I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.

I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.

Once I dig into it more I'll try to highlight other deltas.

> # External mode — you manage llama-server, forge proxies it

> python -m forge.proxy --backend-url http://localhost:8080 --port 8081

This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.

In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)

azurewraith | 9 minutes ago

Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.

One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.

Brief: https://statewright.ai/research