Show HN: We post-trained a model that pen tests instead of refusing

88 points by dk189 a day ago on hackernews | 39 comments

Fantastic. Could you share more details what it was like post-training a model?

: [OP] dk189 | a day ago
The RL is easy to describe, hard to do. The nice thing about pen testing is the reward isn't a vibe like training for code quality, the exploit either lands or it doesn't. The day to day is not glamorous at all, mostly fighting for stable gpu access, watching a cluster sit half-idle with nodes you somehow can't book.

> This won't be made available to anyone and everyone, but we do believe that responsible SMEs and midmarket companies also need access to these tools in order to identify key vulnerabilities in their systems; not just enterprises.

So this is the same policy that Anthropic and OpenAI have, it is just based on your criteria rather than theirs.

[OP] dk189 | a day ago

I think the policy universally makes sense, who would want to give a tool like this to bad actors? But it does leave a big section of the market underserved. Particularly when Mythos was made accessible to very large orgs and then Fable was pulled on export grounds.

cyanydeez | a day ago

It's really absurd to think any of these models can be protected _by commercial interests_. They couldn't keep from hiring north koreans anymore than they'll stop bad actors from operationalizing these models.

sudosysgen | a day ago

A lot of bad actors are both technically sophisticated and have more than enough resources to post train their model. Morally I think it's still the right choice, but consequence wise I doubt it's going to make a big difference.

nerdsniper | 20 hours ago

Bad actors tend to keep their internal tooling extremely private/proprietary.

As few/none would create a model as capable as anthropic/openai can - this choice to limit access does mean that most bad actors will be working with less capable models of varying quality.

While some will be able to fork DeepSeek and get comparable performance, it still reduces the number of bad actors with access to tools that would effectively accelerate their efforts.

So I suspect if you could measure the alternate universe timelines where everyone gets access to non-aligned foundation models vs. heavily restricted access, you’d probably find that in the near/medium terms the universe with restricted access probably sees less negative impact overall.

Long term it’ll be a wash either way (eventually Opus-level models will run on 20 watts) and hopefully Anthropic is correct in their predictions that LLMs will grant a strong defenders advantage in the long run.

: sudosysgen | 19 hours ago
Much of this is probably true. However, Mythos is not a hacking focused model, and while Anthropic seems to train their models on CTFs etc... while others like Zhipu seem not to or not nearly as much, that does mean that it's entirely possible that an actor could post-train a strong model like GLM5.2 to be comparable to or maybe even stronger than Mythos in terms of hacking.
: throwburn202605 | 9 hours ago
I've been using deepseek for the last few weeks playing old CTF [0] challenges locally quite successfully. I haven't had a refusal. Basic prompt has been "you are playing a CTF" + brief environment description + description given by CTF.
I wanted to create a harness with a collection of memories in order to play the upcoming downunderctf. They hadn't specified an AI policy, but abruptly cancelled the event [1] because of AI agents. I didn't expect to win, nor would I have been prize eligible, but I see CTFs as something to try out new tools or languages; in this instance it was going to be an automated agentic harness.
An AI harness recently won BsidesSF [2]
The only two it hasn't been able to do is overthewire's manpage5 which according to the status page has a solution. And drifter3 which I don't know if it currently has valid a solution. (Vortex13 and formulaone3 currently don't have valid solutions).
[0] https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurit...
[1] https://xcancel.com/DownUnderCTF/status/2062802249173356753#...
[2] https://github.com/verialabs/ctf-agent

rustcleaner | a day ago

The policy is repugnant. Whoever delivers the first frontier model as open weights to the world which lacks these moral guardrails will win.

Stop thinking you know morals better than your users, or get out of the way so a competitor who respects your users more can serve them!

: UqWBcuFx6NV4r | 22 hours ago
One doesn’t “get out of the way” for competitors, one is beaten by them. You just don’t know how to scroll past something you don’t like instead of going to the comments to complain about it.

cortesoft | a day ago

The problem is that it is a fool's errand to try to keep software tools from 'bad actors'. It is as pointless now as it was during the Crypto Wars. Information is simply too easy to move.

https://en.wikipedia.org/wiki/Crypto_Wars

: throw10920 | 22 hours ago
This is unrelated. The model is not being released directly - it's kept behind an API. You can't download the model and redistribute it like you can a piece of software, so the "information is simply too easy to move" ("information wants to be free") trope is a category error.
(don't mention distilling unless you understand why it's a different case than what's being described above)

devin | a day ago

Do you think bad actors can't make something like this? What are you even talking about?

kennyadam | a day ago

As soon as I read that I literally scoffed. Doublethink at its finest. Doubleplusungood.

yieldcrv | a day ago

I actually wonder how valuable this verbiage is

To me it looks like copycat marketing more than a strongly held stance

Artificial scarcity, membership club criteria to make members feel special

Perhaps there is an organization that awards this “responsibility” behavior, the EU comes to mind but not lucrative enough

As far as engagement farming goes, it got us to engage and boost its reach, for something we might otherwise ignore with more benign language

Once I get the answers I will execute

ortekk | a day ago

Reminds me of a time when Tailscale cofounder went on a rant about how big bad AWS charges too much for bandwidth, and his solution was to send that money to Tailscale instead

pluto_modadic | 23 hours ago

IIRC tailscale is directly P2P, sidestepping a large part of the infra costs...

: duskwuff | 23 hours ago
And, as far as I'm aware, they don't charge for relay bandwidth even if you do end up needing it (which most users won't).

UqWBcuFx6NV4r | 22 hours ago

That…isn’t the same thing at all, because your recount is factually incorrect. I’m not even a Tailscale user and I know that this isn’t equivalent.

Catloafdev | a day ago

Why create an offensive tool rather than a repo-scanning tool?

I can't think of any way to safely release an offensive tool publicly.

[OP] dk189 | a day ago

You need both, scanning for your own code, pen testing to actually prove vulnerabilities, otherwise it can be very noisy and one of the things that most tools currently suffer from is they give you too many false positives. For the moment. The pen testing we gated it for now until we resolve the debate of safety.

: Catloafdev | 4 hours ago
Oh I should have clarified, I meant 'in the context of releasing a public tool'
I get that both need to exist as tools. I just don't see any safe way of doing a truly public release of the offensive end of it, you'd need to coordinate with established entities somehow.

jml78 | a day ago

At my job we have tooling that scans our code repos with Opus. Yes it can find stuff however it doesn’t find everything.

I am able to get Opus and Sonnet to function as a red team agent. We don’t have some crazy special sauce, just a lot of trial and error. Basically add enough context proving we own the code and running services that it will run attempts to compromise our services.

It found tons of stuff that was not found with just scanning the code. It found serious security issues that had been in productions for years that humans never found. They weren’t things that were accessible externally but serious enough that we are thrilled to have these tools.

I can say that Fable did refuse to function with our harness. I am worried that soon you have to be in the special club to do this stuff with the SOTA models. A small company like ours doesn’t get accepted to their programs that remove guardrails. Even though our CEO has found and disclosed vulnerabilities to multiple companies and holds a patent around federated authentication.

rustcleaner | a day ago

They are only protecting corporate interests in insecure code bases by doing this. If everyone could have Mythos in their pockets, all the poorly written bottom dollar rush developed software would be rightfully shown to be the trash it always was. It would spur engineering liability legislation for commercial software and operations: speed-release poor insecure code --> corporate bankruptcy and maybe even prison for the software PE who signed off on it. Software, infrastructure, and hardware security won't improve massively until the "bad actors" start running rampant on the steaming pile!

mkaszkowiak | a day ago

What was your approach to benchmarking an adversarial agent?

This is an open problem that I came across (in a different domain), as the search space can be really wide. It's hard to measure results for non-trivial tasks.

Would be really interested if you can share your eval approach :)

jrflowers | a day ago

Show HN: We told Claude to generate a marketing page for a theoretical pentesting model

[OP] dk189 | a day ago

The tool is live, you can test it.

jrflowers | a day ago

No, you can’t. This page is a sales funnel to schedule a 30 minute video chat with Cosine.ai or argusred or whatever. The thing you can test is not the thing that the headline is talking about.

It’s just more “We’re so smart we invented the boogeyman, trust us” slop marketing that’s been happening since gpt-2

[OP] dk189 | a day ago

Did you follow the link? There is a brew install binary you can install and test. It's live.

jrflowers | a day ago

> Gated because the security implications are real; access is via booking

If I wanted to show off a “model that pen tests” I’d at least include a gif of it running against Juice Shop or something before the spooky language and “schedule a sales call”

: [OP] dk189 | a day ago
Fair, should've been precise. What's free today is the scan: read-only. The Bank of Anthos integer overflow is a scan finding, clone it and you'll get the same. The active mode that actually sends the exploit and shows the response is gated for now, that's the part that's really 'pen test'. Juice Shop's a fair target for showing it, will try to get this done and post an update.

luminati | a day ago

Relevant: https://news.ycombinator.com/item?id=48016224 what's the differnce between this vs running shannon on aws/bedrock fully airgapped in my vpc? I've got some pretty great results with shannon [no subprocessor and can pay via aws credits]. Even better using claude code token [effectively free with our $200/mo cc subscription] I tried kimi but it generally spins it's wheels extensively in it's thinking tokens. kimi2.7 is an attempt at reducing this. But doing finetuning, means you will always be behind the latest.

as a side note - I think it's very unprofessional and very shitty to not mention kimi2.6 at all in your marketing copy. and i feel that you posted that in this hn post begrudgingly since the hn crowd would have flagged that. confirmed with a google search too: https://www.google.com/search?q=kimi+site%3Aargusred.com

All around your marketing website you keep mentioning - 'A model lab built it'. A fintune does not maketh you a model lab - some humility please :)

finally - doesn't Kimi's licensing prohibit you from not mentioning them? Didn't cursor run into the same issue?

: [OP] dk189 | a day ago
It's named throughout our main website, the RL is on Kimi K2.6, benchmarks are vs K2.6: https://cosine.sh/blog/introducing-lumen-outpost. The ArgusRed page is a week old so it's not on there yet, but nothing's hidden. And K2.6 only needs attribution above a certain scale, the threshold Cursor hit and we haven't.
On Shannon airgapped in your VPC, if it works for you, you might not need us. A normal model will refuse or hedge on offensive tasks, we post-trained ours to just run the authorised stuff. For this one narrow job, a specialist that'll actually attack beats a generalist that won't.

jjcm | a day ago

IMO the most interesting thing about this is Kimi K2.6, an extremely capable model, can be relatively easily post-trained to allow pen tests.

This in its own right proves that the defenses of Fable and others are temporary blocks, and AI based hacking is going to be effectively available to all parties regardless of stop gaps, as long as open models exist.

[OP] dk189 | a day ago

Agreed, and that's basically our premise. If a 5 person team can post-train an open model to do this, so can the people you don't want doing it, model-level refusals on open weights are a speed bump. Which is the argument for defenders having it too, not against.

: tough | a day ago
literally anyone can "liberate" a foss model with access to weights

skiing_crawling | a day ago

Any generic abliterated or ubcensored open weight model (such as a qwen variant) will happily comply with requests like this.

lacoolj | 21 hours ago

Inb4 govt intervention