> Preliminary trials with Claude Mythos Preview showed that it would not provide an apples-to-apples comparison with other models because of how we had set up the experiment and how the model was served.
What does this mean? My guess is they couldn’t co-locate Mythos close enough to reduce latency?
(I’m assuming this experiment pre-dates the export controls)
> My guess is they couldn’t co-locate Mythos close enough to reduce latency?
I doubt network latency is the reason. Even when connecting from literally across the world network latency is lost in the noise of overall response latency of even fast models.
The overall response latency of the model very well could have been the difference, though. AFAIK Mythos is structured to do relatively slow "deep thinking".
Depending on the timeline, it could be that they're not allowed to access Mythos because of something like non-US citizens on the team or the lack of some way for them to meet the constraint DOD has them under.
I strongly suspect if that was the case they would have just directly mentioned that Mythos couldn't be used because of that reason, it would be less confusing and less suspect messaging than saying it wasn't an "apples-to-apples comparsion".
This mostly reads as a comparison between Opus 4.7 and 4.1 it would be more interesting if they reran the experiment against a team of humans with 4.7 and see how much the humans still improve the results today.
It feels like the prevailing view is: if I get $100m before every one loses their job and build the terminator before it un-livings everyone else then I win
I'm getting a bit tired of these disguised adverts.
Here's how non robotics engineers used AI to do a short robot integration task faster than other non robotics engineers without AI.
Where "better" mostly means faster, and who knows what happens on longer horizons, with actual robotics experts, robustness requirements, or tasks where the hard part is control rather than API spelunking.
> I'm getting a bit tired of these disguised adverts.
Its not disguised. Corporate blogs exist overtly to promote the company and its work.
Disguised promotions where notionally independent media publish promotional pieces as news concealing that they were fed to them by party whose products they promote area thing, but this is just the most overt undisguised promotion.
> Its not disguised. Corporate blogs exist overtly to promote the company and its work.
It is. That makes the "research" heavily biased. If xAI did the same thing, with Elon Musk screaming about that it is "AGI", you would not believe them at all.
Given that the work is not independent, such articles of this "research" can easily be manipulated or the results being massaged to promote the company positively.
But when others outside of the company try out the work or reproduce it, they get different results. So of course we continue to hear unverified research especially in AI when the frontier labs do not release their architecture, weights at all.
So in this case with labs raised with VC-funded cash, the incentives are clear and I would not straight up believe results from the first party source unless multiple sources outside of the company have verified it.
You or some other interested person could go do that experiment and publish the results. It shouldn't be hard to figure out what hardware exactly they were using and get a copy, and the prompt also doesn't have to be exactly what they used, just similar enough in spirit. See just how similar/different the outcome is.
Using their own time and money to disprove corporate propaganda. By the time you’ve disproven it they’ve already released 10 new claims. “Why are you still talking about that old news, nobody cares.”
In this case at least it isn't that hard. Opus is available to pretty much everyone, and anyone of sufficient means (I'm guessing at least 95% of active HN) can also easily afford the hardware.
And obviously this isn't something that's being iterated on rapid-fire; there have been 2 relevant publications roughly a year apart. Absolutely no firehose of anything here. As such there should be no problem for someone with enough interest to attempt to disprove the claims, and hopefully share the results regardless of their findings.
> If xAI did the same thing, with Elon Musk screaming about that it is "AGI", you would not believe them at all.
I’m not saying it is trustworthy or that I believe them, I am saying the advertising isn’t even a little bit disguised when it is communicated directly from what is overtly a promotional channel for the company involved.
It's like calling the “9 out of 10 dentist prefer...” claim in a TV commercial “disguised advertising” and then coming back with arguments about how it isn't trustworthy reaearch wheb it is pointed out that TV commercials are openly ads. Yeah, its not trustworthy, but the fact that it is corporate promotional material and not a neutral third-party report is not at all concealed.
It is overt advertising communicated through a channel whose sole and open purpose is advertising for the company whose products it advertises.
Disguised ad or not, I learned that LLMs have the emergent capability of learning to complete tasks in physical space, without being fine-tuned for it.
> However, once again, we are seeing a pattern whereby first, models are helpful to humans. Then, humans are helpful to models. Finally, models are largely able to do things themselves. We have seen this in cybersecurity and now the same dynamics are starting to take shape at the intersection of AI and the physical world.
It’s good they are the one seeing those things because otherwise no one else would have. Now if only seeing things would translate into getting any actual economic value out of them… instead of losing billions. But hey, who am I to do a reality check on this shameless piece of hype.
bob778 | 21 hours ago
What does this mean? My guess is they couldn’t co-locate Mythos close enough to reduce latency?
(I’m assuming this experiment pre-dates the export controls)
georgemcbay | 21 hours ago
I doubt network latency is the reason. Even when connecting from literally across the world network latency is lost in the noise of overall response latency of even fast models.
The overall response latency of the model very well could have been the difference, though. AFAIK Mythos is structured to do relatively slow "deep thinking".
bannable | 21 hours ago
georgemcbay | 21 hours ago
joshu | 21 hours ago
jascha_eng | 21 hours ago
etchalon | 21 hours ago
digitaltrees | 19 hours ago
didibus | 21 hours ago
Here's how non robotics engineers used AI to do a short robot integration task faster than other non robotics engineers without AI.
Where "better" mostly means faster, and who knows what happens on longer horizons, with actual robotics experts, robustness requirements, or tasks where the hard part is control rather than API spelunking.
dragonwriter | 20 hours ago
Its not disguised. Corporate blogs exist overtly to promote the company and its work.
Disguised promotions where notionally independent media publish promotional pieces as news concealing that they were fed to them by party whose products they promote area thing, but this is just the most overt undisguised promotion.
rvz | 20 hours ago
It is. That makes the "research" heavily biased. If xAI did the same thing, with Elon Musk screaming about that it is "AGI", you would not believe them at all.
Given that the work is not independent, such articles of this "research" can easily be manipulated or the results being massaged to promote the company positively.
But when others outside of the company try out the work or reproduce it, they get different results. So of course we continue to hear unverified research especially in AI when the frontier labs do not release their architecture, weights at all.
So in this case with labs raised with VC-funded cash, the incentives are clear and I would not straight up believe results from the first party source unless multiple sources outside of the company have verified it.
dozerly | 20 hours ago
skeledrew | 17 hours ago
jurgenburgen | 11 hours ago
It’s like the firehose of lies but done by corps.
skeledrew | 7 hours ago
And obviously this isn't something that's being iterated on rapid-fire; there have been 2 relevant publications roughly a year apart. Absolutely no firehose of anything here. As such there should be no problem for someone with enough interest to attempt to disprove the claims, and hopefully share the results regardless of their findings.
dragonwriter | 2 hours ago
I’m not saying it is trustworthy or that I believe them, I am saying the advertising isn’t even a little bit disguised when it is communicated directly from what is overtly a promotional channel for the company involved.
It's like calling the “9 out of 10 dentist prefer...” claim in a TV commercial “disguised advertising” and then coming back with arguments about how it isn't trustworthy reaearch wheb it is pointed out that TV commercials are openly ads. Yeah, its not trustworthy, but the fact that it is corporate promotional material and not a neutral third-party report is not at all concealed.
It is overt advertising communicated through a channel whose sole and open purpose is advertising for the company whose products it advertises.
skeledrew | 17 hours ago
nickosh | 20 hours ago
fassssst | 16 hours ago
BobbyJo | 16 hours ago
usernametaken29 | 19 hours ago
It’s good they are the one seeing those things because otherwise no one else would have. Now if only seeing things would translate into getting any actual economic value out of them… instead of losing billions. But hey, who am I to do a reality check on this shameless piece of hype.
InkCanon | 19 hours ago
aabhay | 17 hours ago
LoganDark | 14 hours ago