Gemini Robotics-ER 1.6

212 points by markerbrod 21 hours ago on hackernews | 78 comments

sho_hn | 21 hours ago

It does all start to feel like we'd get fairly close to being able to convincingly emulate a lot of human or at least animal behavior on top of the existing generative stack, by using brain-like orchestration patterns ... if only inference was fast enough to do much more of it.

The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.

Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.

I'm really curious what things we could build if we had 100x or 1000x inference throughput.

moonu | 20 hours ago

Idk if you've seen this already but Taalas does this interesting thing where they embed the model directly onto the chip, this leads to super-fast speeds (https://chatjimmy.ai) but the model they're using is an old small Llama model so the quality is pretty bad. But they say that it can scale, so if that's really true that'd be pretty insane and unlock the inference you're talking about.

lachlan_gray | 19 hours ago

Robotics/control systems is exactly what came to mind when I saw this release! What struck me is the possibility of look ahead search in real time, a bit like alphazero's mcts.

pstuart | 10 hours ago

It's a fascinating proposition and no doubt they'll get bigger models in there, and likely be able to cluster multiple models for mega MOE. One thing that would really be great is if they could take the power requirements down -- the chip requires 2.5KW, which is modest in terms of what the big boys use but would be an issue on a battery powered robot.

fennecfoxy | an hour ago

"The chip" no, a whole rack/deployment they offer takes 2.5kW. Not just one chip. Squeezing 2.5kW thru 1 chip would be mental.

Kostic | 20 hours ago

Taalas showed that you could make LLMs faster by turning them into ASICs and get 10k+ token generation. It's a matter of time now.

timmg | 19 hours ago

Actually pretty interesting to think: in a few years you might buy a raspberry pi style computer board with an extra chip on it with one of these types of embodiment models and you can slap it in a rover or something.

LetsGetTechnicl | 19 hours ago

What if we put slop images into slop machines and got slop^2 back out

tootie | 19 hours ago

Is emulating human behavior really a valuable end goal though? Humans exist as the evolutionary endpoint of exhaustion hunting large pray and organic tool-making. We've built loads of industrial and residential automation tools in the last 100 years and none of them are humanoid. I'd imagine a household robot butler would be more like R2D2 with lots and lots of arms.

hootz | 17 hours ago

It is when the world was made to interface with us. We can't use robots for everything if they aren't emulating us, because we would have to adapt everything for the non-humanlike robots.

tootie | 13 hours ago

We build our living spaces against the constraints of the human form, but that still doesn't imply the human form is optimal for anything. There's no reason a robot traveling over smooth surface should have legs instead of wheels or treads. There's no reason to have a head. Some kind of arm is a common design feature, but certainly no reason to have two. No reason to be symmetrical. A domestic robot may be constrained in terms of scale (ie see things at counter height) but not shape.

hgoel | 13 hours ago

Really, the requirements are for the robot to move in predictable ways (if something looks like an arm, it ought to move like an arm, etc), and to have enough strength to be useful for difficult/tiring tasks while somehow also not being dangerous if something does go wrong.

famouswaffles | 9 hours ago

>We build our living spaces against the constraints of the human form, but that still doesn't imply the human form is optimal for anything.

We build just about everything we expect to interact with against the constraints of the human form, not just living spaces. And yes we because we built those spaces for the human body, the human body is by definition the optimal choice.

>There's no reason a robot traveling over smooth surface should have legs instead of wheels or treads.

There's a reason. The robot becomes useless for any surface that isn't smooth. What's it going to do about stairs ? You're not going to make a bespoke solution that generalizes for us better than 'feet that work'. Do you think it's better to built a million different complex robot bodies for every situation ? That defeats the purpose of being general purpose.

Glemllksdf | 17 hours ago

Every single behavior? For sure not but otherwise we are the result of a very very long evolution and there is nothing else around us as smart and as adjustable.

The planing ahead thing through simulation for example seems to be a very good tool in neuronal network based architectures.

nozzlegear | 11 hours ago

> Humans exist as the evolutionary endpoint

Just want to pedantically point out that we're not at our evolutionary endpoint yet. Humans are still evolving!

jeffbee | 20 hours ago

Showing the murder dog reading a gauge using $$$ worth of model time is kinda not an amazing demo. We already know how to read gauges with machine vision. We also know how to order digital gauges out of industrial catalogs for under $50.

snickmy | 20 hours ago

Agree. I'm unclear what's the highlight of this post. Is the multimodality of the model (that can replace computer vision), is it the reasoning part, is it the overall wrapper that makes it very easy to develop on top?

PunchTornado | 17 hours ago

It's the fact that is not task specific.

readams | 19 hours ago

I think that where this gets interesting is when you can just drop these robotic systems into an environment that wasn't necessarily set up specifically to handle them. The $50 for your gauge isn't really the cost: it's engineering time to go through the whole environment and set it up so that the robotic system can deal with each of the specific tasks, each of which will require some bespoke setup.
Completely agree, I get that this is a stepping stone for future, more reliable robots but I found the demonstration underwhelming.

gallerdude | 20 hours ago

I’ve been thinking about AI robotics lately… if internally at labs they have a GPT-2, GPT-3 “equivalent” for robotics, you can’t really release that. If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

So there might be awesome progress behind the scenes, just not ready for the general public.

monkeydust | 20 hours ago

I ended up watching Bicentennial Man (1999) with Robin Williams over the weekend. If you haven't seen I thought it was a good and timely thing to watch and is kid friendly. Without giving away the plot, the scene where it was unloading the dishwasher...take my money!

spwa4 | 19 hours ago

It's called "VLA" (vision-language-action) models: https://huggingface.co/models?pipeline_tag=robotics

VLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that.

Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics".

A demo: https://www.youtube.com/watch?v=DeBLc2D6bvg

NitpickLawyer | 19 hours ago

> If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

That's a bit exaggerated, no? Early roombas would get tangled in socks, drag pet poop all over the floor, break glass stuff and so on, and yet the market accepted that, evolved, and now we have plenty of cleaning robots from various companies, including cheap spying ones from china.

I actually think that there's a lot of value in being the first to deploy bots into homes, even if they aren't perfect. The amount of data you'd collect is invaluable, and by the looks of it, can't be synth generated in a lab.

I think the "safer" option is still the "bring them to factories first, offices next and homes last", but anyway I'm sure someone will jump straight to home deployments.

doubled112 | 19 hours ago

I have broken dishes loading and unloading the dishwasher. Am I a massive failure?

My non-AI dishwasher can't even always keep the water inside. Nothing is perfect.

Rekindle8090 | 19 hours ago

If someone paid 100 grand for you to load and unload the the dishwasher, and the research to be able to do it costed hundreds of billions, decades of research, hundreds of thousands of researchers, and that was the ONLY thing you could do, yes, you WOULD be a massive failure.

logicprog | 18 hours ago

> If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

Depending on what the rate of breaking dishes is, this would be a massive improvement on me, a human being, since I break a really important dish I needed to use like ~2x per month on average.

nancyminusone | 16 hours ago

You really break a dish once every 2 weeks? That seems exceptionally clumsy.

Not here to shame you for it, for the record.

logicprog | 15 hours ago

> That seems exceptionally clumsy

That's me ;_;

Glemllksdf | 16 hours ago

From an economic standpoint the industry is anyway the most relevant by far. Its easier as the env is a lot more controlled, professionals configure and maintain the robots, they buy in bulk and have more money.

My concern with a household robot is not the dishwasher but the tv screen, the glas door, glas table, animals (fish/aquarium) etc. the robot might walk through, touch through or fall onto.

adityamwagh | 15 hours ago

There's not enough internet-scale data for robotics. The gap is huge! So anyone that claims to have a GPT like model is not behing honest.

skybrian | 20 hours ago

Pointing a camera at a pressure gauge and recording a graph is something that I would have found useful and have thought about writing. Does software like that exist that’s available to consumers?

gunalx | 20 hours ago

Look into opencv.

vessenes | 19 hours ago

I'm pretty sure claude will one shot this for you, including making you a home assistant dashboard item if you ask it.

nickthegreek | 19 hours ago

frigate can be setup to do this I believe, but its overkill. Openclaw could do it, slightly less overkill.

nozzlegear | 11 hours ago

I wonder how the municipal employees would react to cameras suddenly appearing on the meters around my house.

vessenes | 19 hours ago

Nice. I couldn't find the part that I'm most interested in though, latency. This beats their frontier vision model for some identification tasks -- for a robotics model, I'm interested in hz. Since this is an "Embodied Reasoning" model, I'm assuming it's fairly slow - it's designed to match with on-robot faster cycle models.

Anyway, cool.

WarmWash | 19 hours ago

In my quick image recognition testing on AI studio, it's performance seems similar to 3.1 pro, but is much much faster. It "thinks" but only for a few seconds.

Of course this is for counting animal legs while giving coordinates and reading analog clocks. Not coding or or solving puzzles. I imagine the image performance to model weight of this model is very high.

vessenes | 18 hours ago

I thought the entire discourse on how important pointing is to be super interesting. I've been told, although I don't know if this is true, that dogs are the only animal that can understand human pointing. Fascinating to think this might be fundamental to world intelligence requirements. Well, it's required, but it's interesting to think that it might be a core structure required or that learning it might force some sort of neural architecture that's helpful.

And, I was disappointed to see that pointing was just giving x,y coords. I wanted to see robots pointing at stuff.

vibe42 | 19 hours ago

A parcel of land.

A few robot legs and arms, big battery, off-the-shelf GPU. Solar panels.

Prompt: "Take care of all this land within its limits and grow some veggies."

jayd16 | 19 hours ago

Yeah I'm not sure how that's currently working out. https://proofofcorn.com/

jonas21 | 18 hours ago

That's kind of the opposite problem -- the agent doesn't have robot arms or legs or a parcel of land. It has to rely on people to get access to land and plant and harvest the corn, and those people are ignoring it.
Are you saying that has failed? It isn't obvious to me from that page that anything in particular is going wrong. I don't think anyone is daft enough to claim that AI solves the "Iowa remains unplantable due to winter conditions" problem.

adelie | 18 hours ago

logs suggest it's been 'critically failing' and 'blocked for 68 days' on farmhand introduction, although the logs don't go back far enough (and cut off too early) to really tell what's going on. https://proofofcorn.com/log

jayd16 | 18 hours ago

Looking at the timeline they seem hopelessly behind. Its currently the planting window and they don't have land or a person to work it.
Ah, thank you. It's not the planting window yet where I am, north of Iowa, so I wasn't certain where they were.
What if it turns out that "take care of this land" means the traditional way California was taken care of with regular small slow burns. After over 10k years of this type of management there are many important native species that won't even germinate without the presence of ash.

Or it could turn out to look like satayoma (Japanese peasant forests) or it could be more similar to the crop rotation that was traditionally practiced in many parts of Central Africa where roots were important.

In Russia before the Soviets forced "modern scientific agriculture" on peasants to modernize, they practiced things like contour farming (where they interplanted rows of crops against the contours of the land to slow water down) and maslins (where they intermixed multiple varieties of wheat and barleys in the same patch). Now contour farming are an active area of research for their ability to prevent topsoil loss and build soil health while maslins provide superior yield stability and use little to no pesticides.

That's not even getting into the over 40-120,000 varieties of rice we've documented. Most of which are hyper adapted to a very specific location—often even a single village.

My point is there is no one way to take care of a plot of land. It's all relative to a number of factors beyond just the abiotic characteristics of the land itself. Your goals and intentions matter and you will always find localized unique adaptations.

Done! The whole planet is now veggies.

taneq | 18 hours ago

It turned the planet into ‘taterchips!

rcoveson | 13 hours ago

Jury is still out as to how well it works, but the traditional prompt is: "Be fruitful, and multiply"

prossercj | 11 hours ago

Don't forget the extra instructions about not eating from the tree of the knowledge of good and evil.

harrall | 18 hours ago

Google and Boston Dynamics (of Spot, Atlas fame) formed a partnership a while back and they’ve been working on building models together.

Hyundai now owns Boston Dynamics and is pushing to get the robots into their factories.

w10-1 | 18 hours ago

Would this approach destroy critical investments in physics- or modeling-based reasoning?

I'm all for the task reasoning and the multi-view recognition, based on relevant points. I'm very uncomfortable with the loose world "understanding".

The fault model I see is that e.g., this "visual understanding" will get things mostly right: enough to build and even deliver products. However, these are only probabilistic guarantees based on training sets, and those are unlikely to survive contact with a complex interactive world, particularly since robots are often repurposed as tasks change.

So it's a kind of moral-product-hazard: it delivers initial results but delays risk to later, so product developers will have incentives to build and leave users holding the bag. (Indeed: users are responsible for integration risks anyway.)

It hacks our assumptions: we think that you can take an MVP and productize it, but in this case, you'll never backfit the model to conform to the physics in a reliable way. I doubt there's any way to harness Gemini to depend on a physics model, so we'll end up with mostly-working sunk investments out in the market - slop robots so cheap that tight ones can't survive.

pants2 | 17 hours ago

I wonder what new laws of physics AI models have discovered encoded opaquely in their weights, yet to be discovered by physicists
Is there a open source mini robot kit that allows me to play-around with agentic robots?

moffkalast | 17 hours ago

SO-ARM101 I guess? Or more likely the Lekivi variant.

colinator | 16 hours ago

I was also just in the market for a small experiment robot. I got the hiwonder armpi-fpv. Avoid it, the actuators are pretty bad - they're very 'grindy', the robot jitters like crazy when it moves. Any such problems with the lekivi?

moffkalast | 2 hours ago

Hmm I've never used the hiwonder servos so I'm not sure how they compare with the feetech/waveshare STS type, but these have been surprisignly good overall. There is still considerable backlash which results in a cumulative 1-2cm of gripper translational error when accumulated along the arm, but the control is really stable. I don't think there's any jitter at all. They are a bit loud when moving at max speed, but there is also a STS3250 brushless variant that's stronger and really quiet. Expensive though.

I haven't tested the Lekivi specifically, but lots of SO-ARMs and a custom built lekivi-like robot. I think some people have had some issues with the rear omni wheel when moving forward but I haven't seen that myself.

fennecfoxy | an hour ago

Waveshare has some pretty rad ones. I've been tempted to use one of their rovers for something.

Be careful, because you can easily overpay out the ass for "robot kits" online.

steveharing1 | 17 hours ago

Soon Open Source will fill the gap here as well

colinator | 17 hours ago

This seems perfect to hook up to my 'LLMs can control robots over MCP' system. The idea is that LLMs are great at writing code, so let's lean in to that. I'll give it a try! I just got a bigger robot, we'll see how it does...

https://colinator.github.io/Ariel/post1.html

Glemllksdf | 16 hours ago

Really unfortunate that I forgot what YT video i saw about this just 2 weeks ago.

It was about Googles PaLM-E evolution and progress. It basically has two models one which controls the robot, the other is a llm and they are combined together in some attention layer.

johntb86 | 15 hours ago

Must have been https://www.youtube.com/watch?v=2mrGMMmrVNE from Welch Labs.

colinator | 15 hours ago

That video is pretty good, thanks for finding it. I'm basically betting that an earlier, abandoned approach described in the video, "Code as Policy", will beat everything else. It requires no training data, and generalizes instantly to all robots.

Glemllksdf | 3 hours ago

Yes :-)

fennecfoxy | an hour ago

I would look into VLAs/VLMs if you haven't already.

martythemaniak | 17 hours ago

As the article notes regular Gemini and Gemma also have spatial reasoning capabilities, which I decided to test by seeing if Gemini could drive a little rover successfully, which it mostly did: https://martin.drashkov.com/2026/02/letting-gemini-drive-my-...

LLMs are really good at the sort of tasks that have been missing from robotics: understanding, reasoning, planning etc, so we'll likely see much more use of them in various robotics applications. I guess the main question right now is:

- who sends in the various fine-motor commands. The answer most labs/researchers have is "a smaller diffusion model", so the LLM acts as a planner, then a smaller faster diffusion model controls the actual motors. I suspect in many cases you can get away with the equivalent of a tool call - the LLM simply calls out a particular subroutine, like "go forward 1m" or "tilt camera right"

- what do you do about memory? All the models are either purely reactive or take a very small slice of history and use that as part of the input, so they all need some type of memory/state management system to actually allow them to work on a task for more than a little while. It's not clear to me whether this will be standardized and become part of models themselves, or everyone will just do their own thing.

colinator | 17 hours ago

For the fine-motor commands: or, the model can write the code to generate them on the fly. It seems to work, in my very limited experiment.

As for memory: my approach is to give the robot a python repl and, basically, a file system - the LLM can write modules, poke at the robot via interactive python, etc.

Basically, the LLM becomes a robot programmer, writing code in real-time.

Isamu | 16 hours ago

>Our safest robotics model yet Safety is integrated into every level of our embodied reasoning models. Gemini Robotics-ER 1.6 is our safest robotics model to date, demonstrating superior compliance with Gemini safety policies on adversarial spatial reasoning tasks compared to all previous generations.

The safety guidelines are interesting, they treat them as a goal that they are aspiring to achieve, which seems realistic. It’s not quite ready for prime time yet.

Lucasoato | 16 hours ago

Meanwhile, gemini 3.1 pro (that was released two months ago) was completely unavailable to me this afternoon, neither with API nor Subscription.

Nothing was reported in Google status page, not even the CLI is responding, it’s just left there waiting for an answer that will never arrive even after 10 minutes.

shireboy | 15 hours ago

Maybe dumb question: One of the use cases is instrument reading of analog instruments. My brain immediately goes to "this should have some sensor sending data, and not be analog". Is having a robot dog read analog sensors really a better fit in some cases?

heisenzombie | 14 hours ago

It bears really thinking through the alternative:

So we're going to have some engineers specify suitable digital replacements given the process/environment/safety requirements. We'll procure those (noting that an industrial digital pressure transducer can easily push up towards $10k), schedule a plant shutdown (how much does that cost?), then pay a pipefitter/boilermaker to replace the old gauges with new pressure transmitters (do you need a hot work permit for that? Did you get your engineer to sign that off?). Then, your controls sparky has to find a way to route a drop back to your marshalling cabinet for connection into your fieldbus/HART/modbus/whatever network (do you have one of those?) so that your SCADA system can talk to it (do you have one of those?).

Obviously it's not really an apples-to-apples comparison, but I think the costs involved with making "simple" changes in industrial settings are easy to wildly underestimate.

fennecfoxy | an hour ago

I think the thing is: does it need to last 20-30 years between replacements if a robot can easily replace it + they're cheap enough to add redundant ones. Do we really need crazy accuracy even on an industrial level...like this pipe will burst at 200psi so the gauge needs to be accurate to 0.001 psi so we can sound the alarm when it hits 199.999 psi somehow I don't think so.

Dumb silicon is so super cheap now, just look to nfc etc, 1c microcontrollers. We can litter our world with sensors.

Which I would love to see - but I'm also not discounting the usefulness of any robot just being able to read something we can read and vice versa.

schainks | 13 hours ago

I can see many cases where installing an IoT camera will be more reliable and less costly than, "shut down equipment to unplug this analog instrument, hook up a digital one, calibrate it, then restart the equipment".

If it ain't broke don't fix it — pointing a cheap camera at it with some cloud compute will suffice.

Glemllksdf | 3 hours ago

Having a IoT system working flawlessly across all devices you own, would be great right?

Like your washing machine reporting its state, knowing if sun is out, running only when there is a lot of sun.

Your bsement heater sending out its stats.

And your industry machine doing the same thing.

Then you realize that we are talking about industry 4.0 for a decade now, everything IoT is either closed source or always costs extra and working together? hahahaahha...

I don't know why we can't have nice things it would be that easy :|

fennecfoxy | an hour ago

Honestly? Because capitalism.

ComputerGuru | 13 hours ago

So should we be using this until Google deigns to release Gemini Flash 3.1? (Not flash lite or live)

fennecfoxy | an hour ago

I feel like this is a political move between Hyundai and Google (favour by Google).

BD sat back on traditional programming/light ML techniques for ages whilst transformers went wild and it's only now that they're like "oh shit".

Hence the partnership with Google; BD lacks the capabilities otherwise. I bet their internal marketing departments did a bit of hand shaking to spin this piece as a favour for Hyundai/BD. Because from Google's (and our) perspectives - reading a gauge etc isn't that impressive and multimodal transformers solved that years ago, OpenCV many years before that also. But to BD it's impressive/a desperate grasp of "we swear we're using modern ML now! Yes our robot dances were sequenced and took dozens of takes but now we'll start doing it for real, we swear!"