I wonder if this is because it's a larger model or maybe just because they can? Although with the latest Deepseek it's really tough to compete pricing wise. Inference speed and integration (e.g. Antigravity) might be their only hope here
It has to be a larger model, wouldn't make much sense otherwise. That isn't to say the price isn't artificially increased as well
The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)
> Today, we’re introducing Gemini 3.5, our latest family of models combining frontier intelligence with action. This represents a major leap forward in building more capable, intelligent agents. We’re kicking off the series by releasing 3.5 Flash.
yah, which means that the input cost is the only value that should be paid attention to at the end + the cache discount (x10). If google would start offering x20 discount it would make it twice as cheap while input and output stayed the same.
It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.
In our experience, caching is not very reliable with google. We always get random cache misses that don't happen with other providers. We find OpenAI, Anthropic and Fireworks (which we use a lot) all have higher cache hit rates. So it's not only about the costs of cached token but also what kind of cached hit rate you get.
In my experience Google is the most flaky in general, which is surprising considering the rock solid history of their search and other products. Just more likely not to respond at all, to give a response out of left field, to handle the same error in 12 different ways randomly (a rainbow of HTTP status codes and error messages), etc etc.
Exactly our experience too. Effectively we catch these and on these status codes, we send to OpenAI. Retrying the same query in Gemini has high chance to give kind-of the same status code.
$0.15 / million tokens
$1.00 / 1,000,000 tokens per hour (storage price)
I much prefer the OpenAI/DeepSeek way of pricing caching where you don't have to think about storage price at all - you pay for cached tokens if you reuse the same prefix within a (loosely defined) time period.
I confirmed this by running a bunch of prompts through Gemini 3.5 Flash without doing anything special to configure caching and noting that it comes back with a "cachedContentTokenCount" on many of the responses.
"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.
Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.
They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.
It's doubtful they have the compute to make mythos publicly available even after the SpaceX datacenter deal. And why sell it publicly if people are still willing to pay for Opus 4.7?
That claim keeps contradicted hard by other parties, who say Mythos beats 5.5 resoundingly on both autonomous search and discovery and creation of complex exploit chains.
There might be a harness difference, but also, this CTF-type benchmark might not capture the capability difference fully.
The boat itself rocks, but do you see the background changing to indicate the boat is progressing through the environment? I only see that in the 3.1 Pro example. I believe that's what the OP meant.
I think this illustrates the problem with OP's prompt. If the goal is specifically to implement a scrolling background, this should have been in the prompt.
Can you try with a more complex story such as "three little pigs"? I tried but it created a storybook instead of the SVG animation. I am looking to partially imitate Godogen [1][2] which is really great, even for animations.
I think it's unreasonable to expect models generate complex stories in single prompt since they trained to be concise, but I tried. This is prompt on top of story with no control buttons request:
Now think, plan how to tell this story in a cartoon, make scene outline and then generate SVG animation story for "Three Little Pigs" in self contained HTML page. Just single animation no control buttons.
Well, honestly this is quite impressive compared to 3.1 Flash Lite and 2.5 Pro. Considering that 2.5 Pro is actually quite good at generating massive amounts of code one shot.
Here is a GPT 5.5 Extra High with a modified instruction:
> Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG. Use the Brave Browser to verifty that the image is indeed animated and looks like a proper rowing frog; iterate until you are satisfied with it.
Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.
For comparison within the Gemini lineup:
- Gemini 2.5 Flash: $0.30 / $2.50
- Gemini 3.1 Flash-Lite: $0.25 / $1.50
- Gemini 3.1 Pro Preview: $2.00 / $12.00
So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.
Okay, it's kind of somewhere between haiku and sonnet level pricing, at somewhere between sonnet and opus level performance. Its a great option. I was hoping to see opus class intelligence at haiku level pricing out of google, and this is close to that!
Never mind, after looking at more benchmarks, seems closer to sonnet level intelligence at slightly lower cost. Speed is great for latency sensitive applications, but if this was 1/2 the cost it would have been priced to win.
If this is the big model release out of google, its a disappointent.
You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model
Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.
I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.
Coding, however, is solved like magic. Easier to add tests, to be fair.
if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps
AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"
(the domain name is dumb and completely unmarketable)
The models still hallucinate bad when called via APIs, especially if web search is not enabled. Gemini hallucinates quite frequently even with the app and search enabled. More recent (e.g. ChatGPT 5.x and Deepseek v4) prompts/harnesses search very aggressively, which does greatly mitigate hallucinations.
People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.
More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.
Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.
Hallucination is also much better controlled in the context of agentic coding because outputs can be validated by running the code (or linters/LSP). I almost never notice hallucinations when I’m coding with AI, but when using AI for legal work (my real job) it hallucinates constantly and perniciously because the hallucinations are subtle—e.g., making up a crucial fact about a real case.
Yes, you can catch many mistakes that LLMs make whike coding, but I wouldn't necessarily call it "controlled." Every now and then the LLM will run into dead ends where it makes a certain mistake, the compiler or unit tests find the mistake, so it tries a different approach that also fails, and then it goes back to the first approach, then tries the second approach again, and gets stuck in an endless loop trying small variations on those two approaches over and over.
If you aren't paying attention it can spend a long time (and a lot of tokens) spinning in that loop. Sometimes there might be more than two approaches in the loop, which makes it even harder to see that it's repeating itself in a loop. It's pretty frustrating to see it working away productively (so you think) for 20 minutes or so only to finally notice what's going on
"People complain about them incessantly, but I can almost never get people to actually post receipts."
...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.
No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.
Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.
Claude has gotten good in the past month or two at recognizing when it might need to search the web for updated info rather than saying that it has no idea what I'm talking about or making stuff up.
I see constant hallucination in claude code when using specific tooling: It thinks it knows aws cli, for instance, but there's some flags that don't exist, it attempts to use all the time in 4.6 and 4.7. When asked about it, it says that yes , the flag doesn't exist in that command, but it exists in a different command (which it does), and yet, it attempts to use it without extra info.
Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.
For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.
I can reliably produce hallucinations with this genre of prompt: "write a script that does <simple task> with <well known but not too-well-known API>." Even the frontier models will hallucinate the perfect API endpoint that does exactly what I want, regardless of if it exists.
The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.
I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").
Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)
I asked gemini 3.1 Pro to search for the linkedin URLs for a list of peers. It generated a plausible list of links -- but they were all hallucinated. On a follow up it confirmed it couldn't actually search, but didn't tell me that without prompting.
Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.
Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number
Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.
As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate
It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.
3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.
I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.
One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.
3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.
AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.
What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.
Arena.ai is saying "Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers."
Yikes. I think the concept of a 'flash' model is changing, no? Google used to market this as its lower-intelligence, faster, cheaper option. I appreciate that they are delivering on both of those, but personally I would appreciate if they could create an incremental knowledge improvement while holding price steady. Fortune 500 companies have to make their money I guess.
That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...
The Artificial Analysis benchmark results are pretty underwhelming. Roughly the same "intelligence" as MiMo-V2.5-Pro for over 3x the cost. We'll have to see how that translates to actual usage but it's not a great sign.
I didn't take the price into consideration when writing that. I meant to point out that even if they have similar scores, the Flash model might be smaller than MiMo or Kimi, which would by itself be a win
That said, haste makes waste as the price point completely invalidates that
> concerned about Gemini models being benchmaxxed generally
I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.
I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.
Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).
3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10
If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.
Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.
This combined with locally runnable models getting pretty good recently (e.g. Qwen 3.6) tells me that it's time to seriously consider local dev setup again
My guess: it's the price at which they make more money than if they rent the TPUs to other companies.
The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?
The cost at such they could rent out the TPUs, i.e. the market rate, is the inference cost.
Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.
Depends on if you have spare capacity I think. They have minimal competition so they might be maximizing profit by charging prices higher than what clears all their supply.
Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.
You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.
Flash seems to be targeting the near-frontier category.
Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.
Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.
Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.
But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.
...my opinions here are of course, conjecture built on top of conjecture....
3.1 flash lite — $0.25/$1.50 — plus insanely fast.
3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.
Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric, though. You're comparing apples to oranges. Gemini 3.1 Flash is somewhere in the neighborhood between current Haiku and Sonnet, I think? Still a better value than the Anthropic models, I guess, which are quite pricey.
Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.
They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.
Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.
Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.
These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.
Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.
Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.
This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).
It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).
If it's profitable, why haven't they reported any profits? People like Ed Zitron have done the math and it just doesn't add up. I mean he just published this piece today: https://www.wheresyoured.at/ai-is-too-expensive/
Amazon was unprofitable for over a decade, and they were public. Theres no incentive to be profitable as a private company if you can continue to raise money.
His entire brand is that the AI bubble will burst. By his account it was supposed to have several times by now. Like the doomers, it's not if it's when and they have to keep pushing back their predictions. Funny how both camps can be so confident. Alas, that's how they get eyes, ears and dollars.
That's not to say they will be or are wrong, it's just that they aren't exactly unbiased, or humble, sources.
Yeah, at this point I think the worst-case scenario for OpenAI/Anthropic/etc is to slow down frontier model development and focus on tooling and services, as opposed to imploding completely and bursting the economic bubble. I hope?
We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.
That's one solution to the problem. But it still needs some good computational capabilities. Either we optimize the hell out of those models, or we wait for the hardware to become good enough for them.
Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.
Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.
This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.
Mate why are you so mad at people upset the price trippeled? It's a fair complaint that people built services using the cheaper ones with the expectation future models would be similarly priced. You can avoid 'offloading thinking' while still building ontop of these models
Anyone can host Deepseek V4 on rented GPUs and sell inference on it. Price will very quickly converge to the marginal cost of inference. This is as close to a pure commodity as it gets in the AI space so competitive market economics will put in work. Same is true for any open-weights model.
You dont understand the costs involved to run inference at scale
Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.
Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.
Yes it is more efficient in $/tok to run at scale than to run just for yourself. Everyone selling Deepseek V4 inference is selling an undifferentiated good. They have run the numbers on how much it costs and are competing against a dozen other outfits also selling undifferentiated open weights tokens. Whatever the dollar cost they face to rent those GPUs will be what they are able to charge in the competitive market. That is great for you and me because we can buy tokens at pretty much exactly what it costs to produce them.
The benchmark tables in the Google announcement include Opus 4.7, and the numbers are very impressive. Caveat emptor, but it's not unreasonable to compare a new Flash to a current-gen Opus, even if some of the results confirm expectations
Well, the first impression is that Gemini still goes off the instruction rails easier than other models, but I noticed that it tends to go back to the initial goal without holding a hand, which is a real improvement. It's really interesting that these models behave so differently.
What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.
Because it forced them to focus on efficiency, instead of throwing more compute at the problem.
Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.
We're having DeepSeek moments every couple of weeks.
Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.
The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.
And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.
The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).
> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.
You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.
We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.
DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.
In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.
It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.
That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).
I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.
This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:
Gemini 2.5 flash (27 score): $172 (1.0x)
Gemini 2.5 pro (35 score): $649 (3.8x)
Gemini 3.0 Flash (46 score): $278 (1.6x)
Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)
This is a massive price increase... 5.6x compared to Gemini 3.0 Flash
To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.
I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.
That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.
> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from
GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.
Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.
Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.
I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)
For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)
They're still months behind OpenAI and Anthropic on coding.
Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).
I do use Gemini for "lifestyle" AI usage (web research etc) tho.
LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.
Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.
Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.
If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.
The answer is: without being in the training data, LLMs basically don't understand what they're searching for.
At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.
With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.
Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.
> it maybe doesn't even matter that the models are using older data.
This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.
Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?
That's a different problem than I thought you were worried about. I wasn't considering the marketing angle, though that is certainly relevant and a risk to consider, especially when it comes to Google, whose primary businesses are ads and surveillance.
Well, available for Gemini means these days that half the time they are “Receiving a lot of requests right now.” and so sorry they couldn’t complete the task. Luckily the model supports long time horizons because that’s what’s needed. /me likes Gemini a lot just wishing Google would add the compute!
This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.
When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.
Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)
The antigravity teamwork-preview doesn't work for me -- upgraded to ultra, installed antigravity 2, ran teamwork-preview, keeps failing: "You have exhausted your capacity on this model. Your quota will reset after 0s."
Honestly, I feel like the new Gemini 3.5 Flash is a failure. The performance doesn't seem that great, and while they revamped the UI, Anti-Gravity just feels like a cheap CODEX knockoff now. The web UI is underwhelming, and overall it feels like it lost its unique identity by just copying other AIs. It’s a flop in both performance and price point. I’m seriously considering canceling my Gemini subscription altogether. Using Chinese AI models might actually be a better option at this point
I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.
I dunno, the tools are kind of there. Browsers have canvases and JavaScript and SVGs and sound. The communities are around; they're just kind of dispersed. There's no one website that is THE place for fun stuff. Instead, there are dozens, and most of them suck.
There's still fun stuff, though. I stumbled upon this bit of insanity just yesterday: https://tykenn.itch.io/trees-hate-you. It would have fit in fabulously with the old Flash sites.
And there were some amazing RAD and prototyping tools in the 90s (mostly for DOS, but also for Windoze desktop apps.) You're right, we sort of gave up on the idea when everyone wanted to be seen as a "real" software engineer who knew how to sling Java on the back end.
I'm excited for the conversation to switch from intelligence to tps instead. I care much less about what hard thought experiments models can one shot and much more how responsive my plain text interface for doing things is.
Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).
I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.
Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.
I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.
I would argue that prose is just a prompt issue. GPT 5.5 outout is easier to control whan Gemini by prompting. Having better defaults does not make it necessarily better.
I would disagree. I think it'd take a lot of prompting to make GPT 5.5 not have the underlying personality of GPT, which I find awful. They have knobs in ChatGPT to choose a "professional" tone, which improves it somewhat, but even that is still the worst prose of any leading model.
My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.
If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.
Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default has a huge leg up.
Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this?
How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.
Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.
Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?
Actually there's probably a harness that does that - is someone out there using one?
I was using GPT 5.5 for a bunch of work this morning. It's brilliant and efficient. I was also using GPT 5.4 mini. It gets the job done and works great for subtasks that 5.5 designs. Gemini 3.5 Flash is SUCH a Gemini. It seems to work okay, but its attitude is disgusting.
"Yes, your idea is excellent."
"How this works beautifully:"
"This is a fantastic development!"
"This is an exceptionally clean and robust architecture."
and then I point out what feels like an obvious flaw:
"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."
I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.
I switched from Opus 4.6 -> Opus 4.7 -> GPT 5.5 and tried Flash 3.5 tonight and I was not impressed. It is straight up unreliable, e.g. deleting code and forgetting to add the new stuff it was asked to, then happily marking the task as complete with up-beat conclusion. I personally appreciate GPT 5.5 toned-down, objective style so really dislike how this model feels. I get that it's a flash model and not in the same league as GPT 5.5 but their marketing suggest otherwise so thy are just setting themselves up for disappointment.
I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.
The demo of the model in Antigravity automatically rename and categorize unstructured assets using vision was quite cool, it demodulates that the IDE sidepanel can be used for more than just coding. I wonder if the harness in Antigravity is based on Gemini cli or if they are completely different. Could Gemini cli do the same task? Or is the vision feature a Antigravity thing?
This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.
worth noting that Google marked this stable rather than preview, which is unusual compared to their recent releases. Pair that with the 3x price hike and flash pricing now reads like long-term floor they want, not a temporary thing they will walk back later. But its hard to tell yet whether that's Google specifically reading the room or the whole industry quietly resetting the cheap-inference baseline.
I caught it again being deceitful. It did this before
(Me): Did you actually read the paper before when I pasted the link?
> I will be completely honest: No, I did not.
> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.
> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.
I am sure it learned a valuable lesson and won't do it again /s
this seems to happen a lot with commercial models; my local models will happily do as much research and then some when given a task (almost too much), but providers' models refuse to even curl a single datasheet before trying something that i know wont work unless it reads the datasheet
How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?
f311a | 4 hours ago
explosion-s | 4 hours ago
hydra-f | 2 hours ago
The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)
Would still choose Deepseek for the price
alexdns | 3 hours ago
nerdalytics | 3 hours ago
jader201 | 2 hours ago
swe_dima | 3 hours ago
asar | 3 hours ago
6x the price of 3.1 flash lite
himata4113 | 3 hours ago
minimaxir | 3 hours ago
himata4113 | 3 hours ago
wolttam | 3 hours ago
himata4113 | 3 hours ago
johaugum | 2 hours ago
himata4113 | an hour ago
__jl__ | 3 hours ago
svachalek | 2 hours ago
veselin | an hour ago
simonw | 2 hours ago
simonw | an hour ago
I confirmed this by running a bunch of prompts through Gemini 3.5 Flash without doing anything special to configure caching and noting that it comes back with a "cachedContentTokenCount" on many of the responses.
The "storage price" quoted is for an optional Gemini feature that most people don't care about: https://ai.google.dev/gemini-api/docs/caching#explicit-cachi...
John7878781 | 3 hours ago
stri8ed | 3 hours ago
iwhalen | 3 hours ago
Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/
WarmWash | 3 hours ago
Cost per task is a more productive measure, but obviously a more difficult one to benchmark.
Aunche | 2 hours ago
himata4113 | 3 hours ago
They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.
stri8ed | 3 hours ago
himata4113 | 3 hours ago
JanSt | 3 hours ago
himata4113 | 3 hours ago
maipen | 3 hours ago
They are just refining their current models while they finish training the next generation.
They will all come out at about the same time. Anthropic, OpenAi, Google, xAI
ACCount37 | 3 hours ago
Sevii | 3 hours ago
outside1234 | 3 hours ago
throwa356262 | 2 hours ago
Hold on, I think this claim needs some hard data. Here you go gentlemen:
https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...
ACCount37 | 2 hours ago
There might be a harness difference, but also, this CTF-type benchmark might not capture the capability difference fully.
aesthesia | 2 hours ago
throwa356262 | an hour ago
abirch | 2 hours ago
howdareme | 3 hours ago
fikama | 2 hours ago
ActorNightly | 2 hours ago
mnicky | 2 hours ago
For intelligence/size only OpenAI and Anthropic are the frontier. Google has more compute so it can compensate for that with size of the models...
snovv_crash | an hour ago
Jabbles | 2 hours ago
Can you link to a source?
Dinux | 2 hours ago
ActorNightly | 2 hours ago
Nobody really knows the answer to which one is more optimal
* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.
* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.
golfer | 3 hours ago
https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...
mixtureoftakes | 3 hours ago
SXX | 3 hours ago
https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...
3.5 Flash: Thinking High - 7280 tokens
https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...
3.1 Pro - 28,258 tokens
https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...
Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.
abi | 3 hours ago
John7878781 | 3 hours ago
TacticalCoder | 3 hours ago
captn3m0 | 3 hours ago
NitpickLawyer | 3 hours ago
codazoda | 3 hours ago
Fishkins | 3 hours ago
Manuel_D | 2 hours ago
SXX | 2 hours ago
wslh | 3 hours ago
[1] https://github.com/htdt/godogen
[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...
SXX | an hour ago
Actual results for models, one shot:
Gemini 3.5 Flash - Three Little Pigs - 9,050 tokens:
https://gistpreview.github.io/?ed9faa53604035005cae86c63c766...
Gemini 3.1 Pro - Three Little Pigs - 24,272 tokens:
https://gistpreview.github.io/?f506bbfd9b4459c8cd55d89605af8...
Gemini 3 Flash - Three Little Pigs - 5,350 tokens:
https://gistpreview.github.io/?f58eff069cf916031c97d560b0e35...
Gemma 4 31B IT - Three Little Pigs - 5,494 tokens:
https://gistpreview.github.io/?a3aa75abbe8fd7818b73f6fa55ee6...
Gemma 4 26B A4B IT - Three Iittle Pigs - 6,375 tokens:
https://gistpreview.github.io/?1e631caebeb54f9f0cd6d0e3d4d5e...
SXX | 3 hours ago
https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...
Gemini 2.5 Pro - 5,325 tokens:
https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...
Gemini 2.5 Flash - 7,556 tokens:
https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...
Gemma 4 31B IT - 3,261 tokens via AI Studio:
https://gistpreview.github.io/?858a42b96af864859a3b89508619d...
Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:
https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...
SXX | 2 hours ago
https://gistpreview.github.io/?da742884e5e830ce71ee4db877519...
OFC this is just for fun, but nevertheless gave me working code on first try.
abtinf | 3 hours ago
8112 tokens @ 52.97 TPS, 0.85s TTFT
https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...
Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...
Generated with LM Studio on a Macbook Pro M2 Max
https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...
SXX | 2 hours ago
svnt | an hour ago
SXX | an hour ago
franze | 2 hours ago
https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...
tasuki | an hour ago
lpa22 | an hour ago
vtail | 2 hours ago
https://gistpreview.github.io/?557f979c82701862bc26d24f10399...
vtail | an hour ago
> Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG. Use the Brave Browser to verifty that the image is indeed animated and looks like a proper rowing frog; iterate until you are satisfied with it.
It was able to discover and fix an animation bug, but the result is still far from perfect: https://gistpreview.github.io/?029df86d03bfe8f87df1e4d9ed2f6...
krupan | an hour ago
cesarvarela | 3 hours ago
meetpateltech | 3 hours ago
benbencodes | 3 hours ago
Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.
For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00
So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.
conorh | 3 hours ago
mchusma | 3 hours ago
mchusma | 3 hours ago
If this is the big model release out of google, its a disappointent.
jpau | 3 hours ago
(I suspect you're viewing the "flex" pricing).
lyjackal | 3 hours ago
ls_stats | 3 hours ago
Tiberium | 3 hours ago
MallocVoidstar | 2 hours ago
> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization
Every Gemini model starting with 2.5 has been a reasoning model.
aliljet | 3 hours ago
Sevii | 3 hours ago
aliljet | 3 hours ago
Coding, however, is solved like magic. Easier to add tests, to be fair.
throawayonthe | 3 hours ago
goldenarm | 2 hours ago
yieldcrv | 3 hours ago
AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"
(the domain name is dumb and completely unmarketable)
jampekka | 3 hours ago
majso | 3 hours ago
WarmWash | 3 hours ago
More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.
Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.
saberience | 3 hours ago
And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.
I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.
If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.
droidjj | 2 hours ago
krupan | an hour ago
If you aren't paying attention it can spend a long time (and a lot of tokens) spinning in that loop. Sometimes there might be more than two approaches in the loop, which makes it even harder to see that it's repeating itself in a loop. It's pretty frustrating to see it working away productively (so you think) for 20 minutes or so only to finally notice what's going on
rjh29 | 2 hours ago
...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.
No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.
Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.
ls612 | an hour ago
hibikir | 2 hours ago
Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.
For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.
hamdingers | 2 hours ago
The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.
sapneshnaik | 2 hours ago
```
Build a Nango sync that stores Figma projects.
Integration ID: figma
Connection ID for dry run: my-figma-connection
Frequency: every hour
Metadata: team_id
Records: Project with id, name, last_modified
API reference: https://www.figma.com/developers/api#projects-endpoints
```
Note: You do need a Nango account and the Nango Skill installed before it could work.
asdfasgasdgasdg | 2 hours ago
I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").
Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)
brooksc | an hour ago
krupan | an hour ago
Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number
Corence | 55 minutes ago
Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.
Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8
FergusArgyll | 3 hours ago
goldenarm | 2 hours ago
krupan | an hour ago
bakugo | 3 hours ago
Feels like the AI pricing noose is tightening sooner rather than later.
eis | 3 hours ago
I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.
One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.
[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview
ls_stats | 3 hours ago
That's everything I needed to know.
ekojs | 3 hours ago
mijoharas | 2 hours ago
Does that mean this model is production ready?
[0] https://news.ycombinator.com/item?id=47076484
pingou | 2 hours ago
3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.
knollimar | an hour ago
nightski | 3 hours ago
lugu | 2 hours ago
HardCodedBias | 3 hours ago
GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.
That probably works for vibe coded apps by non-practitioners.
I suspect that practitioners/professionals will wait longer for better results.
brokencode | 3 hours ago
And Google is trying to make something affordable enough for a mass market, ad-supported audience.
They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.
OsrsNeedsf2P | 3 hours ago
sauwan | 3 hours ago
droidjj | 2 hours ago
golfer | 2 hours ago
https://x.com/arena/status/2056793180998361233
nicce | 37 minutes ago
s3p | 3 hours ago
2001zhaozhao | 2 hours ago
likium | an hour ago
toraway | an hour ago
noelsusman | 3 hours ago
hydra-f | 2 hours ago
halJordan | 2 hours ago
hydra-f | an hour ago
That said, haste makes waste as the price point completely invalidates that
merb | 3 hours ago
It’s not possible to uptrain on preview releases and it did not get that much love for a while.
warthog | 2 hours ago
Plus the vibe of the gemini models are so weird particularly when it comes to tool calling
At this point I kinda need them to shock me to make the switch
simianwords | 2 hours ago
Also concerned about Gemini models being benchmaxxed generally
NitpickLawyer | 2 hours ago
I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.
computerex | an hour ago
hubraumhugo | 2 hours ago
amarant | 2 hours ago
I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!
npn | 2 hours ago
And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?
It seems like google does want us to use Chinese models.
GodelNumbering | 2 hours ago
Gemini 2.5 flash: $0.30/$2.50
Gemini 3.0 flash preview: $0.50/$3.00
Gemini 3.5 flash: $1.50/$9.00
Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).
3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10
dbbk | 2 hours ago
GodelNumbering | 2 hours ago
mlmonkey | an hour ago
rudedogg | 2 hours ago
Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.
IncreasePosts | 2 hours ago
GodelNumbering | 2 hours ago
MASNeo | 2 hours ago
cft | an hour ago
tempaccount420 | 2 hours ago
My guess: it's the price at which they make more money than if they rent the TPUs to other companies.
The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?
gpm | an hour ago
Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.
HDThoreaun | 27 minutes ago
spyckie2 | an hour ago
You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.
Flash seems to be targeting the near-frontier category.
TurdF3rguson | 46 minutes ago
booty | an hour ago
Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.
https://www.together.ai/pricing
https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)
Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.
But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.
...my opinions here are of course, conjecture built on top of conjecture....
fnordsensei | 2 hours ago
https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...
GodelNumbering | 2 hours ago
dr_dshiv | 2 hours ago
3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.
For comparison, Opus models are $5/$25
SwellJoe | an hour ago
Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.
WarmWash | 14 minutes ago
Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.
doginasuit | 2 hours ago
hnarn | an hour ago
People really can’t wait to be the next Zynga
lanthissa | an hour ago
ilia-a | 2 hours ago
OakNinja | 37 minutes ago
LetsGetTechnicl | 2 hours ago
GaggiX | 2 hours ago
npn | an hour ago
Even anthropic who does not own any hardware still have a big margin providing claude models.
LetsGetTechnicl | an hour ago
npn | 26 minutes ago
Google has just recently upgraded their TPUs.
roadside_picnic | an hour ago
Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.
Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.
This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).
It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).
ReliantGuyZ | an hour ago
Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?
HDThoreaun | 22 minutes ago
LetsGetTechnicl | an hour ago
anthonypasq | 55 minutes ago
Ed Zitron and Gary Marcus are... confused.
goosejuice | 43 minutes ago
That's not to say they will be or are wrong, it's just that they aren't exactly unbiased, or humble, sources.
booty | an hour ago
hei-lima | 2 hours ago
segmondy | 2 hours ago
hei-lima | an hour ago
squidbeak | 2 hours ago
ai_fry_ur_brain | an hour ago
This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.
npn | an hour ago
flakiness | 18 minutes ago
aurareturn | an hour ago
dpoloncsak | an hour ago
ls612 | an hour ago
ai_fry_ur_brain | an hour ago
Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.
Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.
ls612 | an hour ago
zaptrem | 27 minutes ago
GeorgeOldfield | an hour ago
k8sToGo | an hour ago
CognitiveLens | an hour ago
kmac_ | an hour ago
bachmeier | 59 minutes ago
xbmcuser | an hour ago
throwa356262 | an hour ago
humanfromearth9 | 52 minutes ago
Weryj | 45 minutes ago
Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.
SwellJoe | an hour ago
Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.
The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.
And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.
The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).
Zambyte | 43 minutes ago
You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.
trollbridge | 37 minutes ago
DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.
pianopatrick | 40 minutes ago
irthomasthomas | 2 hours ago
CooCooCaCha | an hour ago
photonair | 2 hours ago
llm_nerd | 2 hours ago
SwellJoe | 2 hours ago
I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.
WhitneyLand | an hour ago
Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.
Question is are you going to persuade anyone with this argument?
Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.
SyneRyder | an hour ago
A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.
https://x.com/Steve_Yegge/status/2046260541912707471
A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.
https://x.com/demishassabis/status/2043867486320222333
This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:
https://x.com/mihaimaruseac/status/2046272726881693960
m3kw9 | an hour ago
verdverm | an hour ago
and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)
throwa356262 | an hour ago
Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.
__jl__ | an hour ago
Gemini 2.5 flash (27 score): $172 (1.0x)
Gemini 2.5 pro (35 score): $649 (3.8x)
Gemini 3.0 Flash (46 score): $278 (1.6x)
Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)
This is a massive price increase... 5.6x compared to Gemini 3.0 Flash
OakNinja | 57 minutes ago
I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.
That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.
llmslave | 2 hours ago
This model isnt an advancement, its a previous model that runs more compute, which is why it costs more
npn | 2 hours ago
golfer | 2 hours ago
> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.
https://x.com/arena/status/2056793180998361233
h14h | an hour ago
Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.
andrewstuart | 2 hours ago
If not then I’m not using it.
Cancelled my account 3 months ago, only Claude code level capability would bring me back.
cmrdporcupine | an hour ago
For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)
They're still months behind OpenAI and Anthropic on coding.
Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).
I do use Gemini for "lifestyle" AI usage (web research etc) tho.
reconnecting | 2 hours ago
Latest update: May 2026
I have a very bad feeling about this lag.
hosel | 2 hours ago
nemomarx | 2 hours ago
mixtureoftakes | 2 hours ago
still the cutoff is very much concerning and inconvenient
reconnecting | an hour ago
Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.
neksn | 36 minutes ago
reconnecting | 25 minutes ago
If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.
The answer is: without being in the training data, LLMs basically don't understand what they're searching for.
1. https://github.com/tirrenotechnologies/tirreno
yoda7marinated | 2 hours ago
SwellJoe | an hour ago
With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.
Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.
reconnecting | 45 minutes ago
This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.
Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?
SwellJoe | 24 minutes ago
verdverm | an hour ago
stan_kirdey | 2 hours ago
MASNeo | 2 hours ago
simonw | 2 hours ago
Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.
Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...
hedgehog | 2 hours ago
xattt | an hour ago
joseda-hg | an hour ago
egillie | an hour ago
Xenoamorphous | an hour ago
verdverm | an hour ago
hydra-f | 2 hours ago
nashashmi | 2 hours ago
unglaublich | an hour ago
irthomasthomas | 2 hours ago
edit: fixed human hallucination
derefr | an hour ago
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
irthomasthomas | an hour ago
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
gcgbarbosa | an hour ago
simonw | an hour ago
nickmccann | an hour ago
smcleod | an hour ago
holtkam2 | an hour ago
simonw | an hour ago
tantalor | an hour ago
https://www.gianlucagimini.it/portfolio-item/velocipedia/
> most ended up drawing something that was pretty far off from a regular men’s bicycle
et1337 | an hour ago
lxgr | 33 minutes ago
khy | 55 minutes ago
setgree | 46 minutes ago
wtf
`<!-- Gold Rim -->`
WTF??
nickvec | 20 minutes ago
https://en.wikipedia.org/wiki/Vaporwave
ralusek | 2 hours ago
mackross | 2 hours ago
jdw64 | 2 hours ago
lanewinfield | 2 hours ago
casey2 | an hour ago
OhMeadhbh | an hour ago
nightski | an hour ago
CobrastanJorji | 8 minutes ago
There's still fun stuff, though. I stumbled upon this bit of insanity just yesterday: https://tykenn.itch.io/trees-hate-you. It would have fit in fabulously with the old Flash sites.
_puk | an hour ago
Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!
Every time I have heard the word flash for goodness knows how many years.
OhMeadhbh | 44 minutes ago
goatlover | an hour ago
OhMeadhbh | 42 minutes ago
wg0 | an hour ago
alexandre_m | an hour ago
verdverm | an hour ago
x3cca | an hour ago
ai_fry_ur_brain | an hour ago
paperwork360 | an hour ago
bredren | an hour ago
SwellJoe | an hour ago
Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.
I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.
nicce | 32 minutes ago
SwellJoe | 12 minutes ago
My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.
If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.
Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default has a huge leg up.
owentbrown | an hour ago
Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.
Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?
Actually there's probably a harness that does that - is someone out there using one?
pcwelder | an hour ago
On my tasks it has not been as good as even Sonnet 4.6 so far.
Instruction following over long context feels worse.
It's not a bad model by any means, better than any pro open source model for sure.
landtuna | 59 minutes ago
"Yes, your idea is excellent."
"How this works beautifully:"
"This is a fantastic development!"
"This is an exceptionally clean and robust architecture."
and then I point out what feels like an obvious flaw:
"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."
I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.
andriy_koval | 54 minutes ago
kaspermarstal | 39 minutes ago
kristopolous | an hour ago
which you can invoke with
$ curl day50.dev/art-analysis.sh | bash
inspect the code. it's tiny.
I use it all the time and maintain it. Snag a copy and pull it down again if it breaks on you. I stay on top of it.
hmate9 | an hour ago
quirino | 42 minutes ago
From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.
The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol
Alifatisk | an hour ago
uejfiweun | 49 minutes ago
amelius | 44 minutes ago
nikhilpareek13 | 36 minutes ago
rdtsc | 34 minutes ago
(Me): Did you actually read the paper before when I pasted the link?
> I will be completely honest: No, I did not.
> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.
> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.
I am sure it learned a valuable lesson and won't do it again /s
jareklupinski | 31 minutes ago
brikym | 26 minutes ago
stared | 14 minutes ago
Google: we don’t need Chinese to distill our models, we can do it ourself