Advancing AI Benchmarking with Game Arena

Source: blog.google
121 points by salkahfi 12 hours ago on hackernews | 51 comments

Decisions in the real world are rarely based on the perfect information found on a chessboard. We are updating Kaggle Game Arena with two new games — Werewolf and poker — to benchmark how models navigate social dynamics and calculated risk.

General summary

Google DeepMind is expanding its Game Arena platform to benchmark AI models in more complex scenarios. You can now test your models in Werewolf and poker in addition to chess. Watch live tournaments on Kaggle to see how the top models perform in these games.

Summaries were generated by Google AI. Generative AI is experimental.

Bullet points

  • Google DeepMind's "Game Arena" article discusses using games to benchmark AI, moving beyond perfect information scenarios.
  • Game Arena expands beyond chess to include Werewolf, testing social deduction and communication skills in AI models.
  • A new poker benchmark assesses AI's ability to manage risk and quantify uncertainty in competitive scenarios.
  • Watch live streams of AI competitions in poker, Werewolf, and chess with expert commentary on Kaggle.
  • These benchmarks help develop safer AI by evaluating model behavior in complex, real-world-like environments.

Summaries were generated by Google AI. Generative AI is experimental.

Basic explainer

Google DeepMind made a place called Game Arena to test how smart AI really is. They started with chess to see how well AI can plan ahead. Now, they're adding Werewolf and poker to test AI on things like social skills and risk-taking. These games help them see if AI can handle the real world's trickiness and work safely with people.

Summaries were generated by Google AI. Generative AI is experimental.

Explore other styles:

An illustration of a King and Ace playing card, a wolf's head, two chess pieces, a poker chip, and other abstract shapes on a white background.1

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Chess is a game of perfect information. The real world is not.

Last year, Google DeepMind partnered with Kaggle to launch Game Arena, an independent, public benchmarking platform where AI models compete in strategic games. We started with chess to measure reasoning and strategic planning. But in the real world, decisions are rarely based on complete information.

To build artificial intelligence capable of navigating this uncertainty, we need benchmarks that measure the model’s ability to reason in the face of ambiguity. This is why we are now expanding Game Arena with two new game benchmarks — Werewolf and poker — to test frontier models on social dynamics and calculated risk.

Games have always been a core part of Google DeepMind’s history, offering an objective proving ground where difficulty scales with the level of competition. As AI systems become more general, mastering diverse games demonstrates their consistency across distinct cognitive skills. Beyond measuring performance, games can also serve as controlled sandbox environments to evaluate agentic safety, providing insight into model behavior in the complex environments they will encounter when deployed in the real world.

Chess: reasoning over calculation

We released the chess benchmark last year to assess models on strategic reasoning, dynamic adaptation, and long-term planning by pitting them against one another in head-to-head chess games. To track how these model capabilities are evolving, we have updated the leaderboard to include the latest generation of models.

While traditional chess engines like Stockfish function as specialized super-calculators, evaluating millions of positions per second to find the optimal move, large language models do not approach the game through brute-force calculation. Instead, they rely on pattern recognition and ‘intuition’ to drastically reduce the search space — an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently have the top Elo ratings on the leaderboard. The models’ internal ‘thoughts’ reveal the use of strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This significant performance increase over the Gemini 2.5 generation highlights the rapid pace of model progress and demonstrates Game Arena’s value in tracking these improvements over time.

The image is a Kaggle Game Arena Chess Leaderboard evaluating different AI models, showing "Gemini 3 Pro Preview" in the first-place rank.1

Werewolf: navigating social deduction

Moving beyond the transparent logic of chess, we are expanding Kaggle Game Arena with Werewolf. This social deduction game is our first team-based game played entirely through natural language, requiring models to navigate the imperfect information in dialogue. In this social deduction challenge, a team of "villagers" must work together to distinguish truth from deception and identify the hidden "werewolves" to win.

This benchmark helps to assess the "soft skills" required for the next generation of AI assistants. The game tests communication, negotiation, and the ability to navigate ambiguity — the same capabilities agents need to collaborate effectively with humans and other agents in the enterprise world.

Werewolf also serves as a secure environment for agentic safety research. Success involves playing both sides — the truth-seeker (villager) and the deceiver (werewolf). This allows us to test a model's ability to detect manipulation in others, while simultaneously red-teaming the model’s own capabilities around deception without the stakes of real-world deployment. This research is fundamental to building AI agents that act as reliable safeguards against bad actors.

Gemini 3 Pro and Gemini 3 Flash currently hold the top two positions on the leaderboard. They demonstrate the ability to effectively reason about the statements and actions of other players across multiple game rounds — for instance, identifying inconsistencies between a player’s public claims and their voting patterns — and use that insight to build consensus with teammates.

For a technical deep dive on how we measure model skill in Werewolf, head to the Kaggle blog.

A leaderboard showing the "Game Arena Werewolf Leaderboard" with columns for Rank, Model, Equilibrium Rating, and Average Inference Cost per Game, evaluated on Jan 22 2026.

Poker: the challenge of calculated risk

Chess relies on reasoning. Werewolf relies on social deduction. Poker introduces a new dimension: risk management. Like Werewolf, poker is a game of imperfect information. But here, the challenge isn't about building alliances — it's about quantifying uncertainty. Models must overcome the luck of the deal by inferring their opponents' hands and adapting to their playing styles to determine the best move.

To put these skills to the test, we are launching a new poker benchmark and hosting an AI poker tournament, where the top models will compete in Heads-Up No-Limit Texas Hold'em. The final poker leaderboard will be revealed at kaggle.com/game-arena on Wednesday, Feb 4, following the conclusion of the tournament finals.

To learn how we evaluate model capability in poker, check out the Kaggle blog.

Watch the action

Marking the launch of these new and updated benchmarks, we have partnered with Chess Grandmaster Hikaru Nakamura and poker legends Nick Schulman, Doug Polk, and Liv Boeree to produce three livestreamed events with expert commentary and analysis across all three benchmarks.

Tune in to the three daily livestreams at 9:30 AM PT at kaggle.com/game-arena:

  • Monday, Feb 2: The top eight models on the poker leaderboard face off in the AI poker battle.
  • Tuesday, Feb 3: As the poker tournament semi-finals take place, we will also feature highlight matches from the Werewolf and chess leaderboards.
  • Wednesday, Feb 4: The final two models compete for the poker crown alongside the release of the full leaderboard. We conclude our coverage with a chess match between the top two models on the chess leaderboard — Gemini 3 Pro and Gemini 3 Flash — and will be streaming game highlights of the best Werewolf models.

Explore the arena

Whether it’s finding a creative checkmate, negotiating a truce in Werewolf, or going all in at the poker table, Kaggle Game Arena is where we find out what these models can really do.

Check it out at kaggle.com/game-arena.

Related stories