VideoGameBench

[Research Preview]

Princeton University

Figure 1. GPT-4o, Claude Sonnet 3.7, Gemini 2.5 Pro, and Gemini 2.0 Flash playing Doom II (default difficulty) on VideoGameBench-Lite with the same input prompt! Models achieve varying levels of success but none are able to pass even the first level.

tldr;

We introduce a research preview of VideoGameBench, a benchmark which challenges vision-language models to complete, in real-time, a suite of 20 different popular video games from both hand-held consoles and PC.

We also introduce VideoGameBench-Lite, a subset of the games where the environment pauses the game while the model is thinking, thereby ignoring the long inference latency bottleneck of modern vision-language models (VLMs).

Our benchmark focuses entirely on whether VLM agents can beat these games in their entirety, given only raw visual frames from the game. In this research preview, we provide code, explanations of our framework, and initial observations of our basic agent playing these games.

Figure 2. A sample selection of games from VideoGameBench that our VideoGameAgent is playing (with thoughts and actions side-by-side). Each game has different mechanics, gameplay, and visuals.

Language models (LMs) have shown to be capable of solving incredibly difficult reasoning tasks such as math [17,18,19] and coding [19,20,21]. Many of these tasks are extraordinarily difficult for the average person, primarily due to the large amount of prerequisite knowledge and pattern recognition abilities required. On the other hand, while humans can complete video games, we have yet to see even the state-of-the-art LMs or VLMs complete games such as Doom or Pokemon [14]. This has led the community to focus on simple or hand-written games [7,15]. Solving real video games requires long-term and short-term reasoning, spatial understanding, and intuition (e.g. you need to find a key to open the locked door). In the past, AI models for games were trained specifically on one game using an exponentially large exploration and/or expert trajectory budget [1,2,3,4,8,22]. VLMs are an interesting alternative that can potentially solve multiple different types of games, even without explicitly seeing them before. Thus, in the full VideoGameBench, we also plan to provide a secret eval suite of games that agents can evaluate on for performance.


VideoGameBench Framework and Environment

Our benchmark infrastructure provides a single environment in which an agent can play the 20 games we selected, over both the Game Boy and MS-DOS platforms. At a high-level, our framework abstracts away the game emulator (currently supporting Game Boy through PyBoy and MS-DOS through DOSBOX) and provides the agent with the input and output it needs:

  1. Observations, in this case the game screen as an image
  2. An interface to communicate with the game "controller", through an action (e.g. press "Space"), a sequence of actions (e.g. press "A","A","Start"), or a sequence of timed actions (e.g. press "Space" for 5 seconds and then press "A").
  3. An indication of whether the game was successfully completed or not.
We stray away from including extra game information like parsed text or in-game masks (as in DeepMind's StarCraft II AlphaStar [2] and the PySC2 [23] work) and focus on only providing the game screen to the model.

In this research preview, we provide a minimal implementation of our benchmark setup so you can quickly fork and try your own agents on the benchmark! We are working on some other agent-specific features with our paper release in the near future that are not currently included in our codebase. To help you build more complex agents, we also provide a basic VideoGameAgent with memory and a corresponding UI that works out-of-the-box with most common LLM API providers (e.g. GPT, Claude, Gemini, Deepseek, etc.) through LiteLLM. You can see the UI side-by-side with the game below:

Game Screen
VideoGameBench UI

Figure 3. Our environment contains some simple UI code for tracking its thoughts, actions, and memory per step.


Currently, we focus our benchmark on older Game Boy and classic MS-DOS video games because (1) they are noticeably simpler in visual cues than recent games and (2) they cover both controller and mouse + keyboard mechanics, which present different challenges to a VLM's spatial reasoning abilities than text-based / terminal game actions [6,24].


Benchmarking progress and game completion. Finally, emulators and game engines do not provide a special reward signal for game completion. Therefore, we build a mechanism for detecting whether the agent has successfully completed the game. We do this by matching between reference "game completed" screenshots and the agent's current screen -- this strategy is also useful for user-defined "sub-goal completion" detection, which is useful for benchmarking partial progress or campaign completion for certain games (e.g. Warcraft II, where we're only interested in completing the "Orc" campaign).



VideoGameBench: List of Games

VideoGameBench Games

We selected games based on relative difficulty and diversity of gameplay characteristics, which we've roughly highlighted in the list below. Certain games involve completing the entire single player mode (e.g. Super Mario Land, Legend of Zelda: Link's Awakening) while others only involve completing a single campaign or game due to their lengths (e.g. Civ 1). For certain games we also included sequels due to sufficient variety in world exploration.

MS-DOS 💻

  1. Doom 3D shooter
  2. Doom II 3D shooter
  3. Quake 3D shooter
  4. Sid Meier's Civilization 1 2D strategy turn-based
  5. Warcraft II: Tides of Darkness (Orc Campaign) 2.5D strategy
  6. Oregon Trail Deluxe (1992) 2D strategy turn-based
  7. X-COM UFO Defense 2D strategy
  8. The Incredible Machine (1993) 2D puzzle
  9. Prince of Persia 2D platformer
  10. The Need for Speed 3D racer
  11. Age of Empires (1997) 2D strategy

Game Boy 🎮

  1. Pokemon Red (GB) 2D grid-world turn-based
  2. Pokemon Crystal (GBC) 2D grid-world turn-based
  3. Legend of Zelda: Link's Awakening (DX for GBC) 2D open-world
  4. Super Mario Land 2D platformer
  5. Kirby's Dream Land (DX Mod for GBC) 2D platformer
  6. Mega Man: Dr. Wily's Revenge 2D platformer
  7. Donkey Kong Land 2 2D platformer
  8. Castlevania Adventure 2D platformer
  9. Scooby-Doo! - Classic Creep Capers 2D detective

VideoGameBench-Lite: Giving Agents Time to Think 💭

Figure 4. Our agent (using GPT-4o) plays Doom II (easiest difficulty) on VideoGameBench-Lite, so the environment pauses the game while the agent thinks. The agent is able to defeat enemies and run around the level.

In our experience, current state-of-the-art VLMs substantially struggle to play video games because of high inference latency. When an agent takes a screenshot(s) and queries the VLM about what action to take, by the time the response comes back, the game state has changed significantly and the action is no longer relevant (for example, an enemy shooting at the agent when the screenshot was taken might have now moved to standing in front of the agent).

Subset of VideoGameBench-Lite Games

  1. Doom II 3D shooter
  2. Quake 3D shooter
  3. Prince of Persia 2D platformer
  4. Legend of Zelda: Link's Awakening (DX for GBC) 2D open-world
  5. Super Mario Land 2D platformer
  6. Kirby's Dream Land (DX Mod for GBC) 2D platformer


Initial Observations on VideoGameBench

It becomes apparent after running an agent on any of these games that VLM agents are not close to solving an entire game, let alone even the first level of most games. We have observed some interesting progress in games, e.g. our agent getting to the first mini-boss of Kirby's Dream Land that is interesting to observe. This section is primarily focused on highlighting a few qualitative observations rather than rigorous quantitative experimentation. All experiments use a basic VideoGameAgent that uses ReAct [25] with memory given a sequence of 5-10 frames and issues sequences of key-presses and mouse movements as actions.

Organizing thoughts and goal oriented-ness. There have been several prior works that successfully use LLMs for planning in games [13] and some others that have attempted to solve text games [8,15,24]. In the multimodal setting, we found that agents often misinterpret events from the game screen -- this gap between visual event interpretation and the language understanding causes unintended behavior, which can also affect the internal goals proposed by the agent. In the example below, the agent wastes its ammo thinking dead enemies are alive.

Figure 5. Our agent (using Claude Sonnet 3.7) plays Doom II on VideoGameBench-Lite and repeatedly confuses dead enemies for alive ones, wasting all its ammo and affecting its strategies.

The granularity of an "action". An immediate issue when running a VLM on any real-time video game is the 3-5 second inference delay between querying the VLM for an action given the current game screen and when the action is actually returned by the VLM. Inference latency of large models is unavoidable, so an interesting question is what granularity a VLM "action" takes -- is it a single key press, a sequence of key presses, code, or even a simple mini policy?

Figure 6. Our agent (using GPT-4o) plays Super Mario Land in real-time on VideoGameBench. The agent takes 3-5 seconds to respond to the current game screen, leading to it dying multiple times to the same Goomba.

Controller, Mouse + Keyboard Precision. We found several instances of the agent / VLM struggling to translate the effect of actions (e.g. move right) on the screen. The most obvious across all frontier models we tried (e.g. GPT-4o, Claude Sonnet 3.7, Gemini 2.5 Pro) was the inability to accurately move the position of the mouse for games like Civilization and Warcraft II, where frequent mouse movements are necessary.

Figure 7. Our agent (using GPT-4o) plays Warcraft II in real-time on VideoGameBench. The agent struggles to move the mouse to the correct position, leading to it clicking "load game" instead of "new game" each.

Unintuitive Game Mechanics. Unless explicitly prompted with proper instructions, many game mechanics are not obvious to VLMs. While seemingly obvious, this fact becomes more important when benchmarking on "secret" games, where game-specific prompts may be impossible and the model has to learn these mechanics from scratch.

Figure 8. Our agent (using GPT-4o) plays Kirby Dream Land on VideoGameBench-Lite. The agent gets to the first mini-boss, but does not know it can copy abilities after swallowing the bomb and use it to easily defeat the boss.

Previous Work

Reinforcement Learning (RL) on Games

The premise of this question is a bit mis-leading, since certain data-driven RL methods have already proven to be able to fully solve games better than humans. For starters, RL has been able to solve Atari games [1] for almost a decade now. In terms of superhuman game-playing AI, DeepMind's AlphaGo [22] was a neural network capable of defeating the human world champion, but it was arguable that the game's state and action spaces are discrete. It was then believed that video games were infinitely more complicated, but it turned out with DeepMind's AlphaStar [2] and OpenAI Five [3] that proper featurization of the game environment made it roughly equivalent to the AlphaGo problem. Even in 3D game settings where this featurization is significantly more difficult, there have been strong efforts such as the Dreamer papers [4] that are actively building agents on games like Minecraft. Recently, a larger question with RL on video games is whether language-heavy games can be solved. These methods rely on using language models to replace / model some module of the reinforcement learning pipeline, e.g. the value function. One particularly interesting modern example are new agents on the popular turn-based battle simulator Pokemon Showdown, which have now achieved human-competitive performance using a mix of language models and RL-style policies [5]. Another prominent example is the CICERO agent [26] for the multi-player strategy game Diplomacy.

Pure VLMs and LMs playing Video Games.

There exists a class of unsolved video games that are extremely difficult for both RL methods and VLMs to solve. These games tend to involve some form of language component, long-form exploration / objectives, and spatial reasoning puzzles – the BALROG benchmark contains a suite of these games for VLMs to solve with progress indicators [6]. One of the most popular case studies of AI for games has been NetHack [6,8,9] which is a grid world terminal game with an extraordinarily complex combat, item, and dungeoning system (+ randomization) that makes it difficult for humans to complete.

VideoGameBench deviates slightly from these benchmarks in that a possible solution to most of the games is to learn a policy through "sufficiently large" RL-style exploration (e.g. Pokemon Red has been solved multiple times using RL [10,11], shooters like Doom have RL training platforms [16]). Unlike RL algorithms, while certain information about video games are most likely present in the training data of VLMs, it is significantly smaller than what is provided to RL methods. Recently, there has been a lot more interest in evaluating the capabilities of newer frontier models and agent approaches for solving the aforementioned games such as Claude plays Pokemon and now Gemini plays Pokemon (which uses a more interesting agentic approach). Another recent major effort has been by the Hao AI lab to build a platform for VLM agents to play games like Mario, Sokoban, and Candy Crush in real-time (their work is awesome so please check them out!) [12].

VideoGameBench and VideoGameBench-Lite similarly focus on real video games, but use a fixed and diverse set of challenging games (e.g. platformers, shooters, RTS, RPGs, 2D, 2.5D, 3D, etc.) with a standard common interface. The common environment is also designed to eventually enable plugging in various emulators without any extra friction.

Extending to New Games

It is easy to run other Game Boy or MS-DOS games on our platform by downloading the ROM / js-dos link and adding a new config with specifications (e.g. defining a game-specific prompt, a preset list of actions to run when a game loads like pre-selecting the difficulty, and game settings). We provide explicit instructions in our GitHub README. Extending to other emulators is also relatively simple, but we currently do not support this in a no-code fashion. We want to add diverse support to maintain our benchmark, so we welcome open-source contributions – we generally would like to work in the open! Finally, we picked a specific set of games for our benchmark, but we encourage people to run and experiment with other games on our environment.

We want to emphasize that the development of VideoGameBench was made possible by leveraging existing open-source emulators. We are grateful to the developers of PyBoy, DOSBox, JS-DOS, and Playwright. This benchmark uses proprietary games, please make sure to purchase these games before running the benchmark.


Last Thoughts

Benchmarking frontier models and agents on real video games is both flashy and experimentally useful for measuring reasoning capabilities. Unlike extremely complicated domains like unsolved math proofs and olympiad-level math problems, playing video games is not a superhuman reasoning task, yet models still struggle to solve them. Furthermore, while most progress has been on LLM and text-only reasoning, few benchmarks evaluate reasoning capabilities over multimodal domains in an interpretable and widely understandable setting.

[Open-source] We have worked on many open-source projects and gladly welcome contributions from the community to expand the suite of games and supported emulators. Details on contributing can be found in the project repository: https://github.com/alexzhang13/videogamebench.

References

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning, 2013. URL: https://arxiv.org/abs/1312.5602

[2] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354, 2019. URL: https://deepmind.google/discover/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/

[3] OpenAI Team. OpenAI Five. URL: https://openai.com/index/openai-five/

[4] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. In Advances in Neural Information Processing Systems, 2023. URL: https://arxiv.org/abs/2301.04104

[5] Seth Karten and Andy Luu Nguyen and Chi Jin. PokéChamp: an Expert-level Minimax Language Agent. URL: https://arxiv.org/abs/2503.04094

[6] Davide Paglieri and Bartłomiej Cupiał and Samuel Coward and Ulyana Piterbarg and Maciej Wolczyk and Akbir Khan and Eduardo Pignatelli and Łukasz Kuciński and Lerrel Pinto and Rob Fergus and Jakob Nicolaus Foerster and Jack Parker-Holder and Tim Rocktäschel. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games. In International Conference on Learning Representations, 2025. URL: https://arxiv.org/abs/2411.13543

[7] Muhammad Umair Nasir and Steven James and Julian Togelius. GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps. URL: https://arxiv.org/abs/2410.07765v1

[8] Jens Tuyls and Shunyu Yao and Sham Kakade and Karthik Narasimhan. Multi-Stage Episodic Control for Strategic Exploration in Text Games. URL: https://arxiv.org/abs/2201.01251

[9] Mikayel Samvelyan and Robert Kirk and Vitaly Kurin and Jack Parker-Holder and Minqi Jiang and Eric Hambro and Fabio Petroni and Heinrich Küttler and Edward Grefenstette and Tim Rocktäschel. MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research . URL: https://arxiv.org/abs/2109.13202

[10] Mads Ynndal and Tina Zhu. Pokemon RL Edition. URL: https://drubinstein.github.io/pokerl/

[11] Peter Whidden. Training AI to Play Pokemon with Reinforcement Learning. URL: https://www.youtube.com/watch?v=DcYLT37ImBY

[12] Lanxiang Hu and Qiyu Li and Anze Xie and Nan Jiang and Ion Stoica and Haojian Jin and Hao Zhang. GamingAgent - Personal Computer Gaming Agent. URL: https://github.com/lmgame-org/GamingAgent

[13] Kolby Nottingham and Prithviraj Ammanabrolu and Alane Suhr and Yejin Choi and Hannaneh Hajishirzi and Sameer Singh and Roy Fox. Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling. URL: https://arxiv.org/abs/2301.12050

[14] Anthropic Team. Claude Play Pokemon (twitch.tv). https://www.twitch.tv/claudeplayspokemon

[15] Chen Feng Tsai and Xiaochen Zhou and Sierra S. Liu and Jing Li and Mo Yu and Hongyuan Mei. Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions. URL: https://arxiv.org/abs/2304.02868

[16] Wydmuch, Marek and Kempka, Michal and Jaskowski, Wojciech. ViZDoom Competitions: Playing Doom from Pixels. In IEEE Transactions on Games, 2019. URL: https://github.com/Farama-Foundation/ViZDoom

[17] Yong Lin and Shange Tang and Bohan Lyu and Jiayun Wu and Hongzhou Lin and Kaiyu Yang and Jia Li and Mengzhou Xia and Danqi Chen and Sanjeev Arora and Chi Jin. Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. URL: https://arxiv.org/abs/2502.07640

[18] Elliot Glazer and Ege Erdil and Tamay Besiroglu and Diego Chicharro and Evan Chen and Alex Gunning and Caroline Falkman Olsson and Jean-Stanislas Denain and Anson Ho and Emily de Oliveira Santos and Olli Järviniemi and Matthew Barnett and Robert Sandler and Matej Vrzala and Jaime Sevilla and Qiuyu Ren and Elizabeth Pratt and Lionel Levine and Grant Barkley and Natalie Stewart and Bogdan Grechuk and Tetiana Grechuk and Shreepranav Varma Enugandla and Mark Wildon. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. URL: https://arxiv.org/abs/2411.04872

[19] OpenAI Team. Learning to Reason with LLMs. URL: https://openai.com/index/learning-to-reason-with-llms/

[20] Michael Luo*, Sijun Tan*, Roy Huang*, Ameen Patel*, Alpay Ariyak*, Qingyang Wu*, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level. URL: https://www.together.ai/blog/deepcoder

[21] DeepSeek-AI and Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Ruoyu Zhang and Runxin Xu and Qihao Zhu and Shirong Ma and Peiyi Wang and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Qu and Hui Li and Jianzhong Guo and Jiashi Li and Jiawei Wang and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and J. L. Cai and Jiaqi Ni and Jian Liang and Jin Chen and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Liang Zhao and Litong Wang and Liyue Zhang and Lei Xu and Leyi Xia and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Meng Li and Miaojun Wang and Mingming Li and Ning Tian and Panpan Huang and Peng Zhang and Qiancheng Wang and Qinyu Chen and Qiushi Du and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and R. J. Chen and R. L. Jin and Ruyi Chen and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shengfeng Ye and Shiyu Wang and Shuiping Yu and Shunfeng Zhou and Shuting Pan and S. S. Li and Shuang Zhou and Shaoqing Wu and Shengfeng Ye and Tao Yun and Tian Pei and Tianyu Sun and T. Wang and Wangding Zeng and Wanjia Zhao and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and W. L. Xiao and Wei An and Xiaodong Liu and Xiaohan Wang and Xiaokang Chen and Xiaotao Nie and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and X. Q. Li and Xiangyue Jin and Xiaojin Shen and Xiaosha Chen and Xiaowen Sun and Xiaoxiang Wang and Xinnan Song and Xinyi Zhou and Xianzu Wang and Xinxia Shan and Y. K. Li and Y. Q. Wang and Y. X. Wei and Yang Zhang and Yanhong Xu and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Wang and Yi Yu and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yuan Ou and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yunfan Xiong and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Y. X. Zhu and Yanhong Xu and Yanping Huang and Yaohui Li and Yi Zheng and Yuchen Zhu and Yunxian Ma and Ying Tang and Yukun Zha and Yuting Yan and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhicheng Ma and Zhigang Yan and Zhiyu Wu and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Zizheng Pan and Zhen Huang and Zhipeng Xu and Zhongyu Zhang and Zhen Zhang. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. URL: https://arxiv.org/abs/2501.12948

[22] David Silver, Aja Huang, Chris J. Maddison. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016). URL: https://www.nature.com/articles/nature16961

[23] Oriol Vinyals and Timo Ewalds and Sergey Bartunov and Petko Georgiev and Alexander Sasha Vezhnevets and Michelle Yeo and Alireza Makhzani and Heinrich Küttler and John Agapiou and Julian Schrittwieser and John Quan and Stephen Gaffney and Stig Petersen and Karen Simonyan and Tom Schaul and Hado van Hasselt and David Silver and Timothy Lillicrap and Kevin Calderone and Paul Keet and Anthony Brunasso and David Lawrence and Anders Ekermo and Jacob Repp and Rodney Tsing. StarCraft II: A New Challenge for Reinforcement Learning. URL: https://arxiv.org/abs/1708.04782

[24] Shunyu Yao and Rohan Rao and Matthew Hausknecht and Karthik Narasimhan. Keep CALM and Explore: Language Models for Action Generation in Text-based Games. URL: https://arxiv.org/abs/2010.02903

[25] Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. URL: https://arxiv.org/abs/2210.03629

[26] Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer1, Mike Lewis, Alexander H. Miller1, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, Markus Zijlstra. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. URL: https://www.science.org/doi/10.1126/science.ade9097

Citation / BibTeX

@article{zhang2025videogamebench,
  author    = {Zhang, Alex and Press, Ofir},
  title     = {VideoGameBench: Research Preview},
  year      = {2025},
  url       = {https://www.vgbench.com/},
}