通过玩 “Battleship” 来教 AI 代理提出更好的问题

CSAIL 和 SEAS 学者通过围绕提问和回答自然语言问题重新构建游戏，增加了一些变化。在他们的 “ 协作战舰” 游戏, 中，一名参与者是 “captain”，他询问隐藏船只的位置,，而他们的队友通过实时回答这些问题来玩 “spotter”。

CSAIL and SEAS scholars added a twist by reframing the game around asking and answering natural language questions. In their “Collaborative Battleship” game, one participant is a “captain” who inquires about where hidden ships are, while their teammate plays the “spotter” by responding to those questions in real-time.

研究人员首先让 40 多名玩家一起玩游戏,，收集他们的问题和是非答案，以构建 “BattleshipQA” 数据集。当团队在游戏中测试最先进的 LM (（如 GPT-5)）和较小的模型 (（如 Llama 4 Scout)）时，这些结果是一个有用的比较点。在不事先训练模型的情况下,，他们发现顶级LM可以在“Battleship” —上“击败”人类，即,在更少的回合内完成游戏—，但较小的系统远不那么理性。

The researchers first had over 40 humans play the game together, collecting their questions and yes-no answers to build the “BattleshipQA” dataset. These results were a helpful point of comparison when the team tested state-of-the-art LMs (like GPT-5) and smaller models (like Llama 4 Scout) on their game. Without training the models beforehand, they found that top LMs can “beat” humans at “Battleship” — that is, complete the game in fewer turns — but smaller systems are far less rational.

主要问题是许多模型根本不擅长提出有用的问题。为了让 LM 进行查询，以揭示有关隐藏船只的更多信息,，研究人员为每个模型提供了蒙特卡罗推理策略,，该策略仔细测量每个响应中不同选项正确的可能性。结果: AI 模型可以在 “Battleship,” 上击败普通玩家，无论规模如何。

The chief issue was that many models are simply not adept at coming up with useful questions. To get LMs to inquire in ways that reveal more information about hidden ships, the researchers gave each model a Monte Carlo inference strategy, which carefully measures the likelihood of different options being correct with each response. The result: AI models that can beat regular players at “Battleship,” regardless of scale.

也许最引人注目的结果是 Llama 4 Scout的的收益。作为一个相对较小的 LM,，它只有 8% 的胜率能够击败人类。但通过对其推理策略, 的改进，该模型相对于人类的“Battleship” 胜率达到了 82%。这种仔细而高效的提问方式也使该模型超越了前沿模型 (GPT-5),，同时运行成本仅为其 1% 左右。

Perhaps the most striking results were Llama 4 Scout的 gains. As a relatively small LM, it only beat humans 8 percent of the time. But with refinements to its inference strategy, the model reached a “Battleship” win rate of 82 percent versus humans. This careful and efficient style of asking questions also enabled the model to outpace a frontier model (GPT-5), while operating at around 1 percent of its cost.

除了这一改进,之外，研究人员还缩小了人类和语言模型在回答问题方面的差距。虽然 GPT-5 是一个可靠的观测器，可以帮助模型更快地完成游戏, 较小的系统有一个坏习惯，即给出关于船只隐藏位置的错误答案。当模型开始将问题转换为明确告诉他们如何验证答案的代码时，模型的准确率平均提高了 15%(例如, 当被问及是否有船时让模型对某个区域进行快速搜索)。

On top of this improvement, the researchers shrank the gap between humans and LMs in answering questions. While GPT-5 was a reliable spotter that helped models finish games faster, smaller systems had a bad habit of giving the wrong answers about where ships were hidden. The models saw an accuracy boost of 15 percent on average when they began converting questions into code that explicitly tells them how to verify their answers (for example, having the model run a quick search of an area when asked if a ship was there).

“今天的语言模型主要是为了回答复杂的查询而优化,，但的不太清楚它们是否学会为自己提出好的问题,” 麻省理工学院博士生兼 CSAIL 研究员 Gabriel Grand SM ’23, 说，他是一篇有关这项工作的论文的主要作者。 “我们的工作表明，提出信息丰富的问题取决于预测和模拟世界的能力。我们发现，当我们让代理访问‘世界模型,’时，他们会提出更好的问题并更有效地进行发现。” LM 的巨大变化

“Today的 language models are primarily optimized to answer complex queries, but it的 less clear whether they learn to ask good questions for themselves,” says MIT PhD student and CSAIL researcher Gabriel Grand SM ’23, who is a lead author on a paper about the work. “Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a ‘world model, they ask better questions and make discoveries more efficiently.”

A sea change for LMs

团队的首要关注点是让 LM 提出更好的问题。通过实施蒙特卡罗推理策略,，LM 将潜在的猜测推理为单个粒子。对于观察者的每个答案来说，看起来更有效的那些将被赋予更大的权重,，就像每轮充气或放气的游戏球一样。通过这种更加计算的,自适应方法,，机长可以进行询问，从观察员那里提取更多的信息。

The team的 first focus was getting LMs to ask better questions. By implementing Monte Carlo inference strategies, the LMs reason about potential guesses as individual particles. The ones that appear more valid with each answer from the spotter would be weighted more heavily, sort of like game balls that inflate or deflate each turn. With this more calculated, adaptive approach, the captain could make inquiries that extracted considerably more info from the spotter.

然后，科学家们转向广泛使用的编程语言 Python 来帮助人工智能观察员。船长提出的每个问题都会自动转换成编码命令。例如,像, “这样的问题在第一列中是否有一艘船跨越两行?”变成了指示员LM搜索有问题的区域并评估数字游戏块有多宽的指令。通过用一种它特别能理解的语言给模型提供明确的指示,，每个系统给出正确答案的频率要高得多。轻量级系统 GPT-4o-mini 的性能提升了近 30%,（例如,），甚至大型型号 Claude 4 Opus 也跃升了约 8 个百分点。

The scientists then turned to the widely used programming language Python to help out AI spotters. Each question the captain asked was automatically converted into an encoded command. For example, a question like, “Is there a ship in column one that spans two rows?” turns into instructions for the spotter LM to search the area in question and assess how wide the digital game piece is. By giving the model clear directions in a language it understands particularly well, each system gave correct answers considerably more often. The lightweight system GPT-4o-mini saw a nearly 30 percent performance bump, for instance, and even the large model Claude 4 Opus jumped about eight points.

“该领域从 ‘ 自动形式化策略, 中取得了很大的成功，其中 LM 生成代码来验证其解决方案,” 麻省理工学院电气工程和计算机科学副教授兼 CSAIL 首席研究员、资深作者 Jacob Andreas, 说。 “我发现这项工作最令人兴奋的是，它首先通过提高 LM 探索和信息收集能力，开启了使用这些技术生成更好解决方案的可能性,。我们很高兴能够将这项工作从科学领域扩展到编码和数学问题解决等应用程序。”

“The field has seen a lot of success from ‘auto-formalization strategies, in which LMs generate code to verify their solutions,” says senior author Jacob Andreas, an MIT electrical engineering and computer science associate professor and CSAIL principal investigator. “What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs exploration and information gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving.”

但这种方法在其他棋盘游戏中表现如何? 该团队在 “Guess Who?”, 中测试了他们新配备的 LM，其中大大小小的模型巧妙地削减了 100 个选项，以正确猜测选择了哪个隐藏角色。 Llama 4 Scout 的成功率为 30%,，但在 Grand 和他的同事调整, 之后，它在超过 72% 的运行中完成了任务。与此同时，, GPT-4o 的使用率从 62% 跃升至 90%。 GPT-5 是每场比赛的观察员，以确保尽可能准确地回答问题。

But how would this approach fare in other board games? The team tested their newly equipped LMs at “Guess Who?”, where large and small models skillfully whittled down 100 options to correctly guess which hidden character had been chosen. Llama 4 Scout was successful 30 percent of the time, but after Grand and his colleagues tweaks, it completed the task on over 72 percent of its runs. Meanwhile, GPT-4o leapt from 62 percent to 90 percent. GPT-5 was the spotter in each game to ensure questions were answered as accurately as possible.

虽然 LM 在两款游戏中都取得了可喜的进展,，但仍有的改进空间。例如, 与人类相比，模型仍然难以回答复杂的问题,。 OpenAI 研究员, 最近的哈佛大学毕业生, 和合著者 Valerio Pepe 补充道，“GPT-5 可以击败你的平均水平 ‘Battleship 玩家,，并且通过我们的方法得到更好的结果。然而, 对于所有模型来说，专家玩家仍然很难被击败, 与国际象棋不同,，即使是顶尖玩家也无法战胜 AI 系统。”

While LMs have made promising progress in both games, there的 room for improvement. For instance, the models still struggle to answer complex questions, compared to humans. OpenAI researcher, recent Harvard graduate, and coauthor Valerio Pepe adds that “GPT-5 can beat your average ‘Battleship player, and gets a hair better with our methods. However, expert players are still hard to beat for all models, unlike in chess, where even top players don’t succeed against AI systems.”

研究人员的研究结果表明，人工智能代理在 �%9大海捞针” 发现 — 方面具有未开发的潜力，可以在巨大的选项空间中导航，找到应对科学挑战的罕见解决方案。虽然改进的信息检索技能将使他们成为出色的研究助理，, 说, 识别化合物的分子结构,，但研究人员警告说，“ 协作战舰” 是一个有点简单的测试台。他们’d喜欢在更复杂的设置,中测试LM，其中系统必须考虑更多的选项。

The researchers findings show that AI agents have untapped potential in “needle-in-a-haystack” discovery — navigating a massive space of options to find a rare solution to scientific challenges. While improved information-seeking skills would make them excellent research assistants with, say, identifying a compound的 molecular structure, the researchers caution that “Collaborative Battleship” is a somewhat simple test bed. They’d like to test LMs in more complex settings, where the systems have to consider far more options.

Grand 还计划让人类和人工智能模型合作，研究它们是否可以更好地协同工作。这些模型还可能受益于对游戏模拟, 的一些微调，并且具有更强的计算能力, LM 将具有更先进的推理功能来预测游戏将如何演变。 “A人工智能系统变得更加代理,最困难的问题原来是社交问题:跟踪共同点,解决误解,并随着时间的推移适应不同的合作伙伴,”说罗伯特·霍金斯,斯坦福大学语言学助理教授,他没有参与这篇论文’。 “这项工作在受控协作环境中优雅地捕捉了这些现象,，并令人信服地证明人工智能代理的真正瓶颈’不仅仅是最佳问题的计算,，而是充分利用答案所需的务实推理。”

Grand also plans to have humans and AI models collaborate to study whether they work better together. The models might also benefit from a bit of fine-tuning on game simulations, and with more computing power, LMs would have more advanced inference capabilities to predict how a game will evolve.

“As AI systems become more agentic, the hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time,” says Robert Hawkins, assistant professor of linguistics at Stanford University, who wasn’t involved in the paper. “This work elegantly captures these phenomena in a controlled collaborative setting, and makes a compelling case that the real bottleneck for AI agents isn’t just the calculation of optimal questions, but the pragmatic reasoning needed to make the most of their answers.”

Grand 和 Pepe 与两位 CSAIL 首席研究员: 麻省理工学院副教授 Jacob Andreas 和 MIT 教授 Joshua Tenenbaum 共同撰写了这篇论文。他们的工作在,部分得到,麻省理工学院西格尔家族智能探索,、麻省理工学院-IBM沃森人工智能实验室, FinTechAI@CSAIL计划,斯隆研究奖学金,英特尔,空军科学研究办公室,国防高级研究项目局,海军研究办公室,和国家科学基金会的支持。他们在 4 月份的学习表征国际会议 (ICLR) 上以口头报告的形式展示了他们的论文。

Grand and Pepe wrote the paper with two CSAIL principal investigators: MIT Associate Professor Jacob Andreas and MIT Professor Joshua Tenenbaum. Their work was supported, in part, by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechAI@CSAIL initiative, a Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation. They showcased their paper as an oral presentation at the International Conference on Learning Representations (ICLR) in April.