教 AI 模型说 “I’m 不确定”

信心是有说服力的。在人工智能系统, 中，它常常具有误导性。

Confidence is persuasive. In artificial intelligence systems, it is often misleading.

, 技术称为 RLCR ( 带有校准奖励的强化学习), 训练语言模型以生成经过校准的置信度估计及其答案。除了给出答案, 之外，模型还会考虑答案, 中的不确定性并输出置信度分数。在跨多个基准测试的实验中，, RLCR 将校准误差降低了 90%，同时保持或提高了模型训练任务和从未见过的全新任务的准确性,。这项工作将于本月晚些时候在国际学习表征会议上展示。

The technique, called RLCR (Reinforcement Learning with Calibration Rewards), trains language models to produce calibrated confidence estimates alongside their answers. In addition to coming up with an answer, the model thinks about its uncertainty in that answer, and outputs a confidence score. In experiments across multiple benchmarks, RLCR reduced calibration error by up to 90 percent while maintaining or improving accuracy, both on the tasks the model was trained on and on entirely new ones it had never seen. The work will be presented at the International Conference on Learning Representations later this month.

这个问题的根源出奇地简单。 AI 推理, 最近突破背后的强化学习 (RL) 方法，包括 OpenAI 等系统中使用的训练方法 o1, 奖励模型以获得正确答案, 并惩罚错误答案。两者之间没有什么。通过仔细推理得出正确答案的模型会获得与偶然猜对的模型相同的奖励。随着时间的推移,，这会训练模型自信地回答他们被问到的每个问题, 无论他们有强有力的证据还是有效地掷硬币。

The problem traces to a surprisingly simple source. The reinforcement learning (RL) methods behind recent breakthroughs in AI reasoning, including the training approach used in systems like OpenAI的 o1, reward models for getting the right answer, and penalize them for getting it wrong. Nothing in between. A model that arrives at the correct answer through careful reasoning receives the same reward as one that guesses correctly by chance. Over time, this trains models to confidently answer every question they are asked, whether they have strong evidence or are effectively flipping a coin.

这种过度自信会产生后果。当模型部署在医学,法律,金融,或用户根据人工智能输出,做出决策的任何环境中时，无论其实际确定性如何，表现出高置信度的系统都会变得不可靠，难以从外部检测到。一个模型说"I'm 95%确定"，但只有一半的时间是正确的，这比简单地得到错误答案的模型更危险,，因为用户没有信号寻求第二意见。

That overconfidence has consequences. When models are deployed in medicine, law, finance, or any setting where users make decisions based on AI outputs, a system that expresses high confidence regardless of its actual certainty becomes unreliable in ways that are difficult to detect from the outside. A model that says "I'm 95 percent sure" when it is right only half the time is more dangerous than one that simply gets the answer wrong, because users have no signal to seek a second opinion.

"标准训练方法简单而强大,，但它不会让模型有动力表达不确定性或说我不’不知道," 麻省理工学院博士生、该论文的共同主要作者 Mehul Damani, 说。 "因此，模型在不确定时自然会学会猜测。"

"The standard training approach is simple and powerful, but it gives the model no incentive to express uncertainty or say I don’t know," says Mehul Damani, an MIT PhD student and co-lead author on the paper. "So the model naturally learns to guess when it is unsure."

RLCR 通过在奖励函数: 中添加一项来解决这个问题，Brier 分数, 是一种完善的衡量标准，用于惩罚模型的声明的置信度与其实际准确性之间的差距。在训练期间, 模型学习推理问题和自身的不确定性, 一起生成答案和置信度估计。确实错误的答案将受到惩罚。不必要的不确定正确的也是如此。

RLCR addresses this by adding a single term to the reward function: a Brier score, a well-established measure that penalizes the gap between a model的 stated confidence and its actual accuracy. During training, models learn to reason about both the problem and their own uncertainty, producing an answer and a confidence estimate together. Confidently wrong answers are penalized. So are unnecessarily uncertain correct ones.

数学支持了这一点: 该团队正式证明，这种奖励结构可以保证模型既准确又经过良好校准。然后，他们在一系列问答和数学基准, 上的 70 亿参数模型上测试了该方法，其中包括该模型从未接受过训练的六个数据集。

The math backs it up: the team proved formally that this type of reward structure guarantees models that are both accurate and well-calibrated. They then tested the approach on a 7-billion-parameter model across a range of question-answering and math benchmarks, including six datasets the model had never been trained on.

结果显示出一致的模式。与基本模型, 相比，标准强化学习训练主动降低了校准能力，使得模型在估计自身不确定性方面变得更差。 RLCR 扭转了这一效果, 显着改善了校准，且精度没有损失。该方法还优于事后方法,，其中训练单独的分类器在事后分配置信度分数。 "令人震惊的是，’普通的强化学习训练'并不能帮助校准。麻省理工学院博士生兼共同主要作者 Isha Puri, 表示，它会严重伤害,"。 "模型变得更有能力，同时也变得更加过度自信。"

The results showed a consistent pattern. Standard RL training actively degraded calibration compared to the base model, making models worse at estimating their own uncertainty. RLCR reversed that effect, substantially improving calibration with no loss in accuracy. The method also outperformed post-hoc approaches, in which a separate classifier is trained to assign confidence scores after the fact. "What的 striking is that ordinary RL training doesn't just fail to help calibration. It actively hurts it," says Isha Puri, an MIT PhD student and co-lead author. "The models become more capable and more overconfident at the same time."

该团队还证明，RLCR 生成的置信度估计在推理时实际上很有用。当模型生成多个候选答案, 时，选择自我报告置信度最高的一个, 或按多数投票方案, 中的置信度对选票进行加权, 可以提高计算规模的准确性和校准。

The team also demonstrated that the confidence estimates produced by RLCR are practically useful at inference time. When models generate multiple candidate answers, selecting the one with the highest self-reported confidence, or weighting votes by confidence in a majority-voting scheme, improves both accuracy and calibration as compute scales.

另一项发现表明，对不确定性进行推理的行为本身就有价值。研究人员根据模型输出训练分类器，发现在输入中包含模型的显式不确定性推理可以提高分类器的的性能,，特别是对于较小的模型。模型的关于它做什么和不知道’不知道的自我反思推理包含真实信息,而不仅仅是装饰。

An additional finding suggests that the act of reasoning about uncertainty itself has value. The researchers trained classifiers on model outputs and found that including the model的 explicit uncertainty reasoning in the input improved the classifier的 performance, particularly for smaller models. The model的 self-reflective reasoning about what it does and doesn’t know contains real information, not just decoration.

除了 Damani 和 Puri, 之外，该论文的其他作者还有 Stewart Slocum, Idan Shenfeld, Leshem Choshen, 以及资深作者 Jacob Andreas 和 Yoon Kim。

In addition to Damani and Puri, other authors on the paper are Stewart Slocum, Idan Shenfeld, Leshem Choshen, and senior authors Jacob Andreas and Yoon Kim.