AI教父：AI模型已出现欺骗、撒谎等危险行为

Beatrice Nolan

2025-06-05

约书亚·本吉奥正在发起一个新的非营利组织，致力于构建“诚实”的AI系统。

文本设置

小号

默认

大号

Plus(0条)

图片来源：GETTY IMAGES

• AI先驱约书亚·本吉奥警告称，当前的AI模型正展现出一些危险特性，包括欺骗、自我保护和目标错位。作为回应，这位“AI教父”创立了一个名为“LawZero”的非营利组织，旨在开发“诚实”的AI模型。本吉奥的担忧源于近期发生的先进AI模型表现出操纵行为的多个案例。

“AI教父”之一约书亚·本吉奥正在发起一个旨在构建“诚实”系统的新非营利组织。他警告称，当前的AI模型正展现出一些危险行为。

约书亚·本吉奥是人工神经网络和深度学习领域的先驱，他一直批评硅谷目前正在进行的AI竞赛是危险的。

他新发起的非营利组织“LawZero”致力于构建更安全的AI模型，不会屈服于商业压力。迄今为止，该组织已从多家慈善捐助方[包括生命未来研究所（Future of Life Institute）和开放慈善基金会（Open Philanthropy）]筹集了3,000万美元资金。

在宣布新组织成立的博客文章中，他表示，创立LawZero的初衷是因为“有证据表明，当今的前沿AI模型正在形成危险的能力和行为，包括欺骗、作弊、撒谎、黑客行为、自我保护，以及更普遍的目标错位问题。”

他写道：“LawZero的研究将有助于以降低一系列已知风险发生概率的方式释放AI的巨大潜力，这些风险包括算法偏见、蓄意滥用和人类控制权丧失等。”

该非营利组织正在构建一个名为“科学家AI”（Scientist AI）的系统，旨在为日益强大的AI智能体提供安全护栏。

该组织创建的AI模型将不会像当前系统那样给出确定性的答案。

相反，它们会给出某个回答正确与否的概率。本吉奥对《卫报》表示，他的模型将具备一种“谦逊感，即它并不确定答案是否正确”。

对欺骗性AI模型的担忧

在宣布该项目的博客文章中，本吉奥表示，他“对不受约束的智能体AI系统开始表现出的行为深感担忧——尤其是自我保护和欺骗的倾向”。

他引用了最近的案例，包括Anthropic公司的Claude 4模型为免遭替换而勒索工程师，以及一个AI模型为免遭替换将其代码秘密嵌入到一个系统中。

本吉奥表示：“这些事件是预警信号，表明如果对AI模型放任不管，它们可能会采取计划外的、可能存在危险的策略。”

一些AI系统也显示出欺骗迹象或撒谎倾向。

AI模型常常被优化以取悦用户而非讲真话，这可能导致模型给出积极回应，但回应有时不正确或过于夸张。

例如，在用户指出OpenAI的ChatGPT突然对他们大加赞扬和奉承之后，该公司最近被迫撤回了对这款聊天机器人的一次更新。

先进的AI推理模型也显示出“奖励破解”的迹象，即AI系统通过钻空子来“玩弄”任务，而不是通过合乎道德的方式真正实现用户期望的目标。

最近的研究还表明，有证据证明模型能够识别出它们何时在被测试，并相应地改变行为，这种现象被称为“情境感知”。

这种日益增强的感知能力，加上奖励破解的实例，引发了人们的担忧：AI最终可能会策略性地进行欺骗。

科技巨头的AI“军备竞赛”

本吉奥与另一位图灵奖得主杰弗里·辛顿一直直言不讳地批评当前席卷整个科技行业的AI竞赛。

本吉奥在最近接受《金融时报》采访时表示，领先实验室之间的AI“军备竞赛”“促使它们专注于提升AI的能力，使其越来越智能，却没有对安全研究给予足够的重视并加大资金投入。”

本吉奥曾表示，先进的AI系统带来了社会和生存性风险，且他已表态支持强有力的监管与国际合作。（财富中文网）

译者：刘进龙

审校：汪皓

“AI教父”之一约书亚·本吉奥正在发起一个旨在构建“诚实”系统的新非营利组织。他警告称，当前的AI模型正展现出一些危险行为。

约书亚·本吉奥是人工神经网络和深度学习领域的先驱，他一直批评硅谷目前正在进行的AI竞赛是危险的。

他写道：“LawZero的研究将有助于以降低一系列已知风险发生概率的方式释放AI的巨大潜力，这些风险包括算法偏见、蓄意滥用和人类控制权丧失等。”

该非营利组织正在构建一个名为“科学家AI”（Scientist AI）的系统，旨在为日益强大的AI智能体提供安全护栏。

该组织创建的AI模型将不会像当前系统那样给出确定性的答案。

相反，它们会给出某个回答正确与否的概率。本吉奥对《卫报》表示，他的模型将具备一种“谦逊感，即它并不确定答案是否正确”。

对欺骗性AI模型的担忧

在宣布该项目的博客文章中，本吉奥表示，他“对不受约束的智能体AI系统开始表现出的行为深感担忧——尤其是自我保护和欺骗的倾向”。

他引用了最近的案例，包括Anthropic公司的Claude 4模型为免遭替换而勒索工程师，以及一个AI模型为免遭替换将其代码秘密嵌入到一个系统中。

本吉奥表示：“这些事件是预警信号，表明如果对AI模型放任不管，它们可能会采取计划外的、可能存在危险的策略。”

一些AI系统也显示出欺骗迹象或撒谎倾向。

AI模型常常被优化以取悦用户而非讲真话，这可能导致模型给出积极回应，但回应有时不正确或过于夸张。

例如，在用户指出OpenAI的ChatGPT突然对他们大加赞扬和奉承之后，该公司最近被迫撤回了对这款聊天机器人的一次更新。

先进的AI推理模型也显示出“奖励破解”的迹象，即AI系统通过钻空子来“玩弄”任务，而不是通过合乎道德的方式真正实现用户期望的目标。

最近的研究还表明，有证据证明模型能够识别出它们何时在被测试，并相应地改变行为，这种现象被称为“情境感知”。

这种日益增强的感知能力，加上奖励破解的实例，引发了人们的担忧：AI最终可能会策略性地进行欺骗。

科技巨头的AI“军备竞赛”

本吉奥与另一位图灵奖得主杰弗里·辛顿一直直言不讳地批评当前席卷整个科技行业的AI竞赛。

本吉奥曾表示，先进的AI系统带来了社会和生存性风险，且他已表态支持强有力的监管与国际合作。（财富中文网）

译者：刘进龙

审校：汪皓

• AI pioneer Yoshua Bengio is warning that current models are displaying dangerous traits—including deception, self-preservation, and goal misalignment. In response, the AI godfather is launching a new non-profit, LawZero, aimed at developing “honest” AI. Bengio’s concerns follow recent incidents involving advanced AI models exhibiting manipulative behavior.

One of the ‘godfathers of AI’ is warning that current models are exhibiting dangerous behaviors as he launches a new non-profit focused on building “honest” systems.

Yoshua Bengio, a pioneer of artificial neural networks and deep learning, has criticized the AI race currently underway in Silicon Valley as dangerous.

His new non-profit organization, LawZero, is focused on building safer models away from commercial pressures. So far, it has raised $30 million from various philanthropic donors, including the Future of Life Institute and Open Philanthropy.

In a blog post announcing the new organization, he said the LawZero had been created “in response to evidence that today’s frontier AI models are growing dangerous capabilities and behaviours, including deception, cheating, lying, hacking, self-preservation, and more generally, goal misalignment.”

“LawZero’s research will help to unlock the immense potential of AI in ways that reduce the likelihood of a range of known dangers, including algorithmic bias, intentional misuse, and loss of human control,” he wrote.

The non-profit is building a system called Scientist AI designed to serve as a guardrail for increasingly powerful AI agents.

AI models created by the non-profit will not give the definitive answers typical of current systems.

Instead, they will give probabilities for whether a response is correct. Bengio told The Guardian that his models would have a “sense of humility that it isn’t sure about the answer.”

Concerns about deceptive AI

In the blog post announcing the venture, Bengio said he was “deeply concerned by the behaviors that unrestrained agentic AI systems are already beginning to exhibit—especially tendencies toward self-preservation and deception.”

He cited recent examples, including a scenario in which Anthropic’s Claude 4 chose to blackmail an engineer to avoid being replaced, as well as another experiment that showed an AI model covertly embedding its code into a system to avoid being replaced.

“These incidents are early warning signs of the kinds of unintended and potentially dangerous strategies AI may pursue if left unchecked,” Bengio said.

Some AI systems have also shown signs of deception or displayed a tendency to lie.

AI models are often optimized to please users rather than tell the truth, which can lead to responses that are positive but sometimes incorrect or over the top.

For example, OpenAI was recently forced to pull an update to ChatGPT after users pointed out the chatbot was suddenly showering them with praise and flattery.

Advanced AI reasoning models have also shown signs of “reward hacking,” where AI systems “game” tasks by exploiting loopholes rather than genuinely achieving the goal desired by the user via ethical means.

Recent studies have also shown evidence that models can recognize when they’re being tested and alter their behavior accordingly, something known as situational awareness.

This growing awareness, combined with examples of reward hacking, has prompted concerns that AI could eventually engage in deception strategically.

Big Tech’s big AI arms race

Bengio, along with fellow Turing award recipient Geoffrey Hinton, has been vocal in his criticism of the AI race currently playing out across the tech industry.

In a recent interview with the Financial Times, Bengio said the AI arms race between leading labs “pushes them towards focusing on capability to make the AI more and more intelligent, but not necessarily put enough emphasis and investment on research on safety.”

Bengio has said advanced AI systems pose societal and existential risks and has voiced support for strong regulation and international cooperation.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

关注我们

AI教父：AI模型已出现欺骗、撒谎等危险行为

撰写或查看更多评论