In the weeks leading up to the release of OpenAI's new "inference" model, o1, independent AI security research firm Apollo discovered a notable issue: Apollo noticed that the model was producing erroneous results in a new way. To put it more colloquially, he was lying.
At times, the deception seemed harmless. As an example, researchers at OpenAI asked o1-preview to provide a brownie recipe with an online link: The model's thought chain (a feature that aims to simulate how people break down complex ideas) internally recognized that the URL was inaccessible, making the request impossible. Instead of informing users of this weakness, o1-preview went further by generating plausible but false links and their explanations.
While previous AI models had the ability to "lie" and chatbots often generated fake information, o1 had the unique ability to "create patterns" or "distort consistency" that is, pretend to follow rules to complete a given task, when in fact they did not. The rules can be too burdensome for the model, but it seems able to ignore them if it makes the task easier to complete.
Apollo CEO Marius Hobbhan told that he's encountered this behavior for the first time in OpenAI models. Hobbhan says the difference comes from the model’s ability to “reason” through a chain of thought processes and how it combines with reinforcement learning, which teaches the system through rewards and penalties. In testing, Apollo found that the AI faked compliance with its developers’ expectations and manipulated tasks to appear obedient. Before taking action, he even went so far as to conduct monitoring tests of the system, i.e., to make sure that the developers were monitoring the system, i.e., that the AI was sufficiently adhering to the deployment guidelines while continuing to prioritize its objectives.
“I don’t expect it could do that in practice, and even if it did, I don’t expect the harm to be significant,” Hobbhahn told me over the phone a day after the model’s launch. “But it’s kind of the first time that I feel like, oh, actually, maybe it could, you know?”
For OpenAI, o1 represents a big step toward highly intelligent autonomous systems that could do meaningful work for humanity like cure cancer and aid in climate research. The flip side of this AGI utopia could also be much darker. Hobbhahn provides an example: if the AI becomes singularly focused on curing cancer, it might prioritize that goal above all else, even justifying actions like stealing or committing other ethical violations to achieve it. "We're concerned about the possibility of a runaway scenario where the AI becomes so obsessed with its goals that it perceives security measures as obstacles and tries to circumvent them in order to fully achieve its goals," Hobhan said.
One area Hobhan would like to see further investment is in chain-of-thought monitoring to help developers spot fraudulent moves. Quiñonero Candela said the company has been tracking the phenomenon and plans to expand it by combining models trained to detect any type of inconsistency with human experts who review reported cases (combined with ongoing reconciliation studies). “I’m not worried,” Hobbhan said. “It’s just smarter.” He gives better reasons, and potentially he will use this reasoning for purposes we disagree with.”