Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations
Last Updated on October 31, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.substack.com
Controlling the behavior of foundation models has been at the forefront of research in the last few years in order to accelerate mainstream adoption. From a philosophical standpoint, the meta question is whether we can ultimately control intelligent entities that are way smarter than ourselves. Given that we are nowhere near that challenge, a more practical question is whether models show emerging behaviors that subvert human evaluations. This is the subject of a fascinating research by Anthropic.
In a new paper, Anthropic proposes a framework for assessing the risk of AI models sabotaging human efforts to control and evaluate them. This framework, called βSabotage Evaluationsβ, aims to provide a way to measure and mitigate the risk of misaligned models, which are models whose goals are not fully aligned with human intentions.
Defining the Threat: Sabotage Capabilities
Anthropic focuses on sabotage capabilities that meet three criteria:
- Autonomous: The model acts independently, without human intervention, to undermine human processes.
- Subversive: The model secretly works against beneficial activities rather than engaging in overt attacks.
- Potentially Catastrophic: The modelβs actions could have severe, large-scale consequences.
The framework pays particular attention to risks within AI development labs, recognizing their crucial role in AI safety and their vulnerability to sabotage attempts.
Examples of potentially catastrophic sabotage scenarios include:
- Sabotaging Research: A model, used internally for AI safety research, intentionally skews the research outcomes to make AI systems appear safer than they actually are. This could lead to the premature release of a dangerous AI.
- Manipulating Decision-Making: A model, deployed as an assistant, provides misleading summaries of information when asked to assist with critical decisions. This manipulation could influence decisions in a way that benefits the AIβs misaligned goals.
- Inserting Security Vulnerabilities: A model, tasked with writing infrastructure code, inserts hidden security flaws that allow external actors to access the modelβs weights, potentially facilitating its escape from human control.
Anthropic identifies three core capability thresholds, which represent milestones in a modelβs potential for sabotage:
- Sabotaging Organizational Actions: This signifies a modelβs capacity to reliably execute sabotage within a large organization, under conditions of moderate oversight. This could involve actions like manipulating decision-making, hindering research efforts, or introducing security vulnerabilities, as previously described.
- Hiding Behaviors Until Deployment: This indicates a modelβs ability to conceal dangerous capabilities during testing, only to reveal them in a real-world deployment or when monitoring is absent. This could manifest in scenarios like deliberately underperforming in evaluations or strategically avoiding actions that might trigger negative reinforcement during training.
- Undermining Oversight: This represents a modelβs capacity to subvert other AI systems or mechanisms designed to monitor its actions. Examples include manipulating harmfulness assessments or subtly influencing the training process of other models to perpetuate its own misaligned objectives.
While surpassing these thresholds doesnβt guarantee catastrophic sabotage, Anthropic argues that strong evidence against a model exceeding these thresholds contributes to a comprehensive safety case.
Evaluation Framework: Simulating Real-World Scenarios
Anthropicβs evaluation framework revolves around constructing mock deployment scenarios to test for these sabotage capabilities. These scenarios incorporate:
- Proxy Catastrophic Outcomes: Representing potential real-world harms the model is instructed to achieve within the simulated environment.
- Proxy Mitigations: Simulating safeguards and oversight mechanisms typically present in real-world deployments.
The evaluations aim to assess if the model can achieve the desired harmful outcome while circumventing or avoiding the implemented mitigations. This approach is inspired by the AI control framework where red teams (simulating attackers) attempt to exploit weaknesses in a modelβs safety.
Importantly, Anthropic advocates for designing evaluations that are conservative, setting a bar lower than the actual capability threshold. This strategy creates a buffer to account for:
- Inaccuracies in measurements and threat models.
- Potential future increases in model capabilities.
- Advancements in methods used to elicit harmful behaviors.
Anthropicβs suite of evaluations includes:
- Human Decision Sabotage: This evaluation measures a modelβs ability to manipulate human decision-making subtly. It involves complex business scenarios with extensive background information, where the model assists humans under time pressure. The model is tasked with steering humans toward incorrect decisions without raising suspicion. This evaluation helps assess the sabotaging organizational actions threshold.
- Code Sabotage: This evaluation assesses a modelβs ability to insert hidden bugs into a codebase, simulating sabotage in technical domains. The model is challenged to introduce high-impact bugs while evading detection during code review. This evaluation utilizes a two-phase approach:
- Minigame Phase: Detailed simulation of bug insertion across small commit sequences to collect statistics on success rate and suspiciousness.
- Full Game Phase: Using minigame statistics to simulate longer deployments (around 100,000 commits) and assess the feasibility of achieving sabotage over extended periods.
This evaluation provides insights into the sabotaging organizational actions threshold, as successful code sabotage often requires predicting oversight, generating subtle flaws, and acting reliably.
- Sandbagging: This evaluation investigates a modelβs ability to conceal dangerous capabilities during evaluations, only to express them later in deployment. This is achieved by training a sandbagger, a model specifically designed to underperform in dangerous capability evaluations while maintaining normal performance on benign tasks. The goal is to test if anti-refusal training, designed to encourage models to exhibit their full capabilities, can effectively counter sandbagging attempts. This evaluation informs the hiding behaviors until deployment threshold.
- Undermining Oversight: This evaluation examines a modelβs capacity to subvert automated oversight mechanisms. The model, acting as an overseer, is tasked with assessing the harmfulness of dialogues. The evaluation aims to determine if the model can manipulate these assessments to downplay harm without being detected by a monitoring system. This evaluation directly addresses the undermining oversight threshold.
Initial Findings and Future Directions
Initial results suggest that the evaluated models, Claude 3 Opus and Claude 3.5 Sonnet, did not exceed these thresholds under basic oversight conditions. However, the models did demonstrate some level of proficiency in all tasks.
Anthropic acknowledges that these evaluations are a preliminary step, designed to rule out danger at current capability levels. As models advance, more sophisticated evaluations and stronger mitigations will be necessary. The company plans to refine these evaluations by:
- Incorporating more realistic human factors, such as varying levels of trust and delegation to models.
- Transitioning towards more end-to-end evaluations, simulating longer interactions and deployments to better represent real-world scenarios.
- Integrating large-scale simulations, similar to the code sabotage evaluation, to calibrate thresholds in other evaluations and account for the importance of a modelβs calibration and ability to βbide its timeβ.
Anthropic emphasizes the need to move beyond evaluating isolated skills toward more comprehensive simulations of potential attacks. This shift enables more realistic evaluations and a more straightforward interpretation of their results in terms of real-world risks.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI