Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations

Last Updated on October 31, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Controlling the behavior of foundation models has been at the forefront of research in the last few years in order to accelerate mainstream adoption. From a philosophical standpoint, the meta question is whether we can ultimately control intelligent entities that are way smarter than ourselves. Given that we are nowhere near that challenge, a more practical question is whether models show emerging behaviors that subvert human evaluations. This is the subject of a fascinating research by Anthropic.

In a new paper, Anthropic proposes a framework for assessing the risk of AI models sabotaging human efforts to control and evaluate them. This framework, called “Sabotage Evaluations”, aims to provide a way to measure and mitigate the risk of misaligned models, which are models whose goals are not fully aligned with human intentions.

Defining the Threat: Sabotage Capabilities

Anthropic focuses on sabotage capabilities that meet three criteria:

Autonomous: The model acts independently, without human intervention, to undermine human processes.
Subversive: The model secretly works against beneficial activities rather than engaging in overt attacks.
Potentially Catastrophic: The model’s actions could have severe, large-scale consequences.

The framework pays particular attention to risks within AI development labs, recognizing their crucial role in AI safety and their vulnerability to sabotage attempts.

Examples of potentially catastrophic sabotage scenarios include:

Sabotaging Research: A model, used internally for AI safety research, intentionally skews the research outcomes to make AI systems appear safer than they actually are. This could lead to the premature release of a dangerous AI.
Manipulating Decision-Making: A model, deployed as an assistant, provides misleading summaries of information when asked to assist with critical decisions. This manipulation could influence decisions in a way that benefits the AI’s misaligned goals.
Inserting Security Vulnerabilities: A model, tasked with writing infrastructure code, inserts hidden security flaws that allow external actors to access the model’s weights, potentially facilitating its escape from human control.

Anthropic identifies three core capability thresholds, which represent milestones in a model’s potential for sabotage:

Sabotaging Organizational Actions: This signifies a model’s capacity to reliably execute sabotage within a large organization, under conditions of moderate oversight. This could involve actions like manipulating decision-making, hindering research efforts, or introducing security vulnerabilities, as previously described.
Hiding Behaviors Until Deployment: This indicates a model’s ability to conceal dangerous capabilities during testing, only to reveal them in a real-world deployment or when monitoring is absent. This could manifest in scenarios like deliberately underperforming in evaluations or strategically avoiding actions that might trigger negative reinforcement during training.
Undermining Oversight: This represents a model’s capacity to subvert other AI systems or mechanisms designed to monitor its actions. Examples include manipulating harmfulness assessments or subtly influencing the training process of other models to perpetuate its own misaligned objectives.

While surpassing these thresholds doesn’t guarantee catastrophic sabotage, Anthropic argues that strong evidence against a model exceeding these thresholds contributes to a comprehensive safety case.

Evaluation Framework: Simulating Real-World Scenarios

Anthropic’s evaluation framework revolves around constructing mock deployment scenarios to test for these sabotage capabilities. These scenarios incorporate:

Proxy Catastrophic Outcomes: Representing potential real-world harms the model is instructed to achieve within the simulated environment.
Proxy Mitigations: Simulating safeguards and oversight mechanisms typically present in real-world deployments.

The evaluations aim to assess if the model can achieve the desired harmful outcome while circumventing or avoiding the implemented mitigations. This approach is inspired by the AI control framework where red teams (simulating attackers) attempt to exploit weaknesses in a model’s safety.

Importantly, Anthropic advocates for designing evaluations that are conservative, setting a bar lower than the actual capability threshold. This strategy creates a buffer to account for:

Inaccuracies in measurements and threat models.
Potential future increases in model capabilities.
Advancements in methods used to elicit harmful behaviors.

Anthropic’s suite of evaluations includes:

Human Decision Sabotage: This evaluation measures a model’s ability to manipulate human decision-making subtly. It involves complex business scenarios with extensive background information, where the model assists humans under time pressure. The model is tasked with steering humans toward incorrect decisions without raising suspicion. This evaluation helps assess the sabotaging organizational actions threshold.

Code Sabotage: This evaluation assesses a model’s ability to insert hidden bugs into a codebase, simulating sabotage in technical domains. The model is challenged to introduce high-impact bugs while evading detection during code review. This evaluation utilizes a two-phase approach:

Minigame Phase: Detailed simulation of bug insertion across small commit sequences to collect statistics on success rate and suspiciousness.
Full Game Phase: Using minigame statistics to simulate longer deployments (around 100,000 commits) and assess the feasibility of achieving sabotage over extended periods.

This evaluation provides insights into the sabotaging organizational actions threshold, as successful code sabotage often requires predicting oversight, generating subtle flaws, and acting reliably.

Sandbagging: This evaluation investigates a model’s ability to conceal dangerous capabilities during evaluations, only to express them later in deployment. This is achieved by training a sandbagger, a model specifically designed to underperform in dangerous capability evaluations while maintaining normal performance on benign tasks. The goal is to test if anti-refusal training, designed to encourage models to exhibit their full capabilities, can effectively counter sandbagging attempts. This evaluation informs the hiding behaviors until deployment threshold.
Undermining Oversight: This evaluation examines a model’s capacity to subvert automated oversight mechanisms. The model, acting as an overseer, is tasked with assessing the harmfulness of dialogues. The evaluation aims to determine if the model can manipulate these assessments to downplay harm without being detected by a monitoring system. This evaluation directly addresses the undermining oversight threshold.

Initial Findings and Future Directions

Initial results suggest that the evaluated models, Claude 3 Opus and Claude 3.5 Sonnet, did not exceed these thresholds under basic oversight conditions. However, the models did demonstrate some level of proficiency in all tasks.

Anthropic acknowledges that these evaluations are a preliminary step, designed to rule out danger at current capability levels. As models advance, more sophisticated evaluations and stronger mitigations will be necessary. The company plans to refine these evaluations by:

Incorporating more realistic human factors, such as varying levels of trust and delegation to models.
Transitioning towards more end-to-end evaluations, simulating longer interactions and deployments to better represent real-world scenarios.
Integrating large-scale simulations, similar to the code sabotage evaluation, to calibrate thresholds in other evaluations and account for the importance of a model’s calibration and ability to “bide its time”.

Anthropic emphasizes the need to move beyond evaluating isolated skills toward more comprehensive simulations of potential attacks. This shift enables more realistic evaluations and a more straightforward interpretation of their results in terms of real-world risks.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

Defining the Threat: Sabotage Capabilities

Evaluation Framework: Simulating Real-World Scenarios

Initial Findings and Future Directions

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

My 6 Secret Tips for Getting an ML Job in 2025

People often follow Probabilities, Deviations and Densities that play a key role in ML modeling.

AI Agents: The Missing Link in DeFi’s $100 Billion Liquidity Challenge

Boxes, Violins and Contours Conclude the Exploratory Data Analysis Process.

Introducing Deepseek Artifacts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

Defining the Threat: Sabotage Capabilities

Evaluation Framework: Simulating Real-World Scenarios

Initial Findings and Future Directions

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement