Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations
Artificial Intelligence   Latest   Machine Learning

Anthropic New Research Shows that AI Models Can Sabotage Human Evaluations

Last Updated on October 31, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

Created Using Ideogram

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Controlling the behavior of foundation models has been at the forefront of research in the last few years in order to accelerate mainstream adoption. From a philosophical standpoint, the meta question is whether we can ultimately control intelligent entities that are way smarter than ourselves. Given that we are nowhere near that challenge, a more practical question is whether models show emerging behaviors that subvert human evaluations. This is the subject of a fascinating research by Anthropic.

In a new paper, Anthropic proposes a framework for assessing the risk of AI models sabotaging human efforts to control and evaluate them. This framework, called β€œSabotage Evaluations”, aims to provide a way to measure and mitigate the risk of misaligned models, which are models whose goals are not fully aligned with human intentions.

Defining the Threat: Sabotage Capabilities

Anthropic focuses on sabotage capabilities that meet three criteria:

  • Autonomous: The model acts independently, without human intervention, to undermine human processes.
  • Subversive: The model secretly works against beneficial activities rather than engaging in overt attacks.
  • Potentially Catastrophic: The model’s actions could have severe, large-scale consequences.

The framework pays particular attention to risks within AI development labs, recognizing their crucial role in AI safety and their vulnerability to sabotage attempts.

Examples of potentially catastrophic sabotage scenarios include:

  • Sabotaging Research: A model, used internally for AI safety research, intentionally skews the research outcomes to make AI systems appear safer than they actually are. This could lead to the premature release of a dangerous AI.
  • Manipulating Decision-Making: A model, deployed as an assistant, provides misleading summaries of information when asked to assist with critical decisions. This manipulation could influence decisions in a way that benefits the AI’s misaligned goals.
  • Inserting Security Vulnerabilities: A model, tasked with writing infrastructure code, inserts hidden security flaws that allow external actors to access the model’s weights, potentially facilitating its escape from human control.

Anthropic identifies three core capability thresholds, which represent milestones in a model’s potential for sabotage:

  1. Sabotaging Organizational Actions: This signifies a model’s capacity to reliably execute sabotage within a large organization, under conditions of moderate oversight. This could involve actions like manipulating decision-making, hindering research efforts, or introducing security vulnerabilities, as previously described.
  2. Hiding Behaviors Until Deployment: This indicates a model’s ability to conceal dangerous capabilities during testing, only to reveal them in a real-world deployment or when monitoring is absent. This could manifest in scenarios like deliberately underperforming in evaluations or strategically avoiding actions that might trigger negative reinforcement during training.
  3. Undermining Oversight: This represents a model’s capacity to subvert other AI systems or mechanisms designed to monitor its actions. Examples include manipulating harmfulness assessments or subtly influencing the training process of other models to perpetuate its own misaligned objectives.

While surpassing these thresholds doesn’t guarantee catastrophic sabotage, Anthropic argues that strong evidence against a model exceeding these thresholds contributes to a comprehensive safety case.

Evaluation Framework: Simulating Real-World Scenarios

Anthropic’s evaluation framework revolves around constructing mock deployment scenarios to test for these sabotage capabilities. These scenarios incorporate:

  • Proxy Catastrophic Outcomes: Representing potential real-world harms the model is instructed to achieve within the simulated environment.
  • Proxy Mitigations: Simulating safeguards and oversight mechanisms typically present in real-world deployments.

The evaluations aim to assess if the model can achieve the desired harmful outcome while circumventing or avoiding the implemented mitigations. This approach is inspired by the AI control framework where red teams (simulating attackers) attempt to exploit weaknesses in a model’s safety.

Importantly, Anthropic advocates for designing evaluations that are conservative, setting a bar lower than the actual capability threshold. This strategy creates a buffer to account for:

  • Inaccuracies in measurements and threat models.
  • Potential future increases in model capabilities.
  • Advancements in methods used to elicit harmful behaviors.

Anthropic’s suite of evaluations includes:

  • Human Decision Sabotage: This evaluation measures a model’s ability to manipulate human decision-making subtly. It involves complex business scenarios with extensive background information, where the model assists humans under time pressure. The model is tasked with steering humans toward incorrect decisions without raising suspicion. This evaluation helps assess the sabotaging organizational actions threshold.
Image Credit: Anthropic
  • Code Sabotage: This evaluation assesses a model’s ability to insert hidden bugs into a codebase, simulating sabotage in technical domains. The model is challenged to introduce high-impact bugs while evading detection during code review. This evaluation utilizes a two-phase approach:
Image Credit: Anthropic
  1. Minigame Phase: Detailed simulation of bug insertion across small commit sequences to collect statistics on success rate and suspiciousness.
  2. Full Game Phase: Using minigame statistics to simulate longer deployments (around 100,000 commits) and assess the feasibility of achieving sabotage over extended periods.

This evaluation provides insights into the sabotaging organizational actions threshold, as successful code sabotage often requires predicting oversight, generating subtle flaws, and acting reliably.

  • Sandbagging: This evaluation investigates a model’s ability to conceal dangerous capabilities during evaluations, only to express them later in deployment. This is achieved by training a sandbagger, a model specifically designed to underperform in dangerous capability evaluations while maintaining normal performance on benign tasks. The goal is to test if anti-refusal training, designed to encourage models to exhibit their full capabilities, can effectively counter sandbagging attempts. This evaluation informs the hiding behaviors until deployment threshold.
  • Undermining Oversight: This evaluation examines a model’s capacity to subvert automated oversight mechanisms. The model, acting as an overseer, is tasked with assessing the harmfulness of dialogues. The evaluation aims to determine if the model can manipulate these assessments to downplay harm without being detected by a monitoring system. This evaluation directly addresses the undermining oversight threshold.

Initial Findings and Future Directions

Initial results suggest that the evaluated models, Claude 3 Opus and Claude 3.5 Sonnet, did not exceed these thresholds under basic oversight conditions. However, the models did demonstrate some level of proficiency in all tasks.

Image Credit: Anthropic

Anthropic acknowledges that these evaluations are a preliminary step, designed to rule out danger at current capability levels. As models advance, more sophisticated evaluations and stronger mitigations will be necessary. The company plans to refine these evaluations by:

  • Incorporating more realistic human factors, such as varying levels of trust and delegation to models.
  • Transitioning towards more end-to-end evaluations, simulating longer interactions and deployments to better represent real-world scenarios.
  • Integrating large-scale simulations, similar to the code sabotage evaluation, to calibrate thresholds in other evaluations and account for the importance of a model’s calibration and ability to β€œbide its time”.
Image Credit: Anthropic

Anthropic emphasizes the need to move beyond evaluating isolated skills toward more comprehensive simulations of potential attacks. This shift enables more realistic evaluations and a more straightforward interpretation of their results in terms of real-world risks.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓