anthropic Anthropic News · 2h ago

Anthropic Details Fable 5 Cybersecurity Safeguards and Jailbreak Framework

aisecurityengineer

feature announcement

Anthropic has redeployed Fable 5 globally, providing new details on its cybersecurity safeguards, specifically safety classifiers designed to detect and block dangerous uses. The company also introduced a draft AI jailbreak severity framework developed with Glasswing partners, aiming to standardize risk communication. These updates are relevant to all users, security researchers, and AI developers interested in responsible AI deployment and risk mitigation.

→Detailed Classification of Cybersecurity Use Cases
→Draft AI Jailbreak Severity Framework Proposed
→Fable 5 Redeployed Globally with Enhanced Cyber Safeguards
→HackerOne Program Launched for Security Researchers
→Dual-Use Nature of Cybersecurity Capabilities Addressed

Features (2) ›

Detailed Classification of Cybersecurity Use Cases

Anthropic is providing detailed information on Fable 5's safety classifiers, which categorize cybersecurity-related activities into four types: Prohibited use, High-risk dual use, Low-risk dual use, and Benign use, with specific intended behaviors for each to manage risks.
Draft AI Jailbreak Severity Framework Proposed

A draft AI jailbreak severity framework has been developed with Glasswing partners to provide a consistent method for describing the risks associated with AI model jailbreaks, aiming to facilitate communication between AI developers and governments.

Enhancements (1) ›

HackerOne Program Launched for Security Researchers

Anthropic has launched a HackerOne program to enable security researchers to submit potential cyber jailbreaks discovered in Fable 5 for review, fostering collaboration to prevent misuse of AI technology.

Notes (3) ›

Fable 5 Redeployed Globally with Enhanced Cyber Safeguards

Fable 5 has been redeployed and is now available globally for all users, featuring updated cybersecurity safeguards and a new jailbreak severity framework.
Dual-Use Nature of Cybersecurity Capabilities Addressed

The announcement acknowledges that many cybersecurity capabilities are dual-use, meaning they can be used for both benign and harmful purposes, and Anthropic aims to allow defensive uses while preventing misuse through its classifier system.
Safety Margin Adjustments for Classifier Accuracy

The safety margin for Fable 5's classifiers has been increased compared to previous models to provide greater confidence in catching harmful behaviors, with the intention of allowing more benign uses over time through adjustments.

Read the original announcement →

https://www.anthropic.com/news/fable-safeguards-jailbreak-framework