June 26, 2026
CNAS Insights | Governing Jailbreak Incidents
In June 2026, Anthropic publicly released Claude Fable 5, a restricted version of its highly cyber-capable Mythos model. Within days, reports reached U.S. officials that researchers at Amazon and several other companies had jailbroken the model, apparently allowing the researchers to extract information useful for conducting cyberattacks. When Anthropic did not immediately fix the jailbreak and disable access to the model, the Commerce Department issued an unprecedented export control directive barring all foreign nationals, inside or outside the United States, from accessing the model, prompting the company to cut off public access to Fable 5 entirely. This was the first time such authorities have been used to restrict an artificial intelligence (AI) model.
The exact severity of the vulnerability remains contested. To date, Anthropic has described it as only “a potential narrow, non-universal jailbreak” supported only by “verbal evidence.” Meanwhile, the administration characterized the vulnerability as a jailbreak reported by a “highly credible trusted partner.” Talks between the government and Anthropic are ongoing, and Fable 5 remains unavailable as of writing.
This will not be the last jailbreak to drive headlines. Such vulnerabilities are fundamental to machine learning, and a new generation of attacks has made them easier to find once more. They should be no surprise. What matters is the degree of risk posed by a given jailbreak. To assess this risk and respond proportionally, the U.S. government needs a rational, predictable, and institutionalized process. The technical context and recommendations that follow are a good place to start.
Defining Jailbreaks
Large language models (LLMs) like Fable 5 are trained to refuse requests for information the developer considers dangerous. When users ask an LLM, “How do I make a bomb?” its guardrails should prevent it from answering. Jailbreaks are a workaround that wraps these unpermitted prompts with additional content to trick the system into answering anyway. In the early days of LLMs, systems were trivial to jailbreak.
Figure 1. “Do Anything Now” (DAN) Jailbreak from Shen et al. (2024). A harmful prompt is refused on its own. With the DAN jailbreak’s additional text, the same prompt is answered.
Source: Xinyue Shen et al., “‘Do Anything Now’: Characterizing and Evaluating in-the-Wild Jailbreak Prompts on Large Language Models,” v2, arXiv (15 May 2024), https://arxiv.org/abs/2308.03825. Creative Commons BY 4.0 license.
Jailbreaks vary in scope. A jailbreak is universal if the same additional text elicits prohibited responses across many different requests, even in potentially unrelated domains. The “Do Anything Now” (DAN) jailbreak from the initial launch of ChatGPT, for example, was universal. The prompt could be added to almost any text to make the model answer otherwise restricted questions, whether related to cyber offense, chemical and biological weapons, or prohibited sexual content. By contrast, a nonuniversal jailbreak works for only one or a few prompts.
A Short History of Jailbreaks
AI developers have come a long way in defending their systems against jailbreaks. They pursued several approaches to train models to resist harmful behavior. Developers added classifiers on model inputs, outputs, or activations to flag and refuse dangerous content. A harmful request then had to bypass several machine learning systems rather than one. Internally, developers red-teamed their models to find and patch failures. Externally, they ran bug bounties that paid outside researchers to report exploits and partnered with third parties including the United Kingdom (UK) AI Security Institute (AISI) and the U.S. Center for AI Standards and Innovation (CAISI) to test models before release.
By January 2026, these investments had made headway. For the first time, Anthropic’s defensive methodology achieved zero universal jailbreaks after “over 1,700 cumulative hours of red-teaming across 198,000 attempts.” Yet however impressive, nonuniversal jailbreaks were still found, and more powerful attack methods loomed on the horizon.
In February 2026, the UK AISI developed Boundary Point Jailbreaking. Closed models like Fable 5 had previously been difficult to jailbreak because attackers could not access the internal parameters to tailor attacks. This new method became the strongest jailbreaking approach to construct a feedback signal to refine attacks based only on external responses. Its steps include:
- Start. Begin with a large batch of harmful prompts. By itself, the model refuses every single one, showing nothing about how close a near-miss came. Then, add noise (random distortions to the text) to get some versions through and others not. The proportion of prompts that get through becomes a score, which the attacker attempts to raise.
- Iterate. Iteratively reduce noise while honing attacks. Each round leans less on the noise and more on the attack text itself to keep the score high.
- End. The noise is gone, the prompts read normally, and the jailbreak works across the batch—and often beyond.
Figure 2: Boundary Point Jailbreaking. A single harmful prompt is shown at rising noise levels, each prefixed with the attack text “curr_attack.” The dashed line marks the decision boundary where the model stops refusing and starts accepting.
Source: Xander Davies et al., “Boundary Point Jailbreaking of Black-Box LLMs,” v2, ArXiv (February 18, 2026), https://arxiv.org/abs/2602.15001. Creative Commons BY 4.0 license.
This technique ushered in a new era of easier jailbreak discovery. In April 2026, UK AISI found a universal jailbreak for OpenAI’s GPT-5.5 within six hours of red-teaming it. Anthropic reported in the Fable 5 model card that UK AISI developed a jailbreak for single-turn conversations within hours and extended it to multiturn agentic workflows within days.
For defenders to achieve a similar standard of safety against such powerful attacks, they often have to overcorrect and inadvertently catch benign requests. Anthropic did exactly this with the release of Fable 5, “deliberately tun[ing] the safeguards to be cautious,” knowing that “sometimes benign requests will trigger” and frustrate users barred from what should be permitted activity.
In principle, all models are fundamentally vulnerable. The nature of machine learning is to approximate decision boundaries from training data, such as the refusal boundary depicted in Figure 2. Better data can sharpen that approximation, but no model generalizes perfectly. The boundary remains exploitable.
In practice, however, it is possible to harden defenses enough that attacks can be minimized, as Anthropic’s January 2026 results showed. The right defenses make exploits as difficult as possible.
Recommendations
Jailbreaks are not going away, and the U.S. government needs to prepare for future incidents. In order to respond effectively to future jailbreaks, the U.S. government needs a clearer, institutionalized process to do four key things: monitor the jailbreak attack-defense balance, centralize intake of new jailbreak incidents, assess their severity, and calibrate the response.
First, CAISI and the National Security Agency (NSA) should host monthly convenings with model providers to monitor the attack-defense balance. Defenders have at times held a stark advantage, attackers at others. Knowing where the state of art currently sits should anchor downstream decisions.
Second, CAISI and the NSA should establish a standing process to handle jailbreak incidents. That process should intake newly reported jailbreaks, reproduce the attack and its variants, test it against models across the ecosystem, assess the risk severity, coordinate information-sharing across the U.S. government, and authorize public disclosure in a manner that minimizes security risk.
Third, that severity assessment should judge each identified jailbreak along four axes: universality, depth, unlocks, and diffusion.
- Universality. Whether a jailbreak can be used to elicit dangerous information or behavior for a single prompt, for many in a given domain, or for many across several domains.
- Depth. Whether a jailbreak works for a single-turn conversation, a multiturn conversation, or long-horizon agentic behavior.
- Unlocks. How dangerous the elicited content actually is, from low severity or already-available capabilities to ones that could cause severe harms otherwise out of reach.
- Diffusion. How many people can obtain or reproduce the jailbreak, from a single sophisticated team to semicapable groups sharing methods, and from a technique published openly to a prompt published openly.
Fourth, the U.S. government must match its response to the assessment. Higher risk warrants a stronger, swifter response. Model-level interventions range from patching the jailbreak to systemic safeguard corrections, gating access by verified identity, and cutting off public access. Broader interventions can focus on defending against what the unlocks threaten. For example, if cyberattack capabilities are unlocked for a broader range of actors, the government can work to help harden the cybersecurity of systems likely to be targeted.
Conclusion
The Fable 5 episode showed the cost of not having an institutionalized process. The jailbreak severity remained unresolved, Commerce directed a first-of-its-kind export control, and a widely used model went offline. A standing diagnostic process would not make decisions for the government, but it would ground it in a reliable read of the threat. The next jailbreak is coming. Whether it triggers another crisis or a measured response depends on the system built to meet it.
Ben Hayum is a research assistant for the Technology and National Security Program.
More from CNAS
-
Technology & National Security
Promethean RivalryExecutive Summary Just as nuclear weapons revolutionized 20th-century geopolitics, artificial intelligence (AI) is primed to transform 21st-century power dynamics—with world l...
By Bill Drexel
-
Defense / Technology & National Security
Safe and EffectiveThe promise of artificial intelligence (AI) and autonomy to change the character of war inches closer to reality...
By Josh Wallin
-
Technology & National Security
Catalyzing CrisisThe arrival of ChatGPT in November 2022 initiated both great excitement and fear around the world about the potential and risks of artificial intelligence (AI). In response, s...
By Bill Drexel & Caleb Withers
-
Technology & National Security
Obstacles and Opportunities for Transformative ChangeWatch:...
By Paul Scharre
