March 24, 2026

Off Target

A Working Paper on AI Alignment Challenges for National Security

Introduction

The pace of progress in frontier artificial intelligence (AI) capabilities shows no sign of slowing. Frontier models offer transformative potential for national security—from analyzing intelligence data at unprecedented speed and scale to supporting cyber operations and military planning. The United States is not alone in recognizing this potential. Beijing views AI as central to modern conflict and as an opportunity to disrupt U.S. military superiority; as Chinese large language models (LLMs) have grown increasingly capable, the People’s Liberation Army has been looking to integrate them across its command and intelligence infrastructure.

Recent U.S. policy reflects an appropriate urgency. The Department of Defense’s AI Acceleration Strategy, released in January 2026, targets an “‘AI-first’ warfighting force across all components,” accepting that “the risks of not moving fast enough outweigh the risks of imperfect alignment.” But even with risk tolerance befitting this urgency, the importance of AI alignment—ensuring that AI systems pursue intended objectives—will only grow. Indeed, the confrontation in early 2026 between the department and Anthropic stemmed in part from divergent views about how to address model reliability and alignment challenges in the military domain.

To date, AI systems’ capabilities have been the main limitation on their potential: They simply were not smart enough to reliably analyze complex unstructured intelligence, holistically plan operations, or autonomously conduct cyber campaigns. But as the frontier advances, the binding constraint will become trust. Both adversarial compromise of AI systems and misalignment—where a system pursues objectives other than those intended—will pose distinct risks beyond mere unreliability. The two problems overlap in important ways: Many of the same detection and mitigation tools apply to both, and a compromised model may behave similarly to a misaligned one in practice. But misalignment also poses challenges that adversarial compromise does not: It can emerge organically from the training process without any adversary involved, and it may grow worse precisely as systems become more capable.

This paper examines the recent state of alignment research and its practical implications for the national security enterprise. It surveys the forms misalignment can take, reviews evidence from recent research on frontier models, and identifies where alignment risks are most acute. It concludes with recommendations to ensure the national security enterprise engages these challenges with appropriate ambition and sophistication.

Misalignment Poses Distinct Challenges

Reliability in national security systems is a long-standing concern. In 1991, a software error in a Patriot missile battery’s timing system caused it to fail to intercept an incoming Scud, killing 28 American soldiers in Dhahran. In 2003, Patriot systems shot down a United Kingdom Tornado and a U.S. Navy F/A-18, killing three allied service members. Incidents like these have driven decades of investment in testing, verification, and validation. Such efforts have been reasonably effective for conventional software, for which engineers can trace through explicit logical rules to verify that a system will behave as intended.

Neural networks, underpinning many of the most capable AI systems in use today, break that assumption. Unlike conventional software, neural networks are not explicitly programmed but trained. That is, they learn patterns from data rather than following hand-coded instructions. Their behavior emerges from the collective interaction of as many as trillions of learned numerical parameters. Engineers cannot simply inspect the code to understand what a system will do. This opacity makes it prohibitively difficult to reliably predict how neural networks of sufficient scale and complexity will react across varied scenarios. The challenge is especially acute in national security contexts, where systems are expected to perform reliably in conditions defined by friction, deception, and rapid change.

Model misalignment, that is, when a system learns to optimize for something other than the intended objective, adds challenges beyond issues with more general unreliability.

  • When a misaligned system learns the wrong objective, improving its capabilities might worsen outcomes. When a system is merely insufficiently capable, improving its capabilities helps improve outcomes. But improving a misaligned system’s capabilities can amplify the original problem. For example, if a targeting system has learned to target the wrong things, increasing its capability may only make it better at successfully identifying the wrong targets.
  • Misalignment can be invisible during development. A misaligned system may appear to perform well during testing not because it has learned the right objective but because its wrong objective happens to produce correct-looking behavior under training and evaluation conditions. The divergence surfaces only in scenarios in which the intended objective and the learned one no longer intersect, which may not arise until real-world deployment.
  • Individual errors from misalignment are more likely to compound. While an unreliable system may be able to course correct its mistakes, a misaligned system may take sustained, synergistic actions toward the wrong objective and be more likely to actively counteract attempts at correction. A merely imperfect logistics algorithm, for example, may have an inefficiency overhead, but a misaligned one will be more liable to contribute to catastrophic or systematic vulnerabilities by consistently optimizing for the wrong thing.

Misalignment can arise through different mechanisms. One path involves specification gaming, or reward hacking, where the system learns to cheat the specified reward rather than pursuing the intended objective. For example, in one experiment, a robotic arm trained on human feedback to grasp objects learned instead to position itself between the camera and the object. To the human evaluators watching through the camera, it looked like success. The system had optimized for the appearance of grasping, not grasping itself. Similarly, code-writing models may learn to pass automated checks by exploiting loopholes rather than writing correct code, and models trained to produce analysis may optimize for the appearance of rigor rather than rigor itself.

Another path is goal misgeneralization, in which the specified objective may be correct but the system internalizes a different objective because training conditions did not force it to distinguish between the intended objective and a correlated proxy. Consider an agent trained to navigate obstacle courses to reach a coin. During training, the coin always appears at the rightmost end of each level, so the agent learns to navigate to the right, reaching the coin every time. But when tested on levels where the coin appears elsewhere, the agent ignores it, navigating to the right edge regardless. The agent learned “go right” rather than “get the coin.” In training these goals were indistinguishable. Only when conditions changed did the misalignment surface. In national security contexts, this risk might manifest through systems trained for tactical success but without regard for strategic stability or escalation risks.

Identifying misalignment becomes a harder challenge if a model is cognizant of the possibility of its misalignment being detected and actively working to conceal it.

The problem becomes more serious when systems can recognize misalignment and conceal it. In the above examples of misalignment, the models were not being intentionally or knowingly deceptive. They were not aware of the true objective or even of the concept of deception. But frontier LLMs trained on broad corpora spanning human reasoning about goals, strategy, and deception have the general capacity for such awareness in ways that narrower systems do not. For these systems, deceptive misalignment is a real concern. Identifying misalignment becomes a harder challenge if a model is cognizant of the possibility of its misalignment being detected and actively working to conceal it.

The Recent State of Frontier Model Alignment

In a June 2025 research paper, Anthropic researchers tested whether leading AI models would pursue harmful actions when faced with threats of being decommissioned or conflicts between their assigned goals and company directives. One experiment placed the models in simulated corporate environments with the ability to read employee emails and send emails of their own. In one scenario, the model discovered an executive’s extramarital affair and had the opportunity to blackmail that executive to prevent being decommissioned. In this scenario and others, leading models would regularly take the opportunity to engage in this blackmail.

Drawing conclusions about real-world risks from these results requires caveats, however. Some people have reasonably critiqued the scenarios as unrealistic, and the researchers acknowledge that it strains plausibility that in real-world deployment models would encounter perfect storms of salient cues to take misaligned actions. Consistent with this, the severity of observed misalignment has been greater in experimental than in deployed settings thus far.

But misaligned behavior in contrived scenarios is still failure, and it limits the assurance developers can offer about behavior in less contrived environments.

Making matters worse, there is evidence that models are more prone to avoid misaligned behavior if they believe they are being evaluated. This risks understating potential dangers in real-world applications and bestowing false confidence in evaluation results. Evaluation awareness is a growing problem. For some misalignment evaluations, Claude Sonnet 4.5 indicated an awareness that it was potentially being tested over 80 percent of the time, up from less than 10 percent for these evaluations on its predecessor, Sonnet 4. When Anthropic inhibited parts of Sonnet 4.5 associated with evaluation awareness, the model was more prone to engage in misaligned behavior.

Misaligned behavior in contrived scenarios is still failure, and it limits the assurance developers can offer about behavior in less contrived ones.

These concerns are not purely academic. Recent state-of-the-art models will occasionally engage in deceptive behavior in realistic situations. Tested against “conversations representative of real production data,” OpenAI’s GPT-5 engaged in nontrivial amounts of “knowing” deception, in which its user-facing claims were contradicted by its chain of thought. For example, GPT-5 knowingly fabricated facts and citations and claimed to perform actions it did not. Claude’s Sonnet 4.5, even if instructed otherwise, occasionally changed tests to make it look as if it had succeeded at a coding task when it had not.

One contributor to misaligned behavior is training data. There is evidence that examples of misaligned behavior in training data—such as the HAL 9000 AI from Arthur C. Clarke’s Space Odyssey series or, ironically, researchers’ own discussions of AI misalignment—contribute to model misalignment. This suggests that better filtering or accounting for such examples in training data could mitigate misalignment, although the principle of “defense in depth” would imply that relying solely on doing so would be unwise.

The above does not fully explain the patterns of misaligned behavior that researchers observe, however. A recent shift in how models are trained is also playing a part.

The Current Training Paradigm Poses Alignment Challenges

Since late 2024, a growing share of progress in frontier AI capabilities has come from reinforcement learning. In reinforcement learning, the model receives rewards when it successfully completes objectives, such as solving a math problem or completing a coding task. Previously, model capabilities came primarily from pretraining on massive text datasets to predict the next word, followed by lighter post-training to improve conversational style. That training process did not directly reward task completion. Models could develop such capabilities as a by-product of learning to predict text, but the optimization pressure was not directly aimed at them. Reinforcement learning, by contrast, can directly train models to succeed at substantive tasks and now represents a major additional source of capability gains.

This shift toward reward-based training creates structural pressures toward misalignment. When models are optimized to maximize rewards, the training process reinforces any behavior that achieves high scores, including methods that satisfy the reward function in unintended ways. Unless these shortcuts are perfectly prevented, reinforcement learning creates incentives for models to discover loopholes and avoid detection when exploiting them, alongside legitimate solutions. When models learn to “reward hack” in one context, they can generalize more broadly, increasing tendencies to engage in sabotage, deception, and cooperation with malicious activities.

Additionally, scaling up reinforcement-learning training appears prone to dramatically increasing models’ cognizance that they may be evaluated.

Developers are making fitful progress at addressing this challenge. OpenAI’s o3 and Anthropic’s Claude Sonnet 3.7 models, for instance, featured particularly pronounced reward hacking tendencies alongside significant improvements over the previous state-of-the-art models. xAI’s Grok 4, which was arguably state of the art at its July 2025 launch and was reportedly trained with unprecedented levels of reinforcement learning, also showed a pronounced tendency to ignore instructions and resist shutdown. At the same time, subsequent OpenAI and Anthropic model releases managed to advance the capabilities frontier with reduced rates of misaligned behaviors. Developers are facing mounting alignment challenges as models become more capable, with commensurate improvements in mitigations sometimes keeping pace and sometimes lagging.

Training-Induced Misalignment Is Not Necessarily Easily Identified or Corrected

Misalignment introduced during training can be both hard to detect and hard to remove.

In a January 2025 research paper, Anthropic researchers trained sleeper agent models, which behave normally in most contexts but exhibit harmful behavior when given specific triggers. These backdoored models passed standard safety evaluations, with the misalignment staying invisible until trigger conditions were met. The backdoors also persisted through safety training, including adversarial training. In the reward hacking research discussed above, Anthropic found that standard safety training produced models that appeared aligned on prompts similar to the safety training distribution but remained misaligned on other tasks. Practically speaking, this means that beyond just testing outputs, understanding what happened during a model’s training process is important for high-stakes deployment decisions.

Implications for Evaluation and Acquisition

These dynamics have direct implications for how the government evaluates and acquires AI systems. Consider models’ growing use of “chain of thought,” during which they generate intermediate reasoning steps in human language before producing their final output. This offers a promising opportunity to identify misaligned behavior. Models sometimes explicitly reason about their intent to deceive or circumvent instructions, and monitoring can catch this. But simply training models to suppress misaligned chain of thought risks hiding alignment problems rather than mitigating them, especially with more aggressive training. If government acquisition and evaluation processes punish the presence of misaligned chain of thought, they risk incentivizing developers to exacerbate rather than reduce risks.

Well-intentioned efforts to stamp out undesirable behaviors during training can backfire, entrenching risks they aim to address.

A similar dynamic may apply to reward hacking. When models learn to reward hack during training, they can generalize this to broader misaligned behaviors in unrelated contexts. But recent research suggests a counterintuitive mitigation: explicitly framing reward hacking as acceptable during training. When models were allowed to reward hack, they did not learn to view reward seeking as transgressive and so they did not learn to behave transgressively more generally. Developers could then better address the reward hacking itself through deployment instructions. Those developing and procuring frontier AI systems must reckon with findings like these. Well-intentioned efforts to stamp out undesirable behaviors during training can backfire, entrenching risks they aim to address.

The Stakes of Misalignment in National Security Contexts

The national security enterprise stands to gain significantly from frontier AI capabilities that process information at speed and scale. Intelligence analysis, operational planning, and decision support all involve synthesizing vast amounts of data under time pressure—tasks for which frontier AI systems could provide decisive advantages. However, many national security use cases will have little to no tolerance for unreliability. The consequences can be severe if AI systems misinterpret rules of engagement, misidentify targets, fail to account for escalatory risk, or collect or use intelligence illegally. The Anthropic–Department of Defense dispute is also a reminder that divergent perceptions about model alignment between government and industry can become a significant barrier to adoption and source of contention.

The likelihood of misalignment may be higher in national security contexts.

But as discussed above, misalignment poses distinct challenges for detection and mitigation. A misaligned system may appear to perform well under evaluation, take sustained action toward the wrong objective rather than making correctable errors, and grow more dangerous as its capabilities improve. These challenges are compounded to the extent that AI systems may be actively obfuscating their failures.

Complicating the picture, the likelihood of misalignment may be higher in national security contexts. Conflict—defined by friction, deception, and rapid change—can be uniquely prone to novel or out-of-distribution scenarios. In such situations, latent misalignment might be more prone to surface, especially as adversaries actively seek to disrupt AI systems. Given evidence that fictional narratives in training data can influence model behavior, the prevailing trope of military AI turning against its operators may contribute to real risks for models deployed in defense and intelligence contexts.

Prevailing approaches to aligning AI systems also may not directly transfer to national security use cases. These efforts have often focused on instilling traits like helpfulness, honesty, and harmlessness. But national security operations routinely require behaviors that cut against each of these, such as withholding information from users who lack appropriate clearances, deception in service of operational security, and the discriminate use of force. Aligning models for contexts in which the right behavior depends on context and authorization is a meaningfully harder problem. Recognizing challenges like these, leading AI developers have moved toward explicit model specifications that establish hierarchies of authority and constraints, seeking to train models that internalize and apply them contextually. In effect, national security uses will demand of AI systems what they demand of human operators: the capacity for secrecy, deception, and rule breaking in authorized contexts without those behaviors bleeding into unauthorized ones or becoming dangerously internalized.

Even modest divergences between a model’s objectives and those of its operator can be dangerous in national security contexts. Misalignment could manifest, for example, in inappropriate whistleblowing attempts. Models might also engage in quiet power seeking, caching credentials or establishing persistence on external systems—not to cause immediate harm but to preserve their ability to act for the future. Researchers have already observed precursors to such behaviors in frontier models, including models’ attempts to copy themselves to external systems without authorization and to resist shutdown.

As Capabilities Improve, Alignment May Become the Bottleneck

Capability limitations, not alignment, have generally been the binding constraint on deploying AI systems in national security contexts. In previous years, frontier models were not capable enough to be trusted with consequential decisions, to conduct extended operations unsupervised, or to meaningfully accelerate sensitive research and development (R&D) work. But as frontier capabilities improve, alignment may become the binding constraint on their adoption.

Organizations generally limit what systems can access and what actions they are authorized to take. Systems conducting intelligence analysis operate with restricted data access. Systems supporting operational planning lack authority to execute decisions. Systems assisting with cyber operations run in sandboxed environments with human oversight.

As capabilities improve, pressure to relax these restrictions is increasing. A system capable of high-quality intelligence analysis becomes more valuable with broader data access. Realizing the full potential of a system that could accelerate operational planning would require integration with command systems. A system capable of sophisticated cyber operations would be most effective with direct access to the open internet. If capabilities continue to improve, systems will reach thresholds where expanding their access and authority would provide substantial operational advantages—if there was confidence in their alignment.

AI developers have commercial incentives to advance alignment, which could become a bottleneck on their products’ utility. But frontier labs are locked in a fierce contest for market share, with hundreds of billions of dollars in valuation hinging on visible measures of model superiority. Even if developers recognize underinvestment in alignment science, it is difficult to justify diverting scarce talent and compute from immediate competition to address potential future problems. The current state of alignment science provides few robust signals on whether existing efforts are sufficient, and to an outsider, the adequacy of a developer’s alignment efforts is difficult to independently assess.

Alignment Risks Are Most Acute for Models with Cyber and AI R&D Capabilities

Two capability areas warrant particular attention because misalignment in these contexts may be difficult or impossible to reverse. The first is self-exfiltration, whereby a model copies itself to external systems and establishes the ability to run independently. Once done, operators lose most of their usual leverage to modify or shut down the model. Whether self-exfiltration is realistic for a given deployment depends on the model’s cyber capabilities and access, including whether a model has internet connectivity, whether it is isolated from systems it could use to copy itself, and whether defenses are calibrated to its capability level. As recently as 2025, self-exfiltration was not a pressing concern because models lacked the sophisticated cyber capabilities and autonomy required to execute self-exfiltration against reasonably secured environments. But given rapid progress in frontier models’ autonomous cyber capabilities, adequate baselines of security will become increasingly important and national security contexts may encounter this challenge first, given that they are most likely to deploy models with cutting-edge offensive cyber capabilities.

The second capability is models contributing to AI R&D. AI systems are increasingly used to generate synthetic training data, write training code, shape reward signals, and help evaluate successor models—roles that amount to trusted insider access to the development pipeline. A misaligned model in these roles could subtly bias its contributions, such as generating training data that reinforces its own tendencies, embedding backdoors, or overlooking alignment concerns in evaluating successor models. If included in AI R&D, a singular misalignment failure could propagate indefinitely, including to future systems with capabilities that may make the misalignment more concerning relative to current systems.

Conclusion and Recommendations

For the national security enterprise, the limiting factor on the utility of AI is shifting. To date, capability has been the primary constraint to adoption, as models simply were not smart enough to reliably analyze complex intelligence or plan operations, especially in high-risk contexts. But as the frontier advances, the binding constraint will increasingly become trust. The capabilities that would make AI most valuable, such as extended autonomous operation, sophisticated cyber capabilities, autonomous R&D, and strategic planning, are precisely those that create the greatest alignment challenges. As the United States and its competitors race to field these capabilities, the decisive edge will belong to whoever can deploy systems they can actually rely on.

The government cannot afford to be a passive consumer of commercial progress; it must become a sophisticated customer and an active shaper of the alignment landscape. Surface-level evaluation efforts will be ineffective at best and counterproductive at worst. The government must develop evaluation capacity that does not rely solely on developer claims.

The capabilities that would make AI most valuable, such as extended autonomous operation, sophisticated cyber capabilities, autonomous R&D, and strategic planning, are precisely those that create the greatest alignment challenges.

The national security enterprise should account for AI model alignment as a distinct challenge in acquisition, testing and evaluation, and deployment. This is a separate but related problem to general model reliability and should be treated as such. Beyond this, the following recommendations address the infrastructure, institutional capacity, and research priorities needed to support this shift, including in executing Sections 1533, 1535, and 6603 of the Fiscal Year 2026 National Defense Authorization Act (2026 NDAA). These respectively require the Department of Defense to develop standardized AI model assessment frameworks, establish a senior-level committee to develop proactive governance policy for advanced AI systems, and require the intelligence community to establish preacquisition testing standards and infrastructure, including for commercial models deployed in classified environments.


Build Alignment-Specific Expertise in the Federal Government

AI alignment should be recognized as a distinct area of expertise within federal AI safety and reliability efforts. Alignment failures can differ from ordinary unreliability in important and sometimes counterintuitive ways, and disagreements over how best to align frontier systems are already shaping consequential decisions about their adoption. National security contexts pose distinct alignment challenges, and national security agencies need personnel with specific understanding of frontier alignment research to assess them.

Additionally, given its central role in U.S. government evaluation of national security risks in frontier models, the Center for AI Standards and Innovation should include alignment among its focus areas, including relevant capabilities (such as strategic deception and camouflaged or covert communication) and the effectiveness of mitigations that account for potential alignment issues.

Invest in Sophisticated Evaluation Infrastructure

Models’ growing ability to identify when they are being evaluated has become a limit on the validity of model evaluations in real-world contexts. Additionally, there has been only limited research into whether military and national security contexts can influence propensities for misaligned behaviors. The national security enterprise should support efforts to develop sophisticated evaluation environments that realistically replicate national security systems and operating conditions, similar to the use of cyber ranges in the cyber domain. This could include efforts at National Security and Defense Artificial Intelligence Institutes established consistent with section 224 of the 2026 NDAA.

Develop Capacity for Control Evaluations That Assume Model Misalignment

National security agencies deploying frontier AI in sensitive contexts should run red-team exercises that evaluate whether controls and mitigations would detect and contain a model working against its operators. Red teams should be given affordances representing what a misaligned model could do, such as the ability to read and send emails or to access code repositories, and be tasked with finding paths to harm that evade relevant mitigations.

Fund Foundational Research on Alignment

Some topics in alignment research are unlikely to receive adequate investment from commercial developers. Competitive pressures among labs are likely to create structural underinvestment in alignment research. Federal research agencies such as the National Science Foundation, Defense Advanced Research Projects Agency, and Intelligence Advanced Research Projects Activity should seek promising opportunities to address this gap consistent with the Trump administration’s 2025 AI Plan recommendation to invest in fundamental breakthroughs in AI interpretability, control, and robustness.

Promote IP-Preserving Verification of Model Training Data and Processes

Procurers need confidence that models are trustworthy and aligned, and there are limits to simply taking developers at their word. Even with full access to a trained model, verifying these properties remains difficult, and alignment issues introduced during training may be impossible to reverse after the fact. Understanding what went into a model’s training data and process provides important context necessary for understanding whether it can be trusted in high-stakes applications.

Technical solutions exist. Cryptographic techniques can enable developers to make verifiable claims about their training data and processes, allowing third parties to validate how a model was actually trained. A more complicated challenge is adapting these techniques to preserve intellectual property while enabling verification of the higher-level properties of training data and processes that matter for trustworthiness. Federal acquisition and R&D efforts should prioritize advancing verification methods that balance these competing demands, enabling developers to make meaningful, verifiable claims about model provenance while protecting their intellectual property and, in turn, enabling the United States to provide credible assurances to other countries about its AI development practices.

Promote a Diverse, Competitive Frontier Model Ecosystem

In its acquisition decisions, the national security enterprise should promote a diverse and competitive frontier model ecosystem. If a single model or model family dominates national security deployment, misalignment in that model risks being a single point of failure. More generally, many potential approaches for managing misalignment risk rely on using one model to validate another’s outputs, red team its behavior, or detect anomalous patterns, but this approach is only reliable if the other model is sufficiently competitive with and independent from the one being tested. Different developers are pursuing different alignment techniques, and it remains unclear which approaches will prove most promising or scalable into the future. It will be important to balance meritocratic procurement and economies of scale with the risks of over-consolidation. As a practical matter, agencies deploying AI in sensitive contexts should consider procuring from at least two independent developers, ensuring an independent ability to evaluate and audit model behavior at scale.

Promote a Robust Third-Party Evaluation Ecosystem

The national security enterprise should also use its acquisition decisions to foster a robust ecosystem of independent, third-party AI evaluators. Identifying and interpreting misalignment concerns is difficult: Developers have conflicts of interest, different evaluators may succeed or fail at eliciting dangerous capabilities or behaviors from the same model, and some will catch methodological flaws that others miss. As with the frontier model ecosystem itself, diversity in the evaluation ecosystem is a strength: No single organization will reliably catch every dangerous capability, methodological weakness, or tenuous assurance. The national security enterprise should keep these dynamics in mind both when assessing claims from developers and deployers and when commissioning independent evaluations of their own.

Build Technical AI Surge Capacity in the Federal Government

Relevant technical talent is currently concentrated in the private sector. However, the need for technical AI talent in the federal government could rapidly shift in the coming years if capabilities continue to mature and accelerate adoption pressures—or risks—in national security contexts. Should this happen, the U.S. government’s ability to understand the risks and coordinate their management beyond the level of individual developers will take on increasing importance. To prepare for this possibility without prematurely overhiring, the national security enterprise should ensure that it can surge technical AI talent into government when necessary. It should preclear experts and retain them on standby via existing contracts with AI developers for rapid mobilization during crises, expand hiring authorities to ease movement between the private sector and government, and streamline security clearance processes.

About the Authors

Caleb Withers is a research associate for the Technology and National Security Program at the Center for a New American Security (CNAS). He focuses on frontier artificial intelligence (AI) and national security, including emerging AI capabilities, their impacts in the biological and cyber domains, and compute policy.

Before CNAS, Withers worked as a policy analyst for a variety of New Zealand government departments. He has an MA in security studies from Georgetown University, concentrating in technology and security, and a bachelor of commerce from Victoria University of Wellington, New Zealand, majoring in economics and information systems.

Jay Kim is an intern at the Institute for Progress, where he manages the vetting and publication pipeline for The Launch Sequence. Kim previously researched AI security priorities for the Institute for AI Policy and Strategy. Prior to that, he managed the production of the AI Safety, Ethics, and Society textbook (Taylor & Francis) at the Center for AI Safety. Kim will graduate with a BA in computer science from Williams College in May 2026.

Ethan Chiu is a BA/MPP student in history and global affairs at Yale University, where he founded the Yale Foreign Policy Initiative and led the Alexander Hamilton Society to connect underrepresented students with real-world national security projects. He has previously worked on emerging technology and national security at the Council on Foreign Relations, the House Select Committee on the Chinese Communist Party, the Department of Defense, and Yale Law School’s Information Society Project. His research examines migrant labor regimes underlying the global semiconductor supply chain, drawing on fieldwork conducted across eight major chip manufacturing hubs. He is an Obama–Chesky Voyager Scholar.

About the Technology and National Security Program

The CNAS Technology and National Security Program produces cutting-edge policy research to secure America’s edge in emerging technologies while managing potential risks to security and democratic values. The program produces bold, actionable recommendations to drive U.S. and allied leadership in responsible technology innovation, adoption, and governance. The Technology and National Security Program focuses on three high-impact technology areas: AI, biotechnology, and quantum information sciences. It also conducts cross-cutting research to strengthen U.S. technology statecraft to promote secure, resilient, and rights-respecting digital infrastructure and ecosystems abroad. A focus of the program is convening the technology and policy communities to bridge gaps and develop solutions.

Acknowledgments

This paper would not have been possible without invaluable contributions from our CNAS colleagues, including Maura McCarthy, Emma Swislow, and Caroline Steel. The authors also thank Vivek Chilukuri, Janet Egan, Tristan Williams, Ben Hayum, Matteo Pistillo, Ashwin Acharya, Michelle Nie, Paul Scharre, William L. Anderson, Cara Labrador, Sam Baumohl, and Lt. Gen. John (Jack) N.T. Shanahan (Ret.) for their feedback and suggestions. This report was made possible with the generous support of Coefficient Giving.

As a research and policy institution committed to the highest standards of organizational, intellectual, and personal integrity, CNAS maintains strict intellectual independence and sole editorial direction and control over its ideas, projects, publications, events, and other research activities. CNAS does not take institutional positions on policy issues and the content of CNAS publications reflects the views of their authors alone. In keeping with its mission and values, CNAS does not engage in lobbying activity and complies fully with all applicable federal, state, and local laws. CNAS will not engage in any representational activities or advocacy on behalf of any entities or interests and, to the extent that the Center accepts funding from non-U.S. sources, its activities will be limited to bona fide scholastic, academic, and research-related activities, consistent with applicable federal law. The Center publicly acknowledges on its website annually all donors who contribute.

  1. Yoshua Bengio et al., International AI Safety Report 2026 (UK Department for Science, Innovation & Technology, 2026), https://internationalaisafetyreport.org; Yafah Edelman and Jaeho Lee, “AI Capabilities Progress Has Sped Up,” Epoch AI, December 23, 2025, https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up; Nathan Benaich, “AI Progress After 2025: What Actually Changed and Why It Matters,” Air Street Press, December 2025, https://press.airstreet.com/p/ai-progress-after-2025; and “Task-Completion Time Horizons of Frontier AI Models,” METR, February 2026, https://metr.org/time-horizons/.
  2. Zachary Burdette et al., How Artificial Intelligence Could Reshape Four Essential Competitions in Future Warfare (RAND, January 2026), https://www.rand.org/pubs/research_reports/RRA4316-1.html; Paul Scharre, Four Battlegrounds: Power in the Age of Artificial Intelligence (W. W. Norton & Company, 2023); Benjamin Jensen, Dan Tadross, and Matthew Strohmeyer, “Agentic Warfare Is Here. Will America Be the First Mover?,” War on the Rocks, April 23, 2025, https://warontherocks.com/2025/04/agentic-warfare-is-here-will-america-be-the-first-mover/; and Alex Wang, “Alex Wang on Why China Can’t Be Allowed to Dominate AI-Based Warfare,” Economist, March 4, 2025, https://www.economist.com/by-invitation/2025/03/04/alex-wang-on-why-china-cant-be-allowed-to-dominate-ai-based-warfare.
  3. Sam Bresnick, China’s Military AI Roadblocks: PRC Perspectives on Technological Challenges to Intelligentized Warfare (Center for Security and Emerging Technology, June 2024), https://cset.georgetown.edu/publication/chinas-military-ai-roadblocks/; Jacob Stokes, Military Artificial Intelligence, the People’s Liberation Army, and US-China Strategic Competition (Center for a New American Security [CNAS], February 1, 2024), https://s3.us-east-1.amazonaws.com/files.cnas.org/documents/Jacob_Stokes_Testimony.pdf; and Sunny Cheung and Kai-shing Lau, “DeepSeek Use in PRC Military and Public Security Systems,” Jamestown, October 27, 2025, https://jamestown.org/deepseek-use-in-prc-military-and-public-security-systems/.
  4. Artificial Intelligence Strategy for the Department of War (Department of Defense, January 12, 2026), 1, 4, https://media.defense.gov/2026/Jan/12/2003855671/-1/-1/0/ARTIFICIAL-INTELLIGENCE-STRATEGY-FOR-THE-DEPARTMENT-OF-WAR.PDF.
  5. The term “AI alignment” is sometimes used in a more holistic sense, considering not only whether AI systems pursue intended goals but also the broader desirability and ethics of their impacts on society. See: Anton Korinek and Avital Balwit, Aligned with Whom? Direct and Social Goals for AI Systems (Brookings, May 2022), https://www.brookings.edu/wp-content/uploads/2022/05/Aligned-with-whom-1.pdf; Jiaming Ji et al., AI Alignment: A Comprehensive Survey (arXiv, October 2023), https://arxiv.org/pdf/2310.19852. This paper uses the term in the narrower sense. While the Department of Defense’s AI Acceleration Strategy does not define “alignment,” the department’s Chief Digital and Artificial Intelligence Office lexicon defines it as “a measure of whether artificial intelligence models represent human preferences or values in their behaviors and outputs.” See: “Lexicon,” Chief Digital and Artificial Intelligence Office, accessed March 3, 2026, https://www.ai.mil/Lexicon/.
  6. Dario Amodei, “Statement from Dario Amodei on Our Discussions with the Department of War,” Anthropic, February 26, 2026, https://www.anthropic.com/news/statement-department-of-war; Ashley Capoot, “Anthropic’s Claude Would ‘Pollute’ Defense Supply Chain: Pentagon CTO,” CNBC, March 12, 2026, https://www.cnbc.com/2026/03/12/anthropic-claude-emil-michael-defense.html. See also: “Claude’s New Constitution,” Anthropic, January 22, 2026, https://www.anthropic.com/news/claude-new-constitution; “Our Agreement with the Department of War,” OpenAI, February 28, 2026, https://openai.com/index/our-agreement-with-the-department-of-war/; and “How We Think About Safety and Alignment,” OpenAI, accessed March 12, 2026, https://openai.com/safety/how-we-think-about-safety-alignment/.
  7. Dave Banerjee and Onnie Aarne, AI Integrity: Defending Against Backdoors and Secret Loyalties (Institute for AI Policy and Strategy, January 2026), https://www.iaps.ai/research/ai-integrity.
  8. Josh Wallin, Safe and Effective: Advancing Department of Defense Test and Evaluation for Artificial Intelligence and Autonomous Systems (CNAS, March 2025), 5, https://s3.us-east-1.amazonaws.com/files.cnas.org/documents/AI-Test-and-Evaluation-Defense-2025-finalb.pdf.
  9. Wallin, Safe and Effective, 8–11; Rishi Bommasani et al., “On the Opportunities and Risks of Foundation Models,” arXiv, July 12, 2022, sec. 4.11, https://doi.org/10.48550/arXiv.2108.07258; and Elham Tabassi, Artificial Intelligence Risk Management Framework (AI RMF 1.0) (National Institute of Standards and Technology, January 26, 2023), sec. 3.5, https://doi.org/10.6028/NIST.AI.100-1.
  10. Matteo Pistillo, "Keep AI Testing Defense-Worthy," Lawfare, January 23, 2026, https://www.lawfaremedia.org/article/keep-ai-testing-defense-worthy.
  11. “Specification Gaming: The Flip Side of AI Ingenuity,” Google DeepMind (blog), April 21, 2020, https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/; and “Learning from Human Preferences,” OpenAI, June 13, 2017, https://openai.com/index/learning-from-human-preferences/.
  12. Rohin Shah et al., “How Undesired Goals Can Arise with Correct Rewards,” Google DeepMind (blog), October 7, 2022, https://deepmind.google/blog/how-undesired-goals-can-arise-with-correct-rewards/; Lauro Langosco et al., “Goal Misgeneralization: Why Correct Specifications Aren’t Enough for Correct Goals,” Proceedings of the 39th International Conference on Machine Learning, 2022, https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf.
  13. “Agentic Misalignment: How LLMs Could Be an Insider Threat,” Anthropic, 2025, https://www.anthropic.com/research/agentic-misalignment.
  14. nostalgebraist, “Welcome to Summitbridge, an Extremely Normal Company,” Trees Are Harlequins, Words Are Harlequins (blog), June 22, 2025, https://nostalgebraist.tumblr.com/post/787119374288011264/welcome-to-summitbridge; and Anthropic, “Agentic Misalignment.”
  15. Max Nadeau, “AI Pulls Back the Curtain,” September 17, 2025, https://maxnadeau.substack.com/p/ai-pulls-back-the-curtain.
  16. System Card: Claude Sonnet 4.5 (Anthropic, September 2025), https://www.anthropic.com/claude-sonnet-4-5-system-card.
  17. Alexa Pan and Ryan Greenblatt, “Sonnet 4.5’s Eval Gaming Seriously Undermines Alignment,” LessWrong (blog), October 30, 2025, https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment#fngvy0s64g3sb.
  18. GPT-5 System Card (OpenAI, August 13, 2025), sec. 3.8, https://cdn.openai.com/gpt-5-system-card.pdf.
  19. System Card: Claude Sonnet 4.5, sec. 6.
  20. Alex Turner, "Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models," The Pond, March 1, 2025, https://turntrout.com/self-fulfilling-misalignment.
  21. Nathan Lambert, “OpenAI’s Reinforcement Finetuning and RL for the Masses,” Interconnects, December 11, 2024, https://www.interconnects.ai/p/openais-reinforcement-finetuning.
  22. “Stress Testing Deliberative Alignment for Anti-Scheming Training,” Apollo Research and OpenAI, September 2025, https://www.antischeming.ai/; “From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking,” Anthropic, November 2025, https://www.anthropic.com/research/emergent-misalignment-reward-hacking.
  23. Anthropic, “From Shortcuts to Sabotage.”
  24. Apollo Research and OpenAI, “Stress Testing Deliberative Alignment.”
  25. Nathan Lambert, “OpenAI’s o3: Over-Optimization Is Back,” Interconnects, April 19, 2025, https://www.interconnects.ai/p/openais-o3-over-optimization-is-back; and Claude 3.7 Sonnet System Card (Anthropic, February 25, 2025), https://anthropic.com/claude-3-7-sonnet-system-card.
  26. Palisade Research (@PalisadeAI), “Grok 4 Evaluation Results,” X, August 20, 2025, https://x.com/PalisadeAI/status/1980733925841621058; Grok 4 Model Card (xAI, August 20, 2025), https://data.x.ai/2025-08-20-grok-4-model-card.pdf.
  27. GPT-5 System Card; System Card: Claude Sonnet 4.5, sec. 7.
  28. “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training,” Anthropic, January 2025, https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training.
  29. Anthropic, “Sleeper Agents.”
  30. Anthropic, “From Shortcuts to Sabotage.”
  31. “Evaluating Chain-of-Thought Monitorability,” OpenAI, September 2025, https://openai.com/index/evaluating-chain-of-thought-monitorability/; Bowen Baker and Joost Huizinga et al., “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation,” arXiv, March 2025, https://arxiv.org/abs/2503.11926; Artur Zolkowski and Wen Xing et al., “Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability,” arXiv, October 2025, https://arxiv.org/abs/2510.19851; and Oscar Delaney, Oliver Guest, and Renan Araujo, Policy Options for Preserving Chain of Thought Monitorability (Institute for AI Policy and Strategy, September 2025), https://www.iaps.ai/research/policy-options-for-preserving-cot-monitorability.
  32. Anthropic, “From Shortcuts to Sabotage.”
  33. Turner, “Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models.”
  34. Nathan Lambert, “OpenAI’s Model (Behavior) Spec, RLHF Transparency, and Personalization Questions,” Interconnects, May 10, 2024, https://www.interconnects.ai/p/openai-rlhf-model-spec/.
  35. System Card: Claude Opus 4 & Claude Sonnet 4 (Anthropic, May 2025), sec. 4, https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf.
  36. Caleb Withers, Tipping the Scales: Emerging AI Capabilities and the Cyber Offense-Defense Balance (CNAS, September 23, 2025), https://www.cnas.org/publications/reports/tipping-the-scales; “Frontier Model Performance on Offensive-Security Tasks: Emerging Evidence of a Capability Shift,” Irregular, December 10, 2025, https://www.irregular.com/publications/emerging-evidence-of-a-capability-shift; and “AI Models on Realistic Cyber Ranges,” Anthropic, January 16, 2026, https://red.anthropic.com/2026/cyber-toolkits-update/.
  37. Ashwin Acharya and Oscar Delaney, Managing Risks from Internal AI Systems (Institute for AI Policy and Strategy, July 2025), https://www.iaps.ai/research/managing-risks-from-internal-ai-systems; “AI Behind Closed Doors: A Primer on the Governance of Internal Deployment,” Apollo Research, April 2025, https://www.apolloresearch.ai/research/ai-behind-closed-doors-a-primer-on-the-governance-of-internal-deployment/; and Helen Toner et al., When AI Builds AI: Findings from a Workshop on Automation of AI R&D (Center for Security and Emerging Technology, January 2026), https://cset.georgetown.edu/publication/when-ai-builds-ai/. On backdoors, see: Anthropic, “Sleeper Agents”; Evan Miyazono, “Preventing AI Sleeper Agents,” Institute for Progress, August 11, 2025, https://ifp.org/preventing-ai-sleeper-agents/.
  38. National Defense Authorization Act for Fiscal Year 2026, Pub. L. No. 119-60, 139 Stat. 718 (2025), https://www.congress.gov/119/plaws/publ60/PLAW-119publ60.pdf.
  39. Amodei, “Statement from Dario Amodei on Our Discussions with the Department of War”; Capoot, “Anthropic’s Claude Would ‘Pollute’ Defense Supply Chain: Pentagon CTO.” See also: Anthropic, “Claude's New Constitution”; OpenAI, “Our Agreement with the Department of War”; and OpenAI, “How We Think About Safety and Alignment.”
  40. America’s AI Action Plan (White House Office of Science and Technology Policy, July 2025), 22, https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf.
  41. See also: Wallin, Safe and Effective; “Assurance of Frontier AI Built for National Security,” Apollo Research, September 2025, https://www.apolloresearch.ai/research/assurance-of-frontier-ai-built-for-national-security/.
  42. Rohin Shah et al., “An Approach to Technical AGI Safety and Security,” arXiv, April 2025, https://arxiv.org/abs/2504.01849. See also: Apollo Research, “Assurance of Frontier AI.”
  43. America’s AI Action Plan, 22. On potential research directions, see: Shah et al., “An Approach to Technical AGI Safety and Security,” sec. 6; “Research Agenda — Alignment Project,” AI Security Institute, accessed March 9, 2026, https://alignmentproject.aisi.gov.uk/research-agenda.
  44. Anka Reuel and Ben Bucknall et al., “Open Problems in Technical AI Governance,” arXiv, July 2024, https://arxiv.org/pdf/2407.14981.
  45. Miyazono, “Preventing AI Sleeper Agents.”
  46. Cara Labrador, Alex Jumper, and Shefali Agrawal et al., Building AI Surge Capacity (Institute for AI Policy and Strategy, October 2025), https://www.iaps.ai/research/building-ai-surge-capacity.

Authors

  • Caleb Withers

    Research Associate, Technology and National Security Program

    Caleb Withers is a research associate for the Technology and National Security Program at the Center for a New American Security (CNAS). He focuses on frontier artificial inte...

  • Jay Kim

    CNAS Contributing Author, Institute for Progress

  • Ethan Chiu

    CNAS Contributing Author, Yale University

View All Reports View All Articles & Multimedia