December 01, 2024
Military Artificial Intelligence Test and Evaluation Model Practices
In December 2024, an international group of American, Indian, European, and Chinese experts completed a white paper on “Military Artificial Intelligence Test and Evaluation Model Practices.” This document is the result of nearly two years of discussion and revision in an academic-to-academic track 2 dialogue setting. The goal of the process was to determine whether experts from the multiple delegations might come to a consensus on certain principles and practices for test and evaluation of weapons and related military systems with significant AI components, to make those systems operate more safely, securely and responsibly. Previous dialogue meetings included academics and former officials with military, diplomatic, intelligence, computer science, corporate, and legal backgrounds from the United States, China and an international delegation from Europe, Asia, Russia and elsewhere. A brief summary of the white paper can be found below.
Summary of full white paper
AI Testing and Evaluation Characteristics
AI-enabled systems, both civilian and military, possess distinctive features that significantly influence the testing and evaluation (T&E) process. These characteristics affect not only the evaluation of AI models themselves, but also the broader systems into which these models are integrated.
- Continuous Testing and Monitoring: AI-enabled systems require ongoing evaluation throughout their entire lifecycle, from initial design through long-term sustainment. This continuous approach calls for a more integrated collaboration among designers, developers, testers, and end- users.
- Post-Deployment Evolution and Unpredictability: The potential for continued learning and post-deployment transformation in AI systems, coupled with their inherent opacity, introduces an element of operational unpredictability.
- Dynamic Learning and Rapid Updates: AI and machine learning/deep learning systems possess the unique ability to learn directly from data without additional coding, enabling frequent system updates and, in the case of online learning, real-time adaptations.
- Agile Governance: AI-enabled systems require a shift from traditional linear and sequential software development methodologies to more flexible and responsive approaches.
- Adversarial Resilience: In conjunction with independent ‘red teaming’ exercises, it is necessary for T&E processes to incorporate specific tests for evaluating the effects and risks of dedicated adversarial attacks against AI datasets and models.
- Data-Centric Focus and Computational Demands: The foundation of AI systems lies in their data and the infrastructure required to process it. This centrality of data introduces unique challenges, including the potential for skewed, corrupted, or incomplete datasets, which can significantly affect system performance and reliability.
22 Military AI Model Practices for Test and Evaluation
- AI T&E must include test, evaluation and assessment data obtained under conditions as close as possible to the conditions expected during operational deployment of the system, ideally based on real-world data.
- Choices of test methods used should be informed by the extent to which algorithms and components of an AI system are interpretable and understandable and that these can be assessed through a robust T&E process with clear performance indicators and evaluation metrics.
- The design and development process for AI systems should incorporate T&E requirements from the beginning.
- T&E of AI-enabled military systems should be viewed as a continual process.
- AI T&E should include testing under real-world conditions with due regard for the resilience and robustness of the system to include appropriate handling of edge cases and boundary conditions in harsh, uncertain, and dynamic operational environments, error correction and identification, and rollback/failsafe modes especially in high-risk lethal autonomous weapons systems (LAWS).
- T&E plans should specify when and to what extent modelling and simulation will be used to test AI systems and how this kind of testing will be validated, especially for systems designed to function in environments in which adversaries are expected to deny or deceive fielded AI models.
- T&E plans should include how testing results and system performance will be communicated to all relevant stakeholders.
- Human-system integration and/or human-machine teaming should be considered as an integral component of T&E design.
- Given the potential integration of large language models (LLM)/generative AI into military systems, special attention should be paid in T&E to factors arising from LLMs, including reinforcement learning with human feedback (RLHF), fine tuning, and retrieval augmented generation (RAG), along with known LLM limitations.
- T&E requirements should be designed to ensure that it is possible to evaluate system compliance with relevant legal requirements, including the obligation to conduct legal reviews of a system's ability to be used in compliance with international humanitarian law and other relevant international law instruments.
- T&E should assess not only the performance of components and subsystems of an AI-enabled military weapon or decision support system, but also overall AI system performance and the integration of these components, subsystems, and any external or pre-existing platforms.
- T&E plans and systems documentation should identify a rigorous process by which, prior to deployment of a military AI system to a new operational context or when there are significant changes in the operational environment, hazards are identified, analysed, and remediated.
- T&E plans should identify high risk catastrophic errors that could occur during operations and how these may be prevented, detected, and remediated, especially in the case of strategic command and control systems.
- T&E plans should establish how to evaluate if deployed military AI systems continue to meet their performance goals.
- Collection, management, assessment, and use of data throughout a military AI system’s lifecycle, including during operational deployment, is critical.
- As governments work to develop and strengthen their AI T&E practices for AI-enabled military systems, they should coordinate their efforts with civilian standards, tools, and documentation and draw on professional standards from military and civilian contexts.
- Governments should consider what role is appropriate for the United Nations or expert-level multilateral organizations with respect to standard setting or regulating military AI with respect to technical and/or governance issues that affect T&E.
- Governments should engage in dialogue to learn from each other and share lessons learned from development and deployment of military AI systems, including about T&E standards, T&E’s role in mitigating risks and/or “incidents,” and other transparency and confidence-building measures.
- As part of continuous T&E over a system’s lifecycle, governments and international agencies should consider establishing standards for investigation and remediation of “high consequence incidents” that occur from the use of military AI during exercises or operational deployments.
- To promote transparency, mutual understanding, and consistent best practice, states should publicly release aspects of their processes and approaches to T&E of AI-enabled military systems.
- As governments adopt military AI T&E best practices, they should consider which practices could form the basis of a legally or politically binding instrument and what, if any, might be appropriate enforcement mechanisms.
- Until states gain more experience in developing, testing and fielding AI- enabled military systems, they should be guided by the precautionary principle: the idea that introducing a new product or process whose ultimate effects are disputed or unknown should be approached using caution, pause, and review.
Read the full white paper at INHR.
Download the full white paper.
More from CNAS
-
Sharper: Trump's First 100 Days
Donald Trump takes office in a complex and volatile global environment. Rising tensions with China, the continued war in Ukraine, and instability in the Middle East all pose s...
By Charles Horn
-
Accelerate America’s Quantum Technology Leadership
As the U.S.-China competition for quantum technology leadership continues to intensify, the Trump administration should prioritize both advancing and protecting the country’s ...
By Constanza M. Vidal Bustamante
-
Secure America’s Tech Competitiveness
The Trump administration must bolster America’s science, technology, engineering, and mathematics (STEM) workforce and broader technological competitiveness—documented shortag...
By Sam Howell
-
Make America the Biopower
No country has a better biotechnology hand than America. The administration has a historic opportunity to play it wisely and secure the United States’ position as the 21st cen...
By Vivek Chilukuri