12/19/2025 | Press release | Distributed by Public on 12/19/2025 13:00
Photo: Parilov/Adobe Stock
Commentary by Benjamin Jensen and Yasir Atalan
Published December 19, 2025
Artificial intelligence (AI) is the new arms race and the centerpiece of defense modernization efforts across multiple countries, including the United States. Yet, despite the surge in AI investments, both Silicon Valley and the Pentagon struggle to answer one simple question: How can decisionmakers know if AI actually works in the real world?
The standard approach to answer this question is an evaluation practice called benchmarking. Benchmarking is defined as "a particular combination of a dataset or sets of datasets . . . and a metric, conceptualized as representing one or more specific tasks or sets of abilities, picked up by a community of researchers as a shared framework for the comparison of method." This practice allows the researchers to evaluate and compare AI model performance, for example, how well a large language model (LLM) answers questions about military planning. Yet, proper benchmarking studies are few and far between for national security.
To bridge this gap, CSIS Futures Lab proposes a new initiative: the Defense Benchmarking Suite. This effort calls on like-minded academics, nonprofits, and partners in industry and government to collaboratively build a diverse set of national security and military use cases. By explicitly identifying brittle failure modes before deployment, this framework ensures that the integration of advanced AI enhances strategic stability rather than introducing unpredictable, high-consequence vulnerabilities. Simply put, just as a military would never field a fighter jet without rigorous flight testing, decisionmaking cannot be entrusted to algorithms that have not been stress-tested against the chaos and confusion of actual operations.
This first piece of the series addresses the core problem in rapid AI adoption across the U.S. Department of Defense (DOD, recently renamed the Department of War): the lack of proper evaluation and benchmarking. Future pieces will outline possible use cases and calls for hackathons designed to build benchmarks that can ensure AI models support strategy and military decisionmaking in a way that is responsible way.
The truth is that the global AI benchmarking ecosystem-originally designed for commercial, academic, or consumer applications-has failed to capture the complexities of defense and national security environments. Benchmarks such as Massive Multitask Language Understanding or HellaSwag measure performance on static question sets, not on live, dynamic, and uncertain decision problems. This is why new AI models may be very successful in these synthetic tests, but their performance is less certain when integrated into mission workflows, command systems, or inherently high-stress human decisionmaking given the chaos and fluidity of war. As a result, absent a surge in benchmarking tailored to the complexity of competitive strategy and military-specific use cases, the billions being invested in new systems will be subject to diminishing marginal returns. Smart models will yield dumb answers to the most pressing questions. If these systems cannot be validated against the harsh realities and complexities of conflict, there is a risk of building a force that does not make better decisions but simply makes bad decisions faster.
From 2022 to 2025, AI benchmarking studies tied to LLMs grew from 54 to 8,208, reflecting the rapid expansion of models and evaluation work. However, a simple search of these benchmarking studies on arXiv-a widely used preprint repository for benchmarks-reveals that only a small share focus on defense-related applications. As Figure 1 illustrates, benchmarking studies have risen sharply over the last three years, yet defense-related benchmarks account for only about 2 percent of the total. Most defense benchmarks center on text-based cyberattacks and biosecurity-which are highly valuable-but this narrow focus shows how the industry tends to move in one direction, leaving broader defense use cases largely unexplored.
In the defense sector, test and evaluation (T&E) is the standard term for the AI model assessment. For example, the DOD Chief Digital and AI Office has published T&E guidelines for AI models. Yet, these detailed guidelines are missing a key component: an active repository of benchmarks that speak to military-specific workflows and use cases. Without this repository, the myriad of LLM enterprise solutions offered to warfighters and promises of agentic warfare will prove brittle.
Most AI benchmarks evaluate how well models perform under controlled, idealized conditions, where data is clean, tasks are unambiguous, and success is easily measurable. However, real decisionmaking takes place under uncertainty and noisy environments. Therefore, benchmarking needs to be based on realistic environments and use cases. This is especially true for national security deliberations and military decisionmaking, where the adversary is constantly trying to conceal their intentions and fog and friction reign supreme.
Recently, new frameworks such as OpenAI Evals have emerged to address this problem. They attempt to evaluate AI systems in more dynamic, open-ended ways-closer to real-world tasks. Yet even these efforts remain primarily grounded in commercial use cases, such as coding, summarization, or reasoning benchmarks that reflect software workflows, not strategic operations.
Take SWE-bench. This benchmark evaluates LLMs by asking them to solve real-world software engineering problems scraped from GitHub. Its strength lies in its link between AI performance and economic value. Each solved problem can be priced, showing how much labor cost an AI model can replace or augment.
That kind of direct, dollar-denominated measure does not translate easily into defense contexts. Strategic and operational outcomes cannot be valued like lines of code or customer service tickets. Success is multidimensional, balancing risk to force and mission alongside how well different military options nest objectives, effects, and tasks given constantly changing intelligence updates. The complexity of national security decisionmaking requires a mix of doctrinal and historical foundations with human judgment. Ultimately, defense organizations should stop evaluating these systems as mere software products and start testing them in the same manner as junior staff officers, judging performance based on the ability to interpret the commander's intent and navigate ambiguity, rather than merely synthesizing large amounts of general data.
In defense, the goal is not to replace human labor, but to enhance decision advantage: the ability to observe, orient, decide, and act faster and more accurately than the adversary, all while strictly adhering to the law of armed conflict (LOAC). Consequently, the metric of success extends beyond financial savings to operational utility. A question that must be asked is how well the algorithm generates options that account for time, space, and force to create options that an expert human can review and refine. Furthermore, can the model adapt to the political context? Since war is an extension of politics, effective AI must align its outputs with the shifting strategic circumstances that govern how, where, and why military force is employed.
Benchmarking should reveal inherent tendencies and biases in foundational models that have the potential to skew human decisionmaking, ranging from being overly aggressive to not appreciating human terrain. These more nuanced considerations need to be considered alongside more objective measures like the time it took to make a decision, degree of target accuracy, ability to shift confidence in intelligence assessments when new information becomes available, and deception detection.
These are operational metrics, not economic ones. Yet today's evaluation culture offers no systematic way to measure them. This is the core benchmarking gap for which defense AI lacks measurable units of value. Without them, organizations cannot assess improvement, justify investment, or identify trustworthy systems. AI companies, in turn, cannot prove the real-world impact of their products beyond marketing claims and demonstrations. In the 2019 Defense Innovation Board report, the DOD adopted five key principles that define responsible AI: responsible, equitable, traceable, reliable, and governable. While these principles are conceptually sound, they remain largely theoretical unless translated into concrete benchmarking practices and use cases that can be scrutinized by a larger research community.
The recent growth of AI-powered wargaming platforms illustrates this challenge. Many tools promise to help DOD personnel simulate crises and support decisionmaking. LLMs undoubtedly make wargaming more accessible, democratic, and dynamic-enabling realistic interfaces, AI-driven scenario generation, automated translation, and adaptive narrative construction. Yet without proper benchmarking of AI model behavior, these remain aesthetic enhancements. AI components in simulations should be benchmarked and validated, with clear performance scores that measure their reliability and alignment with operational and ethical standards.
Similarly, AI tools designed to augment command and control (C2) are only valuable if their operational impact can be quantified. It must be possible to measure whether they truly generate a decision advantage, not just by saving time, but by producing options that strictly adhere to legal principles like distinction and proportionality. Furthermore, these systems must demonstrate a grasp of operational art and design, proving they can structure tactical actions to achieve strategic ends.
Furthermore, benchmarks in national security and military contexts will need to address emerging concerns about the fallacies of AI adoption. A recent study, for example, shows that developers using AI tools take longer periods to execute a task, while a separate study found unregulated model use may actually reduce human cognitive ability. Even if a model creates faster answers, it doesn't always mean they are better answers.
AI adoption will need to be accompanied by building a large repository of national security benchmarks that address the specific use cases associated with strategy and military decisionmaking at different echelons. One promising example of a realistic benchmarking study is our Critical Foreign Policy Decisions Benchmark study, which used real-world international crisis scenarios to assess LLMs' tendencies toward escalation or cooperation. Future benchmarks should expand on this approach, creating even more realistic test environments to evaluate AI behavior in strategic decisionmaking contexts.
There will need to be a large and diverse benchmarking community, including academic institutions and nonprofits, that creates specific use cases that test everything from compliance with the LOAC to mastery of operational art in different joint campaigns, as well as competition and deterrence. This repository will need to balance speed of decisionmaking as a metric with common logical and analytical fallacies, as well as test whether models that mirror human thinking captured in text suffer from the same cognitive traps.
To close this gap, the defense community must establish a new kind of evaluation architecture-one that measures uplift rather than output. First, in commercial AI, success is measured in dollars or user engagement. In defense, it must be measured in decision time saved, error reduction, or mission reliability under uncertainty. These units must be context-specific-different for targeting, logistics, cyber defense, or intelligence fusion.
Without credible benchmarking, AI adoption in defense risks becoming a faith-based enterprise. Leaders cannot make informed tradeoffs between automation and human judgment. Programs cannot prioritize investments based on measurable return. And companies cannot prove that their algorithms provide a genuine decision advantage rather than just "AI theater."
To that end, CSIS Futures Lab proposes to launch a new initiative, the Defense Benchmarking Suite, to develop the methodology, data infrastructure, and evaluation frameworks that bridge this gap. The project will convene partners from academia, nonprofits, industry, and defense to codevelop a family of uplift metrics tailored to mission areas such as situational awareness, logistics optimization, and decision support. Furthermore, it will open some of the metrics into the inquiry that eventually will lead to better refined measures. The goal is not to reinvent the benchmarking wheel but to align it with the terrain of human conflict which where reliability and adaptability matter more than synthetic scores or speed.
Benchmarking is not just about performance. It is about trust. Trust requires evidence. And evidence demands metrics that matter in the context of strategy, not just computation.
Benjamin Jensen is director of the Futures Lab and a senior fellow for the Defense and Security Department at the Center for Strategic and International Studies (CSIS). Yasir Atalan is the deputy director of the Futures Lab and a data fellow for the Defense and Security Department at CSIS.
Commentary is produced by the Center for Strategic and International Studies (CSIS), a private, tax-exempt institution focusing on international public policy issues. Its research is nonpartisan and nonproprietary. CSIS does not take specific policy positions. Accordingly, all views, positions, and conclusions expressed in this publication should be understood to be solely those of the author(s).
© 2025 by the Center for Strategic and International Studies. All rights reserved.