07/28/2025 | Press release | Distributed by Public on 07/28/2025 09:25
Photo: Dr. Benjamin Jensen/Created using the Midjourney web app
Commentary by Ian Reynolds
Published July 28, 2025
Recent action taken by the Trump administration in the form of the AI Action Plan and additional executive orders (EOs) further establishes U.S. AI policy, aiming to spur innovation and adoption in both the private and public sector. The administration's new policy guidance argues that "Americans will require reliable outputs from AI" in order to attain the range of benefits that AI technology could bring to broader society.
AI must perform reliably to be effective across its many applications. Without reliable AI, the risks of integrating the technology into critical use cases, such as government services or national security, are unacceptable. For example, imagine a situation in which a defense analyst is expected to leverage the data analysis capabilities of a large language model to assist in understanding rapidly unfolding global events. An AI system performing unreliably in this context presents a clear risk to the capability of the analyst to adequately perform their job. Thus, robust and reliable AI is fundamental to achieving the goals laid out in the Trump administration's AI Action Plan and related EOs.
The Trump administration's interest in reliable AI offers the opportunity for novel policy action related to federal support for benchmarking efforts of foundation models-or studies that assess AI performance on domain-specific tasks. Benchmarking and evaluating models, while not silver bullets, offer a practical tool for achieving better-performing AI across use cases by systematically tracking model tendencies and possible failure modes.
A simple fact of how foundation models are trained is that they will contain some form of underlying biases. But that does not mean that reliable AI performance is an impossibility. A short technical discussion will help to flesh out this argument. Foundation models are trained on available data, commonly scraped from internet sources or licensed from specific partners. Further fine-tuning efforts typically involve a form of human scoring and review of model responses to align models with specific desirable outputs. However, because humans create the data on which foundation models are trained, and humans are inherently biased, the result is that these biases, in a variety of forms, will be reflected in the models. This is because foundation models are a human, socially conditioned, product. As the author and his colleagues have previously argued, "technologies do not simply appear from the anti-social void. They are products of human development and, consequently, are embedded within social relations." The net result is that, because humans are fundamental parts of AI development lifecycles, completely eliminating any form of bias is an unreasonable goal.
This does not mean, however, that models can't be shaped to perform reliably with proper evaluation and assessment. In fact, it is possible to mitigate the impact of the inherent biases of language models through benchmarking efforts and subsequent model training to improve model performance on targeted, domain-specific tasks. Benchmarks are critical tools in the AI research community for assessing how foundation models perform on tasks such as coding or quantitative reasoning. In fact, the CSIS Futures Lab has pioneered efforts at benchmarking AI models in domains where there is limited ground truth, such as foreign policy decisionmaking, finding key differences in model preferences related to escalation scenarios. In order to obtain a clear signal on when models are performing in an undesirable fashion, benchmarking is key.
However, as argued in a recent CSIS report, such efforts to benchmark AI models must be driven from the bottom up, and include a range of civil society actors. This can help to ensure model development and deployment is not driven by targeted interests of large AI firms or government bureaucracies. Research has shown that both private and public organizations are subject to pursuing their own parochial interests. For example, to return to the scenario discussed above, generating targeted, domain-specific benchmarks on the capacity for foundation models to successfully analyze and produce intelligence reports can help improve model performance, as well as increase trust by human users that model outputs are robust. Building such a benchmark should rely on the systematic knowledge of intelligence and international security that exists within civil society organizations such as International Relations departments at U.S. universities.
Importantly, the AI Action Plan recognizes the critical need for building a robust AI "evaluations ecosystem" to improve model performance. To realize this vision and avoid unreliable outputs impacting AI performance across the public and private sectors, the administration should further establish mechanisms for funding benchmarking and evaluation across the multitude of AI use cases. This will ensure, through scientifically rigorous processes, that foundation models function as desired. These funding efforts should enable AI benchmarking to be as bottom up driven as possible. Integrating a bottom-up approach to benchmarking and evaluation and relying on the domain expertise of American civil society organizations, such as universities and think tanks, can help inject critical checks and balances into the benchmarking process. The net result will be AI models that perform more reliably in critical use cases, benefiting public and private sector model deployment efforts. Finally, the administration should consider creating incentives for AI firms to involve civil society organizations in evaluation and benchmarking, both to better obtain domain-specific expertise that is not available within private companies and to involve wider elements of American society in the process of technology development.
Ian Reynolds is the postdoctoral fellow for the Futures Lab at the Center for Strategic and International Studies in Washington, D.C.
Commentary is produced by the Center for Strategic and International Studies (CSIS), a private, tax-exempt institution focusing on international public policy issues. Its research is nonpartisan and nonproprietary. CSIS does not take specific policy positions. Accordingly, all views, positions, and conclusions expressed in this publication should be understood to be solely those of the author(s).
© 2025 by the Center for Strategic and International Studies. All rights reserved.
This digital series-featuring scholars from CSIS Futures Lab and AI evaluation experts from Scale AI-explores how large language models approach critical foreign policy decision-making scenarios.
Commentary by Benjamin Jensen and Yasir Atalan - July 24, 2025