03/13/2026 | Press release | Distributed by Public on 03/13/2026 11:23
At a glance
As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.
When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.
Today, we are excited to announce the open-source release of AgentRx(opens in new tab), an automated, domain-agnostic framework designed to pinpoint the "critical failure step" in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark(opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.
The challenge: Why AI agents are hard to debug
Modern AI agents are often:
Traditional success metrics (like "Did the task finish?") don't tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.
Introducing AgentRx: An automated diagnostic "prescription"
AgentRx (short for "Agent Diagnosis") treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to "guess" the error, AgentRx uses a structured, multi-stage pipeline:
A New Benchmark for Agent Failures
To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:
Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a "Plan Adherence Failure" (where the agent ignored its own steps) and an "Invention of New Information" (hallucination).
Key Results
In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:
By providing the "why" behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.
We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.
We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.
Acknowledgements
We would like to thank Avaljot Singh and Suman Nath for contributing to this project.
Opens in a new tab