Dynatrace Inc.

09/25/2025 | Press release | Distributed by Public on 09/25/2025 11:02

The rise of agentic AI part 6: Introducing AI Model Versioning and A/B testing for smarter LLM services

Debug, optimize, and secure your AI models with confidence

As agentic AI applications and systems gain traction, delivering reliable, high-performing LLMs and agents becomes challenging due to heterogeneous stacks, non-deterministic behavior, and cost sensitivity across multi-cloud runtimes. Reliable delivery and deployment to production requires end-to-end telemetry across the full chain:
UI/services → orchestration/agents (LangChain, LlamaIndex, MCP/A2A) → RAG pipeline (embedding + vector DB) → model gateway (OpenAI, Azure/OpenAI, Bedrock, Gemini, Mistral, DeepSeek) → GPU/infra. To support deterministic rollouts and continuous model improvement, teams need standardized tracing/metrics, guardrail signal capture, and automated cost and performance governance.

The hidden challenges of AI model management

The invisible bottlenecks

AI models, especially LLMs, are prone to issues like hallucinations, degraded performance, and incorrect outputs. Debugging these problems is often like finding a needle in a haystack. Existing tools fall short in providing a unified view to compare prompts, datasets, or model versions, making it hard to identify regressions or improvements.

Navigating the risks of rapid model evolution

The impact of deprecation and automatic upgrades on cost, performance, and quality

The rapid pace of innovation in the AI space means that providers like OpenAI and Anthropic frequently release new versions of their models, such as ChatGPT 5 or

Anthropic Opus 4.1. While these updates often promise better performance and new capabilities, they can also introduce significant risks for your AI services:

Deprecation of older versions: Providers may discontinue support for older models, forcing you to adopt newer versions without sufficient time to test their impact.

Automatic upgrades: Many AI providers automatically update their underlying models, which can lead to unexpected changes in behavior, degraded performance, or even broken workflows.

Compatibility issues: Changes in model behavior, such as output format or token usage, can disrupt your application's functionality, requiring adjustments to prompts, configurations, or integrations.

Tracking token usage and managing costs is another uphill battle. Add to this the risk of prompt injection attacks and data leaks, and it's clear that traditional methods are no longer sufficient

The new AI Model Versioning and A/B testing

Ship better models with confidence. In a single view, compare models and versions to validate improvements and spot bottlenecks across latency, reliability, token usage, cost, and output quality, then drill into prompt-level differences to confirm why a variant wins. When something breaks, follow the request end to end with distributed tracing: from input through orchestration steps and model calls to completion, so you can pinpoint exactly where an error or slowdown originated.

Compare models and versions: Detect bottlenecks and validate improvements in a single view.

Trace prompt failures: Debug errors from input to output with our Distributed Tracing solution.

Monitor costs and token usage: Gain real-time insights into token consumption and cost implications.

Detect security and guardrail risks: Identify and alert on vulnerabilities like prompt injection attacks, toxic responses, or captured PII.

Attach your own attributes like user session, feedback, or dataset ID for additional debugging information.

How it works

With AI Model Versioning, you can track metadata such as model version, dataset ID, and hyperparameters.

A/B testing lets you expose different user segments to model variations, providing data-driven insights into performance metrics like accuracy and cost.

Instrument in minutes: Use the supported OpenTelemetry-based SDK to instrument your service to capture prompts, completions, token usage, errors, and guardrail signals.
You can also enrich spans with attributes like model.version, dataset.id, user/session, and feedback for deeper analysis. (You can read more about this here.)

Start analyzing out of the box: Once data is flowing, the AI Observability app provides ready-made dashboards and distributed tracing so you can compare models/versions, monitor costs and tokens, and debug prompt failures end to end. No extra setup is required; you can try it out on the Dynatrace Playground right now.

By combining observability, AI-driven insights, and organizational knowledge, we're enabling systems that don't just react but learn and adapt. Each critical issue or incident you resolve fuels a living knowledge base, paving the way for proactive incident prevention through alerting.

What's next?

We're committed to enhancing these capabilities further. Upcoming updates will include a dedicated app experience for multi-model and multi-cloud setups, advanced visualization tools, enhanced security features, intelligent forecasting, and alerting for cost/performance and guardrail optimization.

Get started today

Ready to revolutionize your AI services? Here's how:

  1. Sign up for a free trial.
  2. Install the AI Observability app.
  3. Explore the AI Model Versioning ready-made dashboard, or check it out on our playground

Together, let's build smarter, more reliable AI systems.

Dynatrace Inc. published this content on September 25, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 25, 2025 at 17:02 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]