06/26/2026 | Press release | Distributed by Public on 06/26/2026 10:02
Large language models and agents are rapidly transform how organizations build software, automate workflows, and interact with data. From copilots to autonomous agents, AI-powered systems are increasingly responsible for answering questions, generating code, and supporting operational decisions. But as organizations move from experimentation to production, measuring performance reliably is no longer optional; this is where LLM evaluations become essential.
This is the second post in our series on LLM evaluations. In the companion post, Evaluate LLM and agent quality in Dynatrace AI Observability with dt-evals, we showed you how to run online evaluations against real GenAI prompt traces and bring quality scores into Dynatrace AI Observability alongside latency, cost, and errors. This post steps back to the fundamentals: what evaluations are, how they work, and the methods teams use to measure AI quality.
Just as traditional software relies on testing frameworks to ensure reliability, AI systems require robust evaluation frameworks to measure the quality, accuracy, and safety of model outputs. Evals are the primary mechanism by which teams build trust in, iterate on, and responsibly deploy AI systems. Without them, organizations may risk deploying systems that produce unreliable answers, hallucinate facts, or quietly degrade in performance over time.
Traditional software produces deterministic outputs - the same input consistently returns the same result, making pass/fail testing straightforward. LLMs are probabilistic systems: the same prompt can produce different responses depending on context, temperature, and model behavior. This variability makes conventional testing methods insufficient.
Instead of verifying a single correct output, teams must evaluate across multiple dimensions simultaneously:
This transforms evaluation from simple pass/fail checks into continuous measurement of AI quality.
Figure 1. Prompt stream with evaluation results shown in AI Observability appThe most well-known consequence of probabilistic generation is hallucination - when a model produces plausible-sounding but factually incorrect information. This happens because LLMs predict likely word sequences rather than verify facts, which enables powerful reasoning but introduces serious risk in enterprise environments where accuracy is critical.
Addressing this requires evaluation frameworks that track signals like factual accuracy, semantic similarity, groundedness in source data, and consistency across responses. These metrics transform subjective quality judgments into measurable, improvable signals.
An LLM evaluation is a systematic process of testing a model or AI-powered system to determine whether it meets a defined standard of quality. That standard could be factual accuracy, helpfulness, safety, tone, latency, cost-efficiency, or any other measurable dimension that matters to the application.
Evaluations translate vague product goals ("the assistant should be helpful and safe") into concrete, repeatable measurements. They allow teams to:
Evals exist on a spectrum of formality, from a small hand-curated test set run locally, to a large, automated pipeline running thousands of test cases in CI/CD on every deployment.
Figure 2. AI Evaluation & Agentic App Performance dashboard showing dt-evals results in Dynatrace AI ObservabilityAt their core, evaluations follow a consistent pattern regardless of their complexity:
This loop can run manually during development, automatically in CI/CD pipelines, or continuously against live production traffic.
This is one of the most common points of confusion in the space.
LLM evaluations are the broader discipline - the full process described above. They encompass everything from how you define success to how you collect test data to how you score outputs to how you act on results.
LLM-as-a-Judge is one specific scoring method that can be used within an evaluation pipeline. It involves using a language model (often a strong general-purpose model like GPT-5 or Claude Sonnet 4.6) to automatically assess the quality of another model's outputs.
Think of it this way: evaluations are the framework, and LLM-as-a-Judge is one type of grader you can plug into that framework - alongside code-based graders, human graders, or embedding-based similarity checks.
Code-based evaluations use deterministic functions - written in Python or any language - to score model outputs. No secondary LLM model is involved.
You write a function that takes the model output as input and returns a score. The function might check for exact string matches, run regex patterns, execute generated code and test it, parse JSON and validate its structure, call an external API to verify a fact, or compare numerical results.
Code-based LLM evals are the first tool to reach for whenever a task has a clear, verifiable answer. They form the backbone of any reliable eval suite.
These two modes are not competing approaches - they're complementary phases of a complete evaluation strategy.
Offline evals run against a static, pre-collected dataset before a system reaches production. They're the evaluation equivalent of unit and integration tests in software development.
Key advantage: Full control over the test distribution and ground-truth labels.
Key limitation: The dataset may not reflect real user behavior or the long tail of production inputs.
Online evals run against live production traffic in real time or near real time. They observe what is actually happening when real users interact with the system.
Key advantage: Captures real-world usage patterns, prompts, and failure modes from production traffic, giving teams the most representative signal for monitoring AI quality over time.
Key limitation: No pre-defined labels; scoring must rely on heuristics, implicit signals (thumbs up/down, re-prompts), or async LLM-as-a-Judge pipelines.
The field of LLM evals is evolving rapidly. As enterprises deploy increasingly autonomous AI systems, evaluation can play an important role in improving AI accuracy, reliability, and safety.
Here are the trends worth watching and investing in:
Organizations that invest early in robust evaluation frameworks and combine them with AI observability will be positioned to scale AI safely across their operations.
See our companion blog post, Evaluate LLM and agent quality in Dynatrace AI Observability with dt-evals, to learn how dt-evals lets you run LLM-as-a-judge evaluations on real GenAI traces and turn AI quality into a queryable, trendable, and alertable signal inside Dynatrace AI Observability.
Because in the end, AI systems are only as trustworthy as the processes used to evaluate them.