Datadog Inc.

01/10/2025 | News release | Distributed by Public on 01/10/2025 15:15

Unlock advanced query functionality with distribution metrics

As organizations break down monolithic applications in favor of a more distributed, microservices-based architecture, they need to collect increasing amounts of metric data. But how do you summarize this data to provide insights at scale? Averages are simple to calculate but can be misleading, especially for increasingly complex and distributed environments that contain outlier values that skew the average.

Meanwhile, the task of computing precise and accurate percentiles across millions of metric values is substantial. Typically, calculating percentiles accurately requires you to retain all the values (you can't reaggregate values without sacrificing accuracy), sort them, and return the value whose rank matches the percentile. This process leads to impractically high computational costs from storing and processing percentiles when millions of values need to be summarized.

You can use distribution metrics to help solve these challenges. With distribution metrics, you can measure globally accurate percentiles and enable enhanced query functionality for your distributed systems and applications.

In this post, we'll explain how to use distribution metrics to:

Calculate globally accurate percentiles to obtain the right insights

A simple illustration of how averages can provide misleading information is a customer satisfaction survey. Suppose you sent a survey to 10 customers to understand the likelihood of customer churn. Based on your domain expertise, you know that customers tend to churn when they give a rating of 4 or below.

Let's say that half of the customers gave the worst rating of 1 while the other half gave the best rating of 10. If you relied on the average customer rating of 5.5, you would not worry about churn. You would then be surprised when five customers churned. The average is not a useful measure in this case because it fails to represent the happy customers' ratings and the unhappy customers' ratings.

Now assume that you're tracking search request duration for a service to ensure that your customers have a fast search experience in your application. Consider the results from the following Datadog widget.

Widget that shows an average request duration of 98 ms, a p95 average of 280 ms, and a p99 average of 1,405 ms.

If you were to rely on the average, you might think that your customer's search experience is much faster than it is. An average duration of 98 ms seems reasonable. However, many datasets can be skewed by extreme outliers. In this case, failed searches, ones that fail quickly because of user or validation errors, are skewing the average lower.

To avoid this misleading conclusion, you can use distribution metrics to understand the percentiles of your dataset. By looking at the 95th percentile and 99th percentile, you can see that while 95 percent of your users have an experience that is less than 300 ms, the search experience is magnitudes slower for the bottom 5 percent. This slowness for the bottom 5 percent can cause lower customer engagement with your application that worsens as your customer base and business grow over time.

Many vendors calculate percentiles at an individual host or agent level, but summarizing percentile values across all hosts produces misleading results because it involves reaggregating individual hosts' percentiles. However, Datadog's distribution metric type collects all your raw data by using the DDSketch data structure and aggregates a metric's values across all hosts server-side. DDSketches are memory efficient and enable Datadog to efficiently compute any user-defined percentile or standard deviation (in addition to minimum, maximum, sum, average, and count values).

Visualize large-scale data distributions by using heatmaps

Distribution metrics capture high-resolution data that you can visualize by using Datadog heatmaps. With heatmaps, you can get a full picture of what's going on in your environment and identify seasonal patterns over time. For example, consider the following heatmap of latency of a service. The heatmap shows a strong mode at 20 ms, with daily pulses at higher latencies.

After investigating further, we can source the mode back to a specific health check. When we filter out that check, we can better understand seasonal patterns in latency for the service. If these latencies were instead plotted as aggregated percentile changes over time, these low-latency calls would distort the overall interpretation of latency.

Set SLOs with threshold queries

After you use heatmaps to explore how your metric's values are distributed over time, you can use threshold queries to create SLOs. SLOs set targets for a service's performance to improve platform stability and the consistency of user experiences.

For example, you might want to ensure freshness of customer data by tracking the amount of time it takes your platform to ingest data before the data reaches your storage systems. To build these SLOs, you can define threshold queries to ensure that you're within your error budget for the month. Threshold queries, available only for the distribution metric type, allow you to count the number of raw distribution metric values that are above or below a numerical threshold.

SLO that measures the time it takes your platform to ingest data and send it to your storage layer.

Metric-based SLOs are calculated by dividing the sum of good events by the sum of total events over time. In this example, a good event is defined as a call that has latency of less than 30 seconds. You can use threshold queries to calculate by specifying 30 as the threshold. When the SLO is set up, you can track and set alerts for error budget and burn rate, and use them as widgets in dashboards.

Create precise, granular monitors

To ensure that you're in compliance with your SLOs, you need monitors to proactively alert you about issues that might arise. As your business and services scale, you need more granular percentiles to monitor the customer experience and related performance service level agreements (SLAs).

For example, suppose that you maintain a payment service that handles a large number of orders every month. The payment service's queue latency directly impacts your monthly reported revenue. In this case, you need to receive alerts about any high and unacceptable queue latency (even at the 99.99th percentile) and resolve that issue quickly.

Threshold metric monitor that alerts on p99.99 queue latency.

By using a distribution metric, you can set a percentile threshold within metric monitors to alert you when the payment service's p99.99 queue latency exceeds that threshold. You can also set up metric-based change, anomaly, outlier, and forecast monitors. Alerts about the payment service's queue latency help you proactively identify what's blocking the queue, if your job pods are healthy, and whether you need to scale up.

Identify statistically calculated outliers with standard deviations

Additionally, for any timeseries data (for example, CPU usage, server response time, request rate), you might want to monitor and troubleshoot outlier values that represent a degraded user experience. Instead of guessing the static threshold value that identifies those outliers, you can use the standard deviation aggregator to set precise, statistically calculated threshold values for your metric-based SLOs and metric monitors.

Standard deviation of a distribution metric, as shown on the Metrics Explorer page.

When you monitor resource consumption (for example, network traffic, CPU usage, memory usage), you need to be able to balance consumption across resources and provision the resources accordingly. With the standard deviation, you can view the historical variation in your resources' consumption, relative to the average consumption over time. This information is useful for load balancing. You can receive alerts when resources are above or below a standard deviation from the mean and then investigate whether you're properly load balancing.

Start using distribution metrics today

Datadog distribution metrics summarize your data by providing globally accurate percentiles across distributed systems and applications to unlock advanced analysis and monitoring. With distribution metrics, you can use heatmaps to visualize large-scale data distributions, set appropriate SLOs, create precise monitors, and identify outliers. If you don't already have a Datadog account, you can sign up for a 14-day free trial to get started.