01/23/2025 | News release | Distributed by Public on 01/23/2025 11:21
Every day, Stripe processes billions of dollars in payment volume. This figure surges during peak volume periods; over Black Friday and Cyber Monday in 2024, businesses processed more than $31 billion on Stripe. Our users rely on us to maximize their revenue and deliver a seamless experience to their customers, which is why we consistently monitor our global money movement systems to ensure they're operating smoothly.
One approach to monitoring payment performance would be tracking aggregate performance across all payments on our platform. While this would give us a comprehensive overview, it would likely obscure degradations affecting specific segments of traffic. For instance, a card issuer might make system changes that alter the acceptance of a specific payment type (e.g., a UK card issuer begins to decline recurring payments on prepaid cards at high rates). Given the scale of Stripe's processing, that spike in failed payments might not be enough to move global metrics, even though specific businesses in the UK with high use of prepaid cards would feel an acute impact.
To solve this problem, we have developed a system that offers near real-time visibility into the performance of subsets, or "slices," of Stripe traffic. We apply a combination of machine learning (ML) and time series algorithms to detect performance degradations across various metrics, including payment success rates, authentication rates, costs, fraud, and more. When our monitoring system detects a degradation, it automatically alerts the relevant experts at Stripe to investigate and resolve the underlying issue. Achieving this capability required us to grapple with three fundamental problems:
We monitor payments in a high-dimensional space characterized by over 16,000 payment-related variables. These include more than 10,000 issuing banks, hundreds of currencies, countries, card products, and payment features (e.g., Apple Pay, mail or telephone orders, account funding transfers). Performance issues might arise from unique combinations of any of these factors (e.g., a spike in failures on digital wallet payments from debit cards on French issuers).
To effectively define slices, we strike a balance between making them narrow enough to isolate the specific shape of degradations, yet data-rich enough to detect degradations with high statistical confidence. To do this, we:
Finally, we've invested heavily in platformization: we've built slice monitoring as a general framework that allows engineering teams to develop their own metrics and slices while leveraging core slice-monitoring algorithms and operational tooling. This accelerates their ability to build and deploy powerful, precise detectors.
Once we define slices, our next challenge is to accurately identify performance degradation within them. This task is more complex than it might initially appear, as payment metrics can fluctuate dramatically for valid reasons.
Standard time-series anomaly detection approaches compare the current value of a time series against a baseline derived from historic data. However, for payment slice monitoring, standard anomaly detection is insufficient due to the absence of a stable baseline. There are many sources of underlying variation: customer onboarding, fraud trends, changes in business behavior, and more. For example, running a large free trial or launching a new line of business or product could alter customer composition, which consequently affects payment success rates. An algorithm that doesn't account for these intricacies would likely trigger false positives, hindering the effectiveness of our monitoring system.
To control for underlying changes in transaction composition, we employ a combination of Stripe's machine learning models and time-series analysis. First, we leverage ML models to estimate the probability of success for every transaction in our monitoring dataset (i.e., the expected outcome). These models are trained on Stripe's vast transaction-level datasets. Next, we conduct near real-time, time-series anomaly detection, adjusting for the underlying probability of success.
Accurately detecting performance degradations is the first step in our monitoring process. The next challenge lies in determining how and when to act. Even with highly precise anomaly detection, monitoring tens of thousands of slices inherently leads to false positives from random fluctuations in metrics. Given that, we need to enable rapid detection while avoiding unnecessary alarms caused by transient drops in performance.
To achieve this, we use a finite state machine that aggregates losses over time, only triggering alerts when loss thresholds from sustained events are breached. Alerts are classified based on urgency-derived from the rate of volume loss-and inferred root cause, streamlining the routing to the appropriate team for investigation and remediation.
Solving these problems has resulted in a slice monitoring platform that identifies real degradations in payment performance each day with a precision exceeding 90%. This level of effectiveness allows us to have excellent coverage without generating unsustainable operational burden from false positives.
As we continue to refine and expand our slice monitoring capabilities, we're committed to sharing this powerful tool with you. Later this year, we plan to make slice monitoring alerts available directly to select users. Additionally, all users will have full visibility into payment success rates in the Payments analytics page of the Stripe Dashboard, where you'll also find recommendations for performance optimization. For more information on how to increase your revenue on Stripe, check out Stripe's payments performance suite. We'll also be talking more about this at Stripe Sessions 2025, so come join us.