Datadog Inc.

09/19/2025 | News release | Distributed by Public on 09/19/2025 13:19

Aligning SRE And Security For Better Incident Response

In this series, we looked at why we combined our SRE and security teamsinto one cohesive group, and how we made that happen. With this combined approach, we set out to build our internal platform and customer-facing products with a security-first mindset, while still drawing upon the deep expertise of our existing SRE practices. Combining the teams improved the way we build tools for both our engineers and customers and strengthened our ability to mitigate risks. This post focuses on how that approach improved our incident response and how you can apply these lessons to your own organization.

As we worked to integrate security with SRE, we identified common patterns in our day-to-day operations that showed why alignment would benefit both sides. These included duplicate tooling and processes, frequent incidents that spanned both reliability and security domains, and notable gaps in coverage for audit logging and change control. Before we could improve our incident response, we had to first address some of these central pain points for both groups, such as knowledge siloes and gaps in our monitoring capabilities.

Addressing these issues resulted in platform-wide configuration baselines , comprehensive team documentation , shared runbooks and dashboards , and cross-functional exercises to strengthen incident response. These assets helped us foster a blameless, collaborative approach to managing incidents by giving our engineers the tools they needed to work together efficiently.

Configuration baselines, which establish the standards for how we should set up our systems, help us shift controls left to prevent incidents and also guide our response when something goes wrong. Our baselines treat compliance requirements as the bare minimum but scale to help us focus on the most impactful platform issues and align with our golden paths.

By establishing clear, platform-wide standards for security and reliability, our teams have a shared reference point that makes it easier to identify whether an issue is urgent enough to be declared an incident or should simply be tracked as a bug or vulnerability instead. This consistency reduces hesitation and ensures that problems aren't ignored.

We maintain a list of cloud-agnostic baselines, which are routinely reviewed and updated as needed. Our methodology for creating these standards involves evaluating individual rules against the following questions:

  • Can we reasonably enforce this rule with minimal triaging and custom logic?
  • Is the risk that this rule addresses sufficient to warrant expedited attention?
  • Are there legitimate reasons why this rule might frequently generate false positives?
  • Is this rule a standard best practice or a security concern?

Every rule is assigned a severity level, which helps us prioritize findings and ties directly into our criteria for declaring an incident versus creating a bug or vulnerability report. Not every misconfiguration or vulnerability needs an early morning page, so we wanted these baselines to be well-defined up front so engineers aren't left guessing.

In practice, these baselines function as both preventive guardrails and decision-making tools during incidents. For example, if a baseline requires that all production databases be encrypted, we can immediately classify a discovered unencrypted volume as high severity. On the other hand, a misconfiguration that has existed quietly for two months may not trigger an urgent incident, but it should still be monitored and assessed.

This alignment between baselines and escalation paths reduces hesitation in addressing an issue. With it, engineers can confidently declare an incident because they have the data they need to do so. It also helps ensure that we don't ignore important problems simply because they don't fit a narrow definition of an incident.

Over time, incidents surface gaps in our baseline configurations as well. For example, if our investigation during a security incident reveals missing audit logs, we will adjust our requirements for logging configurations, such as retention periods and formats, where necessary. We also continually update our threat detections based on the cause of a security incident, such as a threat actor attempting to compromise accounts. These iterative updates ensure our baselines remain effective, and they create a consistent system that helps us mitigate configuration drift, respond efficiently to high-risk issues, and strengthen both the security and reliability of our platform.

Merging our SRE and security groups required a shared understanding of expectations during incidents, so we unified guidelines and tooling for both security- and reliability-related events. These steps ensure that security incidents follow the same patterns and timelines as any other operational incident, and that familiarity makes incidents less intimidating to manage.

To set these expectations, we considered the following questions:

  • Who do we bring in during an incident?
  • What should their response time be?
  • What steps are they expected to take for each incident?

These questions let us define role-specific guidelines so that everyone working on an incident is confident in their responsibilities and support. For example, all relevant security teams go through our standard incident management training, and gaps in our response protocols are remediated with approval from security leadership. Within this shared framework, we introduced security leads, a new role that not only drives security incidents but also provides relevant context and direction during other types of critical events.

As part of our standard steps for declaring an incident and establishing a current state, we also conduct a security-focused risk assessment. This is a structured set of questions that a security lead answers when called to investigate a reported security-related issue:

  • If a threat actor is involved, what is their objective? How confident are you in this assessment?
  • Can you determine which stage of the attack the threat actor is in, such as initial access?
  • What are the most likely attack paths for a threat actor to achieve their objectives?
  • How likely is it that a threat actor without internal knowledge could identify these paths?

We encourage security leads to use words of estimative probability (WEPs) to determine the likelihood of a specific outcome, such as a threat actor identifying attack paths within our systems. These probability estimates enable our team to scope and prioritize risk effectively.

In addition to comprehensive team and process documentation, we also unified our incident runbooks and monitoring dashboards. Our organization maintains a library of detailed runbooks that we use as part of the incident management process. Security runbooks are developed by our Security Incident Response team (SIRT) in close partnership with other relevant system owners, which ensures technical accuracy. This collaboration also serves as a planning exercise by allowing us to clarify how teams will work together during an incident, what information each role will need, and when they'll need it. Having greater visibility across both reliability and security domains enables us to easily follow a predictable set of remediation steps and resolve issues faster.

Our shared runbooks include high-level graphs that help teams quickly scope the incident to a specific timeframe. They also contain links to relevant logs and traces, along with guidance on what to look for during the review process. For example, in the case of possible DDoS activity, our runbooks include the following dedicated queries that help teams evaluate the likelihood of a legitimate attack:

  • Spikes in requests to specific routes, IP addresses, or ASNs
  • Surges in authentication or authorization attempts (successful or not)
  • Unusual increases in 2xx responses, which could signal an HTTP flood
  • Spikes in 4xx responses, which may indicate a credential stuffing attack attempting to blend in with DDoS traffic

If an incident surfaces a gap in coverage for how to investigate or resolve an issue, we encourage engineers to create the necessary documentation and update runbooks accordingly as part of the postmortem process. Our cross-functional dashboards provide a high-level overview of the critical data we use to investigate incidents so we can quickly connect reliability context to security issues. For example, if we notice an unusual spike in failed login attempts, we can pivot directly to related security signals to investigate further.

Each signal includes additional context, such as associated IP addresses or geolocation, that helps us quickly determine whether the activity stems from a platform misconfiguration or a legitimate attack, such as a distributed credential stuffing campaign.

By building dashboards that SRE and security teams use together, every responder works from the same context, which reduces miscommunication and accelerates the decision-making process.

Building resilience involves more than just improving the way we remediate issues. We also wanted to find opportunities to practice incident response before an issue occurs. We regularly conduct exercises that simulate both security and reliability incidents so we can refine our incident management processes and documentation.

For security, SIRT participates in purple team exercises alongside our threat detection group. These drills help refine detection logic, improve runbooks, and give engineers the muscle memory to handle incidents as if they were real. Some drills are live simulations, while others are theoretical or whiteboard-based tabletop scenarios that let us explore edge cases without affecting our production environments.

To test platform reliability, we take a similar approach through chaos engineering experiments and both small- and large-scale gamedays. These events deliberately introduce controlled failures into our systems, giving teams the opportunity to diagnose and remediate issues under realistic conditions.

Our goals are the same regardless of the type of exercise we conduct: Identify weaknesses early, improve processes and tooling in a safe environment, and ensure that our team can respond quickly when a real incident happens.

In this post, we looked at the outcomes of combining our SRE and security groups as well as how that approach significantly improved the way we manage incidents. While every organization's structure is different, the principles of shared visibility and platform enablement can apply broadly. If you're exploring similar changes for your reliability and security teams, start by identifying shared pain pointsand aligning on team goals. From there, incremental changes in process visibility and ownership will allow you to build the necessary tools for collaborative incident response and more resilient, secure applications.

The workflows we described in this post, such as creating shared runbooks and dashboards, are possible with Datadog. Check out our documentation if you'd like to learn more about our incident managementand securitycapabilities. If you're new to Datadog, you can sign up for a free 14-day trial.

Datadog Inc. published this content on September 19, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 19, 2025 at 19:19 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]