Splunk Inc.

12/20/2024 | News release | Archived content

The Observability Center of Excellence: Measuring & Improving Observability as Service(OaaS)

Welcome to the third blog of the Observability Center of Excellence (O11y CoE) series! If you've been following along, we've discussed the why behind an O11y CoE, and we explored how to assemble and structure the team to make it a reality. Now, we're ready to dive deeper into one of the CoE's critical functions: defining and measuring Observability as a Service (OaaS).

In the context of an Observability CoE, OaaS is the operating model for delivering observability capabilities to the organization. Much like other "as a Service" models, OaaS focuses on providing observability as a scalable, measurable, and value-driven practice that supports teams across the business. To determine its effectiveness, it must be instrumented-just like the systems it aims to monitor.

Is your observability practice positioned to help teams resolve incidents faster, reduce downtime, and optimize performance? Defining some base KPIs early in your journey not only helps the CoE answer these questions but also enables it to leverage data to understand what's working (and what's not). These KPIs provide visibility into the CoE's value, empowering it to continuously refine and improve its delivery of observability services.

In this blog, we'll explore:

  • What makes a good KPI and how KPIs differ from OKRs.
  • Common pitfalls to avoid when defining metrics.
  • A framework of practical, actionable KPIs to measure the success of your OaaS.
  • Real-world examples of KPIs that can serve as a great starting point.

By the end, you'll have the tools and insights to ensure your Observability CoE is delivering measurable value through OaaS, setting the stage for future enhancements like maturity assessments and tactical implementations.

KPIs vs. OKRs: understanding the difference

A fellow Splunker created a great article on KPIs, OKRs, and metrics, breaking down their distinctions and how they complement each other. The gist is simple: KPIs monitor ongoing performance and historical trends, while OKRs align teams around strategic goals and measurable outcomes.

KPIs-Key Performance Indicators-are like the operational pulse of your observability practice. They answer questions like, "What's happening right now?" and "What trends have emerged over time?" These indicators provide a near-time and historical view into the health of your OaaS, helping you identify trends, measure effectiveness, and take action.

OKRs-Objectives and Key Results-are about where you want to go. They combine a clear objective (the goal) with measurable results to ensure progress. While KPIs tell you what's happening, OKRs drive strategic alignment and improvements.

How they work together: an example

Imagine your Observability CoE tracks a KPI called Agent Saturation, which measures the percentage of available resources instrumented with observability agents. This KPI shows how comprehensively your environment is covered.

  • The KPI tells you: "We currently have 75% saturation across Tier 0 and Tier 1 applications."
  • The related OKR might be:
    • Objective: Achieve full observability coverage for critical services.
    • Key Result: Increase agent saturation to 95% across Tier 0 and Tier 1 applications by the end of the quarter.

In this case, the KPI provides the current state and historical context, while the OKR establishes the target state and timeframe for improvement. Together, they ensure the CoE can monitor progress while driving a strategic outcome.

Why both matter

KPIs and OKRs complement each other by ensuring your OaaS practice is operationally effective and strategically aligned:

  • KPIs show how your observability practice is performing now and over time.
  • OKRs define the strategic improvements you're driving toward.

Together, they create a feedback loop: KPIs inform how close you are to achieving OKRs, while OKRs ensure you're focusing on initiatives that deliver meaningful value.By distinguishing between KPIs and OKRs, your Observability CoE can build a framework that tracks progress, measures success, and aligns with organizational goals. In the next section, we'll explore what makes a good KPI and common pitfalls to avoid.

What makes a good KPI?

Any service offering thrives on actionable, meaningful, and relevant KPIs that provide insights into what's working-and what isn't. A well-chosen KPI doesn't just measure performance; it also drives continuous service improvement and supports broader objectives, such as enabling the Observability CoE (O11y CoE) to achieve its OKRs.

For those looking for a deep dive into the nuances of good vs. bad KPIs, I recommend checking out this Splunk article on KPI management. It explores how to identify impactful KPIs, avoid common mistakes, and set up management frameworks.

Common pitfalls to avoid

Defining KPIs is as much about knowing what to avoid as it is about selecting the right metrics. Some common pitfalls include:

  • Vanity Metrics: These look impressive but fail to provide actionable insights. For example, tracking the number of dashboards created may not correlate with better observability outcomes.
  • Subjective or Complex Metrics: Metrics like "False Positive Rates" or "Incident Costs Prevented," while insightful, often rely on subjective inputs or complex calculations, making them hard to track consistently or accurately. If a team member's opinion is required, implement a well-defined decision tree or flowchart to make the process consistent and repeatable.
  • Overloading Metrics: Tracking too many KPIs can overwhelm teams and dilute focus. Prioritize metrics that provide the most value and directly impact your O11y CoE's objectives.
  • Misaligned KPIs: Ensure your KPIs align with your organization's goals and support your OKRs. Misaligned metrics can lead to wasted effort and resources.

The Role of the O11y CoE in KPI success

The Observability CoE is central to ensuring success with both KPIs and OKRs. By defining actionable KPIs early and aligning them with clear OKRs, the CoE can:

  • Establish a Baseline Understanding: Gain a comprehensive view of the organization's observability landscape.
  • Leverage Data for Insights: Identify strengths, gaps, and areas for improvement through data-driven analysis.
  • Validate Impact and Communicate Value: Use KPIs and OKRs to demonstrate the CoE's effectiveness to stakeholders.
  • Drive Iterative Improvement: Utilize OKRs to set measurable objectives that promote continuous value delivery and updates to the business.

Defining KPIs isn't just about tracking progress; it's about laying the foundation for a successful Observability-as-a-Service (OaaS) model. By explicitly integrating OKRs, your O11y CoE gains the ability to continuously adapt, refine, and enhance its value proposition. This alignment ensures that observability practices drive iterative and constant value updates to the business, keeping the organization responsive and competitive.

Categories of observability KPIs

When identifying KPIs for your Observability CoE, it's useful to group them into categories based on their focus and purpose. To quickly recap, OaaS KPIs should help assess whether your OaaS operating model is effectively delivering, or is positioned to deliver, observability capabilities to the organization. Organizing KPIs into these categories ensures your measurements are actionable and aligned with the outcomes your Observability as a Service (OaaS) practice strives to achieve.

Later in this blog, I'll provide specific examples of O11y KPIs, including their descriptions, purposes, calculations, potential data sources, and which category they fall under. For now, let's explore the core KPI categories:

1. Availability

Focus: Ensuring observability tools and platforms are operational and accessible. This type of KPI tracks the reliability of your observability ecosystem, helping you answer questions like:

  • Are the tools available when teams need them?
  • How often are disruptions impacting service quality?

2. Utilization

Focus: Monitoring the deployment and use of observability tools and resources. Utilization KPIs measure things like license usage, tool versioning, and deployment coverage, ensuring you're getting the most out of your investments. Key questions include:

  • Are you maximizing the licenses you've purchased?
  • Are tools updated and deployed across your environment?

3. Adoption

Focus: Measuring engagement with observability tools and practices across teams and environments. Adoption KPIs cover two key dimensions:

  • People perspective: Are teams actively using observability tools in their workflows?
  • Coverage perspective: Are critical resources, such as Tier 0 applications, fully instrumented and monitored?

4. Optimization

Focus: Enhancing efficiency and reducing noise. Optimization KPIs evaluate how well your observability practice reduces unnecessary alerts, improves workflows, and minimizes manual effort. These KPIs tackle questions like:

  • Are Observability operations efficiently meeting business needs?
  • Are there opportunities to reduce spend while maintaining critical Observability capabilities?
  • Are monitoring processes helping resolve incidents faster and more efficiently?

Next steps: diving into KPI examples

By organizing KPIs into these types, you can align your measurements with the strategic goals of your CoE and your organization. In the next section, we'll take a look at some specific examples of OaaS KPIs, explaining their purpose, how to calculate them, and some practical "pro-tips" based on my experience.

[Link]Click here to expand

Taking the next steps

Now that you've explored the critical role KPIs play in defining and measuring Observability as a Service (OaaS), it's time to put these ideas into action. Here's your call to action:

  1. Start Collecting Metrics:
    Begin gathering data for the KPIs we've discussed, even if it's as simple as plugging them into a spreadsheet. This initial step will help your tools administration teams understand the type of information you'll be requesting. More importantly, it may inspire them to think of systemic, programmatic ways to retrieve this data leveraging APIs, automated reports, or other integrations.
  2. Set Your First CoE OKR:
    Make your initial objective simple and actionable. For example:
    Objective: Establish foundational OaaS KPIs.
    Key Result: Collect KPI data for your top X observability tools within the next two months.
  3. Leverage Metrics in Executive Updates:
    Use the outcomes from this exercise to enhance your Observability CoE's monthly updates with your executive champion. Highlight early wins, gaps, and actionable insights to build momentum and alignment.
  4. Create Achievable Goals Based on Data:
    Once you've established baseline data, use it to define meaningful and attainable goals. For example:
    KPI Goal: Reduce the number of observability tools by 5% in the next quarter.
  5. Stay Tuned for What's Next:
    In upcoming blogs, we'll explore deeper aspects of creating a leading observability practice, including tools inventory, rationalization, and strategies for streamlining your observability ecosystem.

If you're passionate about learning more about observability, I'd encourage you to check out my teammates Observability content on Splunk's community blog and watch some of our latest videos on YouTube (Splunk Observability for Engineers).