Splunk Inc.

01/03/2025 | News release | Archived content

IT/ITIL Problem Management

If there is anything that frustrates IT users, it is repeated issues that seem to persist without any reasonable explanation about their cause or effective and permanent resolution. Service disruptions, whether due to slow responsiveness or corrupted data, are an inevitable part of IT. However, when these issues become recurrent, they can leave a lasting bitter impression.

Ask Microsoft who experienced repeat issues with their Azure Front Door service on 30th July and 5th August this year. What differentiates mediocre and mature IT teams is that the latter consistently focus on anticipating and preventing such issues before they happen.

What is problem management in IT?

Problem management, as defined by ITIL 4 guidance, is the practice responsible for reducing the likelihood and impact of incidents by identifying actual and potential causes, as well as managing workarounds and known errors.

The focus here is not on quickly restoring services to normal-that is the role of incident management. Instead, the emphasis is on investigating the root causes of incidents and implementing measures to contain or eliminate them, a process that may require more time.

The value of this practice comes from:

  • Increased availability
  • Improved service levels
  • Reduced costs
  • Improved customer convenience
  • Satisfaction

Problem management is carried out in 3 main phases:

Phase 1: Problem identification

Identifying problems is carried out in two different approaches:

Reactive problem management

This approach is a reaction to incidents that have already happened and involves investigating their symptoms and then unearthing their causes. The main drivers for reactive problem management are contributing to the resolution of open incidents, as well as prevent recurrence.

For example, repeated instances of API gateway timeouts will be investigated for possible network issues, misconfiguration, or unresponsive servers.

Proactive problem management

This approach has the objective of preventing incidents before they occur. It involves analysing information to identify latent incident causes before they lead to a service disruption, drawing from:

  • Systems
  • Vendors
  • Users
  • Other sources

For example, a vendor shares information of a newly discovered vulnerability, or developers unearth a bug while building the next feature update. Once this cause is identified, the risks are analyzed and a response to minimize the incident likelihood or impact is prepared.

Reactive problem management is the more common of the two problem identification techniques. However, as organizations mature in their problem management capability, it becomes more desirable to invest in proactive problem management. The challenge lies in quantifying the value to the business, as prevented incidents and intangible resolution actions can be difficult to measure.

Phase 2: Problem control

Once a problem has been identified, the first step to control is registering it in readiness for detailed investigation. A problem record is created in the organization's chosen mechanism (spreadsheet, ticketing system, or case management tool) by a designated problem management practitioner.

Problems should be recorded separately from incidents since their focus is different, and the timelines involved take much longer. The general information captured at this step includes:

  • Date of registration
  • Description of problem
  • Classification based on type (hardware/software/network/cloud/security etc.)
  • Priority based on likelihood and impact
  • Assignee who will coordinate the investigation and resolution actions
  • Related IT configuration items affected by the problem
  • Related incidents (in the case of reactive problem management)

Root cause analysis

Once the problem is registered, the main activity of problem control kicks in i.e. root cause analysis where information on the IT system and underlying components is analyzed to trace the cause of the causes, until the root of the problem is unearthed. It is important to note that in some cases, there may not be just a single root cause but several.

Apart from IT components, root cause analysis should also consider other factors such as:

  • Human error
  • Configuration changes
  • User behavior
  • Procedure errors

Since no one person can have all the skills and information to look at a problem from multiple angles, problem solving is best done using a multidisciplinary team of technical and business experts according to ITSM.express guidance.

There are many techniques for conducting root cause analysis, and it is crucial that organizations train their tech teams on how to apply them and understanding how to select the right technique for a given situation. The ITIL v3 Service Operation publication provides some guidance on selecting techniques as seen below:

Problem situation

Suggested analysis technique

Complex problems where a sequence of events needs to be assembled to determine exactly what happened

Chronological analysis, Technical observation post

Uncertainty over which problems should be addressed first

Pain value analysis, Brainstorming

Uncertain whether a presented root cause is truly the root cause

5-Whys, Hypothesis testing

Intermittent problems that appear to come and go and cannot be recreated or repeated in a test environment

Technical observation post, Kepner-Tregoe, Hypothesis testing, Brainstorming

Uncertainty over where to start for problems that appear to have multiple causes

Pareto analysis, Kepner-Tregoe, Ishikawa diagrams, Brainstorming

Struggling to identify the exact point of failure for a problem

Fault isolation, Ishikawa diagrams, Kepner-Tregoe, Affinity mapping, Brainstorming

Uncertain where to start when trying to find root cause

5-Whys, Kepner-Tregoe, Brainstorming, Affinity mapping


When a problem has been analyzed but yet to be resolved, it is designated the status "known error". Should the investigation reveal that the root cause was addressed during incident resolution, then the problem record is closed at this point. However, if there was a short-term measure applied to reduce the impact or likelihood of incident recurrence, then this is recorded as a workaround.

Workarounds are extremely useful in helping to resolve further incidents faster and should be properly documented and communicated to first level support teams. An example of well documented workarounds is by AWS for its IVS real-time streaming Android broadcast SDK service which lists known issues and associated workarounds.

(Related reading: incident response plans.)

Phase 3: Error control

The last phase of problem management is error control where the problem record is eventually closed after one of the two options is applied:

  • The problem is solved i.e. a solution is found that contains the likelihood or impact of causing an incident to an acceptable level
  • The problem no longer affects the organization as the context has been changed

The ideal scenario is when a permanent solution to eradicating the root cause is identified and implemented. This could involve a myriad of actions such as system reconfiguration, migration, change of modules, patching/upgrades, updates to policies, enhancement of controls, etc. Depending on the IT problem being addressed, several practices would need to be applied during the resolution:

  • Service financial management: A cost-benefit analysis of the permanent solution is conducted to determine if its implementation makes sense from value and budget perspectives.
  • Change enablement: This involves the submission, review and approval of a change request for the implementation of the permanent solution.
  • Risk management: Analysis of risks associated with implementing of the permanent solution to inform the decision to deploy.
  • Release and deployment: Here the planning, installation/configuration, testing, and review of the permanent solution is carried out under the approved change request.

Some organizations see fit to maintain permanent workarounds as their error control. The reasons behind this may be driven by budget, risk, legacy infrastructure, target architecture, vendor advice, and other perspectives. However, the use of permanent workarounds to prevent incidents may inadvertently lead to increased technical debt. Known errors should be regularly reviewed to identify if their context has changed that allows a shift from workaround to permanent solution.

Prioritizing problem management

Problem management is a practice that many organizations struggle to prioritize, often overshadowed by the fast-paced demands of deploying features and restoring services. Its strength lies in helping IT functions evolve beyond a reactive, firefighting mode to more effective system design and maintenance. Achieving this transformation, however, requires a comprehensive and strategic approach.

To reach higher levels of maturity, leadership must actively promote proactive problem management and integrate its metrics into executive dashboards. Investments in upskilling the technology workforce, guided by frameworks such as SFIA, are essential. Additionally, organizations should deploy technologies that enhance problem investigation, including tools with observability and machine learning capabilities, to support more efficient and effective problem resolution.