noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Nexstar Media Group Inc.

Local Broadcasters Form Joint Venture to Provide High-Speed Data[...]
City of Dallas, TX

City of Dallas Preparation for Inclement Weather
Eli Lilly and Company

Lilly to participate in J.P. Morgan Healthcare Conference

Politics and Policy

Splunk Inc.

01/03/2025 | News release | Archived content

IT/ITIL Problem Management

If there is anything that frustrates IT users, it is repeated issues that seem to persist without any reasonable explanation about their cause or effective and permanent resolution. Service disruptions, whether due to slow responsiveness or corrupted data, are an inevitable part of IT. However, when these issues become recurrent, they can leave a lasting bitter impression.

Ask Microsoft who experienced repeat issues with their Azure Front Door service on 30^th July and 5^th August this year. What differentiates mediocre and mature IT teams is that the latter consistently focus on anticipating and preventing such issues before they happen.

What is problem management in IT?

Problem management, as defined by ITIL 4 guidance, is the practice responsible for reducing the likelihood and impact of incidents by identifying actual and potential causes, as well as managing workarounds and known errors.

The focus here is not on quickly restoring services to normal-that is the role of incident management. Instead, the emphasis is on investigating the root causes of incidents and implementing measures to contain or eliminate them, a process that may require more time.

The value of this practice comes from:

Increased availability
Improved service levels
Reduced costs
Improved customer convenience
Satisfaction

Problem management is carried out in 3 main phases:

Phase 1: Problem identification

Identifying problems is carried out in two different approaches:

Reactive problem management

This approach is a reaction to incidents that have already happened and involves investigating their symptoms and then unearthing their causes. The main drivers for reactive problem management are contributing to the resolution of open incidents, as well as prevent recurrence.

For example, repeated instances of API gateway timeouts will be investigated for possible network issues, misconfiguration, or unresponsive servers.

Proactive problem management

This approach has the objective of preventing incidents before they occur. It involves analysing information to identify latent incident causes before they lead to a service disruption, drawing from:

Systems
Vendors
Users
Other sources

For example, a vendor shares information of a newly discovered vulnerability, or developers unearth a bug while building the next feature update. Once this cause is identified, the risks are analyzed and a response to minimize the incident likelihood or impact is prepared.

Reactive problem management is the more common of the two problem identification techniques. However, as organizations mature in their problem management capability, it becomes more desirable to invest in proactive problem management. The challenge lies in quantifying the value to the business, as prevented incidents and intangible resolution actions can be difficult to measure.

Phase 2: Problem control

Once a problem has been identified, the first step to control is registering it in readiness for detailed investigation. A problem record is created in the organization's chosen mechanism (spreadsheet, ticketing system, or case management tool) by a designated problem management practitioner.

Problems should be recorded separately from incidents since their focus is different, and the timelines involved take much longer. The general information captured at this step includes:

Date of registration
Description of problem
Classification based on type (hardware/software/network/cloud/security etc.)
Priority based on likelihood and impact
Assignee who will coordinate the investigation and resolution actions
Related IT configuration items affected by the problem
Related incidents (in the case of reactive problem management)

Root cause analysis

Once the problem is registered, the main activity of problem control kicks in i.e. root cause analysis where information on the IT system and underlying components is analyzed to trace the cause of the causes, until the root of the problem is unearthed. It is important to note that in some cases, there may not be just a single root cause but several.

Apart from IT components, root cause analysis should also consider other factors such as:

Human error
Configuration changes
User behavior
Procedure errors

Since no one person can have all the skills and information to look at a problem from multiple angles, problem solving is best done using a multidisciplinary team of technical and business experts according to ITSM.express guidance.

There are many techniques for conducting root cause analysis, and it is crucial that organizations train their tech teams on how to apply them and understanding how to select the right technique for a given situation. The ITIL v3 Service Operation publication provides some guidance on selecting techniques as seen below:

Problem situation	Suggested analysis technique
Complex problems where a sequence of events needs to be assembled to determine exactly what happened	Chronological analysis, Technical observation post
Uncertainty over which problems should be addressed first	Pain value analysis, Brainstorming
Uncertain whether a presented root cause is truly the root cause	5-Whys, Hypothesis testing
Intermittent problems that appear to come and go and cannot be recreated or repeated in a test environment	Technical observation post, Kepner-Tregoe, Hypothesis testing, Brainstorming
Uncertainty over where to start for problems that appear to have multiple causes	Pareto analysis, Kepner-Tregoe, Ishikawa diagrams, Brainstorming
Struggling to identify the exact point of failure for a problem	Fault isolation, Ishikawa diagrams, Kepner-Tregoe, Affinity mapping, Brainstorming
Uncertain where to start when trying to find root cause	5-Whys, Kepner-Tregoe, Brainstorming, Affinity mapping

When a problem has been analyzed but yet to be resolved, it is designated the status "known error". Should the investigation reveal that the root cause was addressed during incident resolution, then the problem record is closed at this point. However, if there was a short-term measure applied to reduce the impact or likelihood of incident recurrence, then this is recorded as a workaround.

Workarounds are extremely useful in helping to resolve further incidents faster and should be properly documented and communicated to first level support teams. An example of well documented workarounds is by AWS for its IVS real-time streaming Android broadcast SDK service which lists known issues and associated workarounds.

(Related reading: incident response plans.)

Phase 3: Error control

The last phase of problem management is error control where the problem record is eventually closed after one of the two options is applied:

The problem is solved i.e. a solution is found that contains the likelihood or impact of causing an incident to an acceptable level
The problem no longer affects the organization as the context has been changed

The ideal scenario is when a permanent solution to eradicating the root cause is identified and implemented. This could involve a myriad of actions such as system reconfiguration, migration, change of modules, patching/upgrades, updates to policies, enhancement of controls, etc. Depending on the IT problem being addressed, several practices would need to be applied during the resolution:

Service financial management: A cost-benefit analysis of the permanent solution is conducted to determine if its implementation makes sense from value and budget perspectives.
Change enablement: This involves the submission, review and approval of a change request for the implementation of the permanent solution.
Risk management: Analysis of risks associated with implementing of the permanent solution to inform the decision to deploy.
Release and deployment: Here the planning, installation/configuration, testing, and review of the permanent solution is carried out under the approved change request.

Some organizations see fit to maintain permanent workarounds as their error control. The reasons behind this may be driven by budget, risk, legacy infrastructure, target architecture, vendor advice, and other perspectives. However, the use of permanent workarounds to prevent incidents may inadvertently lead to increased technical debt. Known errors should be regularly reviewed to identify if their context has changed that allows a shift from workaround to permanent solution.

Prioritizing problem management

Problem management is a practice that many organizations struggle to prioritize, often overshadowed by the fast-paced demands of deploying features and restoring services. Its strength lies in helping IT functions evolve beyond a reactive, firefighting mode to more effective system design and maintenance. Achieving this transformation, however, requires a comprehensive and strategic approach.

To reach higher levels of maturity, leadership must actively promote proactive problem management and integrate its metrics into executive dashboards. Investments in upskilling the technology workforce, guided by frameworks such as SFIA, are essential. Additionally, organizations should deploy technologies that enhance problem investigation, including tools with observability and machine learning capabilities, to support more efficient and effective problem resolution.

Sharing and Personal Tools

Please select the service you want to use:

Back

View original format