Problem Management Process explained!
Problem Management Process explained
Purpose & Scope
Purpose
The objective of the Problem Management process is to:
- Prevent the occurrence of problems and resulting incidents.
- Eliminate recurring incidents by identifying the root cause of the incidents and initiating the preventive actions.
- Minimize the impact of the incidents & problems that cannot be prevented.
Scope
All problems with a (possible) negative effect on <Customer’s Name>’s services for which a SLA is signed, lies within the scope of Problem Management.
Definitions
Incident
Incident is an unplanned interruption to an IT service or reduction in the quality of an IT service or failure of a configuration item that has not yet impacted service.
Problem
Problem can be defined as the unknown cause of one or more incidents.
Known Error
A problem that has been successfully diagnosed and for which a permanent solution or workaround exists.
In any event, it remains a Known Error unless it is permanently fixed by a change.
Workaround
The workaround is an identified means of resolving a particular incident by allowing normal service to be resumed; however, it does not actually resolve the issue that caused the incident in the first place.
Change
The addition, modification, or removal of anything that could influence <Customer Name>’s IT Service.
Priority
The result of assessing the impact to the business and the time frame during which the problem must be resolved to minimize further disruption.
Known Error Data Base
It is a database containing all Known Error Records. The Known Error Database is part of the Service Knowledge Management System (SKMS).
Root Cause
It is the underlaying cause of an incident or problem. It the most basic reason for an undesirable condition or state of a CI or an IT Service.
Partner
A Third party responsible for supplying goods or Services that are required to deliver <Customer Name>’s IT Services.
Note: For a deeper definitions of each of the terms above, or if you want to become proficient in Problem Management, please refer to the ITIL4 guide and exam.
Roles and Responsibilities
Problem Manager
Responsibilities
- Identify Problems by doing proactive analysis of incident data, trend analysis etc.
- Liaison with problem resolution groups to ensure the resolution of the problems within SLA targets.
- Responsible for reporting on Problem Management activities.
- Ensure all Problems are registered and tracked.
- Ownership and protection of KEDB.
- Responsible for the inclusion of Known Errors and management of KEDB.
- Responsible for escalating to management if necessary.
- Liaison with partners to ensure the resolution of problems and providing problem related information.
- Identify, receive, accept, register, categorize, prioritize, and assign Problem tickets to resolution groups.
- Identify & allocate resources to perform the investigation & diagnosis of the problems.
- Single point of contact (SPOC) for the Problem Management process and related activities.
- Coordinate process improvement proposals.
- Formal closure of all Problem Records.
- Arranging, running, documenting all activities relating to Major Problem Reviews.
Application Management Team
Responsibilities
- Provide effective specialist contribution to the analysis and resolution of Problems and Known Errors.
- Track the progress of an owned Problem records during entire life cycle.
- Determine the root causes of problems by using various techniques.
- Develop Workarounds and preventative actions.
- Recommend solutions.
- Submit requests for change (RFC)
- Ensure the Known Error Database / Knowledge Database is updated.
- Update records in order to keep the Problem Manager informed of the status
- Has responsibility for completing the Problem record and communicating the solution to the appropriate IT support groups, including Service Desk.
Input, Output
Inputs
- Event
- Incident
- Trend Analysis
- Major Incident
- Configuration Management System
- Information from Technical teams
- Product Information
- Known Error from Partner
Outputs
- Known Errors
- Request For Change (RFC)
- KEDB
- Measures & Reports
- Work arounds
- Permanent Solutions
Problem Management Process
The Problem Management process is intended to reduce both, the count and severity of incidents and problems on the <Customer’s Name>’s business.
Problem Management will also maintain information about problems and the appropriate workarounds and resolutions, so that the organization is able to reduce the number and impact of incidents over time. In this respect, Problem Management has a strong interface with Knowledge Management, and tools such as the Known Error Database will be used.
Structural analysis of the <Customer’s Name>’s IT infrastructure, reports generated from support software, and user-group meetings can also result in the identification of Problems and Known Errors. This is proactive Problem Management. Problem control focuses on transforming Problems into Known Errors. Error control focuses on resolving Known Errors.
Although Incident and Problem Management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact, and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.
Activity No. | Step | Description | Input-Output | Role |
1 | Problem Detection | There are multiple ways of detecting Problems. These includes: Service desk by suspicion or detection of an unknown cause of one or more incidents. Service desk/Resolving group may have resolved the incident with a workaround but has not determined a definitive cause and suspects that it is likely to recur. Analysis of an incident by a technical support group which reveals that an underlying problem exists or is likely to exist. Event Management may reveal the need for a problem record. Major Incidents for which root cause must be identified. Analysis of incidents as part of proactive Problem Management – resulting in the need to raise a Problem Record so that the underlying fault can be investigated further. Notification from Partner that a problem exists that has to be resolved. | Input Incident Database, Major Incident Trend analysis, Product information, Input from Technical Staff, Input from Service desk, Event Management Partner Output Detected Problem | Problem Manager/ Service Desk/ Technical Staff/Tools/ Partner |
2 | Problem Logging | Regardless of the detection method, all the relevant details of the problem must be recorded so that a full historic record exists. A cross-reference must be made to the incident(s) which initiated the Problem Record – and all relevant details must be copied from the Incident Record(s) to the Problem Record. Refer <Problem Logging Checklist>. | Input Detected Problem, Problem Logging Checklist Output Problem Record | Problem Manager/ Service Desk |
3 | Problem Categorization | Problems must be categorized in the same way as incidents (and it is advisable to use the same coding system) so that the true nature of the problem can be easily traced in the future and meaningful management information can be obtained. | Input Problem Record, Problem Categories Output Categorized Problem | Problem Manager |
4 | Problem Prioritization | Allocate an appropriate prioritization code using <Problem Priority Guidelines>. Problems should be prioritized the same way as incidents; it should also consider the frequency & impact of the related incidents. | Input Problem Record, Problem Priority Guidelines Output Problem Record | Problem Manager |
5 | Resource Allocation | Depending on the expertise needed to resolve the Problem, the Problem Manager will assign it to an appropriate Problem Owner. Problem Manager can set up a team of one or more expert(s) to resolve the Problem; however, the Problem Owner remains responsible for resolving the Problem at all times. | Input Problem List Output Assigned Problems | Problem Manager |
6 | Investigation & Diagnosis | As investigation should be conducted to try to diagnose the root cause of the problem. The Configuration Management System (CMS) must be used to determine the level of impact and to assist in pinpointing and diagnosing the exact point of failure. The Know Error Database (KEDB) should also be accessed, and problem-matching techniques (such as key word searches) should be used to see if the problem has occurred before and, if so, to find the resolution. There are many problems analysis, diagnosis and solving techniques available. Refer <Problem Investigation & RCA Techniques>. RCA should be documented and should send it for necessary approvals. Refer <RCA Template>. | Input Problem record, CMS, KEDB, Problem Investigation & RCA Techniques Output RCA | Application Management Team/ Problem manager |
7 | RCA Approved? | After investigation & diagnosis, RCA will be done & documented. This RCA should be reviewed & approved by appropriate authorities. If the RCA is not approved, then further investigation has to happen and arrive at RCA. | Input RCA Output Approved RCA | Problem manager & Customer |
8 | Workaround/ Permanent Fix? | In some cases, it may be possible to find a workaround to the incidents caused by the problem – a temporary way of overcoming the difficulties. But it is important that work on a permanent resolution. In cases where a workaround is found, it is therefore important that the problem record remains open, and details of the workaround are always documented within the Problem Record. | Input Problem Record Output Workaround | Application Management Team |
9 | Create Known Error Record | As soon as the diagnosis is complete, and particularly where a workaround has been found (even though it may not yet be a permanent resolution), a Known Error Record must be raised and placed in the Known Error Database – so that if further incidents or problems arise, they can be identified, and the service restored more quickly. During testing of new applications, systems or releases it is possible that minor faults are not rectified – often because of the balance that has to be made between delivering new functionality to the business as quickly as possible and ensuring totally fault free code or components. Where a decision is made to release something into the production environment that includes known deficiencies, these should be logged as Known Errors in the KEDB, together with details of workarounds or resolution activities. There should be a formal step in the testing sign-off that ensures that this handover always takes place. | Input Problem Record, Workaround, Permanent Solution, Product Information Output Known Error record | Problem Manager |
10 | Partner Involvement Required? | During the investigation phase, if it is evident that the Partner should be involved in resolving the problem, then it is required to co-ordinate and work closely with the partner to resolve the problem. | Input Problem record, RCA Output Partner involvement | Problem Manager |
11 | Change Required? | If changes are required to solve the problem, then RFC should be raised and communicated to Change Management. Refer <RFC Template>. | Input Problem Record Output RFC | Application Management Team |
12 | Resolution | As soon as solution has been found, and there is a need of a Change, then Change Management process should be initiated. Resolution should be only applied when the change has been approved and scheduled for release. There will be some problems for which a Business Case for resolution cannot be justified (e.g., where the impact is limited but the cost of resolution would be extremely high). In such cases a decision must be taken to leave the Problem Record open but to use a workaround description in the Known Error Record to detect and resolve any recurrences quickly. | Input Problem Record Output Resolution | Application Management Team |
13 | Major Problem Review | After every major problem (as determined by the <Problem Priority Guidelines), a review should be conducted to learn any lessons for the future. Any lessons learned should be documented in appropriate procedures, work instructions, diagnostic scripts or Known Error Records. The knowledge learned from the review should be incorporated into a service review meeting with <Customer’s Name> to ensure the customer is aware of the actions taken and the plans to prevent future major incidents from occurring. This helps to improve customer satisfaction and assure the <Customer’s Name> business that Oracle GSD-AMS is handling major incidents responsibly and actively working to prevent their future recurrence. Refer <Major Problem Review agenda& Action tracker> for the review topics. | Input Major Problem Record Output Minutes of Meeting, Improvement Actions | Problem Manager |
14 | Problem Closure | When any change has been completed (and successfully reviewed), and the resolution has been applied, the Problem Record should be formally closed. Refer <Problem Closure Checklist> for the checks that should be done during the closure. | Input Problem Record Output Lessons learnt | Problem Manager |
References
- Procedure for Incident Management Process
- Procedure for Asset & Configuration Management
- Procedure for Change Management
- Procedure for Knowledge Management
- Procedure for Partner Management
Measurements
Reports are generated based on the below metrics.
All the metrics should be broken down by Categories and Priorities.
Metrics | Description |
No. of Problems recorded | Total number of problems recorded in the period (as a control measure) |
% of problems resolved with in SLA | The percentage of problems resolved within SLA targets (and the percentage that are not!). |
No. of open Problems | Number and trend of the open problems (Including backlog) |
No. of Major Problems | The number of major problems (opened and closed and backlog) |
No. of Major Problem reviews | Total number of Major Problems resolved Vs no. of reviews |
No. of Known Error additions | Total number of Known Errors to KEDB |
No. of Problems waiting for Change | Number of Problems for which RCA is available and waiting for change to resolve the Problem |
Authored by Vijay Chander – All Rights Reserved 2023