Accelerating the ITIL Incident Management Lifecycle
How can you accelerate the ITIL Incident Management Lifecycle?
Before we go there, let’s clarify exactly what Incident Management is and where it sits in the overall picture. IT Service Management (ITSM) encapsulates the entire gamut of processes and activities that help design, plan, deliver and control the IT services for your employees and your customers.
ITIL (formerly an acronym for Information Technology Infrastructure Library, but used nowadays as a standalone term) provides a detailed set of best practices to facilitate IT Service Management. It is designed to ensure alignment between the IT services you provide and the overall needs of your business.
The ITIL approach is based on bringing focus to five key process areas:
- Service Strategy – Understanding organizational objectives and customer needs.
- Service Design – Turning the service strategy into a plan for delivering business objectives.
- Service Transition – Developing and improving capabilities for introducing services.
- Service Operation – Managing the services in supported environments.
- Continual Service Improvement – Incremental and large-scale improvements on an on-going basis.
What is ITIL Incident Management?
In the ITIL approach, Incident Management lives under the Service Operation umbrella, alongside Event Management and Problem Management. These activities interact and overlap in order to maintain the uptime of your infrastructure, to rapidly fix issues as they arise and to prevent them from reoccurring.
It’s worth spending a moment to dissect the differences between the three:
Event Management is focused on monitoring the infrastructure to detect changes that may or may not lead to an incident being logged. An incident is defined as an unplanned interruption or degradation of an IT service.
Incident Management is focused on the restoration of service as fast as possible.
Problem Management aims to fix the root cause of incidents (single or multiple) and the prevention of further incidents.
One way to look at it is that Event Management looks for things that are going awry, Problem Management works on fixing issues and Incident Management is all about restoring the service once normalcy is compromised.
The Incident Management Lifecycle’s Phases
The Incident Management lifecycle progresses through several phases. Each of these phases has the potential to significantly impact how fast an interrupted or degraded service is restored. They include:
- Incident identification
- Logging
- Categorization/Prioritization
- Investigation/Escalation
- Resolution
- Closure
- Communication
Typically a service desk provides a single point of contact and is the primary functional organization responsible for Incident Management. The service desk manages the lifecycle and interfaces with users affected by an outage.
The transformation of an event to an incident provides the first opportunity for acceleration. For example, finding the root cause of issues is the most time-consuming part of solving them. Many products on the market will do correlation to simply suppress downstream alarms and point to the first unresponsive device it sees as the root cause of an outage. But to really reduce MTTR, it is important to verify what the issue is and pinpoint the true cause.
The ability of an IT Operations Management (ITOM) platform to filter through the tidal wave of events to rapidly zero in on key events has a direct correlation on the time it takes to generate an incident. ITOM platforms need to have the ability to see the whole picture. Gaps in monitoring significantly affect the ability to generate incidents.
Staying Ahead with Predictive and Proactive Steps
More importantly, predictive and proactive capabilities are even more effective at getting ahead of problem identification and, in some cases, even taking steps to avoid the issue altogether.
Once an incident is identified, the next step is logging it and collecting basic information. Gathering the correct information can go a long way to adding efficiency to the eventual resolution. Templates can provide a systematic approach to facilitate this.
Categorization and prioritization follow. The former makes it possible to rapidly figure out whom to contact to resolve the issue and the latter ensures that incidents are addressed in the right order. Look for a platform that can prioritize incidents based on the business service criticality for the services impacted.
That is followed by investigation. Then, if case resolution procedures are not readily available, service desk operators can rely on workflows to determine escalation.
Each these steps can be accelerated immensely by leveraging automation techniques. Frequently occurring incidents can be referenced in a knowledge base to rapidly move the ticket along based on preconfigured workflows. For example, tickets can be automatically routed to the experts for that particular ticket, saving time not only on workflow delegation but on troubleshooting the ticket.
Once the incident is routed to the best-suited specialist, resolution becomes the focus. This is the primary goal of the problem management team and the ability of the tools used to rapidly get to the root cause of issues directly affects the restoration of services. That, of course, leads to case closure.
Multi-perspective analysis quickly and accurately pinpoints root cause and reduces false alarms by looking at problems from multiple viewpoints. The analysis takes normality/abnormality, relationships and performance into account to find the true root cause of failures and performance problems, not simply what failed.
Workflows built to streamline processes ultimately save both time and money. Your monitoring tool should detect a problem, look at it from multiple viewpoints, leverage automation to take corrective action, retest to see if the problem was fixed and then create a ticket.
Best Practices for Maintaining Effective Communication
Additionally, the use of a Known Error Database (KEDB) is a significant accelerator.
Incident Management teams benefit from the use of a KEDB that is primarily maintained by the Problem Management team. In return, Problem Management benefits greatly from accuracy of incident data to build the KEDB and use it to rapidly diagnose issues.
It follows that as the incident is closed, relevant information is captured in the KEDB for future use. Not doing so can add inefficiencies to the overall Incident Management lifecycle – and is a tragic loss of otherwise easily leveraged information.
The primary goal of Incident Management is the restoration of service. However, the other essential aim is to maintain effective communication with the user community and the staff within the IT organization working on the incident.
The ability to accelerate the Incident Management lifecycle affects not only how your business performs, but also how your customers perceive you. Incident Management has a direct effect on customer satisfaction. After all, the overarching goal is to reduce the impact to the business whenever an outage occurs.