IT Event Management Process explained!

IT Event Management: Strategies and Best Practices

Purpose & Scope

Purpose

The objective of Event Management process is to monitor all events that occur through the <Customer’s Name> IT infrastructure & application to allow for normal operations & also to detect and escalate exception conditions to service desk of AMS.

Scope

The scope of this process includes the customer’s IT infrastructure & application which is under the control of service desk of AMS.

Definitions

Event

An Event can be defined as any detectable occurrence that has significance for management of <Customer’s Name> IT infrastructure & application or the delivery of <Customer’s Name> IT Service.

Alert

An Alert can be defined as a warning that a threshold has been reached, something has changed, or a Failure has occurred.

Configuration Item (CI)

Configuration Item is any component that needs to be managed in order to deliver <Customer’s Name> IT Service.

Roles and Responsibilities

Operations Control Function responsibilities

  • Oversees the execution & monitoring of the operational activities.
  • Define central observation & monitoring capability and then using those consoles to exercise monitoring & control activities (Console management).
  • Management of routine batch jobs or scripts (Job Scheduling).
  • Backup & restore operations.

Service desk responsibilities

  • Response to the alerts if required.
  • Undertaking Incidents that have been identified by Event management and escalate to appropriate team.

Application Management Team responsibilities

  • Participate in instrumentation of service, classify events and ensure the auto-responses are defined.
  • Test the service to ensure the events are properly getting generated and the defined responses are appropriate.
  • Deal with the Incidents & problems related to events.
  • Perform Event management activities for the systems which are under the control of Application Management Service (AMS).

 

Input, Output

Inputs

  • Event
  • Threshold and Exception rules
  • Instrumentation Decisions
  • Thresholds

Outputs

  • Communications
  • Notifications
  • Incident
  • Problem
  • Change

Event Management Process flow

The ability to detect events, make sense of them and determine the appropriate control action is provided by Event Management. Event Management is therefore the basis for Operational Monitoring and Control of <Customer’s Name> IT infrastructure.

Event Management process flow

 

Activity No. Step Description Input-Output Role
1 Event Occurrence Events occur continuously, not all of them will be detected & registered.   It is important to understand what type of events needs to be detected. Input
Instrumentation decisions
Event
Output
Detected event
 –
2 Instrumentation Define what needs to be monitored about CI’s and the way to monitor them.   Define & design exactly what to monitor & how to monitor and control the IT Service.   “Instrumentation decisions & mechanism checklist” provides the basis for key decisions & mechanism for Instrumentation. Output
Instrumentation decisions
Application Management Team
3 Event Notification CIs should be configured to generate a standard set of events, based what is required to operate the CI.   A general principle of Event notification is that the more meaningful the data it contains and the more targeted the audience, the easier it is to make decisions about the event. Input
Event
Output
Event Notification
Automated Tool
4 Event Detection Once an Event notification has been generated, it should be detected to read and interpret the meaning of the event. Input Output Automated Tool
5 Event Filtering The first level of correlation will be performed here.   Determine weather the event is informational, warning or exception. Input
Detected Event
Output Informational/Warning/Exception
Operations Control/Automated Tool
                          6 Significance of Event?

Categorize the significance of the event into the below mentioned broad categories:  

Informational: This refers to an event that does not require any action and does not represent an exception.   These events are typically used to check the status of a device or service, or to confirm the successful completion of an activity.  

Warning: A warning is an event that is generated when a service or device is approaching a threshold.   Warnings are intended to notify that the situation can be checked and the appropriate action taken to prevent an exception.  

Exception: An exception means that a service or device is currently operating abnormally.   This means that SLA has been breached and the business is getting impacted.   Exceptions could represent a total failure, impaired functionality or degraded performance.

Output
Categorization of event
Operations Control
              7 Even Correlation If an event is significant, a decision has to be made about exactly what the significance is and what actions need to be taken to deal with it.   Correlation will be done using “Correlation engine” which compares the event with a set of criteria & rules.   A correlation engine is programmed according to the performance requirements.   Correlation Engine also matches the events to check the similarity between the events. Input
Event
Output
Required response
Operations Control /Automated Tool
 8 Trigger This is the mechanism to initiate the required response recognized by correlation engine.   At this point of the time there are number of response options available.   These response actions can be chosen in any combination. Input
Required response
Output
Response actions
Operations Control /Automated Tool
9 Log Event Log the event in the Event Management tool or in a system log regardless of what activity is performed. Input
Event
Output
Event Log
Operations Control /Automated Tool
          10 Auto response Events that are understood well enough that the appropriate response has already been defined and automated.   The trigger will initiate the action and then evaluate whether it was completed successfully. If not, an Incident or Problem Record will be created. Input
Event
Output
Auto Response
Automated Tool
11 Alert & Human Intervention The event will be escalated if it requires human intervention.   The alert will contain all the information necessary for that person to determine the appropriate action. Input
Event
Operations Control /Automated Tool
 12 Incident/Problem/Change? For the events which will represent a situation where the appropriate response will need to be handled through the Incident, Problem or Change Management process.   A single event may initiate any one or a combination of these three processes. Input
Event
Output
Incident/Problem/Change
Operations Control /Automated Tool
 13 Review Actions For the significant events & exception, review will happen to check the way the events got handled.   Review the handover with other processes.   Review results will be input to the Improvement process. Input
Event Log
Output
Improvement actions
Corrective actions
Application Management Team
 14 Effective? Check the effectiveness of the actions on the Event and take the appropriate actions.   If the results are satisfactory proceed to the closure otherwise invoke Incident/Problem/Change Management process as required. Input
Event Log
Output
Actions identified
Application Management Team
15 Close Event All the events which got logged should formally get closed.   Events should be linked to appropriate Incident/Problem/Change records. Input
Event Log
Output
Closed Event
Operations Control

References

Measurements

Reports are generated based on the below metrics.

Metrics Description
Number of Events by category Occurrence of events in each category, which will indicate the performance.
Number of events by significance Number events generated, which are Informational, Warnings & Exceptions.
Number & percentage of events required human intervention How many incidents required human intervention, which will indicate the opportunity for automation.
Number and percentage of events resulted in incidents and changes Indicate the percentage of events that resulted in Incidents & changes.
Number and percentage of repeated/duplicate events This will help to finetune the correlation engine
Number Events routed for the review This will help to understand the effectiveness of event management

Authored by Vijay Chander – All rights Reserved – 2023

No Comments

Sorry, the comment form is closed at this time.