Friday, May 6, 2011

Problem Management

There comes a time in the hosting world when there is an incident in the LIVE environment.  The service desk received some notification - maybe from a client, or from some monitoring threshold that has been breached. There is an incident, and it is the job of the service center to mitigate and restore the service to normal working conditions.  There will be a conference call started by the service desk.  It is their responsibility to get anyone involved that can assist with the incident, and to do it as quickly as possible. 

After some time the incident is resolved. The case for the incident has been closed and everyone goes back to working on whatever they were doing prior to being called into action for the incident.

This should NOT be the end of action.  Some sort of change should be submitted so that the issue does not happen again.

There should be a person who will assume the role of a Problem Manager.  The Problem Manager should own the task of finding the root cause of the incident if the severity / impact is high enough.  Also it is important to get everyone, or at least the majority, to agree on that root cause.  It could be because of a code problem, an infrastructure glitch, a procedure that was not followed, etc...  The problem manager should also look at trends.  It is very likely that multiple incidents can point to a problem.

The problem manager should have the ability to form a group of Subject Matter Experts, sort of a 'swat' team, to assist with uncovering the root cause.  The SME team should consist of members from different parts of the organization (Development, deployment, operations, service desk, monitoring group, networking / infrastructure, etc).

With the help of the SME team, the Problem Manager will be able to determine the root cause and subsequently submit a change order to make sure that the issue does not happen again. 

This process is outlined extensively in the IT Infrastructure Library (ITIL).

No comments:

Post a Comment