Thursday, May 19, 2011

WebSphere Tuning

We are just finishing up an objective which includes tuning the WebSphere JVM GC Algorithm. We have four servers in the cluster. Each zLinux server (guest) has 1 JVM on it. My collegue picked four DIFFERENT GC tuning settings. The first setting was applied to the first guest's JVM. The second was applied to the second guest's JVM and so on. Each of the four guests in the cluster used a unique GC tuning settting. We ran a load test while monitoring the JVMs. We observed that the second guest performed the best. That setting was then applied to all of the guest's JVMs and the test was run again. Performance dramatically improved.

Instead of running four tests to determine which setting was best we only needed to run 1. It saved us a ton of time!

Friday, May 6, 2011

Problem Management

There comes a time in the hosting world when there is an incident in the LIVE environment.  The service desk received some notification - maybe from a client, or from some monitoring threshold that has been breached. There is an incident, and it is the job of the service center to mitigate and restore the service to normal working conditions.  There will be a conference call started by the service desk.  It is their responsibility to get anyone involved that can assist with the incident, and to do it as quickly as possible. 

After some time the incident is resolved. The case for the incident has been closed and everyone goes back to working on whatever they were doing prior to being called into action for the incident.

This should NOT be the end of action.  Some sort of change should be submitted so that the issue does not happen again.

There should be a person who will assume the role of a Problem Manager.  The Problem Manager should own the task of finding the root cause of the incident if the severity / impact is high enough.  Also it is important to get everyone, or at least the majority, to agree on that root cause.  It could be because of a code problem, an infrastructure glitch, a procedure that was not followed, etc...  The problem manager should also look at trends.  It is very likely that multiple incidents can point to a problem.

The problem manager should have the ability to form a group of Subject Matter Experts, sort of a 'swat' team, to assist with uncovering the root cause.  The SME team should consist of members from different parts of the organization (Development, deployment, operations, service desk, monitoring group, networking / infrastructure, etc).

With the help of the SME team, the Problem Manager will be able to determine the root cause and subsequently submit a change order to make sure that the issue does not happen again. 

This process is outlined extensively in the IT Infrastructure Library (ITIL).