Sometimes in the organizations where the IT support is involved both in the Service Desk and in the new project activities, it is not taken into account the difference between an incident and a problem. In most cases, in fact, when there is an incident (that can be often solved by a sequence of commands from a system administrator) we usually focus too much in seeking the cause of the problem to remove it permanently when instead a Service Desk should restore the service as soon as possible by providing, if necessary, a workaround.
A fast recovery of an IT service is a priority especially if a number of users / clients are impacted by the outage. The analysis process to identify the root cause of the problem may be performed in parallel or in a second phase during the problem management process.
In certain cases it is also possible that the workaround is integrated into a monitoring system able to identify the malfunctioning and to use the provided workarounds in order to restore automatically the service. If this occurs during the time in which the users / customers do not use the IT service, the service can be restored without any impact. This ensures a higher user satisfaction and at the same time more tranquility for the IT department.
Let’s see a typical example:
It happened that our DFS service on the Windows file server crashed. Of course for the Murphy’s Law this always happens over the weekend so that the result on Monday morning was that no one was able to access the files on the network until the system administrator on duty did not restart the service.
System engineers have analyzed the problem during the week: the service was crashing but it was not restarting automatically cause to a dependency with the Remote Registry that was set to DISABLED; they were not able, however, to find the real root cause of the problem.
On the following Monday we had a ” Deja vu “, the DFS service was crashing again… this is what happens when you mix the roles of the Service Desk / Incident Management and Problem Management.
What we have done:
We have introduced a check on NetEye, our monitoring system, to verify the correct status of the DFS service and we have also created an automated procedure to reset the Remote Registry service on the auto_start state and to restart the DFS service.
This procedure has been linked to the same NetEye control so that it can be automatically executed in case of error.
This procedure has also been made available through the NetEye Action Launchpad to the administration department of our company, that has no IT skills (who are the first starting to work in the company in the early morning). With this tool they can independently solve the problem in case the automatic procedure fails.
Now that the incident is closed in the right way, we can concentrate ourselves on the problem and in the case the incident will happen again and the monitoring system is not able to solve it automatically, we have also introduced a self-service solution for the administration department. This solution avoids the need to ensure the presence of the IT support in the company already at 7 am when a certain number of users starts to work and allows us to have more time to be dedicated to the problem analysis.