Incident Management: Safety-Critical Practices Are On The Rise
In May 2017, WannaCry ransomware locked computers around the world. The incident hit the UK’s National Health Service hard. The attack hindered urgent NHS services by blocking access to its computers. It locked out vital medical equipment such as MRI scanners and devices for testing blood and tissue samples. Some hospitals had to send ambulances to other locations.
The impact of technology incidents have long since escaped the back office. Overall, as digital systems become more critical, their risks to society increase. If you are unconscious and a doctor can’t access your medical records, they might give you treatment you’re allergic to. If an MRI scanner can get locked out, what about the IV unit supplying your medicine? Or your heart rate monitor telling the nurses’ station that you’re alive? Incident management is critical in such scenarios. Here at Forrester, we’re observing these trends:
- Complex systems continue to fail, and there’s no sign that this will change. As lower-level components become more resilient, we keep pushing the systems to do more and keep layering newer technology on top of older. In fact, we’re hearing troubling anecdotes that incidents are becoming harder, not easier, to resolve.
- As incident costs and risks increase, so does the need for fast, coordinated response. Identifying an incident condition, assembling a team, and tracking the response to the incident are critical practices for digital organizations. Responding capably to an incident requires frictionless, rapid dispatch and close coordination.
- Digital managers are learning from safety-critical practices. Web-scale properties have found that incident management practices from fire and police services are valuable in a digital context. The influence of these practices continues to spread.
Last November, prominent safety science experts Drs. Sidney Dekker, Steven Spear, and Richard Cook appeared on a high-profile panel at the 2017 DevOps Enterprise Summit:
They called on the audience to adopt a higher level of professionalization. Improving incident response by adopting practices from safety-critical domains is an important aspect.
As John Allspaw, founder of Adaptive Capacity Labs, notes in his research, incident management in digital operations is not well understood. However, there’s applicable research on the subject of “teams engaging in understanding and resolving anomalies under high-tempo, high-consequence conditions such as healthcare, aviation, space operations, and the military.” Practices originating at US fire services (the National Incident Management System) are finding their way into technology organizations, including perhaps your company’s.
Focusing on the actual performance of incident responders is changing the IT industry’s priorities. ITIL emphasized consistently filling out and routing incident tickets. Modern incident management emphasizes collaboration. Now, organizations are turning to chat-first, highly integrated collaborative platforms. New vendors are on the move, including Pagerduty, xMatters, Everbridge, Big Panda, OpsGenie, VictorOps, Nexthink, and Resolve Systems. They harness the power of chat platforms like Slack and Hipchat. And they focus on speeding incident response — e.g., alerting responders and tracking their acknowlegments.
For more information, see my recent reports, The Changing Landscape Of IT Incident And Crisis Management and Now Tech: Continuous Resolution, Q1 2018.