About three weeks ago, British Airways suffered a massive IT failure, resulting in more than 400 canceled flights affecting over 75,000 passengers, due to a mistake made by an engineer.
It may not be fair to jump to the conclusion regarding the root cause of the disaster.
But there are lessons we can all learn from.
Lesson #1 Failure Happens Despite Your Best Efforts
If any industry knows that failures can have huge (and sometimes tragic) consequences, it’s the aviation industry.
Airlines invest heavily in resiliency to reduce the costs of reputational, legal and financial downfall.
Despite their best efforts, almost everything that can fail will fail eventually.
But the question is: How long does it take BA to reboot its systems and resume business? Why wasn’t the recovery faster?
Outages happen — be it caused by a fire, power outage, disk failure, or human error, the “Sorry, our system is down. We really really want to help you but there’s nothing we can do” approach won’t make things any better, it’s the disaster recovery plan that can minimize the impact.
Many organizations thought they had a robust business continuity plan in place until a disaster like Super Storm Sandy strikes, taking out data centers along the east coast.
There is a HUGE difference between THINKING you have a DR plan in place and actively testing your DR plan and SEEING it in action.
Recovery isn’t just about being able to switch over from one set of systems to another.
Recovery depends on your Recovery Time and Point Objectives (RTPO’s) defined by your business continuity plan, you might need data replication as a means of keeping primary and secondary data centers continuously in sync.
Lesson #2 Investing in Staff Training and Facility Management is Cheaper than Downtime Costs
According to Uptime Institute, most of the mistakes in data centers are human error.
Despite the size of your business, there are questions should be addressed with proper training and facility management:
- What’s the condition of the facilities that house your IT systems?
- What’s the Tier of you data center you own or operate out of?
- How well is your staff trained?
Yes, we know that you are under huge pressures to deliver IT with resiliency requirements with a tight budget.
According to an IDC report, the average cost of a critical application failure per hour is $500,000 to $1 million. (Source: http://info.appdynamics.com/rs/appdynamics/images/DevOps-metrics-Fortune1K.pdf )
And a recent availability gap report by Veeam Software and ESG shows that $21.8M
is the average annual cost of downtime. (Source: https://go.veeam.com/2017-availability-report)
Now, would you rather invest more in staff training and facility management or have more unplanned downtime caused by ill-prepared IT staff?
Lesson learned: Hope is not an IT strategy.