With the plethora of threats IT networks face on a daily basis, it is inevitable that once in a while you’re going to have an issue, even if you are British Airways.
On Saturday 27th May, an outage began causing mayhem for the flight operator and hundreds of flights across multiple airports were cancelled or delayed, the knock-on effect of this hit tens of thousands of customers attempting to travel.
Not an IT failure
Over the course of the weekend and the days that followed, media outlets reported several different causes or theories as to what had caused the global outage for the airline. Bearing in mind recent events, many at first may have assumed that systems had been hacked or infected by malware, however, it was quickly established, in part due to the length of the outage, that the issue lay around power or lack thereof.
British Airways representatives went on to broadcast that this was not an IT failure but that an electrical supply had been interrupted, giving light to the theory that human error was to blame either through a contractor or due to many aspects of IT being outsourced. BA’s data centres use an uninterruptable power supply (UPS) and have power generators and batteries on hand as a plan B, the fact that these were not able to curb the problem is the real failure, IT or not.
While the investigation into what really caused the outage, which affected an estimated 75,000 people and will have a predicted cost of between £50m and £100m in compensation alone, is ongoing Bill Francis who is Head of IT for IAG, BA’s parent company issued the following statement –
“This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries… After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.”
So much for business continuity
Disaster Recovery? Failover? Business Continuity? If any of these had been planned for it certainly seems like those plans did not stand up in the face of a real scenario. The effects of the outage were still being felt a week after its occurrence and two-weeks post there has been no official statement on why the outage had such a profound effect on BA’s global systems.
- Disaster Recovery (DR) refers to a policy driven procedures to restore data, infrastructure and systems on a larger scale in the event that a natural disaster or human related disaster takes place. DR could also include failover to systems that data is replicated to.
- Business Continuity refers to the capability of an organization to continue operating and delivering a service or product in the event of a disaster or other incidents.
Disaster Recovery best practice
Every organization with an IT network, from BA to a small start-up, should use disaster recovery planning to ensure that in the event of a disaster (or outage) they can get back to operational capacity as quickly as possible and with minimal disruption to networks that aren’t affected and to customers.
Whilst disaster recovery has historically been an area only enterprize organizations deal in, primarily due to cost, advancements in modern technology have opened the possibilities of DR to all and the use of cloud platforms has increased this.
1. Complete a risk assessment and business impact analysis
Each organization is different and the data or systems that are most valuable will vary. Completing an assessment of what risks there are to your network and the impact of any downtime is a vital step in understanding how to recover and what should be recovered in what order.
2. Have full off-site backups of data
You may have an end-to-end DR solution in place but if systems fail, backups should be the first port of call to begin recovering valuable data. Keeping off-site backups will help to ensure that any potential disaster doesn’t also damage backup data.
3. Implement a full disaster recovery plan
If disaster strikes there will be hundreds of things to be done, having a written plan and procedures to follow will speed up this process and ease confusion. In many cases, it is also necessary for compliance and/or insurance purposes.
4. Regularly test your disaster recovery plan
Having a plan is all well and good but if disaster strikes and your DR plan fails, the effects of the disaster could be worsened. Regularly testing a plan will ensure it works, show anywhere it could be strengthened and ensure that your plan is in line with your ever-changing network.