Global payment organisation Visa suffered an outage on Friday 1st June leading to more than 5 million failed transactions. The problem began at around 2:35 pm and was only able to be fixed on Saturday at around 12:45 am.
Visa has been forthcoming with information releasing several statements and updates since the outage; the outage has not been associated with any breach, hack or ransomware attack. British Airways fell victim to a similar outage in June 2017 and came under criticism for the slow response; the outage lead to thousands of flights being delayed or cancelled.
Visa, like many other organisations operate from data centre facilities with the ability to instantly failover to alternate hardware should something go wrong. In this case, the failover also failed. During the peak disruption periods, totalling an hour over the outage, 35% of transactions failed, throughout the rest of the outage this was between 7% and 10%. Of transactions that failed the figure from the UK was 2.4 million, across the rest of Europe this figure was about 2.8 million.
Speaking in a statement Charlotte Hog, European Chief Executive for Visa said:
“We understand what hardware malfunctioned (the switch) and the nature of the malfunction. We do not yet understand precisely why the switch failed at the time it did.”
It is thought that a partial failure with a component of a switch lead to the issue, preventing the backup switch from activating and allowing the flow of operations to failover. Visa has hired Ernst & Young to complete an independent review following the outage.
The importance of testing backup systems
Visa much like British Airways and Gitlab before them had provisions in place to ensure that outages and data loss should not occur, these systems, however, did not function as planned. With any backup system, the primary function may be to protect systems however the purpose of the systems is to ensure a recovery or failover can take place smoothly, in other words, to ensure business continuity.
Although systems are in place, it is vital that all organisations regularly test backups and failover systems to ensure they work. The consequences of failing to test will more often than not be failures, outages and could even lead to data loss or breach. In January 2017 Gitlab suffered this fate. Although data was being protected by 5 different backup technologies and setups, not one had been tested and the implementations all failed when called upon. A technician accidentally deleted a live database of around 300GB of data, this contained over 4500 customer projects. Had any one of these 5 implementations been tested, the data could have been recovered and downtime kept to a minimum, the company eventually managed to recover some data from an 8-month-old backup.
System failures and the customer
A system failure, outage, breach or loss is likely to make headlines. The past 2-years has seen an increase in attacks but also in consumers awareness of issues with data and data security, due in part to the implementation of the GDPR across Europe. Making headlines for a data breach will cause organisational and reputational damage and could be off-putting for potential new customers. Dixons is another organisation to have made headlines for this reason in the UK after losing credit card details of nearly 6 million customers; many now speculate that they will be one of the first organisations to be hit with a large fine under the GDPR.
Best practice for testing backup systems
Backup or secondary systems make up a vital part of business continuity and disaster recovery plans. Ensuring these are aligned with the data protection policies will give guidance on the acceptable testing policies throughout an organisation.
Establishing testing schedules will ensure that tests take place across the organisation. Tests should be done at a minimum of once a year.
Environments grow in size and complexity over time and with threats like ransomware, updates are almost constant. Systems should be tested fully before and after all major upgrades to ensure any updates do not cause secondary systems to fail.
A secondary system may be implemented on small systems or number of machines but over time this may need to scale to cover multiple sites, machines or aspects of an environment. Testing that a secondary system has the ability to cope with this is vital before a complex, time consuming and potentially costly role out.
To find out more about how to stay safe against the effects of data loss or breach associated with cyber-crime, download the whitepaper now.