The popularity and usage of cloud computing has increased manifold in the last few years. From leading tech companies across the globe to startups, everyone wants to venture into this space. While it is the obligation of the cloud service provider to take responsibility for their infrastructure and ensure security and safety at all ends, sometimes it doesn’t quite happen.
Here we list a few instances of cloud outages from the year 2019 when the customer systems and devices were jeopardised due to these outages.
In May 2019, Salesforce faced one of its biggest service disruptions when the deployment of a database script to its Pardot Marketing Cloud ended up granting elevated permissions to regular users. The company had to block the users to prevent employees from stealing sensitive corporate data. It later had to block network access to other Salesforce services like Sales Cloud and Service Cloud to avoid further damage. As a result of this customers were unable to access Pardot Marketing Cloud for 20 hours and in addition, it took them 12 days to completely roll out other Salesforce services such as Sales Cloud and Service Cloud. The company later stated that a faulty database script had led to almost shut down its entire infrastructure and address the issue of broken user permissions.”
In August, an Amazon AWS US-EAST-1 datacenter in North Virginia experienced a power failure leading to the datacenter’s backup generators to start failing. It led to 7.5% of the EC2 instances and EBS volumes becoming unavailable. After the power was restored, Amazon determined that some EC2 instances and EBS volumes incurred hardware damage and the data stored on them were no longer recoverable. There was an extensive data loss for some customers proving that storing data in the cloud does not mean that it is completely safe and you do not also need a backup. It was one of the major instances of hardware failure in 2019 and proves that hosting data in the cloud is not always safe.
Many iCloud users across the globe briefly got the message of “Service Unavailable – DNS failure” for several hours in July. This widespread cloud outage affected services such as App Store, Apple Music, and Apple TV, Apple Books, Apple ID, Apple Music, Apple Music Subscriptions and more. While the issue has now been resolved, during the time of outage users could not use various functions such as Find My iPhone to locate their devices. The company stated that the cloud outage was a result of a ‘BGP route flap’ issue that caused severe packet loss for users in North America.
Even Microsoft faced its share of cloud outages this year affecting Azure, Microsoft 365, Dynamics, and DevOps. In May, Microsoft had to face an outage that lasted for more than an hour showing network connectivity errors in Microsoft Azure that deeply affected its cloud services including Office 365, Microsoft Teams, Xbox Live, and several others which are widely used by Microsoft’s commercial customers. Engineers identified the root cause to be an incorrect name server delegation issue that affected DNS resolution, network connectivity, and downstream impact. While the services were recovered, no customer DNS records were impacted during this incident.
Google Cloud Servers
The cloud servers in US-east1 region were cut off from the rest of the world as there was an issue with Cloud Networking and Load balancing. It caused physical damage to multiple concurrent fibre bundles that serve network paths in us-east1. Google carried out extensive mitigation work post that however, users faced increased latency.
Google Cloud Platform
Very recently Google Cloud Platform (GCP) was found experiencing major issues with services including cloud dataflow, cloud storage, compute engine, affecting multiple products globally. The company stated that its engineers are investigating the matter and will soon mitigate the incident. GCP stated that they have identified the cause and are currently rolling out mitigation. It seemed to have affected some Google Cloud APIs across us-east1, us-east4 and southamerica-east1, with some APIs impacted globally. It is also interesting to note that it came almost 20 days after users faced 100 per cent packet loss to and from ~20 per cent of instances in GCP’s us-west1-b region for two-and-a-half hours. The reason for the failure was its chubby lock system which resulted in the control plane losing and gaining leadership in short succession, company stated.
In July 2019, Cloudflare visitors s received 502 errors caused by a massive spike in CPU utilization on the network. The company said that the 30-minute outage was due to a CPU spike which, in turn, was caused by a bad software deploy that was rolled back. It immediately took to fixing the issue. The company also clarified that this was not an attack and that the internal team are performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again. Once the issue was fixed, the company stated that everything was back to normal and blamed their own software for the mishap.
Facebook and Instagram face outage
There were issues with Facebook and Instagram earlier this year which was caused due to a server configuration change. During the outage, users faced issues with Facebook-owned properties Instagram and WhatsApp for around 14 hours. Quite recently also users complained that Facebook stopped working and users were not able to carry activities such as sharing a new post or accessing messenger.