Human Error -> Root cause AWS S3 outage is found
Martijn Veldkamp
“Strategic Technology Leader | Customer’s Virtual CTO | Salesforce Expert | Helping Businesses Drive Digital Transformation”
March 3, 2017
An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.
While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes and AWS Lambda were also impacted while the S3 APIs were unavailable.
Read the whole story here
