Critical Amazon Cloud services suffered an outage earlier this week, causing disruptions in the services of a large number of web sites, including Quora, Spotify, Netflix, Slack, Pinterest, Buzzfeed, Trello and IFTTT along web sight isitdownrightnow.com (Is It Down Right Now), which monitors the status of web sites to check whether they are down or not (See: Amazon Web Services suffers major outage).
The outage however, did not affect Amazon's own e-commerce site.
An investigation into the issue revealed that the disruption happened due to a wrong command executed by an employee, which led to a cascading effect and took down a number of other services.
An employee of the Amazon Simple Storage Service (S3) team was conducting a routine debugging operation, investigating S3 billing services that were functioning at a tardy pace.
The employee wanted to execute a command that would take down a small number of servers that handled the S3 billing process, but in the process atypographical error in one of the inputs in the command caused a larger number of servers to be taken down.
An index subsystem containing the metadata, which tracked all objects on the S3 went down. The placement subsystem that depended on the index subsystem to work properly too went down and the disruptions together crippled the system, which could no longer serve API requests from clients in the Northern Virginia region, designated as US-EAST-1.
"At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," Amazon said.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.
Amazon says it is making changes to its system to make sure incorrect commands do not trigger an outage of its web services in the future.
"We want to apologize for the impact this event caused for our customers," it said.
"While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses.
"We will do everything we can to learn from this event and use it to improve our availability even further."