Day 3 of Amazon's Troubles
Posted by Tim Stephenson, RaddOnline® on Saturday, April 23, 2011
We're now on the third day of trouble for Amazon Web Services and they are still reporting trouble. Well over 50 hours now puts them at at least 10 times their yearly downtime limits. Oh well, as a good friend of mine told me yesterday, "whatayagunnado".
Here's the latest status entry from Amazon.
9:11 PM PDT We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances-- and move toward a full recovery.
The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesnt happen again. Once we have additional information on the progress that is being made, we will post additional updates.
Apr 23, 1:55 AM PDT We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information.
