Amazon EC2 and RDS Offline

Posted by Tim Stephenson, RaddOnline® on Thursday, April 21, 2011

Many of our sites are offline today because of an outage with Amazon's Elastic Compute Cloud service and their Relational Database Service. As well as our sites, major sites like Reddit, Foursquare, Quara, and Heroku are also offline.

Amazon says performance issues affected instances of its Elastic Compute Cloud (EC2) service and its Relational Database Service, and it’s “continuing to work towards full resolution”. These are hosted in its North Virginia data centre.

1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.
2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.
3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.
4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.