Reddit Didn’t Need To Have An Outage When AWS Went Down
Several large companies experienced downtime this past week when Amazon Web Services had an outage in one of its data centers in the US-EAST-1 Availability Zone. Hardware failure was the likely culprit, and took down Elastic Block Storage (EBS) and Relational Database Service (RDS) for several hours. Among the companies hit was the popular media site Reddit, whose users had trouble with everything from loading posts to processing comments. But are companies required to have downtime when an AWS data center blips? The answer is "No".
The purpose of this post is to highlight several key approaches AWS customers can take to avoid downtime like Reddit’s.
EBS Backups
One of the most notable issues with Amazon’s downtime was the loss of EBS volume data. There are two proven methods to help prevent this from occurring:
EBS Snapshots
Snapshots are a point-in-time backup copy of an EBS volume that can be stored in any S3 region. EBS Snapshots are incremental and can be completely automated. They are capable of maintaining consistency across workloads that have many volumes working together.
Pricing your snapshot strategy used to be annoying (what does $0.05 per GB-month mean, anyway?), but our friends at Skeddly made a Snapshot Calculator that we use to help price out incremental backup strategies for our customers where appropriate.
As an example: taking an EBS Snapshot every 4 hours on a 2TB volume that changes a ton (9% daily change) runs $370 per month in most regions. That’s a small price to pay for more frequent backups!
Continuous replication
Block-level replication is what the team at North Labs advises for mission-critical workloads. Tools like CloudEndure Disaster Recovery are available from the AWS Marketplace. Compressed, real-time data replication into lightweight “standby” servers allows for sub-second Recovery Point Objective (RPO) standards.
CloudEndure Disaster Recovery costs less than traditional DR philosophies, too. It cuts down on Total Cost of Ownership (TCO) because of the smaller appliances it uses during standby. Only during a disaster will those appliances scale to immediately absorb your workload traffic.
Pricing for CloudEndure is set on a per-machine basis. Monthly, annual, and 3-year licenses are available. 3-year licenses come with a steep discount for organization that are “all-in” on their DR strategy. An organization running 150 machines will pay $11,000 per month with monthly licenses. Annual and 3-year licenses come out to $120,600 and $286,200. This is a massive cost savings as compared to running mirrored environments in a DR facility!
Relational Database Service
Amazon RDS has been a game-changer since it was first released, and North Labs has migrated hundreds of databases to RDS. The biggest downside? It only runs in a single availability zone (AZ) by default.
Using Multi-AZ RDS
Use Multi-AZ configurations for your most critical database workloads. In Multi-AZ deployments, data from a primary database gets replicated to a standby database in another AZ. During a disaster situation, a failover is automatically performed to the standby database. Database endpoints stay fixed throughout the process so no reconfiguration is necessary.
In typical circumstances, you will spend twice as much per compute hour as you would with a single AZ deployment. We find that our customers are usually over-provisioning their databases, so doing a bit of “right-sizing” ahead of time helps save on cost.
Introducing: Amazon Aurora
Another amazing alternative to using Multi-AZ RDS is migrating to Amazon Aurora. Aurora is a fully-managed MySQL and PostgreSQL database built for the cloud. The need for patching and administration is all but eliminated, and data gets replicated across three AZ’s by default! Backup to S3 is continuous with Aurora, and you can have up to 15 low-latency read replicas out of the box.
If you’re using Oracle Database or Microsoft SQL Server, schema conversions can be done with AWS Database Migration Service and Schema Conversion Tool.
So what now?
Let’s face it: downtime is a killer. But you don’t need to invest millions of dollars to develop a DR strategy capable of withstanding single AZ outages. Using the techniques in this post will help your mission-critical workloads stay healthy!
North Labs is a veteran-owned, dedicated Amazon Web Services Consulting Partner specializing in workload migrations and 24/7 cloud managed services.
Leading the delivery of modern cloud services which are essential to the health of our democracy.
5yEBS Snapshots and continuous replication? A modern cloud-native app shouldn't be storing any state on EBS. The loss of an AZ should have no impact on the platform if it's designed properly. Reddit must have some legacy application components which are still storing state locally.
Their database was not multi AZ? That is a simple checkbox.
Sr. Manager DevSecOps Platform Engineering
5ySo it was more of Reddit being short sighted with their DR implementation.
Driving Cloud & AI Adoption & Consumption - Microsoft
5yAwesome post Collin! Love to see where North Labs is going and their customer obsession!