Reddit Didn’t Need To Have An Outage When AWS Went Down

Collin Graves

We help companies advance their cloud data analytics mission with fractional data teams that cost less than one full-time hire. | USAF Veteran

Published Sep 9, 2019

Several large companies experienced downtime this past week when Amazon Web Services had an outage in one of its data centers in the US-EAST-1 Availability Zone. Hardware failure was the likely culprit, and took down Elastic Block Storage (EBS) and Relational Database Service (RDS) for several hours. Among the companies hit was the popular media site Reddit, whose users had trouble with everything from loading posts to processing comments. But are companies required to have downtime when an AWS data center blips? The answer is "No".

The purpose of this post is to highlight several key approaches AWS customers can take to avoid downtime like Reddit’s.

EBS Backups

One of the most notable issues with Amazon’s downtime was the loss of EBS volume data. There are two proven methods to help prevent this from occurring:

EBS Snapshots

Snapshots are a point-in-time backup copy of an EBS volume that can be stored in any S3 region. EBS Snapshots are incremental and can be completely automated. They are capable of maintaining consistency across workloads that have many volumes working together.

Pricing your snapshot strategy used to be annoying (what does $0.05 per GB-month mean, anyway?), but our friends at Skeddly made a Snapshot Calculator that we use to help price out incremental backup strategies for our customers where appropriate.

As an example: taking an EBS Snapshot every 4 hours on a 2TB volume that changes a ton (9% daily change) runs $370 per month in most regions. That’s a small price to pay for more frequent backups!

Continuous replication

Block-level replication is what the team at North Labs advises for mission-critical workloads. Tools like CloudEndure Disaster Recovery are available from the AWS Marketplace. Compressed, real-time data replication into lightweight “standby” servers allows for sub-second Recovery Point Objective (RPO) standards.

CloudEndure Disaster Recovery costs less than traditional DR philosophies, too. It cuts down on Total Cost of Ownership (TCO) because of the smaller appliances it uses during standby. Only during a disaster will those appliances scale to immediately absorb your workload traffic.

Pricing for CloudEndure is set on a per-machine basis. Monthly, annual, and 3-year licenses are available. 3-year licenses come with a steep discount for organization that are “all-in” on their DR strategy. An organization running 150 machines will pay $11,000 per month with monthly licenses. Annual and 3-year licenses come out to $120,600 and $286,200. This is a massive cost savings as compared to running mirrored environments in a DR facility!

Relational Database Service

Amazon RDS has been a game-changer since it was first released, and North Labs has migrated hundreds of databases to RDS. The biggest downside? It only runs in a single availability zone (AZ) by default.

Using Multi-AZ RDS

Use Multi-AZ configurations for your most critical database workloads. In Multi-AZ deployments, data from a primary database gets replicated to a standby database in another AZ. During a disaster situation, a failover is automatically performed to the standby database. Database endpoints stay fixed throughout the process so no reconfiguration is necessary.

In typical circumstances, you will spend twice as much per compute hour as you would with a single AZ deployment. We find that our customers are usually over-provisioning their databases, so doing a bit of “right-sizing” ahead of time helps save on cost.

Introducing: Amazon Aurora

Another amazing alternative to using Multi-AZ RDS is migrating to Amazon Aurora. Aurora is a fully-managed MySQL and PostgreSQL database built for the cloud. The need for patching and administration is all but eliminated, and data gets replicated across three AZ’s by default! Backup to S3 is continuous with Aurora, and you can have up to 15 low-latency read replicas out of the box.

If you’re using Oracle Database or Microsoft SQL Server, schema conversions can be done with AWS Database Migration Service and Schema Conversion Tool.

So what now?

Let’s face it: downtime is a killer. But you don’t need to invest millions of dollars to develop a DR strategy capable of withstanding single AZ outages. Using the techniques in this post will help your mission-critical workloads stay healthy!

North Labs is a veteran-owned, dedicated Amazon Web Services Consulting Partner specializing in workload migrations and 24/7 cloud managed services.

Jason Baker

Leading the delivery of modern cloud services which are essential to the health of our democracy.

EBS Snapshots and continuous replication? A modern cloud-native app shouldn't be storing any state on EBS. The loss of an AZ should have no impact on the platform if it's designed properly. Reddit must have some legacy application components which are still storing state locally.

4 Reactions

Brian Hostetter

Their database was not multi AZ? That is a simple checkbox.

8 Reactions

Gustavo Chavez

Sr. Manager DevSecOps Platform Engineering

So it was more of Reddit being short sighted with their DR implementation.

1 Reaction

Vincent Paterno

Driving Cloud & AI Adoption & Consumption - Microsoft

Awesome post Collin! Love to see where North Labs is going and their customer obsession!

1 Reaction

See more comments

To view or add a comment, sign in

Reddit Didn’t Need To Have An Outage When AWS Went Down

Collin Graves

We help companies advance their cloud data analytics mission with fractional data teams that cost less than one full-time hire. | USAF Veteran

EBS Backups

EBS Snapshots

Continuous replication

Relational Database Service

Using Multi-AZ RDS

Introducing: Amazon Aurora

So what now?

More articles by Collin Graves

Insights from the community

Others also viewed

CloudDaddyPro Review – Google Friendly SSD Cloud Storage Technology

Microsoft’s Oracle Partnership: Rewards and Risks

High Availability vs. Fault Tolerance

Building an Inventory of Current Infrastructure for AWS Migration with Practical Examples

Oracle Cloud File Storage

Explore the end-of-life status for Windows O.S and Microsoft SQL Server, along with the benefits of migrating to Azure Cloud

Azure Weekly Updates - 07th November 2021

Oracle Virtual Machine aka Oracle VM

IBM i on Power Virtual Server - Taking (consistent) snapshots

Explore topics

EBS Backups

EBS Snapshots

Continuous replication

Relational Database Service

Using Multi-AZ RDS

Introducing: Amazon Aurora

So what now?

More articles by Collin Graves

How Manufacturers Can Use AWS Monitron and Amazon IoT SiteWise for Real-Time Predictive Maintenance

The Promise of Data Fabric for Real-Time Industrial Data Capabilities

My Thoughts On Snowflake's 3Q23 Results

Your Current Cloud Management Practice is Wrong. Here's What To Do About It.

Adopting DevOps: Understanding Blue/Green Deployments

Streaming Video Facial Recognition on Amazon Web Services

Digital Health Solutions are Ignoring our Providers

Insights from the community

Others also viewed

CloudDaddyPro Review – Google Friendly SSD Cloud Storage Technology

Microsoft’s Oracle Partnership: Rewards and Risks

High Availability vs. Fault Tolerance

Building an Inventory of Current Infrastructure for AWS Migration with Practical Examples

Oracle Cloud File Storage

Explore the end-of-life status for Windows O.S and Microsoft SQL Server, along with the benefits of migrating to Azure Cloud

Azure Weekly Updates - 07th November 2021

Oracle Virtual Machine aka Oracle VM

IBM i on Power Virtual Server - Taking (consistent) snapshots

Explore topics