How A Simple Serverless Error Almost Cost Us Hundreds Of Dollars

Uriel Bitton

AWS Cloud Consultant | The DynamoDB guy | AWS Certified | I help you supercharge your DynamoDB database ☁️⚡️

Published Sep 23, 2024

🙌 Hello there! Welcome to The Serverless Spotlight!

In this week's edition, I talk about a mistake we made in a real-world project I worked on and the valuable lessons it taught us on cloud computing and costs.

I once worked at a job as an AWS developer to design the database and APIs of a client’s application.

Part of a team of 4, we designed a microservice responsible for sending out emails on an application form.

Long story short, we had a Lambda function trigger an SNS topic which in turn triggered a subscription to send emails to the applicant as well as to us (in the development environment).

When I got around to testing this function, I noticed a very strange behavior.

The email client on my machine kept pinging with new emails.

It started with one email, then 10, 100 and within a few minutes, I had received around 200 emails.

All with the same message about the test template we wrote.

The worst part was that the emails didn’t stop. Instead, they were actually increasing in frequency.

At the start, we were getting a few emails a minute, but after several minutes, we were getting several emails per second.

At this rate, if we left the issue unchecked, we would have sent out 100K emails in a couple of hours.

And if anyone else happened to test the function out at the same time, that number would have risen to an even higher number.

I wouldn’t have wanted to imagine the costs associated with this.

But if I did, it would be somewhere north of 100$, after a few hours.

And thousands of dollars if the function was left for days.

Imagine deploying Friday afternoon and fixing it on Monday morning!

But before emails could rise higher than about 20K, I dove into the Lambda function code.

I immediately noticed there was no error handling. There should have been a try-and-catch block wrapping all of the code.

The catch block would have caught the error and prevented this issue.

But to understand how and why, we must first understand how Lambda and SNS work.

After having identified and fixed the issue here were my main takeaways and lessons.

How This Happened?

When a Lambda function is invoked to send an email via SNS, it triggers an SNS topic.

If the Lambda function encounters an error after sending the message to SNS but before completing execution, the function will still be retried.

A Lambda function can auto-retry up to 3 times by default.

In the case of SNS, here’s what the official docs state about SNS retries:

Why This Happened?

The reason this happened was due to 2 reasons:

The fact the Lambda function ran into an uncaught error caused Lambda to auto-retry a few more times.
Simultaneously, SNS is also retrying to retrigger the subscription that sends an email. Since the Lambda function is “unavailable” it will retry sending the email 10 times every second resulting in a mass amount of emails being sent out after a while.

SNS processes messages into a queue data structure. This means instead of satisfying every request at the same time, it enqueues messages and executes them one at a time.

This also explains why emails kept coming in even when no Lambda functions had been executed in a while.

How Could We Have Avoided This?

Error Handling

This was the way I resolved the problem from happening again.

By adding necessary error handling (try-catch blocks) in all the necessary places, the function exited on any error.

If the Lambda function had gracefully handled errors from the start, it would not have retried itself. This would make it available to SNS, and SNS would not be retrying messages.

To fix the issue, after the fact, all I did was delete the SNS topic and the emails stopped a few hours later (many were still in the queue).

Dead Letter Queue

In the case where a Lambda function is retrying itself an indefinite number of times, it is often useful to configure a dead letter queue.

This is also a possible preventative measure.

The dead letter queue will prevent the function from retrying itself more than a few times.

After a set number of times (the defined maximum retry attempts) the function is sent to the dead letter queue and not retried again.

Conclusion

Running into issues is the best way to learn, especially when costs are at stake. The biggest problems often teach us the greatest lessons.

In our case, we learned that mechanisms such as error handling and dead-letter queues proved to be critical when dealing with Lambda.

Robust error-handling code is a must and could save a lot of time and money.

👋 My name is Uriel Bitton and I hope you learned something in this edition of The Serverless Spotlight

🔗 You can share the article with your network to help others learn as well.

📬 If you want to learn how to save money in the cloud you can subscribe to my brand new newsletter The Cloud Economist.

🙌 I hope to see you in next week's edition!

The Serverless Spotlight

123 followers

+ Subscribe

Liora Zohar Bitton

Financial Security Advisor

3mo

Such great advice! although I am not in the web development filed I highly value the importance of being mindful when it comes to costly errors!

1 Reaction

Gabriele Marini

CTO at Data Rover | Vice President at MAASI Enterprises | Co-founder | Entrepreneur

3mo

Completely agree Uriel Bitton. Unfortunately it's often very difficult to fully understand how the pricing model works, especially for cloud services. The information provided is not always clear, and it can be tricky to predict actual costs. This often leads to unexpected charges or surprises, as the pricing tends to be more complex than it initially appears.

How A Simple Serverless Error Almost Cost Us Hundreds Of Dollars

Uriel Bitton

AWS Cloud Consultant | The DynamoDB guy | AWS Certified | I help you supercharge your DynamoDB database ☁️⚡️

How This Happened?

Recommended by LinkedIn

Why This Happened?

How Could We Have Avoided This?

Error Handling

Dead Letter Queue

Conclusion

The Serverless Spotlight

123 followers

More articles by this author

Insights from the community

Others also viewed

A Thorough Guide to Deploying LiveKit on AWS EKS

AWS API Gateway With Amit Kumar

Event Driven AWS Services

AWS Lambda Layers: Simplifying Serverless Codebases

How Does AWS Lambda Work?

End to end AWS Observability for Containers

Deploy to Kubernetes in Google Cloud: Challenge Lab

Build a Scalable Backend Service with AWS: The Power of Serverless Architecture

Mastering AWS Lambda: 5 tips from my own experience

A subjective feature comparison between Terraform and Azure ARM / AWS CloudFormation

Explore topics

How This Happened?

Recommended by LinkedIn

Why This Happened?

How Could We Have Avoided This?

Error Handling

Dead Letter Queue

Conclusion

The Serverless Spotlight

123 followers

How Tinder Migrated From Self-Hosted Redis To AWS ElastiCache

Dec 23, 2024

5 Common Mistakes When Using DynamoDB Transactions And How to Avoid Them

Dec 21, 2024

Lambda Vs EC2: Which Compute Service Is Cheaper?

Dec 16, 2024

How DynamoDB Enforces Efficiency

Dec 14, 2024

How Samsung Cloud Uses DynamoDB To Optimize Costs

Dec 9, 2024

How To Model 1-Many Relationships & Reverse Lookups In DynamoDB

Dec 6, 2024

Time-to-Live (TTL) Best Practices in DynamoDB

Nov 29, 2024

Use Infrequently Accessed Tables In DynamoDB To Save Big On Costs

Nov 25, 2024

The Journey Of A DynamoDB Query: A Behind The Scenes Adventure

Nov 22, 2024

How To Use CloudWatch To Debug Serverless Lambda Functions In AWS

Nov 19, 2024

Insights from the community

Others also viewed

A Thorough Guide to Deploying LiveKit on AWS EKS

AWS API Gateway With Amit Kumar

Event Driven AWS Services

AWS Lambda Layers: Simplifying Serverless Codebases

How Does AWS Lambda Work?

End to end AWS Observability for Containers

Deploy to Kubernetes in Google Cloud: Challenge Lab

Build a Scalable Backend Service with AWS: The Power of Serverless Architecture

Mastering AWS Lambda: 5 tips from my own experience

A subjective feature comparison between Terraform and Azure ARM / AWS CloudFormation

Explore topics