Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Published Apr 15, 2023

In the world of big data, the ability to process and analyze large volumes of data quickly and efficiently is critical for businesses to gain insights and make informed decisions. However, building data engineering pipelines that can handle such large volumes of data can be a challenge, especially when it comes to managing the underlying infrastructure.

Fortunately, AWS Lambda and S3 provide an easy and cost-effective way to build serverless data engineering pipelines. In this tutorial, we'll walk you through how to use AWS Lambda and S3 to generate and store Parquet files for data analytics, without needing to manage any servers.

What is Parquet?

Parquet is a columnar storage format that is designed to be highly efficient for analytics workloads. It's optimized for processing large datasets and allows for fast and efficient queries. Parquet is supported by many popular big data tools and frameworks, including Apache Spark, Apache Hive, and Amazon Athena.

Why use AWS Lambda and S3?

AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. It's highly scalable and can handle large volumes of data processing tasks efficiently. S3 is an object storage service that provides highly scalable and durable storage for your data. Together, they provide a powerful platform for building serverless data engineering pipelines.

Video Guide :

Architecture

The architecture of our data processing system is designed for speed, efficiency, and flexibility. Using AWS Lambda and SQS queue, we can process large amounts of data quickly without the need for server management. Our Lambda functions receive data in batches from the queue, with the option to scale batch size and memory as needed. Additionally, you can perform AWS Lambda power tuning to optimize the performance of our functions. Processed data is stored in Parquet format in S3, partitioned by year, month, and day, creating a RAW zone that is easy to manage and query. This architecture is also flexible, as data can be sent to the SQS queue from various sources such as Event Bridge or SNS, making it a versatile module for processing various types of data.

Steps By Step Guide

Step 1: Clone the Repository from Github

Step 2: Edit Env File and Deploy the stack Deploy the stack

Deploy the stack

Step 3: Open terminal and Run Python File to Publish Messages to SQS

Check S3

Code Explanation

Serververless YML

Lambda Function Explanation

We have defined all imports

The DataTransform class is a helper class that contains methods for data transformation.

The class contains two methods, flatten_dict and dict_clean, that are both decorated with a decorator function error_handler. The error_handler function is used to handle errors that may occur during execution of the methods. It prints a message indicating whether the method was successful or not, and if the exit_flag is set to True, it will cause the program to exit if an error occurs.

The flatten_dict method takes a nested dictionary as input and returns a flattened dictionary, where nested keys are concatenated with a separator. The parent_key and sep arguments are optional and are used to specify the parent key and separator for nested keys.

The dict_clean method takes a dictionary as input and returns a cleaned dictionary, where any values that are None, "None", "null", or empty strings are replaced with "n/a". The method converts all values to strings.

Overall, the DataTransform class provides useful methods for cleaning and transforming data, which can be useful in a variety of data processing tasks.

The function starts by creating an instance of a class DataTransform, which appears to have some methods for flattening and cleaning nested dictionaries.

The function then processes the messages received in the event argument using a for loop that iterates over the Records key in the event dictionary. The messages are first converted to Python dictionaries using json.loads, and then flattened and cleaned using the DataTransform methods. The processed messages are then added to a list called processed_messages.

The function then converts the processed_messages list to a Pandas DataFrame, which is then converted to an Arrow table using the pa.Table.from_pandas method. The Arrow table is then written to a Parquet file in memory using the pq.write_table method.

The Parquet file is then uploaded to an S3 bucket using the s3.put_object method, which takes as input the S3 bucket name, the file path, and the Parquet file as a binary stream.

Finally, the function returns a dictionary with a statusCode of 200 and a body of "Parquet file uploaded to S3".

Conclusion

Serverless Data Engineering is a powerful tool for businesses that need to quickly and easily process and store data in the cloud. AWS Lambda and S3 are two of the most popular and widely used services in this space. In this blog post, we discussed how to use these services to generate Parquet files and upload them to S3.

We began by discussing the benefits of using Parquet files and why they are an ideal data format for big data processing. Next, we provided an overview of AWS Lambda and how it can be used to process data in a serverless environment. We then walked through the process of using AWS Lambda to generate Parquet files and upload them to S3.

We provided a Python code example that demonstrated how to use the DataTransform class to transform and clean data before converting it to a Pandas dataframe and then to an Arrow table. Finally, we used the Arrow library to write the Arrow table to a Parquet file and then uploaded it to S3.

By following the steps outlined in this blog post, you can quickly and easily generate Parquet files from your data and store them in the cloud. This can help you to streamline your data processing workflows, reduce costs, and improve the scalability and reliability of your data infrastructure.

Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Step 2: Edit Env File and Deploy the stack Deploy the stack

Step 3: Open terminal and Run Python File to Publish Messages to SQS

Recommended by LinkedIn

Code Explanation

More articles by this author

Insights from the community

Others also viewed

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

100 open source Big Data and ML architecture papers for data professionals (sequel).

AWS Data Engineering Essentials Guidebook

How modern data-analytics architecture works with Azure Databricks

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

DATA Pill #076 - Distributed Computing MMA: Ray vs Spark, SQL cookbook for dbt

Event processing of data streams optimizing SQS processing and efficient end-user querying

MongoDB for Data Science

Explore topics

Step 2: Edit Env File and Deploy the stack Deploy the stack

Step 3: Open terminal and Run Python File to Publish Messages to SQS

Recommended by LinkedIn

Code Explanation

QuackETL| DuckDB-Powered Lightweight ETL: An Extensible Framework for Seamless Data Integration

Dec 23, 2024

How to Query New S3 Table Buckets Using DuckDB: A Hands-On Guide

Dec 20, 2024

Medallion Architecture (Raw → Bronze → Silver) Using New S3 Table Buckets, EMR EC2, and Orchestrating Jobs with Step Functions | Hands-On Labs

Dec 17, 2024

Amazon DynamoDB Zero-ETL Integration with SageMaker Lakehouse(Iceberg Tables) : Hands-on Lab | Query With Any Engine Athena | DuckDB

Dec 9, 2024

Simple 4-Step Process to Create S3 Table Buckets and Deploy an Iceberg PySpark Job with EMR 7.5 with simple Shell Script | Hands on Labs

Dec 5, 2024

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

Dec 5, 2024

Key AWS re:Invent 2024 Announcements in the Data Space for Data Engineers

Dec 4, 2024

Learn How to Ingest Semi-Structured Data from Kafka Topics in a Stream-Oriented Fashion into Delta 4.0 with Variant Type in Spark 4.0.0-beta1

Dec 3, 2024

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Dec 2, 2024

Using DuckDB to Cache Query Results and Reduce Load on Your Operational Database

Nov 30, 2024

Insights from the community

Others also viewed

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

100 open source Big Data and ML architecture papers for data professionals (sequel).

AWS Data Engineering Essentials Guidebook

How modern data-analytics architecture works with Azure Databricks

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

DATA Pill #076 - Distributed Computing MMA: Ray vs Spark, SQL cookbook for dbt

Event processing of data streams optimizing SQS processing and efficient end-user querying

MongoDB for Data Science

Explore topics