Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3

Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3

In the world of big data, the ability to process and analyze large volumes of data quickly and efficiently is critical for businesses to gain insights and make informed decisions. However, building data engineering pipelines that can handle such large volumes of data can be a challenge, especially when it comes to managing the underlying infrastructure.

Fortunately, AWS Lambda and S3 provide an easy and cost-effective way to build serverless data engineering pipelines. In this tutorial, we'll walk you through how to use AWS Lambda and S3 to generate and store Parquet files for data analytics, without needing to manage any servers.

What is Parquet?

Parquet is a columnar storage format that is designed to be highly efficient for analytics workloads. It's optimized for processing large datasets and allows for fast and efficient queries. Parquet is supported by many popular big data tools and frameworks, including Apache Spark, Apache Hive, and Amazon Athena.


Why use AWS Lambda and S3?

AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. It's highly scalable and can handle large volumes of data processing tasks efficiently. S3 is an object storage service that provides highly scalable and durable storage for your data. Together, they provide a powerful platform for building serverless data engineering pipelines.

Video Guide :



Architecture

No alt text provided for this image

The architecture of our data processing system is designed for speed, efficiency, and flexibility. Using AWS Lambda and SQS queue, we can process large amounts of data quickly without the need for server management. Our Lambda functions receive data in batches from the queue, with the option to scale batch size and memory as needed. Additionally, you can perform AWS Lambda power tuning to optimize the performance of our functions. Processed data is stored in Parquet format in S3, partitioned by year, month, and day, creating a RAW zone that is easy to manage and query. This architecture is also flexible, as data can be sent to the SQS queue from various sources such as Event Bridge or SNS, making it a versatile module for processing various types of data.

Steps By Step Guide

Step 1: Clone the Repository from Github

No alt text provided for this image


Step 2: Edit Env File and Deploy the stack Deploy the stack

No alt text provided for this image

Deploy the stack

No alt text provided for this image

Step 3: Open terminal and Run Python File to Publish Messages to SQS

No alt text provided for this image

Check S3

No alt text provided for this image

Code Explanation

Serververless YML

No alt text provided for this image

Lambda Function Explanation

We have defined all imports


No alt text provided for this image



No alt text provided for this image

The DataTransform class is a helper class that contains methods for data transformation.

The class contains two methods, flatten_dict and dict_clean, that are both decorated with a decorator function error_handler. The error_handler function is used to handle errors that may occur during execution of the methods. It prints a message indicating whether the method was successful or not, and if the exit_flag is set to True, it will cause the program to exit if an error occurs.

The flatten_dict method takes a nested dictionary as input and returns a flattened dictionary, where nested keys are concatenated with a separator. The parent_key and sep arguments are optional and are used to specify the parent key and separator for nested keys.

The dict_clean method takes a dictionary as input and returns a cleaned dictionary, where any values that are None, "None", "null", or empty strings are replaced with "n/a". The method converts all values to strings.

Overall, the DataTransform class provides useful methods for cleaning and transforming data, which can be useful in a variety of data processing tasks.

No alt text provided for this image

The function starts by creating an instance of a class DataTransform, which appears to have some methods for flattening and cleaning nested dictionaries.

The function then processes the messages received in the event argument using a for loop that iterates over the Records key in the event dictionary. The messages are first converted to Python dictionaries using json.loads, and then flattened and cleaned using the DataTransform methods. The processed messages are then added to a list called processed_messages.

The function then converts the processed_messages list to a Pandas DataFrame, which is then converted to an Arrow table using the pa.Table.from_pandas method. The Arrow table is then written to a Parquet file in memory using the pq.write_table method.

The Parquet file is then uploaded to an S3 bucket using the s3.put_object method, which takes as input the S3 bucket name, the file path, and the Parquet file as a binary stream.

Finally, the function returns a dictionary with a statusCode of 200 and a body of "Parquet file uploaded to S3".


Conclusion

Serverless Data Engineering is a powerful tool for businesses that need to quickly and easily process and store data in the cloud. AWS Lambda and S3 are two of the most popular and widely used services in this space. In this blog post, we discussed how to use these services to generate Parquet files and upload them to S3.

We began by discussing the benefits of using Parquet files and why they are an ideal data format for big data processing. Next, we provided an overview of AWS Lambda and how it can be used to process data in a serverless environment. We then walked through the process of using AWS Lambda to generate Parquet files and upload them to S3.

We provided a Python code example that demonstrated how to use the DataTransform class to transform and clean data before converting it to a Pandas dataframe and then to an Arrow table. Finally, we used the Arrow library to write the Arrow table to a Parquet file and then uploaded it to S3.

By following the steps outlined in this blog post, you can quickly and easily generate Parquet files from your data and store them in the cloud. This can help you to streamline your data processing workflows, reduce costs, and improve the scalability and reliability of your data infrastructure.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics