Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3
In the world of big data, the ability to process and analyze large volumes of data quickly and efficiently is critical for businesses to gain insights and make informed decisions. However, building data engineering pipelines that can handle such large volumes of data can be a challenge, especially when it comes to managing the underlying infrastructure.
Fortunately, AWS Lambda and S3 provide an easy and cost-effective way to build serverless data engineering pipelines. In this tutorial, we'll walk you through how to use AWS Lambda and S3 to generate and store Parquet files for data analytics, without needing to manage any servers.
What is Parquet?
Parquet is a columnar storage format that is designed to be highly efficient for analytics workloads. It's optimized for processing large datasets and allows for fast and efficient queries. Parquet is supported by many popular big data tools and frameworks, including Apache Spark, Apache Hive, and Amazon Athena.
Why use AWS Lambda and S3?
AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. It's highly scalable and can handle large volumes of data processing tasks efficiently. S3 is an object storage service that provides highly scalable and durable storage for your data. Together, they provide a powerful platform for building serverless data engineering pipelines.
Video Guide :
Architecture
The architecture of our data processing system is designed for speed, efficiency, and flexibility. Using AWS Lambda and SQS queue, we can process large amounts of data quickly without the need for server management. Our Lambda functions receive data in batches from the queue, with the option to scale batch size and memory as needed. Additionally, you can perform AWS Lambda power tuning to optimize the performance of our functions. Processed data is stored in Parquet format in S3, partitioned by year, month, and day, creating a RAW zone that is easy to manage and query. This architecture is also flexible, as data can be sent to the SQS queue from various sources such as Event Bridge or SNS, making it a versatile module for processing various types of data.
Steps By Step Guide
Step 1: Clone the Repository from Github
Step 2: Edit Env File and Deploy the stack Deploy the stack
Deploy the stack
Step 3: Open terminal and Run Python File to Publish Messages to SQS
Check S3
Recommended by LinkedIn
Code Explanation
Serververless YML
Lambda Function Explanation
We have defined all imports
The DataTransform class is a helper class that contains methods for data transformation.
The class contains two methods, flatten_dict and dict_clean, that are both decorated with a decorator function error_handler. The error_handler function is used to handle errors that may occur during execution of the methods. It prints a message indicating whether the method was successful or not, and if the exit_flag is set to True, it will cause the program to exit if an error occurs.
The flatten_dict method takes a nested dictionary as input and returns a flattened dictionary, where nested keys are concatenated with a separator. The parent_key and sep arguments are optional and are used to specify the parent key and separator for nested keys.
The dict_clean method takes a dictionary as input and returns a cleaned dictionary, where any values that are None, "None", "null", or empty strings are replaced with "n/a". The method converts all values to strings.
Overall, the DataTransform class provides useful methods for cleaning and transforming data, which can be useful in a variety of data processing tasks.
The function starts by creating an instance of a class DataTransform, which appears to have some methods for flattening and cleaning nested dictionaries.
The function then processes the messages received in the event argument using a for loop that iterates over the Records key in the event dictionary. The messages are first converted to Python dictionaries using json.loads, and then flattened and cleaned using the DataTransform methods. The processed messages are then added to a list called processed_messages.
The function then converts the processed_messages list to a Pandas DataFrame, which is then converted to an Arrow table using the pa.Table.from_pandas method. The Arrow table is then written to a Parquet file in memory using the pq.write_table method.
The Parquet file is then uploaded to an S3 bucket using the s3.put_object method, which takes as input the S3 bucket name, the file path, and the Parquet file as a binary stream.
Finally, the function returns a dictionary with a statusCode of 200 and a body of "Parquet file uploaded to S3".
Conclusion
Serverless Data Engineering is a powerful tool for businesses that need to quickly and easily process and store data in the cloud. AWS Lambda and S3 are two of the most popular and widely used services in this space. In this blog post, we discussed how to use these services to generate Parquet files and upload them to S3.
We began by discussing the benefits of using Parquet files and why they are an ideal data format for big data processing. Next, we provided an overview of AWS Lambda and how it can be used to process data in a serverless environment. We then walked through the process of using AWS Lambda to generate Parquet files and upload them to S3.
We provided a Python code example that demonstrated how to use the DataTransform class to transform and clean data before converting it to a Pandas dataframe and then to an Arrow table. Finally, we used the Arrow library to write the Arrow table to a Parquet file and then uploaded it to S3.
By following the steps outlined in this blog post, you can quickly and easily generate Parquet files from your data and store them in the cloud. This can help you to streamline your data processing workflows, reduce costs, and improve the scalability and reliability of your data infrastructure.