Build production Ready Real Time Transaction Apache Hudi Datalake from DynamoDB Streams using Glue lambdas and kinesis | Stream Processing

Build production Ready Real Time Transaction Apache Hudi Datalake from DynamoDB Streams using Glue lambdas and kinesis | Stream Processing

Author

Soumil N Shah

Software Developer

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Software engineer at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Project Overview

Users in this architecture purchase things from online retailers and generate an order transaction that is kept in DynamoDB. The raw data layer stores the order transaction data that is fed into the data lake. To accomplish this, enable Kinesis Data Streams for DynamoDB, and we will stream real-time transactions from DynamoDB into kinesis data streams, process the streaming data with lambda, and insert the data into the next kinesis stream, where a glue streaming job will process and insert the data into Apache Hudi Transaction data lake, and build dashboards and derive insights using QuickSight.

In this Project, you’ll learn how to:

  • Create and configure AWS Glue
  • Create a Hudi Table
  • Create a Spark Data Frame
  • Add data to the Hudi Table
  • Query data via Athena

Introduction

There is no single definition of “big data,” but in general it refers to data sets that are too large or complex for traditional data processing and analysis tools. Big data often includes data that is unstructured, meaning it does not fit neatly into traditional databases. Big data can come from sources such as social media, sensors, and transactional data. Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes. 

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data. The size and significance of this data increase at the same rate as system traffic. A scalable, cost-effective, and long-lasting streaming data solution is provided by Amazon Kinesis Data Streams. Tens of thousands of sources, including internet clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events can be used by Kinesis Data Streams to collect terabytes of data every second.

A data lake house is a data management architecture that uses a lake as a central data repository. The data lake house architecture includes three main components: the data lake, the data warehouse, and the data mart. The data lake is a central repository for all data, both structured and unstructured. The data warehouse is a centralized repository for all structured data. The data mart is a centralized repository for all unstructured data.

Section II 

What is Apache Hudi and why we should use it for Data lakes?

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record-level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools(Amazon).

Popular Features of HUDI

1.    Upserts, Deletes with fast, pluggable indexing

2.    Transactions, Rollbacks, Concurrency Control.        

3.    Automatic file sizing, data clustering, compactions, cleaning.           

4.    Built-in metadata tracking for scalable storage access.         

5.    Incremental queries, Record level change streams

6.    SQL Read/Writes from Spark, Presto, Trino, Hive & more

7.    Streaming ingestion, Built-in CDC sources & tools.

8.    Backwards compatible schema evolution and enforcement.

Section III

Architecture 


Figure: Shows High level Architecture Diagram 

The first block is DynamoDB, which stores all transactions. On DynamoDB, we will enable streams and stream these transactions into kinesis data streams. To convert the incoming streaming data, we will use a lambdas function. The lambdas will receive a batch of records, which we will reserialize and convert from DynamoDB JSON to normal regular JSON before flattening any complex map objects and inserting them into the next kinesis streams, which will receive flatten JSON data from the lambda’s functions. The Glue Streaming job will take the streaming data from the Kinesis data stream and UPSERT it into Data Lake. Apache HUDI will be utilized to build our data lakes. Data from RAW Hudi tables shall be consumed to build dimension and fact tables or creates views. End user can consume the data by running ad hoc queries using Athena and valuable insights can be derived from the data by building beautiful dashboard using Quicksight

Step 1: Users in this architecture purchase things from online retailers and generate an order transaction that is kept in DynamoDB.

Step 2: The raw data layer stores the order transaction data that is fed into the data lake. To accomplish this, enable Kinesis Data Streams for DynamoDB, and we will stream real-time transactions from DynamoDB into kinesis data streams, process the streaming data with lambda, and insert the data into the next kinesis stream, where a glue streaming job will process and insert the data into Apache Hudi Transaction data lake.  

Step 3: Users can build dashboards and derive insights using QuickSight.

Video Tutorial

Code :

Step by Step PDF guide :


Learn More on HUDI ?

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics