Build production Ready Real Time Transaction Apache Hudi Datalake from DynamoDB Streams using Glue lambdas and kinesis | Stream Processing

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Published Dec 16, 2022

+ Follow

Author

Soumil N Shah

Software Developer

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Software engineer at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Project Overview

Users in this architecture purchase things from online retailers and generate an order transaction that is kept in DynamoDB. The raw data layer stores the order transaction data that is fed into the data lake. To accomplish this, enable Kinesis Data Streams for DynamoDB, and we will stream real-time transactions from DynamoDB into kinesis data streams, process the streaming data with lambda, and insert the data into the next kinesis stream, where a glue streaming job will process and insert the data into Apache Hudi Transaction data lake, and build dashboards and derive insights using QuickSight.

In this Project, you’ll learn how to:

Create and configure AWS Glue
Create a Hudi Table
Create a Spark Data Frame
Add data to the Hudi Table
Query data via Athena

Introduction

There is no single definition of “big data,” but in general it refers to data sets that are too large or complex for traditional data processing and analysis tools. Big data often includes data that is unstructured, meaning it does not fit neatly into traditional databases. Big data can come from sources such as social media, sensors, and transactional data. Organizations often gather enormous amounts of data and continue to produce ever-increasing amounts of data, ranging from terabytes to petabytes and occasionally even exabytes.

Data that is continuously created from many sources, such as a website's log files or the page analytics on an e-commerce site, is known as stream data. The size and significance of this data increase at the same rate as system traffic. A scalable, cost-effective, and long-lasting streaming data solution is provided by Amazon Kinesis Data Streams. Tens of thousands of sources, including internet clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events can be used by Kinesis Data Streams to collect terabytes of data every second.

A data lake house is a data management architecture that uses a lake as a central data repository. The data lake house architecture includes three main components: the data lake, the data warehouse, and the data mart. The data lake is a central repository for all data, both structured and unstructured. The data warehouse is a centralized repository for all structured data. The data mart is a centralized repository for all unstructured data.

Section II

What is Apache Hudi and why we should use it for Data lakes?

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record-level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools(Amazon).

Popular Features of HUDI

1. Upserts, Deletes with fast, pluggable indexing

2. Transactions, Rollbacks, Concurrency Control.

3. Automatic file sizing, data clustering, compactions, cleaning.

4. Built-in metadata tracking for scalable storage access.

5. Incremental queries, Record level change streams

6. SQL Read/Writes from Spark, Presto, Trino, Hive & more

Recommended by LinkedIn

The Future of Data-Driven Ecosystems: Cloud Platforms,…

Jay S. 1 week ago

Scale with a K.I.S.S: Keep It Simple, Stupid

Sunny R Gupta 2 months ago

How to orchestrate MLOps by using Azure Databricks?

Aritra Ghosh 1 year ago

7. Streaming ingestion, Built-in CDC sources & tools.

8. Backwards compatible schema evolution and enforcement.

Section III

Architecture

Figure: Shows High level Architecture Diagram

The first block is DynamoDB, which stores all transactions. On DynamoDB, we will enable streams and stream these transactions into kinesis data streams. To convert the incoming streaming data, we will use a lambdas function. The lambdas will receive a batch of records, which we will reserialize and convert from DynamoDB JSON to normal regular JSON before flattening any complex map objects and inserting them into the next kinesis streams, which will receive flatten JSON data from the lambda’s functions. The Glue Streaming job will take the streaming data from the Kinesis data stream and UPSERT it into Data Lake. Apache HUDI will be utilized to build our data lakes. Data from RAW Hudi tables shall be consumed to build dimension and fact tables or creates views. End user can consume the data by running ad hoc queries using Athena and valuable insights can be derived from the data by building beautiful dashboard using Quicksight

Step 1: Users in this architecture purchase things from online retailers and generate an order transaction that is kept in DynamoDB.

Step 2: The raw data layer stores the order transaction data that is fed into the data lake. To accomplish this, enable Kinesis Data Streams for DynamoDB, and we will stream real-time transactions from DynamoDB into kinesis data streams, process the streaming data with lambda, and insert the data into the next kinesis stream, where a glue streaming job will process and insert the data into Apache Hudi Transaction data lake.

Step 3: Users can build dashboards and derive insights using QuickSight.

Video Tutorial

Code :

Step by Step PDF guide :

Learn More on HUDI ?

To view or add a comment, sign in

See all

Build production Ready Real Time Transaction Apache Hudi Datalake from DynamoDB Streams using Glue lambdas and kinesis | Stream Processing

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Soumil N Shah

Software Developer

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Fueling Generative AI's Potential through Databases

DATABRICKS

Wide Vs. Narrow Transformations in Spark/Distributed Compute

DATA Pill #069 - is ELT dead? Chatbot with LLMs, DevOpsGPT

DATA Pill #038 - 2 ways to build a Feature Store, Spark Structured Streaming, and a lot of Apache…

💊 DATA Pill #123 - Stateless vs. Stateful Stream Processing, BigQuery Engine for Apache Flink

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

💊 DATA Pill #118 - Real-time Streaming, Data Lakehouse, Flink & Kubernetes

Best Practices: Running Stateful Apps on Kubernetes

Learn How to use DeltaStreamer with XTable to Interoperate with Hudi and Iceberg using EMR Serverless Hands-on Labs

Explore topics

Soumil N Shah

Software Developer

Recommended by LinkedIn

Understanding Expression Indexes in Hudi: Improving Query Performance

Jan 2, 2025

How Customers and Companies Can Use Hudi 1.0.0 on EMR Serverless | Developer Guide

Jan 2, 2025

QuackETL| DuckDB-Powered Lightweight ETL: An Extensible Framework for Seamless Data Integration

Dec 23, 2024

How to Query New S3 Table Buckets Using DuckDB: A Hands-On Guide

Dec 20, 2024

Medallion Architecture (Raw → Bronze → Silver) Using New S3 Table Buckets, EMR EC2, and Orchestrating Jobs with Step Functions | Hands-On Labs

Dec 17, 2024

Amazon DynamoDB Zero-ETL Integration with SageMaker Lakehouse(Iceberg Tables) : Hands-on Lab | Query With Any Engine Athena | DuckDB

Dec 9, 2024

Simple 4-Step Process to Create S3 Table Buckets and Deploy an Iceberg PySpark Job with EMR 7.5 with simple Shell Script | Hands on Labs

Dec 5, 2024

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

Dec 5, 2024

Key AWS re:Invent 2024 Announcements in the Data Space for Data Engineers

Dec 4, 2024

Learn How to Ingest Semi-Structured Data from Kafka Topics in a Stream-Oriented Fashion into Delta 4.0 with Variant Type in Spark 4.0.0-beta1

Dec 3, 2024

Insights from the community

Others also viewed

Fueling Generative AI's Potential through Databases

DATABRICKS

Wide Vs. Narrow Transformations in Spark/Distributed Compute

DATA Pill #069 - is ELT dead? Chatbot with LLMs, DevOpsGPT

DATA Pill #038 - 2 ways to build a Feature Store, Spark Structured Streaming, and a lot of Apache…

💊 DATA Pill #123 - Stateless vs. Stateful Stream Processing, BigQuery Engine for Apache Flink

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

💊 DATA Pill #118 - Real-time Streaming, Data Lakehouse, Flink & Kubernetes

Best Practices: Running Stateful Apps on Kubernetes

Learn How to use DeltaStreamer with XTable to Interoperate with Hudi and Iceberg using EMR Serverless Hands-on Labs

Explore topics