Change Data Capture (CDC) Events Ingestion

Change Data Capture (CDC) Events Ingestion

Data Engineering System Design 1: Change Data Capture (CDC) events ingestion

Why Change Data Capture (CDC) ? 

All applications start off with small data and evolve into larger data sets supporting multiple use-cases. Different data storage options for different use-cases like data warehousing, analytics, time-series data, advanced text searches. 

Over a period of time, the application creates multiple versions of data. Source database remains the source of truth if there’s discrepancy in downstream storage.

For example, Imagine a customer opening a bank account and adding money to his bank account. The record would be first created in a relational database and the subsequent versions of data would be propagated to downstream storages. In this case, the source of truth is a relational database and records stored in downstream storages after applying transformations and filtering is derived data.

Enter CDC

Process of observing all data changes written to a database and extracting them in a form in which they can be replicated to derived data systems.

CDC process has 3 stages 

  1. Change detection
  2. Change generation and ingestion
  3. Change propagation

Change Detection

  1. We can periodically poll for last_change_timestamp column changes
  2. Watch database WALs (Write Ahead logs) 
  3. Database triggers for row-level operations.

Change Generation 

  1. Write ahead logs can be read to generate CDC events
  2. Converts each log into an event.
  3. Each event contains a single change (Insert/Update/Delete) 

Change Ingestion 

  1. Maintaining order of events is critical
  2. Events are written to an event bus (Kafka)
  3. Event bus must be highly reliable, scalable and guarantee order of events
  4. Events are written to a kafka topic.

Change Propagation

  1. Any downstream application can subscribe to this topic
  2. Depending on the use-case, downstream can do streaming or event-driven consumption.

CDC events ingestion system with various use-cases

No alt text provided for this image

Debezium:

  1. Debezium is an open-source tool to capture row-level updates from most popular databases like PostgresQL, MySQL, Oracle, DB2, MangoDB etc. 
  2. Support for Cassandra and Vites is in incubation stage.
  3. Debezium can be used as a standalone server (Debezium server) or we can add it as a dependency to application code.

Debezium Architecture:

No alt text provided for this image

Kafka Connect:

  1. It provides source and sink connectors to pull and push data.
  2. Supports JDBC source/sink, JMS source, ElasticSearch Sink, Amazon S3 sink, HDFS 2 sing and Kafka replicator.
  3. Kafka connect can be deployed in both standalone or distributed mode.

Add your inputs in Comments, Subscribe to the Newsletter to get the next instance.

Sanjay S S

Lead Data Engineer|Databricks|Cloud Migration|Pyspark|MBA from Symbiosis International University Navodayan

2y

Isha insightful , we can also create Data quality project if in case there is huge loss of data in down stream system . i.e from source to hub and hub to reporting .

Like
Reply
Supritee Pattanaik

Analytics leader|Project Manager|Product Manager|Azure architect

2y

Nice article Isha Rani .Will give a try by using Debezium.

Daniel Njoku

Software | Cloud | DevOps | SRE | Data, Analytics & Solution Engineer | Technical Support Specialist | Security | Automation

2y

Amazing

Noufal Rijal

Senior Data Engineer | AWS Certified Data Engineer | 2 x AWS | Enterprise Data Lakes | Data as a Product | Python | PySpark

2y

I have used Debezium for implementing a real time data transfer between 2 different systems - source in on-prem and target being in AWS 😀 And the Debezium Server is really cool feature that allows the cloud integrations more efficient and quick.

To view or add a comment, sign in

More articles by Isha Rani

Insights from the community

Others also viewed

Explore topics