Change Data Capture (CDC) Events Ingestion
Data Engineering System Design 1: Change Data Capture (CDC) events ingestion
Why Change Data Capture (CDC) ?
All applications start off with small data and evolve into larger data sets supporting multiple use-cases. Different data storage options for different use-cases like data warehousing, analytics, time-series data, advanced text searches.
Over a period of time, the application creates multiple versions of data. Source database remains the source of truth if there’s discrepancy in downstream storage.
For example, Imagine a customer opening a bank account and adding money to his bank account. The record would be first created in a relational database and the subsequent versions of data would be propagated to downstream storages. In this case, the source of truth is a relational database and records stored in downstream storages after applying transformations and filtering is derived data.
Enter CDC
Process of observing all data changes written to a database and extracting them in a form in which they can be replicated to derived data systems.
CDC process has 3 stages
Change Detection
Change Generation
Recommended by LinkedIn
Change Ingestion
Change Propagation
CDC events ingestion system with various use-cases
Debezium:
Debezium Architecture:
Kafka Connect:
Add your inputs in Comments, Subscribe to the Newsletter to get the next instance.
Lead Data Engineer|Databricks|Cloud Migration|Pyspark|MBA from Symbiosis International University Navodayan
2yIsha insightful , we can also create Data quality project if in case there is huge loss of data in down stream system . i.e from source to hub and hub to reporting .
Analytics leader|Project Manager|Product Manager|Azure architect
2yNice article Isha Rani .Will give a try by using Debezium.
Software | Cloud | DevOps | SRE | Data, Analytics & Solution Engineer | Technical Support Specialist | Security | Automation
2yAmazing
Senior Data Engineer | AWS Certified Data Engineer | 2 x AWS | Enterprise Data Lakes | Data as a Product | Python | PySpark
2yI have used Debezium for implementing a real time data transfer between 2 different systems - source in on-prem and target being in AWS 😀 And the Debezium Server is really cool feature that allows the cloud integrations more efficient and quick.