Unleash NO-Code ETL Pipeline With Azure Datafactory (ADF)
1. Introduction:
I am excited to share my first LinkedIn article, Today I am going to discuss how I built a production-grade no-code ETL pipeline with Microsoft Azure data datafactory, and why it is the ideal tool for someone who landed in the data engineering field as a fresher and I will also share the pros and cons of ADF and how to address the cons.
2. What Is Azure Datafactory:
ADF is a fully managed data integration service and end-to-end ETL and ELT tool (from pulling data from source or datalake(ADLS-gen2) to build a dashboard and deploy pipeline) with a simple drag and drop visual interface (UI).
In ADF, Apache Spark takes care of code generation and maintenance, to improve performance and scalability, and reduces cost.
ADF has 90+ connectors to acquire big data sources such as AWS and Google Cloud storage and leading data warehouses like Snowflake, Oracle Exadata, etc.
In short, ADF is primarily built for orchestrating and managing data workflows.
3. Components of ADF:
ADF has multiple components for ingesting data from the source, transform the data, and move the transformed data to the target sink. I will share major components to build the ETL pipeline and they are
1. Linked Services: We define the necessary connection information for the Data Factory to connect to external sources (HTTP, data storage, etc).
2. Datasets: In ADF datasets are labeled folders that tell us exactly, where the piece of information (CSV, delta, etc file path) is stored.
Note: if you are confused between Linked Services and Datasets. here is an easy explanation for your understanding.
Linked Services (LS) connect to storage services such as S3 for AWS or Azure Data Lake Gen2 to interact with Azure Data Factory. For eg, if there's an S3 bucket named Sample-folder-S3, a Linked Service will help communicate with it.
On the other hand, Datasets are used to pinpoint the exact location of a file. For instance, a dataset named sample.csv contains information to locate the exact file location of sample.csv inside the sample-folder-S3.
3. Mapping Dataflows: In ADF dataflows are used to execute your data transformation logic like select, sort, aggregate, lookup, joins, etc. with the help of visual interface in adf. ADF will execute our transformation logic on the spark cluster to improve performance. Spark cluster will spin up and spin down based on our needs and there is no need to manage or maintain clusters.
Link for optimization of ADF dataflows: Mapping data flows performance and tuning guide
Recommended by LinkedIn
4. Pipelines: In ADF pipeline is a logical grouping of activities that performs a unit of work. A project can contain one or more pipelines. for eg, a pipeline run can contain ingest data from a source and then execute queries from Databricks notebook on Databricks cluster.
Pipelines are executed by Triggers (Trigger enables automation by running the pipeline based on events automatically without the need for manual intervention).
5. Pros And Cons Of ADF:
Every tool has its pros and cons, Here is the list of P&C of ADF.
Pros:
Cons:
6. How to overcome the Cons:
Limited Customization and data storing types - Pipelines in the real world can be more complex and need more customization. To overcome this you can write complex logic and queries in Azure Databricks notebooks which are tightly integrated with ADF.
Azure Databricks provides high customization with peak performance to save the cost of the ETL pipeline. like Photon Engine (Apache Spark rewritten in C++ for 20x performance for big data processing), also provides other features like liquid clustering and other optimization techniques for Databricks.
Databricks also addresses the problem of open lakehouse data format storing esp. Delta (Extension of parquet with ACID Transaction ability by having file-based transaction log) - Diagram rep for delta by Avril Aysha .
7. Credits:
Ramesh Retnasamy and I would thank Mary Loubele and Minita Dabhi, M.Eng. for constantly supporting and motivating me to write my first LinkedIn article.
I have given my best to write this article, if you have any feedback feel free to share it in the comments or let me know personally, happy learning ;)
Postgrad Student | Azure Databricks | PySpark | PowerBI | Azure Data Factory | Python | Hadoop | SSAS | Cloud Computing
6moWell said!
Small progress makes a large effect......
6moInsightful! bhaiya........ And keep it up Bhaiya.
Data Engineer | Databricks | Pyspark | Python | SQL | 4xDatabricksCertifications | Freelance
6moVery informative