Building a Data-Driven Future: Part 1 - Five Approaches to Data Processing
This post is part of a five-blog series focused on how businesses can plan and build a data-driven Future. The blogs will walk readers through the basics, fundamental dynamics, and nuances in modern data management and processing. This blog describes 5 approaches to data processing that anyone working with data should consider. In the next four blog posts we will cover: a deep dive into ELT, why data is broken for enterprises, the current state of the data ecosystem, and the future of data with a look into emerging concepts like data fabric and data mesh.
As enterprises double down on data-driven everything, it means that data is being used for analytics, data science, and operations in all aspects of business, be it product, sales, marketing, logistics, or HR. Different applications will generate data in different ways and need to consume data in different ways.
How is Data Processing done today? Part 1 of this series breaks down and goes deep into different ways in which data makes its way to various data driven applications.
How Applications Work with Data
The purpose of any application is to serve a user's needs. An application both consumes and produces data. What is the right data system for such an application - database, files, API, or streams?This is a decision that developers need to make and is based on three key criteria - schema, format, and velocity:
- Schema: Attributes contained in a record, e.g. a contact record has name, phone, address attributes
- Format: Whether JSON, XML, CSV etc.
- Velocity: Batch, Streaming, or Real-time.
Based on these three criteria, developers choose the right data system whether its database, files, API, streams etc. for the data consumption and production steps.
The Data Processing Stage
Data produced by one application or system is often consumed by another application or system. This may be for analytical purposes, data science applications, or operational needs. An example of an operational use is when the data from inventory management systems in a store is used by the order management system.
In order to make data generated by one application usable in another application, two key data processing steps need to take place:
- Move data from one system to another. For example, to analyze log files you copy them to a data warehouse or after you have shortlisted sales opportunities in a spreadsheet you send it to Salesforce. When moving data, the velocity may vary. It can be batch, real-time, or streaming.
- Modify the data to be useful. For example, data is collected in local time zones, e.g. Pacific Standard Time (PST) but needs to be normalized to UTC time.
E, T, L : Three essential steps of data processing
This is where E, T, and L come into play in the context of data processing:
E : Extract or read data from a source system which could be a file, a database, some API, a stream, human input, anything.
T : Transform data means modifying the attributes, schema, format of data. This step can involve computations, joins, enrichment, and in some cases also include filtering, validating, and quality checking data
L : Load data into a target or destination system where you will use it. The target system can be a database, data warehouse, API, SaaS service, file, email, stream, or a spreadsheet which means anywhere it can become ready for consumption by another application.
Understanding the Five Styles of Data Processing
Now that we understand the three steps of E, T, and L we can construct the five styles of data processing. Each of them is a variation of these three steps
Data Push vs Pull:
Push and pull are two common modes of data flow
Data Pull: You request data. It is fetched and delivered instantly. For example you request the status of your Fedex Package. The latest information instantly pulled to tell you where your package is.
Push: Data is delivered to you periodically or based on an event/rule. For example when your package is shipped or delivered a text message is sent to notify you.
Style 1: ETL (Extract-Transform-Load)
You may need to Transform before you load in cases where:
- Data volume is large on extract, so you want only a partial load of data. For example your web server generates terabytes of activity logs for every visit to your website, but you are only looking for data related to pages that failed to load. Here the transform step will filter out data.
- Information may change often. For example every visit to your page has an IP address and you use a geo provider to map the IP address to the city or country of the visitor. Since IP assignments can dynamically change, you need to make this transform in real-time versus later when the information may become inaccurate.
Style 2: ELT (Extract-Load-Transform)
This works great for cases where
- Data replication cost is not a concern and the target model in the Data Warehouse is known or predefined. For example replicating data from Salesforce Accounts into Snowflake. It is easy to pre-define (most ELT tools would have done this for the user) how the Salesforce objects will map in a database.
- Transform action is being performed by analysts who are comfortable writing SQL code. This is essential when a powerful GUI based transform is not available and code is the primary way to go.
Style 3: API Integration
API integration is the choice when you are:
- Connecting two applications or services together. For example, every time a customer calls into your call center you want a ticket to be automatically generated in your support system.
- Data Volume is not too large. While API integration also involves data movement from one service to another, it is typically for a small unit of data. As such, most existing solutions are not meant to use API Integration as a large scale data flow mechanism.
Streaming vs Real-time
While fast streaming and real-time may seem similar at the surface, there are some key differences to know when thinking in terms of data velocity and responsiveness
Streaming: Think of this as a queue. Data enters at the back of the queue and is consumed when it is at the head of the queue. From data entering the queue to being consumed can be a snap, but don’t be surprised if it takes longer when the queue is very full (think traffic jam). Even 20-30 seconds can feel like an eternity when you are trying to check the status of your Fedex package or online auction.
Real-time: Data is delivered instantly, typically in single digit or tens of milliseconds. These systems are designed to not use queues or other mechanisms that may have uncertain delays. Of course, improperly designed real-time systems can get slow and give you the same bad experience like waiting in a queue.
Style 4: API Proxy
While not traditionally considered as data processing, API Proxy isn’t that much different except data is flowing between APIs and its flow is on-demand, i.e. real-time. This is ideal when
- Connecting with multiple similar APIs. Example you pull package tracking status for multiple APIs each with their own authentication and format. A proxy layer can unify these variations to give a single authentication and format for the application consuming this information.
- Speed is real-time. When you are serving a live user request the API-to-API flow has to be instantaneous and cannot incur more than a few milliseconds of latency.
Style 5: Data-as-a-Service
Here the final data consumption happens in an API, and just like API proxy the data is delivered on demand
- Data is served from a large store. Typically a DB or fast Data Warehouse would be the ultimate source of data. For example, you are a equity research provider and you have ratings for different stocks generated via analytics and ML. DaaS would help create APIs to this data that you can provide to your customers. Now your customers can embed these APIs in your live application.
- Real-time speed. Data is requested by an application via an API and immediately delivered within a few milliseconds. This also requires transforming the data into an API structure which requires an underlying real-time flow of data
Nexla Advantage: A Holistic Approach to Data Processing
The five styles to data processing essentially address the variations in the requirements around velocity of data flow and the sequence of E, L, and T steps.
Nexla was built with a vision of data flow across organizations that is orthogonal to how other companies approach data processing. Our approach elegantly solves the complex problem of addressing every possible variation of data processing with ease of use and simplicity. Nexsets provide a logical view of data that abstracts format, schema, and velocity. The result is a simple yet powerful, converged approach to data processing.
_______________________________
In the next post, I will take a deeper look at ELT. ELT has skyrocketed in popularity and for good reasons. But with increasing popularity, often comes hype where shortcomings get overlooked. Stay tuned to learn about when and how to best leverage ELT.
This article is a cross-post from Nexla blog Building a Data Driven Future: Part 1 - Five Approaches to Data Processing.
Making workloads, not infrastructure @ Northflank
3yVery good 👏
AI, Data and Analytics Business Leader
4yVamsi Kiran Tapas Pattanaik Vivek Agarwal Nitin S. Piyush Kumar Prateek G. subhayu mukherjee please follow this blog series