Building a Data-Driven Future: Part 1 - Five Approaches to Data Processing

Saket Saurabh

Co-founder & CEO at Nexla. Ask me about production-grade GenAI

Published Oct 29, 2020

This post is part of a five-blog series focused on how businesses can plan and build a data-driven Future. The blogs will walk readers through the basics, fundamental dynamics, and nuances in modern data management and processing. This blog describes 5 approaches to data processing that anyone working with data should consider. In the next four blog posts we will cover: a deep dive into ELT, why data is broken for enterprises, the current state of the data ecosystem, and the future of data with a look into emerging concepts like data fabric and data mesh.

As enterprises double down on data-driven everything, it means that data is being used for analytics, data science, and operations in all aspects of business, be it product, sales, marketing, logistics, or HR. Different applications will generate data in different ways and need to consume data in different ways.

How is Data Processing done today? Part 1 of this series breaks down and goes deep into different ways in which data makes its way to various data driven applications.

How Applications Work with Data

The purpose of any application is to serve a user's needs. An application both consumes and produces data. What is the right data system for such an application - database, files, API, or streams?This is a decision that developers need to make and is based on three key criteria - schema, format, and velocity:

Schema: Attributes contained in a record, e.g. a contact record has name, phone, address attributes
Format: Whether JSON, XML, CSV etc.
Velocity: Batch, Streaming, or Real-time.

Based on these three criteria, developers choose the right data system whether its database, files, API, streams etc. for the data consumption and production steps.

The Data Processing Stage

Data produced by one application or system is often consumed by another application or system. This may be for analytical purposes, data science applications, or operational needs. An example of an operational use is when the data from inventory management systems in a store is used by the order management system.

In order to make data generated by one application usable in another application, two key data processing steps need to take place:

Move data from one system to another. For example, to analyze log files you copy them to a data warehouse or after you have shortlisted sales opportunities in a spreadsheet you send it to Salesforce. When moving data, the velocity may vary. It can be batch, real-time, or streaming.
Modify the data to be useful. For example, data is collected in local time zones, e.g. Pacific Standard Time (PST) but needs to be normalized to UTC time.

E, T, L : Three essential steps of data processing

This is where E, T, and L come into play in the context of data processing:

E : Extract or read data from a source system which could be a file, a database, some API, a stream, human input, anything.

T : Transform data means modifying the attributes, schema, format of data. This step can involve computations, joins, enrichment, and in some cases also include filtering, validating, and quality checking data

L : Load data into a target or destination system where you will use it. The target system can be a database, data warehouse, API, SaaS service, file, email, stream, or a spreadsheet which means anywhere it can become ready for consumption by another application.

Understanding the Five Styles of Data Processing

Now that we understand the three steps of E, T, and L we can construct the five styles of data processing. Each of them is a variation of these three steps

Data Push vs Pull:

Push and pull are two common modes of data flow

Data Pull: You request data. It is fetched and delivered instantly. For example you request the status of your Fedex Package. The latest information instantly pulled to tell you where your package is.

Push: Data is delivered to you periodically or based on an event/rule. For example when your package is shipped or delivered a text message is sent to notify you.

Style 1: ETL (Extract-Transform-Load)

You may need to Transform before you load in cases where:

Data volume is large on extract, so you want only a partial load of data. For example your web server generates terabytes of activity logs for every visit to your website, but you are only looking for data related to pages that failed to load. Here the transform step will filter out data.
Information may change often. For example every visit to your page has an IP address and you use a geo provider to map the IP address to the city or country of the visitor. Since IP assignments can dynamically change, you need to make this transform in real-time versus later when the information may become inaccurate.

Style 2: ELT (Extract-Load-Transform)

This works great for cases where

Data replication cost is not a concern and the target model in the Data Warehouse is known or predefined. For example replicating data from Salesforce Accounts into Snowflake. It is easy to pre-define (most ELT tools would have done this for the user) how the Salesforce objects will map in a database.
Transform action is being performed by analysts who are comfortable writing SQL code. This is essential when a powerful GUI based transform is not available and code is the primary way to go.

Style 3: API Integration

API integration is the choice when you are:

Connecting two applications or services together. For example, every time a customer calls into your call center you want a ticket to be automatically generated in your support system.
Data Volume is not too large. While API integration also involves data movement from one service to another, it is typically for a small unit of data. As such, most existing solutions are not meant to use API Integration as a large scale data flow mechanism.

Streaming vs Real-time

While fast streaming and real-time may seem similar at the surface, there are some key differences to know when thinking in terms of data velocity and responsiveness

Streaming: Think of this as a queue. Data enters at the back of the queue and is consumed when it is at the head of the queue. From data entering the queue to being consumed can be a snap, but don’t be surprised if it takes longer when the queue is very full (think traffic jam). Even 20-30 seconds can feel like an eternity when you are trying to check the status of your Fedex package or online auction.

Real-time: Data is delivered instantly, typically in single digit or tens of milliseconds. These systems are designed to not use queues or other mechanisms that may have uncertain delays. Of course, improperly designed real-time systems can get slow and give you the same bad experience like waiting in a queue.

Style 4: API Proxy

While not traditionally considered as data processing, API Proxy isn’t that much different except data is flowing between APIs and its flow is on-demand, i.e. real-time. This is ideal when

Connecting with multiple similar APIs. Example you pull package tracking status for multiple APIs each with their own authentication and format. A proxy layer can unify these variations to give a single authentication and format for the application consuming this information.
Speed is real-time. When you are serving a live user request the API-to-API flow has to be instantaneous and cannot incur more than a few milliseconds of latency.

Style 5: Data-as-a-Service

Here the final data consumption happens in an API, and just like API proxy the data is delivered on demand

Data is served from a large store. Typically a DB or fast Data Warehouse would be the ultimate source of data. For example, you are a equity research provider and you have ratings for different stocks generated via analytics and ML. DaaS would help create APIs to this data that you can provide to your customers. Now your customers can embed these APIs in your live application.
Real-time speed. Data is requested by an application via an API and immediately delivered within a few milliseconds. This also requires transforming the data into an API structure which requires an underlying real-time flow of data

Nexla Advantage: A Holistic Approach to Data Processing

The five styles to data processing essentially address the variations in the requirements around velocity of data flow and the sequence of E, L, and T steps.

Nexla was built with a vision of data flow across organizations that is orthogonal to how other companies approach data processing. Our approach elegantly solves the complex problem of addressing every possible variation of data processing with ease of use and simplicity. Nexsets provide a logical view of data that abstracts format, schema, and velocity. The result is a simple yet powerful, converged approach to data processing.

_______________________________

In the next post, I will take a deeper look at ELT. ELT has skyrocketed in popularity and for good reasons. But with increasing popularity, often comes hype where shortcomings get overlooked. Stay tuned to learn about when and how to best leverage ELT.

This article is a cross-post from Nexla blog Building a Data Driven Future: Part 1 - Five Approaches to Data Processing.

Chase Roberts

Making workloads, not infrastructure @ Northflank

Very good 👏

2 Reactions

Debajyoti Dasgupta

AI, Data and Analytics Business Leader

Vamsi Kiran Tapas Pattanaik Vivek Agarwal Nitin S. Piyush Kumar Prateek G. subhayu mukherjee please follow this blog series

Building a Data-Driven Future: Part 1 - Five Approaches to Data Processing

Saket Saurabh

Co-founder & CEO at Nexla. Ask me about production-grade GenAI

How Applications Work with Data

The Data Processing Stage

E, T, L : Three essential steps of data processing

Understanding the Five Styles of Data Processing

Style 1: ETL (Extract-Transform-Load)

Style 2: ELT (Extract-Load-Transform)

Style 3: API Integration

Style 4: API Proxy

Style 5: Data-as-a-Service

Nexla Advantage: A Holistic Approach to Data Processing

_______________________________

More articles by this author

Insights from the community

Others also viewed

"In God we trust. All others must bring data"

Big Data and Business Intelligence: Extracting Insights for a Competitive Advantage

Data Engineering Services vs Warehousing vs Analytics: Pick Your Data Strategy

Unlock the Power of Data: How Data Cafe Simplifies Business Intelligence for Everyone

Transform Your Data into Actionable Insights with Data Cafe

Six Ways Sigma Simplifies Data Analysis

Data Cafe: Your Ultimate Solution for Data Integration and Visualization

Standardizing Data Delivery with Data as a Product

Data vault builder

Explore topics

How Applications Work with Data

The Data Processing Stage

E, T, L : Three essential steps of data processing

Understanding the Five Styles of Data Processing

Style 1: ETL (Extract-Transform-Load)

Style 2: ELT (Extract-Load-Transform)

Style 3: API Integration

Style 4: API Proxy

Style 5: Data-as-a-Service

Nexla Advantage: A Holistic Approach to Data Processing

_______________________________

What is a Data Product?

Feb 10, 2022

Data Economy is subsuming the API Economy

Dec 21, 2021

Data Mesh: Design, Benefits, Hype, and Reality

Mar 1, 2021

Five Data Technology Predictions (Observations) from the Trenches

Jan 6, 2021

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Oct 30, 2020

The Hype for Nexla is Real

Sep 25, 2020

What We Value at Nexla

May 9, 2017

A first principles analysis of the immigration EO

Feb 7, 2017

Will Data Save our Lives?

Jan 9, 2017

Insights from the community

Others also viewed

"In God we trust. All others must bring data"

Big Data and Business Intelligence: Extracting Insights for a Competitive Advantage

Data Engineering Services vs Warehousing vs Analytics: Pick Your Data Strategy

Unlock the Power of Data: How Data Cafe Simplifies Business Intelligence for Everyone

Transform Your Data into Actionable Insights with Data Cafe

Six Ways Sigma Simplifies Data Analysis

Data Cafe: Your Ultimate Solution for Data Integration and Visualization

Standardizing Data Delivery with Data as a Product

Data vault builder

Explore topics