Spark-Beyond Basics: Liquid Clustering in Delta tables

Aman Dahiya

Senior Azure Data Engineer | Microsoft Certified-Azure Data Engineer| Databricks | Pyspark | SQL |Python-Pandas

Published Jul 17, 2024

+ Follow

Before we discuss anything about Liquid Clustering, you have to know about Z-Ordering. (don't you worry, I got you 😉

Question you might ask is, when we had Z-Ordering, why Liquid Clustering? 🤔

Short comings of Z-Ordering

Once you have Z-Ordered and partitioned the table on one column, you can't change the column of Z-order without rewriting your data
ZORDER has significant write amplification, as it is not incremental, and cannot be done on-write. Meaning, you can't Z-Order the new rows and append to existing table, you have to append new rows, and then perform Z-Order on updated table.
The choice of column decides how good, or how poor your query performance will be after applying Z-Order. (If column is of low cardinality, well, you are doomed! 🥲)

These are short comings which you won't face right after the implementation of Z-Ordering, but it’ll hit you eventually..and till then, it would be too late to fix it! 🤐

And hence, Databricks came up with Liquid Clustering.

Liquid Clustering

Liquid Clustering is an innovative data management technique that replaces table partitioning and ZORDER. 🤔

The shirt comings of Z-Order are all addressed by Liquid clustering and mush more, by figuring out the right data layout for you. 🤨

Benefits of Liquid Clustering

Benefit 1

NO NEED TO CONSIDER CARDINALITY WHILE CHOOSING THE COLUMN! 😱 Meaning, you can select any column to “cluster by” based on your query pattern. Liquid will take care skewing, producing consistent file sizes, and avoiding over- and under-partitioning.

Recommended by LinkedIn

Big Data vs. Fast Data: The Evolution of Speed in…

Pratibha Kumari J. 3 months ago

Forrester changed the way they think about data…

Prukalpa ⚡ 2 years ago

Databricks Unity Catalog - Best Practices

Xorbix Technologies, Inc. 1 month ago

Benefit 2

WRITING CLUSTERED DATA IS WAYYYY FASTER! 🫣Liquid offers low write amplification. Liquid achieves 7x faster write times than partitioning + Zorder.

Moreover, using DatabricksIQ, we can apply Liquid Clustering at the write time on new data during ingestion. This means, we don't have to append and then apply clustering on the complete table. Clustering would happen On-the-go! 😏

Benefit 3

ROW-LEVEL CONCURRENCY WITH LIQUID CLUSTERING! Databricks is the only Lakehouse that offers row-level concurrency. (perform parallel writes). And now, we can combine Liquid with this to perform clustering with row-level concurrency.

Some of these benefits are something which can be manually (with extra headache) implemented alongside Z-Ordering. But why to have a manual approach, when a better, more robust technique already exists? 😎😎

Note: Liquid Clustering is available in public preview from DBR 15.2 onwards

-- Creating a new table
 CREATE TABLE table1(t timestamp, s string) CLUSTER BY (t);

That sums up the latest tech rollout by Databricks: Liquid Clustering.

If you liked the blog, please clap 👏 to make this reach to all the Data Engineers.

Spark-Beyond Basics: Liquid Clustering in Delta tables

Aman Dahiya

Senior Azure Data Engineer | Microsoft Certified-Azure Data Engineer| Databricks | Pyspark | SQL |Python-Pandas

Short comings of Z-Ordering

Liquid Clustering

Benefits of Liquid Clustering

Benefit 1

Recommended by LinkedIn

Benefit 2

Benefit 3

More articles by Aman Dahiya

Insights from the community

Others also viewed

A Revolution in Analytical Technology

Hive Optimization 50 Tips

The Growing Use Of Humanized Big Data

Big Data Analytics: Strategies for Handling and Analyzing Large Datasets

Open File format in data analytics and AI - changing the international rules game

Unleashing the Power of Graph Databases: Visualizing Complex Data with Graph Technology

Big Data Isn’t a Thing; Big Data is a State of Mind

Big Data Applications and Examples

The 5 Stages of Data Science Adoption

Revolutionize Data Analytics with Recursive CTEs

Explore topics

Short comings of Z-Ordering

Liquid Clustering

Benefits of Liquid Clustering

Benefit 1

Recommended by LinkedIn

Benefit 2

Benefit 3

More articles by Aman Dahiya

Simple ways to improve your PySpark and Parquet pipeline performance

Databricks Delta Live Table (DLT): Turning SQL Queries into Pipelines

Delta Lake with Python: How to Use Delta Lake Without Spark

Azure Storage Account : The Nuances

SCD2 – Implementing Slowly Changing Dimension Type 2 in PySpark

SCD1 – Implementing Slowly Changing Dimension Type 1 in PySpark

How to Become Better at Problem Solving with LeetCode

Why We Use if __name__ == ‘__main__’ in Python

Z-ordering or Z-encoding in pyspark

Data Warehouse

Insights from the community

Others also viewed

A Revolution in Analytical Technology

Hive Optimization 50 Tips

The Growing Use Of Humanized Big Data

Big Data Analytics: Strategies for Handling and Analyzing Large Datasets

Open File format in data analytics and AI - changing the international rules game

Unleashing the Power of Graph Databases: Visualizing Complex Data with Graph Technology

Big Data Isn’t a Thing; Big Data is a State of Mind

Big Data Applications and Examples

The 5 Stages of Data Science Adoption

Revolutionize Data Analytics with Recursive CTEs

Explore topics

Why We Use if name == ‘main’ in Python