Spark-Beyond Basics: Liquid Clustering in Delta tables

Before we discuss anything about Liquid Clustering, you have to know about Z-Ordering. (don't you worry, I got you 😉


Question you might ask is, when we had Z-Ordering, why Liquid Clustering? 🤔

Short comings of Z-Ordering

  1. Once you have Z-Ordered and partitioned the table on one column, you can't change the column of Z-order without rewriting your data
  2. ZORDER has significant write amplification, as it is not incremental, and cannot be done on-write. Meaning, you can't Z-Order the new rows and append to existing table, you have to append new rows, and then perform Z-Order on updated table.
  3. The choice of column decides how good, or how poor your query performance will be after applying Z-Order. (If column is of low cardinality, well, you are doomed! 🥲)

These are short comings which you won't face right after the implementation of Z-Ordering, but it’ll hit you eventually..and till then, it would be too late to fix it! 🤐

And hence, Databricks came up with Liquid Clustering.

Liquid Clustering

Liquid Clustering is an innovative data management technique that replaces table partitioning and ZORDER. 🤔

The shirt comings of Z-Order are all addressed by Liquid clustering and mush more, by figuring out the right data layout for you. 🤨

Benefits of Liquid Clustering

Benefit 1

NO NEED TO CONSIDER CARDINALITY WHILE CHOOSING THE COLUMN! 😱 Meaning, you can select any column to “cluster by” based on your query pattern. Liquid will take care skewing, producing consistent file sizes, and avoiding over- and under-partitioning.

Benefit 2

WRITING CLUSTERED DATA IS WAYYYY FASTER! 🫣Liquid offers low write amplification. Liquid achieves 7x faster write times than partitioning + Zorder.

Moreover, using DatabricksIQ, we can apply Liquid Clustering at the write time on new data during ingestion. This means, we don't have to append and then apply clustering on the complete table. Clustering would happen On-the-go! 😏

Benefit 3

ROW-LEVEL CONCURRENCY WITH LIQUID CLUSTERING! Databricks is the only Lakehouse that offers row-level concurrency. (perform parallel writes). And now, we can combine Liquid with this to perform clustering with row-level concurrency.

Some of these benefits are something which can be manually (with extra headache) implemented alongside Z-Ordering. But why to have a manual approach, when a better, more robust technique already exists? 😎😎

Note: Liquid Clustering is available in public preview from DBR 15.2 onwards
-- Creating a new table
 CREATE TABLE table1(t timestamp, s string) CLUSTER BY (t);        



That sums up the latest tech rollout by Databricks: Liquid Clustering.

If you liked the blog, please clap 👏 to make this reach to all the Data Engineers.

Thanks for reading! 😁

To view or add a comment, sign in

More articles by Aman Dahiya

Insights from the community

Others also viewed

Explore topics