Spark-Beyond Basics: Liquid Clustering in Delta tables
Before we discuss anything about Liquid Clustering, you have to know about Z-Ordering. (don't you worry, I got you 😉
Question you might ask is, when we had Z-Ordering, why Liquid Clustering? 🤔
Short comings of Z-Ordering
These are short comings which you won't face right after the implementation of Z-Ordering, but it’ll hit you eventually..and till then, it would be too late to fix it! 🤐
And hence, Databricks came up with Liquid Clustering.
Liquid Clustering
Liquid Clustering is an innovative data management technique that replaces table partitioning and ZORDER. 🤔
The shirt comings of Z-Order are all addressed by Liquid clustering and mush more, by figuring out the right data layout for you. 🤨
Benefits of Liquid Clustering
Benefit 1
NO NEED TO CONSIDER CARDINALITY WHILE CHOOSING THE COLUMN! 😱 Meaning, you can select any column to “cluster by” based on your query pattern. Liquid will take care skewing, producing consistent file sizes, and avoiding over- and under-partitioning.
Recommended by LinkedIn
Benefit 2
WRITING CLUSTERED DATA IS WAYYYY FASTER! 🫣Liquid offers low write amplification. Liquid achieves 7x faster write times than partitioning + Zorder.
Moreover, using DatabricksIQ, we can apply Liquid Clustering at the write time on new data during ingestion. This means, we don't have to append and then apply clustering on the complete table. Clustering would happen On-the-go! 😏
Benefit 3
ROW-LEVEL CONCURRENCY WITH LIQUID CLUSTERING! Databricks is the only Lakehouse that offers row-level concurrency. (perform parallel writes). And now, we can combine Liquid with this to perform clustering with row-level concurrency.
Some of these benefits are something which can be manually (with extra headache) implemented alongside Z-Ordering. But why to have a manual approach, when a better, more robust technique already exists? 😎😎
Note: Liquid Clustering is available in public preview from DBR 15.2 onwards
-- Creating a new table
CREATE TABLE table1(t timestamp, s string) CLUSTER BY (t);
That sums up the latest tech rollout by Databricks: Liquid Clustering.
If you liked the blog, please clap 👏 to make this reach to all the Data Engineers.
Thanks for reading! 😁