Spark - repartition() vs coalesce()

DILIP KUMAR KHANDELWAL

Azure Big Data Engineer | Databricks | Spark Ecosystem | Python | Azure Synapse Analytics | Microsoft Fabric | Machine learning

Published Aug 20, 2019

1. Repartitioning is a fairly expensive operation. Spark also as an optimized version of repartition called coalesce() that allows Minimizing data movement as compare to repartition.you can decreasing the number of RDD partitions.

2. repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

3. coalesce minimizing data movement If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

Example : -

Node 1 = 1,2,3

Node 2 = 4,5,6

Node 3 = 7,8,9

Node 4 = 10,11,12

I will use coalesce(2) then i will get result

Node 1 = 1,2,3 + (10,11,12)

Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

4. repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly.

Example:

Partition 00000: 1, 2, 3

Partition 00001: 4, 5, 6

Partition 00002: 7, 8, 9

Partition 00003: 10, 11, 12

If i will use repartition(2) and i will get result

Partition A: 1, 3, 4, 6, 7, 9, 10, 12

Partition B: 2, 5, 8, 11

The repartition method makes new partitions

To view or add a comment, sign in

Spark - repartition() vs coalesce()

DILIP KUMAR KHANDELWAL

Azure Big Data Engineer | Databricks | Spark Ecosystem | Python | Azure Synapse Analytics | Microsoft Fabric | Machine learning

More articles by DILIP KUMAR KHANDELWAL

Insights from the community

Others also viewed

🚀 Node.js Streams: Lighting Up Your Data Highway!

Measures of Dispersion- Range, Varince & Standard Deviation

How can you handle missing values in a time-series dataset?

Array Data Structure

Tail Sampling: The Cool Kid's Guide to Smarter Tracing

PoDSI — Proof of Data Segment Inclusion

A first problem with relative frequencies

Spark Transformations

Data Structures: Arrays

All about Linked Lists

Explore topics