Spark - repartition() vs coalesce()

Spark - repartition() vs coalesce()

1. Repartitioning is a fairly expensive operation. Spark also as an optimized version of repartition called coalesce() that allows Minimizing data movement as compare to repartition.you can decreasing the number of RDD partitions.

2. repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

3. coalesce minimizing data movement If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

Example : -

Node 1 = 1,2,3

Node 2 = 4,5,6

Node 3 = 7,8,9

Node 4 = 10,11,12

I will use coalesce(2) then i will get result

Node 1 = 1,2,3 + (10,11,12)

Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

4. repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly.

Example:

Partition 00000: 1, 2, 3

Partition 00001: 4, 5, 6

Partition 00002: 7, 8, 9

Partition 00003: 10, 11, 12

If i will use repartition(2) and i will get result

Partition A: 1, 3, 4, 6, 7, 9, 10, 12

Partition B: 2, 5, 8, 11

The repartition method makes new partitions

To view or add a comment, sign in

More articles by DILIP KUMAR KHANDELWAL

  • Confusion Matrix

    Confusion Matrix

    Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or…

  • Azure Data Factory - Data Flow

    Azure Data Factory - Data Flow

    Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic…

  • groupByKey vs reduceByKey in Spark

    groupByKey vs reduceByKey in Spark

    While both of these functions will produce the same result 1. reduceByKey example works much better on a large dataset…

Insights from the community

Others also viewed

Explore topics