Spark Job Optimisation
Optimising running Spark jobs involves improving resource utilisation, execution efficiency, and reducing overall runtime.
Here are steps and best practices to optimise Spark jobs:
1. Analyze and Debug Existing Jobs
2. Optimize Data Operations
Use repartition() to evenly distribute data across partitions for parallelism.
Use coalesce() for reducing the number of partitions if less parallelism is needed.
Use salting techniques or partition your data based on more balanced keys.
Use randomSplit() or aggregate less-skewed data before joining.
Minimise the use of wide transformations like join, groupBy, or distinct.
Use broadcast() joins when one dataset is small enough to fit in memory
3. Tune Resource Allocation
Balance the number of executors and cores per executor.
Typically, allocate one core per task. Over-provisioning can lead to task failures or resource contention.
Recommended by LinkedIn
Increase executor memory (spark.executor.memory) if you encounter memory errors.
Tune JVM settings, such as `spark.executor.memoryOverhead` and `spark.driver.memory`.
Enable spark.dynamicAllocation to adjust resources automatically based on workload.
4. Use Efficient Data Formats
File Formats:
Partitioning:
Caching and Persistence:
5. Optimize Configurations
Example Tuning Scenario
For a Spark job that reads a large Parquet file, groups data by a key, and writes it back:
scala
// Example code snippet
val df = spark.read.parquet("path/to/data")
val groupedDF = df.groupBy("key").agg(sum("value").as("total"))
// Cache if reused later
groupedDF.cache()
// Write optimized data
groupedDF.write.mode("overwrite").parquet("path/to/output")
Optimization Steps: