An Introduction to Apache, PySpark and Dataframe Transformations

Víctor Román Aragay

Senior Data Engineer at Aily Labs

Published Jun 13, 2019

A Comprehensive Guide to Master Big Data Analysis

Introduction: The Big Data Problem

Apache arises as a new engine and programming model for data analytics. It’s origin goes back to 2009, and the main reasons why it has gained so much importance in the past recent years are due to changes in enconomic factors that underline computer applications and hardware.

Historically, the power of computers only grew with time. Each year, new processors were able to perform operations faster and the applications that run on top of them automatically got faster.

All of this changed in 2005, when the limits in heat disipation caused the switch from making individual processors faster, to start exploring the parallelization of CPU cores. This meant that applications and the code that run them must be changed too. All of this is what layed out the ground of new models like Apache Spark.

In addition, the cost of sensors and storing units only had decreased on the last years. Nowadays is completely unexpensive to collect and store vast amounts of information.

There is so much data available, that the way to process it and analyze it, must change radically too, by making large parallel computations on clusters of cumputers. These clusters enable the synergic combination of those computers’ power, simultaneously, and make much easier tackling expensive computational tasks like data processing.

And this is where Apache Spark comes into play.

So if you want to find out more about:

The basis of Apache Spark
The intuition of why it is important and how it operates
Perform analysis operations with PySpark and Dataframes

I invite you to click in the link and read the full article!

To view or add a comment, sign in

An Introduction to Apache, PySpark and Dataframe Transformations

Víctor Román Aragay

Senior Data Engineer at Aily Labs

A Comprehensive Guide to Master Big Data Analysis

Introduction: The Big Data Problem

More articles by Víctor Román Aragay

Insights from the community

Others also viewed

Spark - Managers' snapshot

Spark Tidbits - Lesson 8

Identifying Counts Using the DAG inside SparkUI - Apache Spark | Databricks

Spark pools

What is an RDD?

SparkR (R on Spark)

UNLEASH THE POWER OF APACHE SPARK WITH DATAMINDSHUB

Leveraging PySpark for Integrating Diverse Data Sources: A Guide

Explore topics