What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing with built-in modules for SQL, streaming, machine learning, and graph processing. Spark can run on Apache Hadoop, Apache Mesos, Kubernetes, on its own, in the cloud—and against diverse data sources.

One common question is when do you use Apache Spark vs. Apache Hadoop? They are both among the most prominent distributed systems on the market today. Both are similar Apache top-level projects that are often used together. Hadoop is used primarily for disk-heavy operations with the MapReduce paradigm. Spark is a more flexible and often more costly in-memory processing architecture. Understanding the features of each will guide your decisions on which to implement when. 

Learn about how to use Dataproc to run Apache Spark clusters, on Google Cloud, in a simpler, integrated, more cost-effective way. 

Apache Spark overview

The Spark ecosystem includes five key components:

1. Spark Core is a general-purpose, distributed data processing engine. On top of it sit libraries for SQL, stream processing, machine learning, and graph computation—all of which can be used together in an application. Spark Core is the base of a whole project, providing distributed task dispatching, scheduling, and basic I/O functionalities.

2. Spark SQL is the Spark module for working with structured data that supports a common way to access a variety of data sources. It lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Spark SQL supports the HiveQL syntax and allows access to existing Apache Hive warehouses. A server mode provides standard connectivity through Java database connectivity or open database connectivity.

3. Spark Streaming makes it easy to build scalable, fault-tolerant streaming solutions. It brings the Spark language-integrated API to stream processing, so you can write streaming jobs in the same way as batch jobs. Spark Streaming supports Java, Scala, and Python, and features stateful, exactly-once semantics out of the box. 

4. MLlib is the Spark scalable machine learning library with tools that make practical ML scalable and easy. MLlib contains many common learning algorithms, such as classification, regression, recommendation, and clustering. It also contains workflow and other utilities, including feature transformations, ML pipeline construction, model evaluation, distributed linear algebra, and statistics. 

5. GraphX is the Spark API for graphs and graph-parallel computation. It’s flexible and works seamlessly with both graphs and collections—unifying extract, transform, load; exploratory analysis; and iterative graph computation within one system. In addition to a highly flexible API, GraphX comes with a variety of graph algorithms. It competes on performance with the fastest graph systems, while retaining the flexibility, fault tolerance, and ease of use of Spark.

What are the benefits of Apache Spark?

Speed

You can run workloads 100 times faster than Hadoop MapReduce. Spark achieves high performance for both batch and streaming data using a state-of-the-art directed acyclic graph scheduler, a query optimizer, and a physical execution engine.

Ease of use

Spark offers more than 80 high-level operators that make it easy to build parallel apps. You can use it interactively from Scala, Python, R, and SQL shells to write applications quickly.

Generality

Spark powers a stack of libraries, including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Open source framework innovation

Spark is backed by global communities united around introducing new concepts and capabilities faster and more effectively than internal teams working on proprietary solutions. The collective power of an open source community delivers more ideas, quicker development, and troubleshooting when issues arise, which translates into a faster time to market. 

Why choose Spark over a SQL-only engine?

Apache Spark is a fast general-purpose cluster computation engine that can be deployed in a Hadoop cluster or stand-alone mode. With Spark, programmers can write applications quickly in Java, Scala, Python, R, and SQL which makes it accessible to developers, data scientists, and advanced business people with statistics experience. Using Spark SQL, users can connect to any data source and present it as tables to be consumed by SQL clients. In addition, interactive machine learning algorithms are easily implemented in Spark.

With a SQL-only engine like Apache Impala, Apache Hive, or Apache Drill, users can only use SQL or SQL-like languages to query data stored across multiple databases. That means that the frameworks are smaller compared to Spark.

How are companies using Spark?

Many companies are using Spark to help simplify the challenging and computationally intensive task of processing and analyzing high volumes of real-time or archived data, both structured and unstructured. Spark also enables users to seamlessly integrate relevant complex capabilities like machine learning and graph algorithms.

Data engineers

Data engineers use Spark for coding and building data processing jobs—with the option to program in an expanded language set.

Data scientists

Data scientists can have a richer experience with analytics and ML using Spark with GPUs. The ability to process larger volumes of data faster with a familiar language can help accelerate innovation.

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Google Cloud
  翻译: