Life After Hadoop: Getting Data Science to Work for Your Business

Five essential ingredients to scaling data science

Josh Patterson
Towards Data Science
9 min readAug 6, 2019

--

By Josh Patterson and Jim Scott

Data science can have transformative effects on a business. But not if it takes weeks to get any results.

Putting data science to work can be simplified into three steps. First, capture data from your business processes. Second, combine this big data with massive processing to create machine learning models. Third, validate the models for accuracy and deploy them.

This almost makes it sound easy, but to date it’s been anything but. Traditional approaches like Apache Hadoop and CPU-based infrastructure aren’t up to the task — they pose too many bottlenecks and are too slow.

NVIDIA and its partners are building GPU-based tools that let businesses unleash data science using AI. Below are five key elements to building AI-optimized data centers, but first let’s take a look at the crossroads data science is at.

The Trouble with Big Data

Data scientists implementing machine learning need three things: algorithms, lots of data, and lots of compute power. They require terabytes of data, petaflops of compute, and hundreds of iterative learning cycles to create a machine learning data pipeline.

Big data was initially built on Hadoop because it scaled easily. This allowed tremendous amounts of data to be stored. To process this data, software frameworks such as MapReduce were used with Hadoop across large clusters.

However, common data processing tasks associated with machine learning have many steps (data pipelines), which Hadoop couldn’t handle efficiently. Apache Spark solved this problem by holding all the data in system memory, which allowed more flexible and complex data pipelines, but introduced new bottlenecks.

The many steps of these pipelines include gathering, cleaning, standardizing and enriching data. But processing these complicated workflows using only CPUs is prohibitively slow.

Analyzing even a few hundred gigabytes of data could take hours if not days on Spark clusters with hundreds of CPU nodes. Now consider processing the terabytes of data that big data is known for, and a data scientist could be forgiven for thinking the process had ground to a halt.

And this doesn’t even get at the real work a data scientist wants to perform. While a single run of a workflow is taking a few hours to complete, data scientists in the meantime want to modify variables, models, and model parameters to select the best combinations and iterate their ideas.

Workflows have many iterations of transforming Raw Data into Training Data, which gets fed into many algorithms combinations, which undergo hyperparameter tuning to find the right combinations of models, model parameters, and data features for optimal accuracy and performance.

The New Era of Data Science

To iterate quickly, data scientists need more compute power. In fact, what they need is supercomputing power, but in a way that enterprises can easily integrate into existing infrastructure. Old style data centers just aren’t going to get it done.

GPUs can accelerate many steps of their data pipelines. While adding GPUs to data centers provides a benefit, it’s not enough.

To tap the potential of data science, GPUs have to be at the center of data center design. Generally speaking, data science workflows on GPUs are roughly 10x faster end to end than on CPUs.

We’ve been working with our partners to create the essential building blocks to address bottlenecks in data centers for data science. There are five elements of the optimized AI data center:

Compute Nodes

With their tremendous computational performance, systems with NVIDIA GPUs are the core compute building block for AI data centers. From NVIDIA T4 to V100, NVIDIA enterprise GPUs supercharge servers across numerous applications.

We’ve seen a 10x-plus improvement in streaming analytic applications by adding a single T4 per node. Elsewhere, we’ve seen 1,000x improvements in graph analytics with V100 GPUs connected with NVSwitch in our DGX systems.

NVIDIA DGX systems deliver groundbreaking AI performance and lead the charge in GPU density with eight and 16 GPUs in a single node. A single DGX-1 can replace on average 50 dual-socket CPU servers. This is the first step to give data scientists the industry’s most powerful tools for data exploration.

Networking

The story doesn’t stop at scaling up from CPUs to T4 and DGX systems. With massive data that enterprises commonly collect, data scientists also have to scale out. Most data centers today use 10 or 25 GbE networking because there’s almost no benefit from using anything faster. In traditional systems GPUs communicate through standard networking. Once data has to cross nodes, the bottleneck is the CPU and network.

Remote direct memory access (RDMA) solves this bottleneck, providing top performance for deep learning and machine learning. RDMA in Mellanox NICs, NCCL2 (NVIDIA collective communication library) and OpenUCX (an open-source point to point communication framework) has led to tremendous improvements in training speed. With RDMA allowing GPUs to communicate directly with each other, across nodes, at up to 100Gb/s, they can span multiple nodes and operate as if they were on one massive server.

The benefits of RDMA aren’t limited to training. It’s also critical for large, complex ETL workloads. Improving networking with RDMA provides up to a 20x benefit for distributed joins over PCIe GPU on systems without it. In addition, formerly limited to GPU-to-GPU communication, RDMA provides an additional 2x speed boost on ingesting data at rest with GPUDirect Storage.

Storage

Accelerated storage is not a new concept to enterprises, it just didn’t provide a significant cost improvement for traditional big data clusters built on Hadoop. Given the performance of CPUs, the answer was to use slower disk and replicate data many times to increase locality.

But as workloads became more complex, replication would increase. Or CPUs would spend up to 70 percent of their time sending data back and forth between the hundreds of nodes in a Spark cluster. While a single GPU has at most 48GB of memory, shuffling data from storage is eventually required, and traditional storage becomes the final bottleneck.

GPUDirect Storage allows both NVMe and NVMe over Fabric (NVMe-oF) to read and write data directly to the GPU, bypassing the CPU and system memory. Legacy architectures placed a “bounce buffer” in the CPU’s system memory between storage and GPU computing, where copies of data would be used to keep applications on GPUs fed.

With GPUDirect Storage, the GPU not only becomes the fastest computing element, but also the one with the highest I/O bandwidth, once all of the computation happens on the GPU. This frees up the CPU and system memory for other tasks, while giving each GPU access to orders of magnitude more data at up to 50 percent greater bandwidth.

NVSwitch unified GPUs within a system, RDMA unified GPUs across nodes, and now GPUDirect Storage unifies GPUs and NVMe to improve their capacity to petabyte scale.

Unifying GPUs intra-node with NVLink and NVSwitch, internode with RDMA networks, and with NVMe over Fabric Storage with GPUDirect Storage provides the highest throughput for GPU processing and training.

Kubernetes and Containers

Deploying software is the last key to scaling data science. Enterprises typically rely on some virtualization or orchestration layer to deploy software. In the data science world, Yarn long reigned supreme. It’s the backbone of Hadoop, Spark, and the rest of the traditional big data ecosystem and allows enterprises to switch between libraries in workflows.

Nearly all Yarn deployments are on a version that predates GPU support. This makes it difficult for data scientists in Yarn environments to target GPU machines for their workloads. And it’s one of several reasons why enterprises are moving to Kubernetes and Docker containers.

Packaging and deploying applications via containers is critical to gaining flexibility. Combining containerized applications with Kubernetes enables businesses to change priorities on what task is the most important immediately, and adds resiliency, reliability and scalability to AI data centers. Kubernetes and containers are also faster than traditional packaging and Yarn in deployment time. While spinning up a data science cluster could take tens of minutes on Yarn, it can be done in seconds with Kubernetes.

Being able to change tasks quickly reduces idle cycles, which improves the throughput and productivity of a cluster. As compute resources become more powerful and dense, this is essential to maximizing the value of a data center.

NVIDIA realized the importance of Docker and Kubernetes and invested in this community early. This investment has resulted in NVIDIA GPUs being a first-class resource within Kubernetes. Software can access all the functionality and hardware acceleration in the data center needed to maximize corporate investments in data science.

Accelerated Software

RAPIDS is a suite of open source software libraries and APIs for executing data science pipelines entirely on GPUs — and can reduce training times from days to minutes. Built on NVIDIA CUDA-X AI, RAPIDS unites years of development in graphics, machine learning, deep learning, HPC and more.

By hiding the complexities of working with the GPU and even the behind-the-scenes communication protocols within the data center architecture, RAPIDS creates a simple way to get data science done. As more data scientists use Python and other high-level languages, providing acceleration without code change is essential to rapidly improving development time.

Another way to accelerate development is with integration to leading data science frameworks like Apache Spark, Dask and Numba, as well as numerous deep learning frameworks, such as PyTorch, Chainer and Apache MxNet. By doing this, RAPIDS provides the foundations of a new accelerated data science ecosystem.

As Apache 2.0 open source software, RAPIDS not only brings together this ecosystem on GPUs, but lowers the barrier of entry for new libraries as well. Increasingly, products are being built on top of RAPIDS — for example, BlazingSQL, a SQL engine (which was recently fully open-sourced) — which adds even more accelerated capabilities for users.

RAPIDS allows a new wave of applications to have native integrations, Kubernetes support, RDMA functionality and GPUDirect Storage across a myriad of data formats without having to invest in years of development. And more software makes it easier for data scientists to what they want to do: solve challenging problems with data. Pairing these benefits with the other facets of an AI data center lets you tackle data science without performance limitations.

NVIDIA GPUs in Action

Regardless of industry or use case, when putting machine learning into action, many data science problems break down into similar steps: Iteratively preprocessing data to build features, training models with different parameters, and evaluating the model to ensure performance translates into valuable results.

RAPIDS helps accelerate all of these steps while maximizing the user’s hardware investments. Early customers have taken full data pipelines that took days if not weeks and ran them in minutes. They’ve simultaneously reduced costs and improved the accuracy of their models because more iterations allow data scientists to explore more models and parameter combinations, as well as train on larger datasets.

Retailers are improving their forecasting. Finance companies are better assessing credit risk. And adtech firms are enhancing their ability to predict click-through rates. Data scientists often achieve improvements of 1–2 percent. This may seem trivial, but it can translate to tens or hundreds of millions of dollars of revenue and profitability.

Help Shape the Future

NVIDIA is committed to simplifying, unifying and accelerating the data science community. By optimizing the whole stack — from hardware to software — and by removing bottlenecks, NVIDIA is helping data scientists everywhere do more than ever with less. This translates into more value for enterprises from their most precious resources: their data and data scientists.

But, we’re not doing this alone. Building bridges across the data science community accelerates everyone. Be a part of the community. Start using accelerated libraries, let us know what features should come next, and apply to join our advisory consortium.

NVIDIA is building the data science platform of the future, and we want you to be a part of it.

--

--

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Josh Patterson

Working on something new. HIRING! Ex NVIDIA, White House PIF, Accenture Labs. Started RAPIDS.ai.

No responses yet

  翻译: