Winning Solution of RecSys2020 Challenge
Accelerated Feature Engineering and Training for Recommender Systems
By Benedikt Schifferer and Even Oldridge
Our 1st place solution of the RecSys Challenge 2020 focused on predicting tweet interaction based on this year’s dataset provided by the competition host, Twitter. NVIDIA’s interdisciplinary team included colleagues from NVIDIA’s KGMON (Kaggle Grandmasters), NVIDIA’s RAPIDS (Data Science), and NVIDIA’s Merlin (Recommender Systems) who collaborated on the winning solution.
In the post, The Great AI Bake-Off: Recommendation Systems on the Rise, we tell our story behind the scene. The focus of this blog post is to share our learnings of this year’s competitions from a technical perspective, including speeding up end-2-end data processing by 25x with GPUs-acceleration and influencing the next version of NVTabular, which simplifies data processing.
Our approach achieved the highest score in seven of the eight metrics used to calculate the final leaderboard position. The acceleration of our pipeline was critical in our ability to quickly perform exploratory data analysis (EDA) and led to discovering a range of effective features used in the final solution, which is provided as open-source software.
Introduction
Recommender systems (RecSys) play a critical role in driving user engagement and revenue on online platforms. The data structure is often tabular and requires extensive feature engineering to find features that provide the best performance when modeling. Although deep learning architectures can automatically extract good data representations in other domains, such as images and text, they still depend on expert knowledge and feature engineering in tabular data. Top solutions in previous and current RecSys challenges depended on custom features fed into boosted decision trees or neural networks.
For the RecSys 2020 Challenge, Twitter provided a dataset containing 200 million user-tweet combinations, requiring participants to predict the user’s behavior given a user-tweet combination: does the user like “reply”, “retweet”, and/or “retweet with comment” the tweet? A description of the data schema can be found here. The 200 million tweet dataset required significant computation to do feature engineering and prepare the dataset for modeling, and our winning solution leveraged several key tools to accelerate our pipeline.
Acceleration of ETL and Training
Discovering the most accurate model requires running many experiments, such as varying feature engineering, different validation dataset splits, and finding hyperparameters. One experiment may require running multiple steps of the data pipeline. For example, a new feature requires training a model and calculating the performance metrics. It is crucial to optimize the data pipeline end-2-end to be able to run all experiments in time.
Our final implementation runs entirely on GPU and includes feature engineering, preprocessing, and training the models for the 200M interactions all in two minutes 18 seconds, a speed-up of greater than two orders of magnitude over CPU optimized code. This was achieved using a combination of RAPIDS cuDF, Dask, UCX, and XGBoost on a single machine with four NVIDIA V100 GPUs. Let’s analyze the different stages of our data pipeline.
Preparing
We must preprocess the original dataset to get it into an optimal format. This is only performed once and then used by all future experiments. In this step, we loaded the provided TSV files, imputed NaNs, factorized categorical variables, optimized variable data types, processed lists into columns, decoded BERT tokens into text, extracted text features (e.g., word counts), and computed a dozen of count features related to engaging and engaged users. As this was a one-time preprocessing step, we did not include it in our comparison in Figure 1.
Feature Engineering
The essential feature engineering techniques are Target Encoding, Count Encoding, and Difference Lag (variable differences in time). After brainstorming an idea, we would compute the feature(s) and then train, infer, and compute validation scores. It was essential to speed up this cycle as much as possible. Using RAPIDS cuDF, DASK, and UCX enabled us to use four NVIDIA V100 GPUs for this task. With these libraries, it takes less than one minute to engineer all features, while using Pandas on CPU takes eight hours 56 minutes, a speed-up factor of over 500x. Even compared to optimized CPU code (Pandas+Dask) on 20 cores, our solution provided a speed-up of 42x. Figure 1 visualizes the speed up, excluding Pandas, as it is outside the range.
Training
Using Dask-XGBoost enabled our models to be trained on four NVIDIA V100 GPUs, which sped up model training and inference significantly compared to LightGBM on CPU. Using this multi-GPU solution, we accelerated training by a factor of 120x. The competition metrics PRAUC and RCE required the calculation of AUC and log loss. We accelerated these calculations with GPU and Numba JIT, reducing computation from minutes to seconds.
End-2-end our solution was 280x faster than our initial CPU-based implementation and 25x faster than the optimized CPU code. The techniques used in the solution are also being integrated into NVTabular, our recommender system-specific ETL library.
Simplifying recommendation workflows with NVTabular
NVIDIA’s Merlin is an application framework and ecosystem that enables end-2-end development of recommender systems, accelerated on NVIDIA GPUs. NVTabular is the ETL component of NVIDIA Merlin. In the NVTabular v0.2 release, we implemented the learnings from the RecSys2020 challenge to provide the operators Target Encoding, Count Encoding, and Difference Lag, as well as new multi-GPU support. NVTabular creates the feature engineering pipeline in just 64 lines of code.
Let’s take a look at the pipeline operator-by-operator:
- We transform the labels (targets), reply, retweet, retweet with comment, and like to a binary column. In the raw data, positive interactions are encoded with the timestamp.
- We apply Target Encoding with 5 k-fold strategy and smoothing factor of 20 on all labels for the categorical features defined by cat_groups.
- We use Count Encoding defined on the categorical features in columns.
- We create a new feature by taking the ratio of a_following_count and a_follower_count.
- We create a new feature by taking the ratio of b_following_count and b_follower_count.
- We convert the timestamp to datetime format.
- We extract the hour from the timestamp.
- We convert the datatype for the columns b_follower_count, b_following_count, and language.
- We apply Difference Encoding on the columns b_follower_count, b_following_count, and language.
- We fill missing values with 0.
The full end-2-end implementation will be released here.
Conclusion
In this blog post, we explained our 1st place solution of the RecSys2020 challenge to a technical audience. We shared the techniques that enabled us to win the competition. We were able to speed up our pipeline 25x (optimized CPU implementation) to 280x (initial CPU implementation), running end-2-end in two minutes and 18 seconds. We showed a detailed analysis of the different stages of our pipeline. First, a data preparation step is executed only once. Second, the feature engineering step was accelerated by 42x, and finally, the training step achieved 120x speed up. Our GPU-optimized pipeline enabled us to quickly iterate on new ideas and run many experiments in a short time, giving us a competitive advantage to win this year’s competition.
In addition, we share the data pipeline as an NVTabular workflow, requiring only 64 lines of code. We implemented our learnings throughout this process into NVTabular, providing easy access to the same techniques for the community.
Authors
A collaboration of the participating team in the RecSys2020 challenge: Benedikt Schifferer, Gilberto Titericz Junior, Chris Deotte, Christof Henkel, Kazuki Onodera, Jiwei Liu, Bojan Tunguz, Even Oldridge, Gabriel De Souza Pereira Moreira and Ahmet Erdem.