My Third Win in Kaggle's Data Science for Good Competition (with key tips)

Shivam Bansal

VP, AI Field Technology @ H2O.ai • 3x Kaggle Grandmaster • NUS Valedictorian

Published Jul 2, 2019

I just won 1st place in yet another Kaggle's Data Science for Good Analytics Competition. In this article, I have shared a high-level overview of the competition: Data Science for Good Challenge: City of Los Angeles and my approach. My submission can be viewed here.

Previous Wins: This is the third time that I have won this challenge. Prior to this, I had also won the following DSfG competitions:

What are Data Science for Good Analytics Competitions?:

Unlike the standard "Prediction Competitions" on Kaggle in which the focus is to solve a structured machine learning problem, Analytics competition aims to solve open-ended data science problems. The winners are determined by evaluating the solutions presented in kernels against multiple evaluation criteria.

Problem Statement: City of Los Angeles

The workforce at The City of Los Angeles serves as the backbone for a number of services. As an organization, they were facing a unique hiring situation. One-third of their workers are retiring very soon and they have opened a number of job roles. The organization wants to improve the job bulletins that will fill all the open positions. Most of this data is present in unstructured form and it needs to be converted into a structured format before it can be analysed and obtain actionable insights. The content, tone, and format of job bulletins can strongly influence the quality of the applicants. Also, within such a huge organization, it also becomes difficult to clearly identify which promotions are available.

Key Objectives

Develop an NLP framework to accurately structurize the job descriptions into structured information and entities. Develop a framework to measure and quantify the implicit bias in the text. Also, focus on providing a solution which encourages diversity among the candidates. Furthermore, Develop a system to clearly identify the promotion pathways within the organization.

My Approach

I shared my solution in a series of 5 kernels. In the first kernel, a complete job bulletin entity extraction and data structuring engine are developed. In the second kernel, a deep analysis of unconscious bias is performed and a framework is shared to validate and reproduce the results. In the third kernel, the impact of content, tone, and language is measured. And finally, in the last two kernels, a methodology is shared to visualize the promotion pathways (both explicit and implicit) and identifying the possible promotions pathways for a role.

In part 1, I developed a highly robust, generic, NLP parsing module to extract well defined structured entities and information from the text. The architecture is shown below:

NLP Architecutre to Structurize huge text files

Additionally, I also shared a python package for the same module: PyCola on PyPi

In the second kernel, I focussed on the analysis, measurement, and quantification of unconscious (or implicit) bias present in the job bulletins Different visualizations were used to showcase key insights and trends. In the end, I perform a simulation experiment to validate the hypothesis.

The aim of the third kernel was to provide a deep analysis framework to measure how well the job descriptions are written. This included an analysis of content, tone, and language. To validate the hypothesis, I also created a benchmark index by analysing data of Fortune 500 Companies.

Finally, in the fourth and fifth kernels, I developed methods to find the promotional pathways in both implicit and explicit ways. Following Architecture diagram shows the process.

The indirect connections were obtained using contextual similarity among word embeddings generated from pre-trained fasttext models. The figure in right shows an example of promotional pathways that were generated. The graphs were generated using d3.js and produced by the use of require.js in kaggle kernels, which was further controlled by python.

In the end, I would like to share some key tips that really helped me to win:

Iterative Approach: From my experience, I have learned that excellent solutions are prepared in a step by step manner. Breaking the problem into milestones, further into mini-tasks is a good approach to follow. To prepare the solutions, always considering one mini-task at a time, complete it, validate it, and improve it.
Structured Thinking: It is important to follow a structured thinking approach. Most of the data science for good challenges are open-ended data science problems. It is important to approach them in a structured manner.
Well Structured and Cleaned Code: These solutions for these type of competitions are always obtained in an iterative process. No one can get a complete and accurate solution in one go. Hence, It is very important to follow very good coding practices so that iterations become easier.
Effective Data Storytelling: These type of competitions are very open-ended, it is necessary to the community the important insights in a well-structured manner. Use of right and neat visualizations, along with necessary explanations is useful.
Innovative Elements: Always focus on adding something new, creative, and innovative. Instead of following what others are doing, a winning solution is the one which uses something unique.

Mohan HR

Chair - Events, IEEE Computer Society, Former AVP Systems, The Hindu, Board Member AI Forum, IEEE Ambassador, Past President - CSI, Past Chair - IEEE CS, IEEE PCS Madras & ACM Chennai, IEEE CS R10 GAC

Hearty congratulations

Paul Ngana

Senior Reliability Engineer

Shivam Bansal what tool did you use to draw your architecture diagrams?

Janmejaya Nanda

Congrats Shivam Bansal. Such a nice explanation with beautiful visualization.

Riddhi L.

Experimenting with Gen AI @ TCS Cloud Labs

Kudos

1 Reaction

See more comments

To view or add a comment, sign in

My Third Win in Kaggle's Data Science for Good Competition (with key tips)

Shivam Bansal

VP, AI Field Technology @ H2O.ai • 3x Kaggle Grandmaster • NUS Valedictorian

More articles by Shivam Bansal

Insights from the community

Others also viewed

Tracing the Roots of Data Science: From Statistics to Big Data and Beyond

Top 3 Data Science Trends In 2022

Life Is 10% What You Make It, 90% How You Take It: Data Science Perspective

Identifying Data Science Use Cases – Boosting Business Through Data Science Series

Data Science's Meteoric Rise: A Story of Interdisciplinary Innovation and Real-World Impact

DATA Pill #086 - Milk the LLM

The Full Stack Data Science Rabbit Hole

Top Myths About Data Science, Pt. 2

Path to Data science - Zero to Hero Series 1 - Week1

An Introduction to Data Science: Uncovering the Power of Data Week #1

Explore topics