Data Analysis and Decision Making in Formula-1: Part-1
from ferrari.com, Ferrari car in 2014

Data Analysis and Decision Making in Formula-1: Part-1

In 2019, Mercedes tweeted that “Races are won at the track. Championships are won at the factory.” Clearly, Mercedes knows a thing or two about Formula-1 (F1) racing success -- this car maker has manufactured the chassis and engine for each of the last seven F1 winners. The last time another manufacturer’s vehicle achieved F1 victory was in 2013. 

But is it just the car? Or the design? Or the engine? It is none of those things in isolation. In addition to superior engineering skills, Mercedes has created a strong data analytics foundation to sustain its competitiveness in F1; meanwhile, other manufacturers have done the same.

 As a fan of F1 racing, I decided to analyze a series of 13 Formula-1 data sets posted to Kaggle. I am interested in exploring any patterns that the data may help me to uncover, such as common reasons for drivers not finishing races, correlations among qualifying round and final race performance, the impact of maintenance issues on performance, or anything else that I can unearth through data mining. 

This article is the first in a series of blog posts titled “Data Analysis and Decision Making in Formula-1.” The first post is focused on exploration of F1 data from the last 70 years. Using the R language, I will dig into the data, while utilizing key exploratory concepts such as joining two data frames, filtering data, summarizing data, and creating visualizations using the ggplot2 package.

Joining two data frames using inner_join() and merge() function

One of the first questions that I sought to answer here was “Who is the winningest F-1 driver during this time period?” After answering the question, I would like to generate a visualization that shows the F-1 drivers, ranked by number of total wins. I started out with a couple of library() functions before importing four of the datasets:

library(readr)
library(tidyverse)
drivers <- read_csv("drivers.csv")
results <- read_csv("results.csv")
races <- read_csv(“races.csv”)

status <- read_csv(“status.csv”)

The results.csv file contains 24,960 rows of data, with detailed information about the performance of each driver in each race during the time period. It does not include the drivers’ names, however -- only a numeric identifier (driverId). The screenshot below shows the first six rows of results. 

head(results)
No alt text provided for this image

In order to make an easily-interpretable visualization, the driver’s name, rather than a driverId, should be used. In order to have drivers’ names alongside their performance data, I will need to combine information from results.csv and drivers.csv. As is often the case with R, there are multiple approaches that can be taken to accomplish this task. 

 Here, I am using the inner_join() function from the dplyr package:

 results_agg_total <- inner_join(drivers,results,by = "driverId")

The inner_join() function will ignore any driverId which is not present in both of the data frames -- rows with driverId values contained in one of the dataframes, but not the other, will simply be dropped from the resulting merged dataframe.  The screenshot below shows the first six rows of results_agg_total, which includes all the unique columns from drivers and results

Alternatively, we can obtain similar results using the merge() function from R’s base package, as shown below:

results_agg_total_alt <- merge(drivers,results,by = "driverId")

Next, I will perform one more merge operation. By joining races.csv with results_agg_total, I will be able to see Grand Prix race names alongside race results. Here is a quick look at races:

head(races)
No alt text provided for this image
results_agg_total <- inner_join(races,results_agg_total,by = "raceId")

head(results_agg_total)
No alt text provided for this image

As the next data preparation step, I will use the unite() function from tidyr package to generate a name column that combines drivers’ first and last names into a new variable that is formatted in a consistent way (in the original version, driverRef is formatted differently for drivers whose surnames are not unique in the dataset):

results_agg_total <- results_agg_total %>% unite(name, forename, surname, sep=" ")

Now, we have a dataframe that includes driver name, race name, and race information, all in a single place. We’re ready for the next step!

Filtering a Dataframe so that only Particular Rows are Retained

I would like to create a new data frame that contains one row for each Grand Prix race winner. This can be achieved with the following code:

results_winners <- results_agg_total %>% filter(position==1)

Creating a dataframe with all podium finishers (1st, 2nd, or 3rd place in a Grand Prix race) can be accomplished with the help of the logical “OR” operator in R, which is the | symbol. 

results_podium_finishers <- results_agg_total %>%

  filter(position==1|position==2|positionText==3)

Creating visualizations using ggplot

For this section, I will use the ggplot2 package to create visualizations to answer the following questions:

  1. Who are the top 10 drivers in Formula-1 history, ranked by total wins ?
  2. Which drivers have accumulated the most race retirements* during their careers?
  3. Who are the top drivers, as ranked by win percentage?
  4. Who are the most retirement-prone drivers, as ranked by retirement percentage?
  5. What are the most common reasons for retirement?

 *Retirement from the race means the driver abandons the car and leaves the race without finishing.  

Cleveland Dot Plot for Top 10 Drivers in Formula-1 history

Cleveland dot plots are commonly used to depict the relationship between categorical and numerical variables. Developed by Cleveland and McGill, these can be a handy alternative to the barplot, as they are less cluttered and therefore may be more aesthetically pleasing. 

 In order to create the dot plot, I have written some additional code to generate a dataframe that includes the full names of the 10 winningest drivers, as shown below:

mostwins <- as.data.frame(sort(table(results_winners$name), decreasing=TRUE)[1:10])

View(mostwins)
No alt text provided for this image

When using the geom_point() function with a categorical variable on one axis and a numeric variable on the other axis, a Cleveland Dot Plot is rendered:

ggplot(mostwins, aes(x = reorder(Var1, Freq),y=Freq)) +
geom_point() + ggtitle('Most Title Wins') + xlab('Driver Name') + ylab('Total
  Titles')+coord_flip()
No alt text provided for this image

As indicated above, Lewis Hamilton is the most successful driver in the history of F-1. The other names listed here are well-known drivers. Followers of F-1 racing may be familiar with the rivalry between Prost and Senna.

Understanding Retirements

retirements <- results_agg_total %>% filter(positionText=="R")
mostrets <- as.data.frame(sort(table(retirements$name), decreasing=TRUE)[1:10])

View(mostrets)
No alt text provided for this image
No alt text provided for this image

Understanding Podium Finishers

most_podium <- as.data.frame(sort(table(results_podium_finishers$name), decreasing=TRUE))
races_total <- as.data.frame(sort(table(results_agg_total$name), decreasing=TRUE))
freq_winners <- inner_join(most_podium, races_total, by="Var1")
freq_winners$pct <- freq_winners$Freq.x/freq_winners$Freq.y   
freq_winners <- arrange(freq_winners, desc(pct))

View(freq_winners)            
No alt text provided for this image

The screenshot above shows that all drivers are now sorted from top to bottom, based on the percent of races in which they achieved one of the top three positions. These results might be misleading, however, as some drivers have just a handful of total races. By removing drivers with fewer than 20 races, we can get a more meaningful depiction of win percentage:

freq_winners2 <- filter(freq_winners, Freq.y >= 20)
freq_winners2 <- slice(freq_winners2, 1:10)

ggplot(freq_winners2, aes(x = reorder(Var1, pct),y=pct)) +

  geom_point() + ggtitle('Most Frequent Podium Placement (20 Or More Races)') + xlab('Driver Name') + ylab('Total Podium Percentage')+ coord_flip()
No alt text provided for this image

Understanding Frequent Winners and Retiree from the Race

most_retire <- as.data.frame(sort(table(retirements$name), decreasing=TRUE))
freq_retire <- inner_join(most_retire, races_total, by="Var1")
freq_retire$pct <- freq_retire$Freq.x/freq_retire$Freq.y   
freq_retire <- arrange(freq_retire, desc(pct))


View(freq_retire)
No alt text provided for this image

Using the same methodology as with the win percentages, we run into the same issue with percentage calculations -- drivers with very small numbers of total races appear at the top of the list, in a way that could be misleading. Again, we will filter this dataframe to only include drivers with 20 or more races, and then select the 10 drivers with the highest percentages from among this group:

freq_ret2 <- filter(freq_retire, Freq.y >= 20)

freq_ret2 <- slice(freq_ret2, 1:10)
No alt text provided for this image

Alternatively, to depict the drivers with the least frequent retirements (among those with 20 or more races), we can use the code shown below. Note that the axis showing retirement percentage is reversed on the graph shown below, so that lower retirement percentages appear further to the right along the axis depicting percentages.

freq_ret3 <- arrange(freq_retire, pct)
freq_ret3 <- slice(freq_ret3, 1:10)
ggplot(freq_ret3, aes(x = reorder(Var1, pct),y=pct)) +

  geom_point() + ggtitle('Least Frequent Retirements (20 Or More Races)') + xlab('Driver Name') + ylab('Total Retirement Percentage')+coord_flip() + scale_y_reverse()
No alt text provided for this image

Historical Reasons for F-1 Race Retirements

Now, let’s try to dig into the data to determine the most frequent reasons for drivers’ retirement from races.  To achieve this task, we will need to merge the status.csv file with the results.csv file, using a process similar to the one shown previously in this article. 

No alt text provided for this image

Gearbox and Spun off are the most technical reasons of retirement other than the engine failure. The spin off usually occurs due to lock-up.  As might be expected with high-speed auto racing, accidents and collisions are also the causes of many retirements.

A look at the results_agg_total dataframe, with an emphasis on win percentage and retirement percentage, shows some interesting results. Here is a look at that dataframe:

#Merge all of the dataframe together using the driverId

results_agg_total <- merge(drivers,results_agg_total,by = "driverId")
results_agg_total <- merge(results_agg,results_agg_total,by = "driverId")
results_agg_total <- merge(results_agg_retire,results_agg_total,by = "driverId")
results_agg_total$win_percentage <- (results_agg_total$TotalTitles/results_agg_total$TotalRaces)*100
results_agg_total$retire_percentage <- (results_agg_total$TotalRetirement/results_agg_total$TotalRaces)*100

View(results_agg_total)
No alt text provided for this image

The above table shows the performance of top drivers in terms of winning percentage vs. retire percentage.

 One of the striking features you can note is that Lewis Hamilton is third on the winning percentage but his retirement percentage is very low that means he is one of the drivers who “finished” lots of races. The greatest of the drivers such as Michael Schumacher, Prost, and Senna had very high retirement percentages in their career.

The article is co-authored by Prof. Greg Page, BU and Abhishek Kumar, Master's student in Applied Business Analytics, BU.

Sheetal Raina

Senior Data Engineer at Visa | MS Data Analytics Student at Penn State University

3y

This is great analysis,Abhishek. Do include the performance of teams like McLaren, Renault and other middle order teams who might lack the engine perfection of Mercedes but are still able to provide F1 fans with a good race weekend😋 Looking forward to the other posts in this series.

To view or add a comment, sign in

Insights from the community

Explore topics