CYCLISTIC BIKE-SHARE COMPANY
INTRODUCTION
This is the Google Data Analytics Professional Certificate course capstone project. In this case study, I will analyze a public dataset for a fictional bike-share company provided by Google. This is a large dataset with millions of rows, so I will be using both SQL an R programming language to analyze the dataset.
In order to answer the key business questions, I’ll follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Summary about Case.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
The case study roadmap below will be followed on each step:
• Code, when necessary
• Key tasks.
• Deliverables.
ASK
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
Lily Moreno (the director of marketing and my manager) has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
Key tasks
1. Identify the business task
• The main objective is to design marketing strategies aimed at converting casual riders to annual members by understanding how they differ.
2. Consider key stakeholders
• Director of Marketing (Lily Moreno), Marketing Analytics team, Executive team.
Deliverable
A clear statement of the business task
• To find the differences between the casual riders and annual members.
In this case I’m part of marketing analytic team and the manager Lily Moreno has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
Understanding the business task.
What is the problem I’m trying to solve?
Cyclistic Company needs to increase their annual membership, the director of marketing and manager has decided the best strategy to use, so the problem that I am trying to solve is to convert casual users to annual members.
In order to use the strategy presented by the director, I need to Analyze Cyclistic’s Rides Reports for 2022 and Identify insights on how annual members and casual riders use cyclistic bikes differently. To accomplish this, I’ll try to answer How, When, Where and Why each rider uses the bike sharing service.
How can your insights drive business decisions?
The goal of this case study is to get insights that will lead us to create a marketing campaign which would lead casual members to be converted to annual members and help the company to be more profitable.
My stakeholders will be the Marketing Director Lily Moreno and Cyclistic executive team that will decide whether to approve the recommended marketing program.
PREPARE
Data Location: I’ll use Cyclistic historical data which is located at their cloud servers to identify trends. You can access data HERE
Key tasks
1. Download data and store it appropriately.
• Data has been downloaded and copies have been stored securely on my computer and here on Kaggle.
2. Identify how it’s organized.
The data is in CSV (comma-separated values) format, and there are a total of 13 columns.
3. Sort and filter the data.
For this analysis, I will be using data for the last 12 months (Feb 2021 - Jan 2022) because it is more current.
4. Determine the credibility of the data.
The data is credible and reliable because it is from the company’s website.
Deliverable
1. A description of all data sources used main source of data provided by the Cylistic company
I proceed to download data and store it on a folder.
Understanding how Data is Organized.
Data is stored monthly on a .csv format files, it is organized in rows and columns. I will proceed to open one file in a spreadsheet to understand what information come in the data.
There are thirteen columns, starting with an unique ID by ride, rideable type, date time columns about when the ride start and stop, start station name and end station name as an ID for each station, latitude and longitude for starting and ending point, and finally the type of user (member or casual).
Are there issues with bias or credibility in this data: There is no biased data because this is full representation of all stations for all rides by day and time, but mainly data comes directly the company’s website.
Does this data ROCCC:
The data is Reliable, Original, Comprehensive, Current and Cited.
For this analysis I got Data from the original source, provided by the company which upload updated monthly reports on csv format. The data has been made available by Motivate International Inc. under this license.
Uploading data to Database
I will Upload all file to a Database on BigQuery so I can start analyzing with SQL. First I uploaded each file and stored them on separate tables on a Database.
Here, I am joining all tables into one so I can analyze all the dataset better. Now I’ll have the full year report in just one table.
Here is the first view of a full year 2022 report for Cyclistic. As you can see in the bottom, the full report contains 5,667,717 records.
Verifying Data Integrity
Data Type
First I proceed to check if all columns have the correct Data Type.
Null fields
Now I check for null fields in the table, consulting the columns I will use for analysis and using the function COUNTIF.
Got Nulls on two columns
I got null fields in start station and end station columns, but there are so many nulls, so it may be just missing information.
Therefore I need to dive in to understand those null fields. In this query I create a new column calculating rides length, so I can have an idea if nulls correspond to errors or just there is some missing information.
Null fields represent a lot of rides and many ride hours, so I don’t feel confidence deleting all this fields. In this case while is not possible to communicate with the company owner of the dataset, I will assign ‘ambulatory’ name both for start and end stations.
Duplicated fields
I have 5 million records, so a small percentage of duplicated fields could lead to a wrong analysis.
First I analyze Ride ID: Using functions COUNT + GROUP BY and HAVING
No duplicated IDs
Even when I got not duplicated fields, could exist some mistyping. To find it out, I query for fields with a different length of characters.
LENGTH FUNCTION.
All Ride IDs has the same length
Now I look for duplicates or mistyping on User Type.
Any mistyping on User's Type Column
In case of start station name, I review all stations with same name.
Everything looks correct
Watch out for the Process and Analyze Phase in my next post. Thanks for reading. Kindly like, share and leave a comment below.
PROCESS PHASE
I used R to process and analyse the dataset. Then I installed and loaded the following packages.
Install and load necessary packages
install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("geosphere")
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(geosphere)
Import data to R studio
202201_divvy <- read_csv("C:/Users/User/Downloads/202201_divvy.xlsx")
202202_divvy <- read_csv("C:/Users/User/Downloads/202202_divvy.xlsx")
202203_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202203_divvy_tripdata.xlsx")
202204_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202204_divvy_tripdata.xlsx")
202205_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202205_divvy_tripdata.xlsx")
202206_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202206_divvy_tripdata.xlsx")
202207_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202207_divvy_tripdata.xlsx")
202208_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202208_divvy_tripdata.xlsx")
202209_divvy_publictripdata <- read_csv("C:/Users/User/Downloads/202209_divvy_publictripdata.xlsx")
202210_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202210_divvy_tripdata.xlsx")
202211_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202211_divvy_tripdata.xlsx")
202212_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202212_divvy_tripdata.xlsx")
Now I have verified all data integrity, so let proceed to PROCESS step.
PROCESS
Key tasks
1. Check the data for errors.
2. Choose your tools.
3. Transform the data so you can work with it effectively.
4. Document the cleaning process.
Deliverable
1. Documentation of any cleaning or manipulation of data
I used SQL to read and clean Data in order to get it ready to be analyzed. There are so many SQL tools, but in this case I will use Google BigQuery, a cloud based SQL tool.
Now that I have checked the data, I’m going the clean it, which include filtering, selecting columns I will use for final analysis and creating some calculations.
Cleaned dataset
As you can see the cleaned dataset has 5,590,846 records, and now we can analyze rides by month, day of week or by hours.
The following code chunks will be used for this phase.
head(trip_data) #see the first 6 rows of the data frame
nrow(trip_data) #how many rows are in the data frame
colnames(trip_data) #list of column names
dim(trip_data) #dimensions of the data frame
summary(trip_data) #statistical summary of data, mainly for numerics
str(trip_data) #see list of columns and data types
tail(trip_data) #see the last 6 rows of the data frame
Adding columns for date, month, year, day of the week into the data frame.
trip_data$date <- as.Date(trip_data$started_at)
trip_data$month <- format(as.Date(trip_data$date), "%m")
trip_data$day <- format(as.Date(trip_data$date), "%d")
trip_data$year <- format(as.Date(trip_data$date), "%Y")
Recommended by LinkedIn
trip_data$day_of_week <- format(as.Date(trip_data$date), "%A")
colnames(trip_data) #to get the names of all the columns
Add a ride_length calculation to trip_data
trip_data$ride_length <- difftime(trip_data$ended_at, trip_data$started_at)
str(trip_data) #to inspect the structure of the columns
Convert ride_length from Factor to Numeric in order to run calculations
trip_data$ride_length <- as.numeric(as.character(trip_data$ride_length))
is.numeric(trip_data$ride_length) #to confirm it is now numeric
Add ride_distance calculation to trip_data
trip_data$ride_distance <- distGeo(matrix(c(trip_data$start_lng, trip_data$start_lat), ncol=2), matrix (c(trip_data$end_lng, trip_data$end_lat), ncol=2))
trip_data$ride_distance <- trip_data$ride_distance/1000 #distance in km
Remove "bad" data
The data frame includes a few hundred entries when bikes where taken out of docks and checked for quality by Divvy where ride_length was negative or "zero"
trip_data_clean <- trip_data[!(trip_data$ride_length <= 0),]
glimpse(trip_data_clean) #gives column names and data in the column
ANALYZE
In this phase, I will use Tableau to see how members and casual riders are using cyclistic bike-share.
All the required information are now in one place and ready for exploration.
Key tasks
1. Aggregate your data so it’s useful and accessible.
2. Organize and format your data.
3. Perform calculations.
4. Identify trends and relationships.
Deliverable
1. A summary of the analysis.
From the above graph, we can observe that casual riders went a longer distance compared to members by a few kilometres.
Finally, I’ll try to find trends and relationships on data, so I can understand better how Users use Cyclistic bike-sharing differently.
Just to refresh, all queries will include a comparison between annual Member Vs Casual riders.
For this analysis I’ll be using the following functions: COUNT, SUM, AVG, MAX, MIN, ROUND, WHERE, GROUP BY and ORDER BY
General Records
Let’s get a first general metric of Members Vs Casual Users.
For that, I will count rides, get total ride hours, average ride minutes, max ride hours and min ride seconds.
Here we can see, at this time Cyclistic has 44%(1,007,466.00) more rides than members, but casual users represent 59%(419,044.95) more ride hours than members and a bigger average ride minutes.
* Percentage calculation comes from the difference on total_rides and total_ride_hours columns.
Monthly Metrics
Now I’m going to review Monthly User’s behaviour. Here I expect to find which month has the bigger demand by riders.
Functions used here: COUNT, SUM, AVG, MAX, MIN, ROUND, GROUP BY, ORDER BY
Monthly Metric - Query
Monthly Metric - Result
Here we see June, July, August and September are top month for both, members with an average of 400k per month that mean 22% of all rides and casual users with and average 350k rides represent 25% of all rides and 62% of casual rides.
While July, June and August represent top month for casual users on total ride hours with up 190k per month. In fact, Casual Users has double Ride Time vs. Members and the same with average ride minutes.
Day of Week - Metrics
I expect to find which day of week has the highest demand by different users.
Day of Week Metric - Query
Day of Week Metric - Result
Members: Top ride days are Tuesday, Wednesday and Thursday, meaning 28% of total rides, but in general member’s rides are almost the same during all the week. Average ride around 12 minutes increasing a little bit on weekend to 14 minutes.
Casual: Top ride days are Saturday and Sunday, meaning 17% of total rides, but the ride time on just weekends represent 26%. Average ride close to 30 minutes.
Hourly Metrics
¿What’s the preferred time of the day for taking rides in case of Members vs Casual users? I’ll know in the following query.
Hourly Metric - Query
Hourly Metric - Result
Members has a ride peak at 8 mornings and at 17 afternoons while casual users tend to ride more at afternoon.
Top ride hours for both members and casual users goes from 14 to 19, almost half of all rides are done at this hour range.
Members has 46% of all its rides at these hours and 48% of all its ride hours while casual does 48% of its rides and 47% of its ride time.
Average ride keeps around 13 minutes for members and 28 for casual users.
Top Stations
I expect to find any stations preference by users.
Top Stations - Query
Top Stations - Result
Ambulatory that could mean a place at not specific station represent 9% of all rides for members and 6% for casual users.
For Stations we know a name, there is just one with a big difference of rides (Streeter Dr & Gran Ave.) 57k rides, in fact first four stations with more rides are for casual users hitting 140k rides.
Any way ride’s difference between stations is not too representative, at least 100 stations has up 10k rides. Top 10 stations for members represent just 3.6% of all rides and top 10 stations for casual users represent just 4.4% of all rides, but thanks to this query I can know exactly which stations are the most used by casual users.
Rideable Type Metrics
Now I want to know which is the preferred Rideable Type for each user type.
Rideable Type - Query
Rideable Type - Result
Members: Don’t have ride’s preference for rideable type, classic and electric are used almost equally with 30% for classic and 29% for electric. For ride time (longer rides), members have a little preference for classic bike meaning 22% of all ride time.
Casuals: Have a marked preference for electric bikes hitting 53% of casual rides and 22% of all rides, but for mid length rides, casuals prefer classic bikes too.
Casual’s classic rides represent the bigger quantity of ride hours with 23% of all ride hours and the biggest average ride is too for casuals with 123.38 minutes.
Long Rides Vs Short Rides.
While I was diving into data, could notice there is more rides with ride length < 24 hours than rides > 24 hours, so I wanted to know what users likes more.
We have seen along the analysis casual users usually takes longer rides than members and I can verify that even with short rides this pattern is the same, casual users get a ride average of 22.14 minutes while members just keep in 12.57.
Following code chunks will be used for this phase.
str(trip_data_clean) #first lets check the structure of the data frame
summary(trip_data_clean) #to check the summarized details of the clean data frame
Conduct descriptive analysis
Descriptive analysis on 'ride_length'
mean = straight average (total ride length / total rides)
median = midpoint number of ride length
max = longest ride
min = shortest ride
trip_data_clean %>%
group_by(member_casual) %>%
summarise(average_ride_length = mean(ride_length), median_length = median(ride_length),
max_ride_length = max(ride_length), min_ride_length = min(ride_length))
Let's visualize members and casuals by the total ride taken (ride count)
trip_data_clean %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id))
• From the above graph, we can observe that there are more member riders compared to casual rides based on the ride count.
Let's see total rides and average ride time by each day for members vs casual riders
# lets order the days of the week
trip_data_clean$day_of_week <- ordered(trip_data_clean$day_of_week,
levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
trip_data_clean %>%
group_by(member_casual, day_of_week) %>% #groups by member_casual
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_ride_length = mean(ride_length),.groups="drop") %>% # calculates the average duration
arrange(member_casual, day_of_week) #sort
let's visualize the above table by days of the week and number of rides taken by member and casual riders.
trip_data_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n(), .groups="drop") %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
labs(title ="Total rides of Members and Casual riders Vs. Day of the week") +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
From the above graph, it is observed that casual riders have the highest number of rides on the weekend(saturday and sunday) compared to the other days while members are quite consistent but they have the lowest number of rides on the weekend.
Let's visualize average ride by day of the week
trip_data_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(average_ride_length = mean(ride_length), .groups="drop") %>%
ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
labs(title ="Average ride time of Members and Casual riders Vs. Day of the week")
• From the above graph, we can observe that casual riders ride for a longer time during the week with the highest ride on the weekends while members drive at a consistent pace during the week with the highest rides on the weekend.
Let's visualize the total rides taken by members and casuals by month
trip_data_clean %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n(),.groups="drop") %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
labs(title ="Total rides by Members and Casual riders by Month") +
theme(axis.text.x = element_text(angle = 45)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
• From the above graph we can observe that members had higher number of rides all through the year except in June, July and August where casual riders took more rides.
Let's compare Members and Casual riders depending on ride distance
trip_data_clean %>%
group_by(member_casual) %>% drop_na() %>%
summarise(average_ride_distance = mean(ride_distance)) %>%
ggplot() +
geom_col(mapping= aes(x= member_casual,y= average_ride_distance,fill=member_casual), show.legend = FALSE)+
labs(title = "Mean distance traveled by Members and Casual riders")
Summary Analysis
Last year 2022 Cyclistic had 5,590,846 rides, 40.1% Casual Users and 59.9% Members. About Ride Time, there were 1,836,571 ride hours, 61.41% for Casual Users and 38.59% for Members. Average ride minute was 21.21, 29.53 for Casual Users and 12.89 for Members. Here we can see an indicator that tells us Casual Users tend to take longer rides than Members.
Hottest months for both Members and Casual were from June to September which is expected given this is summer season, representing 29% for Members and 25% for Casuals of all rides and 62% of total casual rides.
The behavior on days of week show Members has slightly more rides from Tuesday to Thursday but in general all days of week has the same average ride around 400k rides, having Sunday the lower ride demand. In case of Casual User, has a ride’s preference for weekends with 37% of total casual rides vs. 63% rides on weekdays. Ride Time average even increase on weekends meaning 42% of total casual ride hours.
Members has some peak hours at 8 and 17, reaching 100k rides at 7 mornings while Casual hit 100k rides at 11 and have its peak ride hour at 17 same as member.
Both users have the same range hours with more rides, from 14 to 19, hitting almost 50% of all members and casual rides and ride time, while average ride is 13 for members and 28 for casual users.
Regarding Stations there’s just one with an important difference vs. other, Streeter Dr & Grand Ave with 57,525 rides by Casuals, but that is not too representative.
Out of 1673 Stations, Top 25, represent 13% of total member rides and 12% of total ride hours while for casuals are 18% rides and 24% ride time.
For complete the understanding of ride habits, Long Rides (Rides>24 hours) represent just 0.1% of all rides, but 16% of total ride time and casual user owner 87% of long rides, 94% of ride hours, representing 15% of all ride time.
For short rides difference is lower, 59% for members and 41% for casual riders, but average ride is just 12.57 for members and 22.14 for casual users.
Thanks for reading, any feedback is welcome.