CYCLISTIC BIKE-SHARE COMPANY

Maryann Onyeka

--Junior data analyst

Published Jan 13, 2024

INTRODUCTION

This is the Google Data Analytics Professional Certificate course capstone project. In this case study, I will analyze a public dataset for a fictional bike-share company provided by Google. This is a large dataset with millions of rows, so I will be using both SQL an R programming language to analyze the dataset.

In order to answer the key business questions, I’ll follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

Summary about Case.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

The case study roadmap below will be followed on each step:

• Code, when necessary

• Key tasks.

• Deliverables.

ASK

Three questions will guide the future marketing program:

1. How do annual members and casual riders use Cyclistic bikes differently?

2. Why would casual riders buy Cyclistic annual memberships?

3. How can Cyclistic use digital media to influence casual riders to become members?

Lily Moreno (the director of marketing and my manager) has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Key tasks

1. Identify the business task

• The main objective is to design marketing strategies aimed at converting casual riders to annual members by understanding how they differ.

2. Consider key stakeholders

• Director of Marketing (Lily Moreno), Marketing Analytics team, Executive team.

Deliverable

A clear statement of the business task

• To find the differences between the casual riders and annual members.

In this case I’m part of marketing analytic team and the manager Lily Moreno has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Understanding the business task.

What is the problem I’m trying to solve?

Cyclistic Company needs to increase their annual membership, the director of marketing and manager has decided the best strategy to use, so the problem that I am trying to solve is to convert casual users to annual members.

In order to use the strategy presented by the director, I need to Analyze Cyclistic’s Rides Reports for 2022 and Identify insights on how annual members and casual riders use cyclistic bikes differently. To accomplish this, I’ll try to answer How, When, Where and Why each rider uses the bike sharing service.

How can your insights drive business decisions?

The goal of this case study is to get insights that will lead us to create a marketing campaign which would lead casual members to be converted to annual members and help the company to be more profitable.

My stakeholders will be the Marketing Director Lily Moreno and Cyclistic executive team that will decide whether to approve the recommended marketing program.

PREPARE

Data Location: I’ll use Cyclistic historical data which is located at their cloud servers to identify trends. You can access data HERE

Key tasks

1. Download data and store it appropriately.

• Data has been downloaded and copies have been stored securely on my computer and here on Kaggle.

2. Identify how it’s organized.

The data is in CSV (comma-separated values) format, and there are a total of 13 columns.

3. Sort and filter the data.

For this analysis, I will be using data for the last 12 months (Feb 2021 - Jan 2022) because it is more current.

4. Determine the credibility of the data.

The data is credible and reliable because it is from the company’s website.

Deliverable

1. A description of all data sources used main source of data provided by the Cylistic company

I proceed to download data and store it on a folder.

Understanding how Data is Organized.

Data is stored monthly on a .csv format files, it is organized in rows and columns. I will proceed to open one file in a spreadsheet to understand what information come in the data.

There are thirteen columns, starting with an unique ID by ride, rideable type, date time columns about when the ride start and stop, start station name and end station name as an ID for each station, latitude and longitude for starting and ending point, and finally the type of user (member or casual).

Are there issues with bias or credibility in this data: There is no biased data because this is full representation of all stations for all rides by day and time, but mainly data comes directly the company’s website.

Does this data ROCCC:

The data is Reliable, Original, Comprehensive, Current and Cited.

For this analysis I got Data from the original source, provided by the company which upload updated monthly reports on csv format. The data has been made available by Motivate International Inc. under this license.

Uploading data to Database

I will Upload all file to a Database on BigQuery so I can start analyzing with SQL. First I uploaded each file and stored them on separate tables on a Database.

Here, I am joining all tables into one so I can analyze all the dataset better. Now I’ll have the full year report in just one table.

Here is the first view of a full year 2022 report for Cyclistic. As you can see in the bottom, the full report contains 5,667,717 records.

Verifying Data Integrity

Data Type

First I proceed to check if all columns have the correct Data Type.

Null fields

Now I check for null fields in the table, consulting the columns I will use for analysis and using the function COUNTIF.

Got Nulls on two columns

I got null fields in start station and end station columns, but there are so many nulls, so it may be just missing information.

Therefore I need to dive in to understand those null fields. In this query I create a new column calculating rides length, so I can have an idea if nulls correspond to errors or just there is some missing information.

Null fields represent a lot of rides and many ride hours, so I don’t feel confidence deleting all this fields. In this case while is not possible to communicate with the company owner of the dataset, I will assign ‘ambulatory’ name both for start and end stations.

Duplicated fields

I have 5 million records, so a small percentage of duplicated fields could lead to a wrong analysis.

First I analyze Ride ID: Using functions COUNT + GROUP BY and HAVING

No duplicated IDs

Even when I got not duplicated fields, could exist some mistyping. To find it out, I query for fields with a different length of characters.

LENGTH FUNCTION.

All Ride IDs has the same length

Now I look for duplicates or mistyping on User Type.

Any mistyping on User's Type Column

In case of start station name, I review all stations with same name.

Everything looks correct

Watch out for the Process and Analyze Phase in my next post. Thanks for reading. Kindly like, share and leave a comment below.

PROCESS PHASE

I used R to process and analyse the dataset. Then I installed and loaded the following packages.

Install and load necessary packages

install.packages("tidyverse")

install.packages("lubridate")

install.packages("ggplot2")

install.packages("dplyr")

install.packages("geosphere")

library(tidyverse)

library(lubridate)

library(ggplot2)

library(dplyr)

library(geosphere)

Import data to R studio

202201_divvy <- read_csv("C:/Users/User/Downloads/202201_divvy.xlsx")

202202_divvy <- read_csv("C:/Users/User/Downloads/202202_divvy.xlsx")

202203_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202203_divvy_tripdata.xlsx")

202204_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202204_divvy_tripdata.xlsx")

202205_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202205_divvy_tripdata.xlsx")

202206_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202206_divvy_tripdata.xlsx")

202207_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202207_divvy_tripdata.xlsx")

202208_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202208_divvy_tripdata.xlsx")

202209_divvy_publictripdata <- read_csv("C:/Users/User/Downloads/202209_divvy_publictripdata.xlsx")

202210_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202210_divvy_tripdata.xlsx")

202211_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202211_divvy_tripdata.xlsx")

202212_divvy_tripdata <- read_csv("C:/Users/User/Downloads/202212_divvy_tripdata.xlsx")

Now I have verified all data integrity, so let proceed to PROCESS step.

PROCESS

Key tasks

1. Check the data for errors.

2. Choose your tools.

3. Transform the data so you can work with it effectively.

4. Document the cleaning process.

Deliverable

1. Documentation of any cleaning or manipulation of data

I used SQL to read and clean Data in order to get it ready to be analyzed. There are so many SQL tools, but in this case I will use Google BigQuery, a cloud based SQL tool.

Now that I have checked the data, I’m going the clean it, which include filtering, selecting columns I will use for final analysis and creating some calculations.

Cleaned dataset

As you can see the cleaned dataset has 5,590,846 records, and now we can analyze rides by month, day of week or by hours.

The following code chunks will be used for this phase.

head(trip_data) #see the first 6 rows of the data frame

nrow(trip_data) #how many rows are in the data frame

colnames(trip_data) #list of column names

dim(trip_data) #dimensions of the data frame

summary(trip_data) #statistical summary of data, mainly for numerics

str(trip_data) #see list of columns and data types

tail(trip_data) #see the last 6 rows of the data frame

Adding columns for date, month, year, day of the week into the data frame.

trip_data$date <- as.Date(trip_data$started_at)

trip_data$month <- format(as.Date(trip_data$date), "%m")

trip_data$day <- format(as.Date(trip_data$date), "%d")

trip_data$year <- format(as.Date(trip_data$date), "%Y")

CYCLISTIC BIKE-SHARE COMPANY

Maryann Onyeka

--Junior data analyst

Recommended by LinkedIn

More articles by Maryann Onyeka

Insights from the community

Others also viewed

Oregon ranked #3 most bicycle friendly state in the nation

At the Intersection of Digital and Physical: My Insights & Discoveries

Cyclistic bike-share analysis case study!

Next-level enterprise integration with StreamSets

Youtube Royalty, Exclusive Offers and Picnics In The Park..

Bike Match, Congestion Pricing, Supersized Vehicles

Why Would a Google Executive and Serial Tech Entrepreneur Become a Bike Guide?

Cyclistic Bike-Share data using Excel, SQL and R Programming

Case Study: How Does a Bike-Share Navigate Speedy Success?

Hello, and welcome!

Explore topics

Recommended by LinkedIn

More articles by Maryann Onyeka

Five most prominent ways to immigrate to Canada in 2021

Life is not so rosy in Canada: Four reasons why you shouldn’t relocate to Canada in 2021

2023 is around the corner. See whom the youths should consider as the next president.

Insights from the community

Others also viewed

Oregon ranked #3 most bicycle friendly state in the nation

At the Intersection of Digital and Physical: My Insights & Discoveries

Cyclistic bike-share analysis case study!

Next-level enterprise integration with StreamSets

Youtube Royalty, Exclusive Offers and Picnics In The Park..

Bike Match, Congestion Pricing, Supersized Vehicles

Why Would a Google Executive and Serial Tech Entrepreneur Become a Bike Guide?

Cyclistic Bike-Share data using Excel, SQL and R Programming

Case Study: How Does a Bike-Share Navigate Speedy Success?

Hello, and welcome!

Explore topics