Introduction of the company
Cyclistic is a bike-share company in Chicago. Their bike-share program features more than 5,800 bicycles and 600 docking stations. The company distinguishes itself from its competitors by offering reclining bikes, hand tricycles, and cargo bikes providing an opportunity for people with disabilities.
The company offers three pricing plans:
- single-ride passes
- full-day passes
- annual membership
Customers who purchase single-ride or full-day
passes are reffered to as casual riders
and customers who
purchase annual membership are reffered to as
Cyclistic members
.
Business Task
Annual members are much more profitable and the director of marketing, Lily Moreno, believes that maximizing the number of annual members will be the key to future growth. Moreno also believes that there is a very good chance to convert casual riders into members rather than creating a marketing campaign that targets all-new customers.
Goal:
- Design marketing strategies aimed at converting casual riders into annual members
Answer following questions:
- How do annual members and casual riders use Cyclistic bikes
differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
My Task
As a junior data analyst, you have been asked to analyze data to help with the solution of the the first question: how do annual members and casual riders use Cyclistic bikes differently.
Data Used for the Analysis
I examined 12 months of data reaching from the 2nd quartile of 2019 to the 1st quartile of 2020.
Data source: divvy_tripdata
Question 1 : How do annual members and casual riders use Cyclistic bikes differently?
Loading Packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Collecting Data
<- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q2.csv") q2_2019
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q3.csv") q3_2019
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q4.csv") q4_2019
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2020_Q1.csv") q1_2020
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning Data
Renaming columns to be consistent with q1_2020
<- rename(q4_2019,
q4_2019 ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
<- rename(q3_2019,
q3_2019 ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
<- rename(q2_2019,
q2_2019 ride_id = "01 - Rental Details Rental ID",
rideable_type = "01 - Rental Details Bike ID",
started_at = "01 - Rental Details Local Start Time",
ended_at = "01 - Rental Details Local End Time",
start_station_name = "03 - Rental Start Station Name",
start_station_id = "03 - Rental Start Station ID",
end_station_name = "02 - Rental End Station Name",
end_station_id = "02 - Rental End Station ID",
member_casual = "User Type")
Converting ‘ride_id’ and ‘rideable_type’ in 2019 to ‘character’ to match 2020
<- mutate(q4_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
q4_2019 <- mutate(q3_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q2_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type)) q2_2019
Combining data frames
<- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020) all_trips
Removing data points that were not used since 2020
<- all_trips %>%
all_trips select(-c(start_lat,start_lng, end_lat, end_lng,birthyear, gender,"01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
Renaming ‘member_casual’ entries
<- all_trips %>%
all_trips mutate(member_casual = recode(member_casual, "Subscriber" = "member", "Customer" = "casual"))
Converting ‘ride_length’ to numeric
$ride_length <- as.numeric(as.character(all_trips$ride_length)) all_trips
Removing irrelevant entries
<- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length < 0),] all_trips_v2
Descriptive Analysis
Analysing ‘ride_length’
%>%
all_trips_v2 summarise(average_ride_length = mean(ride_length),
median_ride_length = median(ride_length), # the middle number
longest_ride_length = max(ride_length),
shortest_ride_length = min(ride_length))
## # A tibble: 1 × 4
## average_ride_length median_ride_length longest_ride_length shortest_ride_len…¹
## <dbl> <dbl> <dbl> <dbl>
## 1 1479. 712 9387024 1
## # … with abbreviated variable name ¹shortest_ride_length
Comparing members and casual users ride length
%>%
all_trips_v2 group_by(member_casual) %>%
summarise(average_ride_length = mean(ride_length),
median_ride_length = median(ride_length), # the middle number
longest_ride_length = max(ride_length),
shortest_ride_length = min(ride_length))
## # A tibble: 2 × 5
## member_casual average_ride_length median_ride_length longest_ride_le…¹ short…²
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 casual 3553. 1546 9387024 2
## 2 member 850. 589 9056634 1
## # … with abbreviated variable names ¹longest_ride_length, ²shortest_ride_length
Correcting the order of day_of_week
$day_of_week <- ordered(all_trips_v2$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) all_trips_v2
Average ride time by each day for members and casual users
%>%
all_trips_v2 group_by(day_of_week, member_casual) %>%
summarise(average_length_of_ride = mean(ride_length),
number_of_rides = n()) %>%
arrange(day_of_week)
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: day_of_week [7]
## day_of_week member_casual average_length_of_ride number_of_rides
## <ord> <chr> <dbl> <int>
## 1 Monday casual 3372. 103296
## 2 Monday member 843. 472196
## 3 Tuesday casual 3596. 90510
## 4 Tuesday member 826. 508445
## 5 Wednesday casual 3719. 92457
## 6 Wednesday member 824. 500329
## 7 Thursday casual 3683. 102679
## 8 Thursday member 824. 484177
## 9 Friday casual 3774. 122404
## 10 Friday member 825. 452790
## 11 Saturday casual 3332. 209543
## 12 Saturday member 969. 287958
## 13 Sunday casual 3581. 181293
## 14 Sunday member 920. 267965
Visualizating Data
Length of rides
%>%
all_trips_v2 group_by(day_of_week, member_casual) %>%
summarise(average_length_of_ride = mean(ride_length),
number_of_rides = n()) %>%
arrange(day_of_week) %>%
ggplot(mapping = aes(x = day_of_week, y = average_length_of_ride, fill = member_casual)) + geom_col(position = "dodge", color = "black") +
labs(title = "The Length of Rides in Seconds", subtitle = "The difference in the length of rides between members and casual riders", x = "Day of Week", y = "Average Length of Ride", fill = "Casuals vs Members")
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
Number of Rides
%>%
all_trips_v2 group_by(day_of_week, member_casual) %>%
summarise(average_length_of_ride = mean(ride_length),
number_of_rides = n()) %>%
arrange(day_of_week) %>%
ggplot(mapping = aes(x = day_of_week, y = number_of_rides, fill = member_casual)) + geom_col(color = "black") +
facet_wrap(~member_casual) +
labs(title = "Number of Rides", subtitle = "The difference in the number of rides between members and casual riders", x = "Day of Week", y = "Average Number of Rides", fill = "Casuals vs Members") +
theme(axis.text.x = element_text(angle = 45)) + # changing the angle of x axis labels
expand_limits( x = c(0,NA), y = c(0,NA)) + # changes the outcome
scale_y_continuous(labels = scales::comma) # of y to readable numbers
Conclusion
Analyzing and visualizing the data provided us with two main differences between casual riders and members:
- Casual riders use bikes for significantly longer period of time
compared to members
- Casual riders do not rent bikes as often as members
The simple analysis has a potential to lead the marketing team to a data-driven decision. We have found out, that casual riders are much more likely to rent a bike for a longer duration of time which we can use to adjust the membership program so it is more lucrative for them to become members.
One of the options might be to provide a discount for future renting for every 10km a customer rides. Casual riders do not use the renting service that often but if the discount (or other benefits) was significant enough, they might be willing to become members.