Introduction of the company

Cyclistic is a bike-share company in Chicago. Their bike-share program features more than 5,800 bicycles and 600 docking stations. The company distinguishes itself from its competitors by offering reclining bikes, hand tricycles, and cargo bikes providing an opportunity for people with disabilities.

The company offers three pricing plans:

single-ride passes
full-day passes
annual membership

Customers who purchase single-ride or full-day passes are reffered to as casual riders and customers who purchase annual membership are reffered to as Cyclistic members.

Business Task

Annual members are much more profitable and the director of marketing, Lily Moreno, believes that maximizing the number of annual members will be the key to future growth. Moreno also believes that there is a very good chance to convert casual riders into members rather than creating a marketing campaign that targets all-new customers.

Goal:

Design marketing strategies aimed at converting casual riders into annual members

Answer following questions:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

My Task

As a junior data analyst, you have been asked to analyze data to help with the solution of the the first question: how do annual members and casual riders use Cyclistic bikes differently.

Data Used for the Analysis

I examined 12 months of data reaching from the 2^nd quartile of 2019 to the 1^st quartile of 2020.

Data source: divvy_tripdata

Question 1 : How do annual members and casual riders use Cyclistic bikes differently?

Loading Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Collecting Data

q2_2019 <- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q2.csv")

## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num  (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q3_2019 <- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q3.csv")

## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q4_2019 <- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2019_Q4.csv")

## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

q1_2020 <- read_csv("C:\\Users\\tomas\\OneDrive\\Plocha\\Google Data Analytics Certificate\\CaseStudies\\Track_1_CaseStudy_1\\Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data

Renaming columns to be consistent with q1_2020

q4_2019 <- rename(q4_2019,
                  ride_id = trip_id,
                  rideable_type = bikeid,
                  started_at = start_time,
                  ended_at = end_time,
                  start_station_name = from_station_name,
                  start_station_id = from_station_id,
                  end_station_name = to_station_name,
                  end_station_id = to_station_id,
                  member_casual = usertype)
q3_2019 <- rename(q3_2019,
                  ride_id = trip_id,
                  rideable_type = bikeid,
                  started_at = start_time,
                  ended_at = end_time,
                  start_station_name = from_station_name,
                  start_station_id = from_station_id,
                  end_station_name = to_station_name,
                  end_station_id = to_station_id,
                  member_casual = usertype)
q2_2019 <- rename(q2_2019,
                  ride_id = "01 - Rental Details Rental ID",
                  rideable_type = "01 - Rental Details Bike ID",
                  started_at = "01 - Rental Details Local Start Time",
                  ended_at = "01 - Rental Details Local End Time",
                  start_station_name = "03 - Rental Start Station Name",
                  start_station_id = "03 - Rental Start Station ID",
                  end_station_name = "02 - Rental End Station Name",
                  end_station_id = "02 - Rental End Station ID",
                  member_casual = "User Type")

Converting ‘ride_id’ and ‘rideable_type’ in 2019 to ‘character’ to match 2020

q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))

Combining data frames

all_trips <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020)

Removing data points that were not used since 2020

all_trips <- all_trips %>% 
  select(-c(start_lat,start_lng, end_lat, end_lng,birthyear, gender,"01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Renaming ‘member_casual’ entries

all_trips <- all_trips %>% 
  mutate(member_casual = recode(member_casual, "Subscriber" = "member", "Customer" = "casual"))

Adding time-related collumns

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

all_trips$ride_length <- difftime(all_trips$ended_at, all_trips$started_at)

Converting ‘ride_length’ to numeric

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

Removing irrelevant entries

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length < 0),]

Descriptive Analysis

Analysing ‘ride_length’

all_trips_v2 %>% 
  summarise(average_ride_length = mean(ride_length),
            median_ride_length = median(ride_length), # the middle number
            longest_ride_length = max(ride_length),
            shortest_ride_length = min(ride_length))

## # A tibble: 1 × 4
##   average_ride_length median_ride_length longest_ride_length shortest_ride_len…¹
##                 <dbl>              <dbl>               <dbl>               <dbl>
## 1               1479.                712             9387024                   1
## # … with abbreviated variable name ¹shortest_ride_length

Comparing members and casual users ride length

all_trips_v2 %>% 
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length),
            median_ride_length = median(ride_length), # the middle number
            longest_ride_length = max(ride_length),
            shortest_ride_length = min(ride_length))

## # A tibble: 2 × 5
##   member_casual average_ride_length median_ride_length longest_ride_le…¹ short…²
##   <chr>                       <dbl>              <dbl>             <dbl>   <dbl>
## 1 casual                      3553.               1546           9387024       2
## 2 member                       850.                589           9056634       1
## # … with abbreviated variable names ¹longest_ride_length, ²shortest_ride_length

Correcting the order of day_of_week

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

Average ride time by each day for members and casual users

all_trips_v2 %>% 
  group_by(day_of_week, member_casual) %>% 
  summarise(average_length_of_ride = mean(ride_length), 
            number_of_rides = n()) %>% 
  arrange(day_of_week)

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   day_of_week [7]
##    day_of_week member_casual average_length_of_ride number_of_rides
##    <ord>       <chr>                          <dbl>           <int>
##  1 Monday      casual                         3372.          103296
##  2 Monday      member                          843.          472196
##  3 Tuesday     casual                         3596.           90510
##  4 Tuesday     member                          826.          508445
##  5 Wednesday   casual                         3719.           92457
##  6 Wednesday   member                          824.          500329
##  7 Thursday    casual                         3683.          102679
##  8 Thursday    member                          824.          484177
##  9 Friday      casual                         3774.          122404
## 10 Friday      member                          825.          452790
## 11 Saturday    casual                         3332.          209543
## 12 Saturday    member                          969.          287958
## 13 Sunday      casual                         3581.          181293
## 14 Sunday      member                          920.          267965

Visualizating Data

Length of rides

all_trips_v2 %>% 
  group_by(day_of_week, member_casual) %>% 
  summarise(average_length_of_ride = mean(ride_length), 
            number_of_rides = n()) %>% 
  arrange(day_of_week) %>% 
  ggplot(mapping = aes(x = day_of_week, y = average_length_of_ride, fill = member_casual)) + geom_col(position = "dodge", color = "black") +
  labs(title = "The Length of Rides in Seconds", subtitle = "The difference in the length of rides between members and casual riders", x = "Day of Week", y = "Average Length of Ride", fill = "Casuals vs Members")

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

Number of Rides

all_trips_v2 %>% 
  group_by(day_of_week, member_casual) %>% 
  summarise(average_length_of_ride = mean(ride_length), 
            number_of_rides = n()) %>% 
  arrange(day_of_week) %>% 
  ggplot(mapping = aes(x = day_of_week, y = number_of_rides, fill = member_casual)) + geom_col(color = "black") +
  facet_wrap(~member_casual) +
  labs(title = "Number of Rides", subtitle = "The difference in the number of rides between members and casual riders", x = "Day of Week", y = "Average Number of Rides", fill = "Casuals vs Members") +
  theme(axis.text.x = element_text(angle = 45)) + # changing the angle of x axis labels
  expand_limits( x = c(0,NA), y = c(0,NA)) +  # changes the outcome 
  scale_y_continuous(labels = scales::comma)  # of y to readable numbers

Conclusion

Analyzing and visualizing the data provided us with two main differences between casual riders and members:

Casual riders use bikes for significantly longer period of time compared to members
Casual riders do not rent bikes as often as members

The simple analysis has a potential to lead the marketing team to a data-driven decision. We have found out, that casual riders are much more likely to rent a bike for a longer duration of time which we can use to adjust the membership program so it is more lucrative for them to become members.

One of the options might be to provide a discount for future renting for every 10km a customer rides. Casual riders do not use the renting service that often but if the discount (or other benefits) was significant enough, they might be willing to become members.

Google Data Analytics : Case Study 1

Tomas Kejha

2022-10-30