Cyclistic is a bike-share company in Chicago that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members.
Design marketing strategies aimed at converting casual riders into annual members by answering the following question:
Based on their behavior, how “casual” users could be motivated to change to an annual subscription?
Cyclistic trip data can be download here https://divvy-tripdata.s3.amazonaws.com/index.html (Note: The data sets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license https://ride.divvybikes.com/data-license-agreement).
For this analysis I’m using the data from September 2021 to August
2022.
Each data set was separated per month, column names were consistent and
organized. There is no obvious sign of bias.
Please note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that for the purposes of this case study, I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
For the purposes of this case study I decided to use R, and Tableau. To ensure data’s integrity I have saved a copy of the original data sets as well as a copy of each of the steps of the analysis process, as well as adding comments in every step of said process.
First I install the libraries needed for this analysis.
#Adding libraries that I need for my analysis
library(tidyverse) #will load the core tidyverse packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate) #helps with date-time attributes
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) #helps visualize data
library(geosphere) #helps manage distance, direction, area, etc for geographic coordinates
getwd() #displays your working directory
## [1] "/home/csllmai/Projects/cyclistic-case-study"
setwd("~/Projects/cyclistic-case-study") #sets your working directory to simplify calls to data
Next step is to load the data sets (csv files) into their respective data frame variable in R.
#=====================
# Load each data set (csv files) and assign to different variables
#=====================
cyclistic_df1 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df2 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df3 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df4 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df5 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df6 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df7 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df8 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df9 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df10 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df11 <- read_csv("202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df12 <- read_csv("202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
While the column names don’t have to be in the same order, they DO need to match before we can use a command to join them into one data frame.
#====================================================
# Compare column names for each of the files
#====================================================
colnames(cyclistic_df1)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df2)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df3)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df4)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df5)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df6)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df7)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df8)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df9)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df10)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df11)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(cyclistic_df12)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
There’s no need to rename columns, all columns in the data frames are consistent. Now it is time to inspect the structure of the data frames looking for inconsistencies. After that, I was ready to generate a single data frame containing the 12 months data, from September 2021 to August 2022.
#====================================================
# Inspect the structure of the data frames looking for inconsistencies
#====================================================
str(cyclistic_df1)
str(cyclistic_df2)
str(cyclistic_df3)
str(cyclistic_df4)
str(cyclistic_df5)
str(cyclistic_df6)
str(cyclistic_df7)
str(cyclistic_df8)
str(cyclistic_df9)
str(cyclistic_df10)
str(cyclistic_df11)
str(cyclistic_df12)
#====================================================
# Create unique data frame -12months- from September 2021 to August 2022
#====================================================
cyclistic_df <- rbind(cyclistic_df1, cyclistic_df2, cyclistic_df3, cyclistic_df4, cyclistic_df5,
cyclistic_df6, cyclistic_df7, cyclistic_df8, cyclistic_df9, cyclistic_df10,
cyclistic_df11, cyclistic_df12)
In addition to head()
you can also use the
str()
and glimpse()
functions to get summaries
of each column in your data arranged horizontally. You can try these two
functions by completing and running the code chunks below:
{Dimension of the initial data frame}
dim(cyclistic_df)
{Check the first observations contained in the data
frame}
head(cyclistic_df)
{str compactly display the structure of the data
frame}
str(cyclistic_df)
{similar to str, but tries to show as much data as
possible}
glimpse(cyclistic_df)
#====================================================
# Cleaning the data frame
#====================================================
cyclistic_df_v1 <- drop_na(cyclistic_df)
#dim(cyclistic_df_v1)
There are a few problems that need to get fixed:
(1) I’ll add additional columns of data – such as day, month, year –
that provide additional opportunities to aggregate the data.
(2) I’ll add a calculated column for length of ride
called(ride_length
) to the entire data frame for
consistency.
(3) There are some rides where ride_length
shows up as
negative, including several hundred rides where I assume the bikes were
took out of circulation for Control reasons. I will delete these
rides.
Now I add the columns date
, month
,
day
, and year
. This will allow to aggregate
ride data for each month, day, or year. I’ll also add
ride_length
by subtracting started_at from ended_at, also
create ride_distance
, ride_speed
and
day_of_week
columns and save them in a new data frame that
includes the new columns.
#====================================================
#Adding new columns date, month, day, year, ride_length, ride_distance, ride_speed, day_of_week
#====================================================
cyclistic_df_v1$date <- as.Date(cyclistic_df_v1$started_at) #The default format is yyyy-mm-dd
cyclistic_df_v1$month <- format(as.Date(cyclistic_df_v1$date), "%m")
cyclistic_df_v1$day <- format(as.Date(cyclistic_df_v1$date), "%d")
cyclistic_df_v1$year <- format(as.Date(cyclistic_df_v1$date), "%Y")
cyclistic_df_v1$day_of_week <- format(as.Date(cyclistic_df_v1$date), "%A")
# Add a "ride_length" calculation to cyclistic_df_v1 (in seconds)
cyclistic_df_v1$ride_length <- difftime(cyclistic_df_v1$ended_at, cyclistic_df_v1$started_at)
# Add ride_distance using coordinates from start_lat to end_lat
cyclistic_df_v1$ride_distance <- (distGeo(matrix(c(cyclistic_df_v1$start_lng,cyclistic_df_v1$start_lat), ncol = 2),
matrix(c(cyclistic_df_v1$end_lng, cyclistic_df_v1$end_lat), ncol = 2)))/1000
cyclistic_df_v1$ride_speed <- c(cyclistic_df_v1$ride_distance)/as.numeric(c(cyclistic_df_v1$ride_length), units="hours")
# Inspect the structure of the columns
str(cyclistic_df_v1)
## tibble [4,560,146 × 21] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:4560146] "E92C804563F261EC" "9ECA91210441E847" "3DAA144C4E85E3AA" "53B99FE9BD0D35F8" ...
## $ rideable_type : chr [1:4560146] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:4560146], format: "2021-09-05 01:25:08" "2021-09-05 13:33:41" ...
## $ ended_at : POSIXct[1:4560146], format: "2021-09-05 01:35:46" "2021-09-05 14:06:08" ...
## $ start_station_name: chr [1:4560146] "Wells St & Walton St" "Larrabee St & Armitage Ave" "Dearborn St & Monroe St" "Kingsbury St & Kinzie St" ...
## $ start_station_id : chr [1:4560146] "TA1306000011" "TA1309000006" "TA1305000006" "KA1503000043" ...
## $ end_station_name : chr [1:4560146] "Desplaines St & Kinzie St" "Clark St & Leland Ave" "Dearborn St & Monroe St" "Orleans St & Elm St" ...
## $ end_station_id : chr [1:4560146] "TA1306000003" "TA1309000014" "TA1305000006" "TA1306000006" ...
## $ start_lat : num [1:4560146] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:4560146] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:4560146] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:4560146] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:4560146] "casual" "casual" "casual" "casual" ...
## $ date : Date[1:4560146], format: "2021-09-05" "2021-09-05" ...
## $ month : chr [1:4560146] "09" "09" "09" "09" ...
## $ day : chr [1:4560146] "05" "05" "04" "14" ...
## $ year : chr [1:4560146] "2021" "2021" "2021" "2021" ...
## $ day_of_week : chr [1:4560146] "Sunday" "Sunday" "Saturday" "Tuesday" ...
## $ ride_length : 'difftime' num [1:4560146] 638 1947 579 587 ...
## ..- attr(*, "units")= chr "secs"
## $ ride_distance : num [1:4560146] 1.5 5.79 0 1.53 0 ...
## $ ride_speed : num [1:4560146] 8.45 10.7 0 9.37 0 ...
# Convert "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(cyclistic_df_v1$ride_length)
## [1] FALSE
cyclistic_df_v1$ride_length <- as.numeric(as.character(cyclistic_df_v1$ride_length))
is.numeric(cyclistic_df_v1$ride_length)
## [1] TRUE
Ride_length with negative values, meaning the ride has started after it ended which is not correct. The data frame also includes a few hundred entries when bikes were taken out of docks and checked for quality.For that reason I have to delete them, but before that I have to make sure I only delete the ones with negative or less than 60 seconds as a ride_length value.
I create a new version of the data frame (cyclistic_df_v2) since data is being removed.
cyclistic_df_v2 <- cyclistic_df_v1[!(cyclistic_df_v1$start_station_name == "HQ QR" | cyclistic_df_v1$ride_length <=60),]
# Inspect the structure of the columns
str(cyclistic_df_v2) # observations left after removing negative ride_length values
## tibble [4,487,607 × 21] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:4487607] "E92C804563F261EC" "9ECA91210441E847" "3DAA144C4E85E3AA" "53B99FE9BD0D35F8" ...
## $ rideable_type : chr [1:4487607] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:4487607], format: "2021-09-05 01:25:08" "2021-09-05 13:33:41" ...
## $ ended_at : POSIXct[1:4487607], format: "2021-09-05 01:35:46" "2021-09-05 14:06:08" ...
## $ start_station_name: chr [1:4487607] "Wells St & Walton St" "Larrabee St & Armitage Ave" "Dearborn St & Monroe St" "Kingsbury St & Kinzie St" ...
## $ start_station_id : chr [1:4487607] "TA1306000011" "TA1309000006" "TA1305000006" "KA1503000043" ...
## $ end_station_name : chr [1:4487607] "Desplaines St & Kinzie St" "Clark St & Leland Ave" "Dearborn St & Monroe St" "Orleans St & Elm St" ...
## $ end_station_id : chr [1:4487607] "TA1306000003" "TA1309000014" "TA1305000006" "TA1306000006" ...
## $ start_lat : num [1:4487607] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:4487607] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:4487607] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:4487607] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:4487607] "casual" "casual" "casual" "casual" ...
## $ date : Date[1:4487607], format: "2021-09-05" "2021-09-05" ...
## $ month : chr [1:4487607] "09" "09" "09" "09" ...
## $ day : chr [1:4487607] "05" "05" "04" "14" ...
## $ year : chr [1:4487607] "2021" "2021" "2021" "2021" ...
## $ day_of_week : chr [1:4487607] "Sunday" "Sunday" "Saturday" "Tuesday" ...
## $ ride_length : num [1:4487607] 638 1947 579 587 7466 ...
## $ ride_distance : num [1:4487607] 1.5 5.79 0 1.53 0 ...
## $ ride_speed : num [1:4487607] 8.45 10.7 0 9.37 0 ...
I double check to make sure there are no duplicated values before starting with the analysis phase.
#=====================================
# Unique values per column
#=====================================
print("Unique values per column:")
## [1] "Unique values per column:"
cyclistic_df_v2 %>% summarise_all(list(~n_distinct(.)))
## # A tibble: 1 × 21
## ride_id rideable_type starte…¹ ended…² start…³ start…⁴ end_s…⁵ end_s…⁶ start…⁷
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4487607 3 3874760 3882223 1355 1213 1384 1230 497714
## # … with 12 more variables: start_lng <int>, end_lat <int>, end_lng <int>,
## # member_casual <int>, date <int>, month <int>, day <int>, year <int>,
## # day_of_week <int>, ride_length <int>, ride_distance <int>,
## # ride_speed <int>, and abbreviated variable names ¹started_at, ²ended_at,
## # ³start_station_name, ⁴start_station_id, ⁵end_station_name, ⁶end_station_id,
## # ⁷start_lat
#=====================================
# Summary on ride_length (all figures in seconds)
#=====================================
mean(cyclistic_df_v2$ride_length) #straight average (total ride length / rides)
## [1] 1097.009
median(cyclistic_df_v2$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 671
max(cyclistic_df_v2$ride_length) #longest ride
## [1] 2442301
min(cyclistic_df_v2$ride_length) #shortest ride
## [1] 61
You can condense the four lines above to one line using summary() on
the specific attribute.
summary(cyclistic_df_v2$ride_length)
Now I want to compare information per user type(casual and subscribers).
#=====================================
# Compare members and casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = mean)
## cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1 casual 1568.1325
## 2 member 765.7713
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = median)
## cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1 casual 894
## 2 member 557
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = min)
## cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1 casual 61
## 2 member 61
#=====================================
# Average ride time by each day for members vs casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual + cyclistic_df_v2$day_of_week, FUN = mean)
## cyclistic_df_v2$member_casual cyclistic_df_v2$day_of_week
## 1 casual Friday
## 2 member Friday
## 3 casual Monday
## 4 member Monday
## 5 casual Saturday
## 6 member Saturday
## 7 casual Sunday
## 8 member Sunday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Tuesday
## 12 member Tuesday
## 13 casual Wednesday
## 14 member Wednesday
## cyclistic_df_v2$ride_length
## 1 1460.5213
## 2 748.9705
## 3 1628.7957
## 4 741.6798
## 5 1726.4485
## 6 860.8886
## 7 1807.7487
## 8 864.0588
## 9 1384.3927
## 10 734.5616
## 11 1372.5103
## 12 720.6006
## 13 1342.5320
## 14 727.2049
#=====================================
# Notice that the days of the week are out of order. Let's fix that.
#=====================================
cyclistic_df_v2$day_of_week <- ordered(cyclistic_df_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#=====================================
# Average ride length by each day for members vs casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual + cyclistic_df_v2$day_of_week, FUN = mean)
## cyclistic_df_v2$member_casual cyclistic_df_v2$day_of_week
## 1 casual Sunday
## 2 member Sunday
## 3 casual Monday
## 4 member Monday
## 5 casual Tuesday
## 6 member Tuesday
## 7 casual Wednesday
## 8 member Wednesday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Friday
## 12 member Friday
## 13 casual Saturday
## 14 member Saturday
## cyclistic_df_v2$ride_length
## 1 1807.7487
## 2 864.0588
## 3 1628.7957
## 4 741.6798
## 5 1372.5103
## 6 720.6006
## 7 1342.5320
## 8 727.2049
## 9 1384.3927
## 10 734.5616
## 11 1460.5213
## 12 748.9705
## 13 1726.4485
## 14 860.8886
#=====================================
# analyze number of rides data by user type and weekday
#=====================================
cyclistic_df_v2 %>%
group_by(member_casual, day_of_week) %>% #groups by user type and weekday
summarise(number_of_rides = n() #calculates the number of rides
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, day_of_week) # sorts by user type and weekday
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_week number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sunday 339288 1808.
## 2 casual Monday 218010 1629.
## 3 casual Tuesday 202756 1373.
## 4 casual Wednesday 214001 1343.
## 5 casual Thursday 227928 1384.
## 6 casual Friday 254610 1461.
## 7 casual Saturday 396022 1726.
## 8 member Sunday 307848 864.
## 9 member Monday 370205 742.
## 10 member Tuesday 419619 721.
## 11 member Wednesday 427296 727.
## 12 member Thursday 405983 735.
## 13 member Friday 359199 749.
## 14 member Saturday 344842 861.
I create a general summary and export later for further analysis and create my data visualizations using Tableau.
#=====================================
#General Summary
#=====================================
getmode <- function(day_of_week) {
uniqv <- unique(day_of_week)
uniqv[which.max(tabulate(match(day_of_week, uniqv)))]
}
clean_cyclistic_df_summary <-
cyclistic_df_v2 %>%
group_by(member_casual, rideable_type, month.abb[as.numeric(cyclistic_df_v2$month)], year, day_of_week) %>%
summarise(number_of_rides = n(),
average_ride_length = mean(ride_length),
average_ride_distance = mean(ride_distance),
average_ride_speed = mean(ride_speed),
mode_dow = getmode(day_of_week))
clean_cyclistic_df_summary
## # A tibble: 420 × 10
## # Groups: member_casual, rideable_type,
## # month.abb[as.numeric(cyclistic_df_v2$month)], year [60]
## membe…¹ ridea…² month…³ year day_o…⁴ numbe…⁵ avera…⁶ avera…⁷ avera…⁸ mode_…⁹
## <chr> <chr> <chr> <chr> <ord> <int> <dbl> <dbl> <dbl> <ord>
## 1 casual classi… Apr 2022 Sunday 7638 1674. 2.07 7.02 Sunday
## 2 casual classi… Apr 2022 Monday 3954 1425. 1.90 7.87 Monday
## 3 casual classi… Apr 2022 Tuesday 5028 1399. 1.96 8.23 Tuesday
## 4 casual classi… Apr 2022 Wednes… 3159 1202. 1.86 8.45 Wednes…
## 5 casual classi… Apr 2022 Thursd… 5616 1465. 2.00 7.93 Thursd…
## 6 casual classi… Apr 2022 Friday 5568 1368. 1.90 7.70 Friday
## 7 casual classi… Apr 2022 Saturd… 15879 1834. 2.36 6.90 Saturd…
## 8 casual classi… Aug 2022 Sunday 17534 1616. 2.14 7.25 Sunday
## 9 casual classi… Aug 2022 Monday 14287 1406. 2.04 7.90 Monday
## 10 casual classi… Aug 2022 Tuesday 17437 1421. 2.13 8.06 Tuesday
## # … with 410 more rows, and abbreviated variable names ¹member_casual,
## # ²rideable_type, ³`month.abb[as.numeric(cyclistic_df_v2$month)]`,
## # ⁴day_of_week, ⁵number_of_rides, ⁶average_ride_length,
## # ⁷average_ride_distance, ⁸average_ride_speed, ⁹mode_dow
I generate a couple of basic plots.
#=====================================
# Let's visualize the number of rides by user type
#=====================================
ggplot(data = cyclistic_df_v2, mapping = aes(x = day_of_week), y = (summarise(number_of_rides = n()))) +
geom_bar(aes(fill = member_casual)) +
labs(title="Cyclistic: Number of Rides",
subtitle = "Number of rides per type of user",
x = "Day of week", y = "Number of rides", fill = "User type")
#=====================================
# Let's create a visualization for average ride duration
#=====================================
clean_cyclistic_df_summary %>%
ggplot() +
geom_col(mapping = aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
labs(title="Cyclistic: Average duration of rides",
subtitle = "Average duration of rides per user type",
x = "Day of week", y = "Average ride duration (sec)", fill = "User type")
I create csv files to export a summary of my analysis and create other visualizations using Tableau.
#=================================================
#Export summary file for further analysis
#=================================================
expfile <- cyclistic_df_v2 %>%
group_by(member_casual, day_of_week) %>% #groups by user type and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, day_of_week)
write.csv(expfile, file = '~/Projects/cyclistic-case-study/expfile.csv')
write.csv(clean_cyclistic_df_summary, file = '~/Projects/cyclistic-case-study/summary.csv')
Thank you for reading me.