Cyclistic.

Cyclistic is a bike-share company in Chicago that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.


Scenario.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members.


Business Task.

Design marketing strategies aimed at converting casual riders into annual members by answering the following question:

Based on their behavior, how “casual” users could be motivated to change to an annual subscription?


Data Sources.

Cyclistic trip data can be download here https://divvy-tripdata.s3.amazonaws.com/index.html (Note: The data sets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license https://ride.divvybikes.com/data-license-agreement).

For this analysis I’m using the data from September 2021 to August 2022.
Each data set was separated per month, column names were consistent and organized. There is no obvious sign of bias.

Please note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that for the purposes of this case study, I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.


Let’s get started with the process phase!

For the purposes of this case study I decided to use R, and Tableau. To ensure data’s integrity I have saved a copy of the original data sets as well as a copy of each of the steps of the analysis process, as well as adding comments in every step of said process.

First I install the libraries needed for this analysis.

#Adding libraries that I need for my analysis
library(tidyverse)  #will load the core tidyverse packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)  #helps with date-time attributes
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)  #helps visualize data
library(geosphere) #helps manage distance, direction, area, etc for geographic coordinates
getwd() #displays your working directory
## [1] "/home/csllmai/Projects/cyclistic-case-study"
setwd("~/Projects/cyclistic-case-study") #sets your working directory to simplify calls to data

Next step is to load the data sets (csv files) into their respective data frame variable in R.

#=====================
# Load each data set (csv files) and assign to different variables
#=====================
cyclistic_df1 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df2 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df3 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df4 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df5 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df6 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df7 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df8 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df9 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df10 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df11 <- read_csv("202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df12 <- read_csv("202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

While the column names don’t have to be in the same order, they DO need to match before we can use a command to join them into one data frame.

#====================================================
# Compare column names for each of the files
#====================================================
colnames(cyclistic_df1)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df2)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df3)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df4)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df5)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df6)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df7)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df8)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df9)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df10)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df11)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(cyclistic_df12)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

There’s no need to rename columns, all columns in the data frames are consistent. Now it is time to inspect the structure of the data frames looking for inconsistencies. After that, I was ready to generate a single data frame containing the 12 months data, from September 2021 to August 2022.

#====================================================
# Inspect the structure of the data frames looking for inconsistencies
#====================================================
str(cyclistic_df1)
str(cyclistic_df2)
str(cyclistic_df3)
str(cyclistic_df4)
str(cyclistic_df5)
str(cyclistic_df6)
str(cyclistic_df7)
str(cyclistic_df8)
str(cyclistic_df9)
str(cyclistic_df10)
str(cyclistic_df11)
str(cyclistic_df12)

#====================================================
# Create unique data frame -12months- from September 2021 to August 2022
#====================================================
cyclistic_df <- rbind(cyclistic_df1, cyclistic_df2, cyclistic_df3, cyclistic_df4, cyclistic_df5,
                  cyclistic_df6, cyclistic_df7, cyclistic_df8, cyclistic_df9, cyclistic_df10,
                  cyclistic_df11, cyclistic_df12)

In addition to head() you can also use the str() and glimpse() functions to get summaries of each column in your data arranged horizontally. You can try these two functions by completing and running the code chunks below:

{Dimension of the initial data frame}
dim(cyclistic_df)

{Check the first observations contained in the data frame}
head(cyclistic_df)

{str compactly display the structure of the data frame}
str(cyclistic_df)

{similar to str, but tries to show as much data as possible}
glimpse(cyclistic_df)


#====================================================
# Cleaning the data frame
#====================================================
cyclistic_df_v1 <- drop_na(cyclistic_df)
#dim(cyclistic_df_v1)

There are a few problems that need to get fixed:
(1) I’ll add additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.
(2) I’ll add a calculated column for length of ride called(ride_length) to the entire data frame for consistency.
(3) There are some rides where ride_length shows up as negative, including several hundred rides where I assume the bikes were took out of circulation for Control reasons. I will delete these rides.

Now I add the columns date, month, day, and year. This will allow to aggregate ride data for each month, day, or year. I’ll also add ride_length by subtracting started_at from ended_at, also create ride_distance, ride_speed and day_of_week columns and save them in a new data frame that includes the new columns.

#====================================================
#Adding new columns date, month, day, year, ride_length, ride_distance, ride_speed, day_of_week
#====================================================
cyclistic_df_v1$date <- as.Date(cyclistic_df_v1$started_at) #The default format is yyyy-mm-dd  
cyclistic_df_v1$month <- format(as.Date(cyclistic_df_v1$date), "%m")  
cyclistic_df_v1$day <- format(as.Date(cyclistic_df_v1$date), "%d")  
cyclistic_df_v1$year <- format(as.Date(cyclistic_df_v1$date), "%Y")  
cyclistic_df_v1$day_of_week <- format(as.Date(cyclistic_df_v1$date), "%A")  
# Add a "ride_length" calculation to cyclistic_df_v1 (in seconds)  
cyclistic_df_v1$ride_length <- difftime(cyclistic_df_v1$ended_at, cyclistic_df_v1$started_at)  
# Add ride_distance using coordinates from start_lat to end_lat  
cyclistic_df_v1$ride_distance <- (distGeo(matrix(c(cyclistic_df_v1$start_lng,cyclistic_df_v1$start_lat), ncol = 2), 
                        matrix(c(cyclistic_df_v1$end_lng, cyclistic_df_v1$end_lat), ncol = 2)))/1000
cyclistic_df_v1$ride_speed <- c(cyclistic_df_v1$ride_distance)/as.numeric(c(cyclistic_df_v1$ride_length), units="hours")


# Inspect the structure of the columns
str(cyclistic_df_v1)
## tibble [4,560,146 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4560146] "E92C804563F261EC" "9ECA91210441E847" "3DAA144C4E85E3AA" "53B99FE9BD0D35F8" ...
##  $ rideable_type     : chr [1:4560146] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:4560146], format: "2021-09-05 01:25:08" "2021-09-05 13:33:41" ...
##  $ ended_at          : POSIXct[1:4560146], format: "2021-09-05 01:35:46" "2021-09-05 14:06:08" ...
##  $ start_station_name: chr [1:4560146] "Wells St & Walton St" "Larrabee St & Armitage Ave" "Dearborn St & Monroe St" "Kingsbury St & Kinzie St" ...
##  $ start_station_id  : chr [1:4560146] "TA1306000011" "TA1309000006" "TA1305000006" "KA1503000043" ...
##  $ end_station_name  : chr [1:4560146] "Desplaines St & Kinzie St" "Clark St & Leland Ave" "Dearborn St & Monroe St" "Orleans St & Elm St" ...
##  $ end_station_id    : chr [1:4560146] "TA1306000003" "TA1309000014" "TA1305000006" "TA1306000006" ...
##  $ start_lat         : num [1:4560146] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:4560146] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:4560146] 41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:4560146] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:4560146] "casual" "casual" "casual" "casual" ...
##  $ date              : Date[1:4560146], format: "2021-09-05" "2021-09-05" ...
##  $ month             : chr [1:4560146] "09" "09" "09" "09" ...
##  $ day               : chr [1:4560146] "05" "05" "04" "14" ...
##  $ year              : chr [1:4560146] "2021" "2021" "2021" "2021" ...
##  $ day_of_week       : chr [1:4560146] "Sunday" "Sunday" "Saturday" "Tuesday" ...
##  $ ride_length       : 'difftime' num [1:4560146] 638 1947 579 587 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ride_distance     : num [1:4560146] 1.5 5.79 0 1.53 0 ...
##  $ ride_speed        : num [1:4560146] 8.45 10.7 0 9.37 0 ...
# Convert "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(cyclistic_df_v1$ride_length)
## [1] FALSE
cyclistic_df_v1$ride_length <- as.numeric(as.character(cyclistic_df_v1$ride_length))
is.numeric(cyclistic_df_v1$ride_length)
## [1] TRUE

Remove “bad” data.

Ride_length with negative values, meaning the ride has started after it ended which is not correct. The data frame also includes a few hundred entries when bikes were taken out of docks and checked for quality.For that reason I have to delete them, but before that I have to make sure I only delete the ones with negative or less than 60 seconds as a ride_length value.

I create a new version of the data frame (cyclistic_df_v2) since data is being removed.

cyclistic_df_v2 <- cyclistic_df_v1[!(cyclistic_df_v1$start_station_name == "HQ QR" | cyclistic_df_v1$ride_length <=60),]

# Inspect the structure of the columns
str(cyclistic_df_v2) # observations left after removing negative ride_length values
## tibble [4,487,607 × 21] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4487607] "E92C804563F261EC" "9ECA91210441E847" "3DAA144C4E85E3AA" "53B99FE9BD0D35F8" ...
##  $ rideable_type     : chr [1:4487607] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:4487607], format: "2021-09-05 01:25:08" "2021-09-05 13:33:41" ...
##  $ ended_at          : POSIXct[1:4487607], format: "2021-09-05 01:35:46" "2021-09-05 14:06:08" ...
##  $ start_station_name: chr [1:4487607] "Wells St & Walton St" "Larrabee St & Armitage Ave" "Dearborn St & Monroe St" "Kingsbury St & Kinzie St" ...
##  $ start_station_id  : chr [1:4487607] "TA1306000011" "TA1309000006" "TA1305000006" "KA1503000043" ...
##  $ end_station_name  : chr [1:4487607] "Desplaines St & Kinzie St" "Clark St & Leland Ave" "Dearborn St & Monroe St" "Orleans St & Elm St" ...
##  $ end_station_id    : chr [1:4487607] "TA1306000003" "TA1309000014" "TA1305000006" "TA1306000006" ...
##  $ start_lat         : num [1:4487607] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:4487607] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:4487607] 41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:4487607] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:4487607] "casual" "casual" "casual" "casual" ...
##  $ date              : Date[1:4487607], format: "2021-09-05" "2021-09-05" ...
##  $ month             : chr [1:4487607] "09" "09" "09" "09" ...
##  $ day               : chr [1:4487607] "05" "05" "04" "14" ...
##  $ year              : chr [1:4487607] "2021" "2021" "2021" "2021" ...
##  $ day_of_week       : chr [1:4487607] "Sunday" "Sunday" "Saturday" "Tuesday" ...
##  $ ride_length       : num [1:4487607] 638 1947 579 587 7466 ...
##  $ ride_distance     : num [1:4487607] 1.5 5.79 0 1.53 0 ...
##  $ ride_speed        : num [1:4487607] 8.45 10.7 0 9.37 0 ...

Analysis.

I double check to make sure there are no duplicated values before starting with the analysis phase.

#=====================================
# Unique values per column
#=====================================
print("Unique values per column:")
## [1] "Unique values per column:"
cyclistic_df_v2 %>% summarise_all(list(~n_distinct(.)))
## # A tibble: 1 × 21
##   ride_id rideable_type starte…¹ ended…² start…³ start…⁴ end_s…⁵ end_s…⁶ start…⁷
##     <int>         <int>    <int>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 4487607             3  3874760 3882223    1355    1213    1384    1230  497714
## # … with 12 more variables: start_lng <int>, end_lat <int>, end_lng <int>,
## #   member_casual <int>, date <int>, month <int>, day <int>, year <int>,
## #   day_of_week <int>, ride_length <int>, ride_distance <int>,
## #   ride_speed <int>, and abbreviated variable names ¹​started_at, ²​ended_at,
## #   ³​start_station_name, ⁴​start_station_id, ⁵​end_station_name, ⁶​end_station_id,
## #   ⁷​start_lat
#=====================================
# Summary on ride_length (all figures in seconds)
#=====================================
mean(cyclistic_df_v2$ride_length) #straight average (total ride length / rides)
## [1] 1097.009
median(cyclistic_df_v2$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 671
max(cyclistic_df_v2$ride_length) #longest ride
## [1] 2442301
min(cyclistic_df_v2$ride_length) #shortest ride
## [1] 61

You can condense the four lines above to one line using summary() on the specific attribute.
summary(cyclistic_df_v2$ride_length)

Now I want to compare information per user type(casual and subscribers).

#=====================================
# Compare members and casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = mean)
##   cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1                        casual                   1568.1325
## 2                        member                    765.7713
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = median)
##   cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1                        casual                         894
## 2                        member                         557
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual, FUN = min)
##   cyclistic_df_v2$member_casual cyclistic_df_v2$ride_length
## 1                        casual                          61
## 2                        member                          61
#=====================================
# Average ride time by each day for members vs casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual + cyclistic_df_v2$day_of_week, FUN = mean)
##    cyclistic_df_v2$member_casual cyclistic_df_v2$day_of_week
## 1                         casual                      Friday
## 2                         member                      Friday
## 3                         casual                      Monday
## 4                         member                      Monday
## 5                         casual                    Saturday
## 6                         member                    Saturday
## 7                         casual                      Sunday
## 8                         member                      Sunday
## 9                         casual                    Thursday
## 10                        member                    Thursday
## 11                        casual                     Tuesday
## 12                        member                     Tuesday
## 13                        casual                   Wednesday
## 14                        member                   Wednesday
##    cyclistic_df_v2$ride_length
## 1                    1460.5213
## 2                     748.9705
## 3                    1628.7957
## 4                     741.6798
## 5                    1726.4485
## 6                     860.8886
## 7                    1807.7487
## 8                     864.0588
## 9                    1384.3927
## 10                    734.5616
## 11                   1372.5103
## 12                    720.6006
## 13                   1342.5320
## 14                    727.2049
#=====================================
# Notice that the days of the week are out of order. Let's fix that.
#=====================================
cyclistic_df_v2$day_of_week <- ordered(cyclistic_df_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#=====================================
# Average ride length by each day for members vs casual users
#=====================================
aggregate(cyclistic_df_v2$ride_length ~ cyclistic_df_v2$member_casual + cyclistic_df_v2$day_of_week, FUN = mean)
##    cyclistic_df_v2$member_casual cyclistic_df_v2$day_of_week
## 1                         casual                      Sunday
## 2                         member                      Sunday
## 3                         casual                      Monday
## 4                         member                      Monday
## 5                         casual                     Tuesday
## 6                         member                     Tuesday
## 7                         casual                   Wednesday
## 8                         member                   Wednesday
## 9                         casual                    Thursday
## 10                        member                    Thursday
## 11                        casual                      Friday
## 12                        member                      Friday
## 13                        casual                    Saturday
## 14                        member                    Saturday
##    cyclistic_df_v2$ride_length
## 1                    1807.7487
## 2                     864.0588
## 3                    1628.7957
## 4                     741.6798
## 5                    1372.5103
## 6                     720.6006
## 7                    1342.5320
## 8                     727.2049
## 9                    1384.3927
## 10                    734.5616
## 11                   1460.5213
## 12                    748.9705
## 13                   1726.4485
## 14                    860.8886
#=====================================
# analyze number of rides data by user type and weekday
#=====================================
cyclistic_df_v2 %>% 
  group_by(member_casual, day_of_week) %>%  #groups by user type and weekday
  summarise(number_of_rides = n()   #calculates the number of rides
            ,average_duration = mean(ride_length)) %>%  # calculates the average duration
  arrange(member_casual, day_of_week)   # sorts by user type and weekday
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual day_of_week number_of_rides average_duration
##    <chr>         <ord>                 <int>            <dbl>
##  1 casual        Sunday               339288            1808.
##  2 casual        Monday               218010            1629.
##  3 casual        Tuesday              202756            1373.
##  4 casual        Wednesday            214001            1343.
##  5 casual        Thursday             227928            1384.
##  6 casual        Friday               254610            1461.
##  7 casual        Saturday             396022            1726.
##  8 member        Sunday               307848             864.
##  9 member        Monday               370205             742.
## 10 member        Tuesday              419619             721.
## 11 member        Wednesday            427296             727.
## 12 member        Thursday             405983             735.
## 13 member        Friday               359199             749.
## 14 member        Saturday             344842             861.

I create a general summary and export later for further analysis and create my data visualizations using Tableau.

#=====================================
#General Summary 
#=====================================
getmode <- function(day_of_week) {
  uniqv <- unique(day_of_week)
  uniqv[which.max(tabulate(match(day_of_week, uniqv)))]
}
clean_cyclistic_df_summary <- 
  cyclistic_df_v2 %>%
  group_by(member_casual, rideable_type, month.abb[as.numeric(cyclistic_df_v2$month)], year, day_of_week) %>% 
  summarise(number_of_rides = n(),  
            average_ride_length = mean(ride_length),
            average_ride_distance = mean(ride_distance),
            average_ride_speed = mean(ride_speed),
            mode_dow = getmode(day_of_week))

clean_cyclistic_df_summary
## # A tibble: 420 × 10
## # Groups:   member_casual, rideable_type,
## #   month.abb[as.numeric(cyclistic_df_v2$month)], year [60]
##    membe…¹ ridea…² month…³ year  day_o…⁴ numbe…⁵ avera…⁶ avera…⁷ avera…⁸ mode_…⁹
##    <chr>   <chr>   <chr>   <chr> <ord>     <int>   <dbl>   <dbl>   <dbl> <ord>  
##  1 casual  classi… Apr     2022  Sunday     7638   1674.    2.07    7.02 Sunday 
##  2 casual  classi… Apr     2022  Monday     3954   1425.    1.90    7.87 Monday 
##  3 casual  classi… Apr     2022  Tuesday    5028   1399.    1.96    8.23 Tuesday
##  4 casual  classi… Apr     2022  Wednes…    3159   1202.    1.86    8.45 Wednes…
##  5 casual  classi… Apr     2022  Thursd…    5616   1465.    2.00    7.93 Thursd…
##  6 casual  classi… Apr     2022  Friday     5568   1368.    1.90    7.70 Friday 
##  7 casual  classi… Apr     2022  Saturd…   15879   1834.    2.36    6.90 Saturd…
##  8 casual  classi… Aug     2022  Sunday    17534   1616.    2.14    7.25 Sunday 
##  9 casual  classi… Aug     2022  Monday    14287   1406.    2.04    7.90 Monday 
## 10 casual  classi… Aug     2022  Tuesday   17437   1421.    2.13    8.06 Tuesday
## # … with 410 more rows, and abbreviated variable names ¹​member_casual,
## #   ²​rideable_type, ³​`month.abb[as.numeric(cyclistic_df_v2$month)]`,
## #   ⁴​day_of_week, ⁵​number_of_rides, ⁶​average_ride_length,
## #   ⁷​average_ride_distance, ⁸​average_ride_speed, ⁹​mode_dow

Basic Plots

I generate a couple of basic plots.

#=====================================
# Let's visualize the number of rides by user type
#=====================================
ggplot(data = cyclistic_df_v2, mapping = aes(x = day_of_week), y = (summarise(number_of_rides = n()))) +
  geom_bar(aes(fill = member_casual)) + 
  labs(title="Cyclistic: Number of Rides", 
       subtitle = "Number of rides per type of user",
       x = "Day of week", y = "Number of rides", fill = "User type")

#=====================================  
# Let's create a visualization for average ride duration
#=====================================  
clean_cyclistic_df_summary %>% 
ggplot() +
geom_col(mapping = aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  labs(title="Cyclistic: Average duration of rides", 
       subtitle = "Average duration of rides per user type",
       x = "Day of week", y = "Average ride duration (sec)", fill = "User type")

I create csv files to export a summary of my analysis and create other visualizations using Tableau.

#=================================================
#Export summary file for further analysis
#=================================================

expfile <- cyclistic_df_v2 %>% 
  group_by(member_casual, day_of_week) %>%  #groups by user type and weekday
  summarise(number_of_rides = n()   #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>%      # calculates the average duration
  arrange(member_casual, day_of_week)
write.csv(expfile, file = '~/Projects/cyclistic-case-study/expfile.csv')
write.csv(clean_cyclistic_df_summary, file = '~/Projects/cyclistic-case-study/summary.csv')

Thank you for reading me.