Take-home Exercise 3

In this take-home exercise 3, we will examine the financial status of the city of Engagement, Ohio USA using tidyverse, ggoplot2 and its extension.

Hulwana Saifulzaman https://www.linkedin.com/in/hulwana-saifulzaman/ (SMU, Master of IT in Business)https://scis.smu.edu.sg/master-it-business
2022-05-16

1. Overview

In this take-home exercise, appropriate static statistical graphics methods are used to reveal the financial status of the city of Engagement, Ohio USA, particularly on the business aspect.

The key questions to address are:

● Which businesses appear to be more prosperous?

● Which appear to be struggling?

Due to limited data available, we will only examine the businesses of restaurants and pubs of the city of Engagement.

The data would be processed by using appropriate tidyverse family of packages and the statistical graphics would be prepared using ggplot2 and its extensions.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The chunk code below will do the trick.

packages = c('tidyverse', 'lubridate', 'patchwork', 'RColorBrewer', 'ggthemes', 
             'ggiraph')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Dataset

The TravelJournal dataset has been obtained from the Journals folder retrieved from Vast Challenge 2022 website.

The reason why this dataset is used as opposed to FinancialJournal.csv is because it contains data of the travelEndLocationId which is crucial in identifying how mcuh sales were genrated by each pubs or restaurants.

3.1. Importing Data

The code chunk below import TravelJournal.csv from the data folder by using read_csv() of readr into R and save it as an tibble data frame called travel.

travel <- read_csv("data/TravelJournal.csv")

3.2. Data Definition

The following data definition has been extracted from the VAST Challenge 2022 Dataset Descriptions file which can be obtained from the Vast-Challenge-2022 folder downloaded earlier.

Data definition

TravelJournal.csv data contains information about about participants’ motivation for movement around the city. This provides a compressed summary and additional context regarding location-event and financial transaction information contained in the Participant Logs.

● participantId (integer): unique ID corresponding to the participant in question

● travelStartTime (datetime): the time when the participant started traveling

● travelStartLocationId (integer): the unique ID corresponding to the location the participant is leaving when they begin to travel, NA if unknown

● travelEndTime (datetime): the time when the participant concluded their travel

● travelEndLocationId (integer): the unique ID corresponding to the location the participant is traveling to

● purpose (string factor): a description of the purpose for the recorded travel, one of: {“Coming Back From Restaurant”, “Eating”, “Going Back to Home”, “Recreation (Social Gathering)”, “Work/Home Commute”}

● checkInTime (datetime): the time when the participant checked in to their destination

● checkOutTime (datetime): the time when the participant left their destination

● startingBalance (double): the participant’s starting balance at the beginning of their travels

● endingBalance (double): the participant’s ending balance at the conclusion of their travel

3.3. Data Structure and Summary

The following codes are executed to reveal the structure and summary statistics of the data:

str(travel)
spec_tbl_df [2,099,656 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId        : num [1:2099656] 23 876 902 919 154 509 911 23 512 556 ...
 $ travelStartTime      : POSIXct[1:2099656], format: "2022-03-01 05:20:00" ...
 $ travelStartLocationId: num [1:2099656] 532 NA NA NA NA ...
 $ travelEndTime        : POSIXct[1:2099656], format: "2022-03-01 05:55:00" ...
 $ travelEndLocationId  : num [1:2099656] 894 1804 1801 1802 446 ...
 $ purpose              : chr [1:2099656] "Recreation (Social Gathering)" "Eating" "Eating" "Eating" ...
 $ checkInTime          : POSIXct[1:2099656], format: "2022-03-01 05:55:00" ...
 $ checkOutTime         : POSIXct[1:2099656], format: "2022-03-01 06:00:00" ...
 $ startingBalance      : num [1:2099656] 851 2072 2116 2120 2246 ...
 $ endingBalance        : num [1:2099656] 850 2066 2110 2115 2242 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   travelStartTime = col_datetime(format = ""),
  ..   travelStartLocationId = col_double(),
  ..   travelEndTime = col_datetime(format = ""),
  ..   travelEndLocationId = col_double(),
  ..   purpose = col_character(),
  ..   checkInTime = col_datetime(format = ""),
  ..   checkOutTime = col_datetime(format = ""),
  ..   startingBalance = col_double(),
  ..   endingBalance = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

3.4. Data Preparation

In order to get the sales volume for each of the pubs and restaurants, we will need to execute the following steps:

● Filter the purpose to consists only of “Recreation (Social Gathering)” and “Eating”. “Recreation (Social Gathering)” would refer to pubs whereas “Eating” would refer to restaurants.

● Compute the amount spent by a customer at each venue by subtracting the endingBalance from the startingBalance.

● Extract year-month, weekday and hour from the checkInTime. This will be used in the plotting of line chart (to see the monthly sales pattern) as well as for the generation of heatmap to visualize the check in rate of customers across the weekdays at an hourly interval.

sales <- travel %>% select(participantId, travelEndLocationId, purpose, checkInTime, startingBalance, endingBalance) %>% 
  filter(purpose == "Eating" | purpose == "Recreation (Social Gathering)") %>%  
  mutate(spend = startingBalance - endingBalance) %>% 
  mutate(date = format(as.Date(checkInTime), "%Y-%m")) %>% 
  mutate(wkday = wday(checkInTime, label = TRUE)) %>% mutate(hr = hour(checkInTime)) 

sales$travelEndLocationId <- as.character(sales$travelEndLocationId)

Group the sales by the travelEndLocationId, date, weekday and hour.

● Count the number of customers checking in at each hour.

● Sum the total sales at each hour.

sales_hourly <- sales %>% group_by(purpose, travelEndLocationId, date, wkday, hr) %>% 
  summarise(total = sum(spend), customer = n())

Pubs data

pubs_hourly <- sales_hourly %>% filter(purpose == "Recreation (Social Gathering)")  

4. Data Visualization

4.1. Total sales for pubs

In order to ascertain which pubs are prospering or struggling, we will compute the total sales throughout the priod covered and ranked them.

pubs_t <- pubs_hourly %>% group_by(travelEndLocationId) %>% 
  summarise(overall = sum(total))

4.1.1 Total sales for all pubs

pubs_rank<- ggplot(data=pubs_t, aes(x=reorder(travelEndLocationId, -overall), y=overall)) + 
  geom_bar(stat = "identity", fill="navy blue")  + 
  ylim(0,900000) +
  theme_bw() +
  labs(x = "Pubs", y = "Total Sales", title ="Ranking of pubs by overall sales")

pubs_rank

As the y-values are too long to be displayed in the chart we will format it for easy readability.

so_formatter <- function(x) {
  dplyr::case_when(
      x < 1e3 ~ as.character(x),
      x < 1e6 ~ paste0(as.character(x/1e3), "K"),
      x < 1e9 ~ paste0(as.character(x/1e6), "M"),
  )
}

pubs_t$lab <- so_formatter(signif(pubs_t$overall, digits=3))

4.2.1 Top 5 Pubs

The top 5 performing pubs are pubs 1342, 1344, 1798, 893 1nd 1343. We notice that the top 2 pubs exceed $500k in sales across the study period.

pubs_5 <- top_n(pubs_t, 5)

pubs_5_bar<- ggplot(data=pubs_5, aes(x=reorder(travelEndLocationId, -overall), y=overall,  fill=travelEndLocationId)) + 
  geom_bar(stat = "identity")  + 
  ylim(0,1000000) +
  geom_text(aes(label = lab , vjust = -1)) +
  labs(x = "Pubs", y = "Total Sales", title ="Top 5 Pubs") +
  #scale_fill_manual(values=c("skyblue2","dodgerblue1", "dodgerblue3","royalblue3","blue4")) + 
  theme_bw() +
  theme(legend.position = "none")

pubs_5_bar

4.2.2 Monthly Sales Chart for top 5 pubs

As the previous chart only compares at the overall sales volume, it provides no insights on how the businesses are performing overtime. Therefore, the Monthly Sales chart visualizes how the pubs are performing at a monthly interval.

Prepare the monthly sales data for pubs

● Group by the pubs and year-month

● As the month of May 2023 is incomplete we will filter that out.

pubs_monthly <- pubs_hourly %>% group_by(travelEndLocationId, date) %>% 
  summarise(monthly_sales = sum(total)) %>% filter(date != "2023-05")

Inner join with the top 5 pubs

We see that for the top 5 pubs in general, there is a sharp decline in sales from March 2022 to April 2022. Businesses continues to dip till May 2022 after which the sales fluctuates.

pubs_line <- merge(pubs_monthly, pubs_5, by = "travelEndLocationId")

pubs_5_line<-ggplot(data=pubs_line, aes(x=date, y=monthly_sales, group=travelEndLocationId)) +
  geom_line(aes(color=travelEndLocationId)) +
  theme_bw() +
  ylim(20000, 120000) +
  labs(x = "Date", y = "Sales", title ="Monthly Sales for Top 5 Pubs") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))

pubs_5_line

Combine the 2 charts

pubs_5_bar / pubs_5_line + plot_layout(heights = c(1,1))

4.3.1 Bottom 5 pubs

Pubs with the lowest sales are pubs 443, 444, 442, 894 and 1799.

pubs_b5 <- top_n(pubs_t, -5)

pubs_b5_bar<- ggplot(data=pubs_b5, aes(x=reorder(travelEndLocationId, +overall), y=overall,  fill=travelEndLocationId)) + 
  geom_bar(stat = "identity")  + 
  ylim(0,400000) +
  geom_text(aes(label = lab , vjust = -1)) +
  labs(x = "Pubs", y = "Total Sales", title ="Bottom 5 Pubs") + 
  theme_bw() +
  theme(legend.position = "none")

pubs_b5_bar

4.3.2 Monthly Sales Chart for bottom 5 pubs

Inner join with the bottom 5 pubs

We notice similar sales pattern for the bottom 5 performing pubs as that of the top 5 performing pubs, in which there is a sharp decline in sales from March 2022 to April 2022 and the sales volume fluctuates from June 2022 onwards.

pubs_lineb <- merge(pubs_monthly, pubs_b5, by = "travelEndLocationId")

pubs_5_lineb<-ggplot(data=pubs_lineb, aes(x=date, y=monthly_sales, group=travelEndLocationId)) +
  geom_line(aes(color=travelEndLocationId)) +
  theme_bw() +
  ylim(10000, 50000) +
  labs(x = "Date", y = "Sales", title ="Monthly Sales for Bottom 5 Pubs") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))

pubs_5_lineb

4.4. Customer check in for all pubs

To deep-dive further on the customer patter of the pubs, we will plot the heatmap using the average check-in for each weekday at hourly intervals.

Calculate the average number of customer per weekday at hourly intervals

● Make wkday and hr into factors.

● Get the average number of customers check in per timing.

wkday_levels <- c('Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon', 'Sun')

checkin_pubs<- sales_hourly %>% filter(purpose == "Recreation (Social Gathering)") %>%
  mutate(wkday = factor(wkday, levels = wkday_levels),
    hr  = factor(hr, levels = 0:23))

checkin_pubs_hm <- checkin_pubs %>% group_by(travelEndLocationId, wkday, hr) %>% 
  summarise(ave_checkin = mean(customer)) %>% na.omit

Plot the heatmap

For pubs, we observe that the peak and non-peak hours are relatively similar for all pubs.

pubs_heatmap<-ggplot(checkin_pubs_hm, 
       aes(hr, 
           wkday, 
           fill = ave_checkin)) + 
  geom_tile(color = "white", 
          size = 0.1) + 
  theme_tufte(base_family = "Helvetica") + 
  coord_equal() +
  scale_fill_gradient(name = "# of customers",
                    low = "sky blue", 
                    high = "dark blue") +
  facet_wrap(~travelEndLocationId, ncol = 4) +
  labs(x = NULL, y = NULL, 
     title = "Average Check into pubs by weekday and time of the day") +
  theme(axis.ticks = element_blank(),
        axis.text.x = element_text(size = 7),
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6) )

pubs_heatmap

4.5. Plot the boxplot distribution of monthly sales for all pubs

pubs_levels <- c('1342', '1344', '1798', '893', '1343', '1800', '892', '1799', '894', '442', '444', '443')


pubs_monthly$travelEndLocationId <- factor(pubs_monthly$travelEndLocationId, pubs_levels)

pubs_box_all <- ggplot(data=pubs_monthly, aes(x=travelEndLocationId, y=monthly_sales)) +
              geom_boxplot() +
              stat_summary(geom = "point",fun.y="mean",colour ="green",size=2) +
              geom_hline(aes(yintercept=mean(monthly_sales,na.rm=T)), color="red", linetype="dashed", size=1) + 
              ggtitle("Distribution of monthly sales")

pubs_box_all