1. Overview

In this take-home exercise, appropriate static statistical graphics methods are used to reveal the demographic of the city of Engagement, Ohio USA.

The data would be processed by using appropriate tidyverse family of packages and the statistical graphics would be prepared using ggplot2 and its extensions.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The chunk code below will do the trick.

3. Dataset

The Participants dataset has been obtained from the Attributes folder retrieved from Vast Challenge 2022 website.

3.1. Importing Data

The code chunk below import Participants.csv from the data folder by using read_csv() of readr into R and save it as an tibble data frame called participants_data.

participants_data <- read_csv("data/Participants.csv")

3.2. Data Dictionary

The following data definition has been extracted from the VAST Challenge 2022 Dataset Descriptions file which can be obtained from the Vast-Challenge-2022 folder downloaded earlier.

Data definition

Participants.csv data contains information about the residents of Engagement, OH that have agreed to participate in this study.

● participantId (integer): unique ID assigned to each participant

● householdSize (integer): the number of people in the participant’s household

● haveKids (boolean): whether there are children living in the participant’s household

● age (integer): participant’s age in years at the start of the study

● educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}

● interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.

● joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.

3.3. Data Exploration

The participants data consists of 1011 observations with 7 variables namely participantID, householdSize, haveKids, age, educationLevel, interestGroup and joviality. We will omit participantID from our exploration as it has no significant meaning. To visualize the distribution of the 6 other variables, we will plot a bar chart to understand the spread.

The code chunk below plot a bar chart by using geom_bar of ggplot2.

3.3.1.Summary Statistic

To have a rough idea on the distribution of the variables, we ontained the summary statistics using the built-in function summary().

summary(participants_data)

 participantId    householdSize    haveKids            age       
 Min.   :   0.0   Min.   :1.000   Mode :logical   Min.   :18.00  
 1st Qu.: 252.5   1st Qu.:1.000   FALSE:710       1st Qu.:29.00  
 Median : 505.0   Median :2.000   TRUE :301       Median :39.00  
 Mean   : 505.0   Mean   :1.964                   Mean   :39.07  
 3rd Qu.: 757.5   3rd Qu.:3.000                   3rd Qu.:50.00  
 Max.   :1010.0   Max.   :3.000                   Max.   :60.00  
 educationLevel     interestGroup        joviality       
 Length:1011        Length:1011        Min.   :0.000204  
 Class :character   Class :character   1st Qu.:0.240074  
 Mode  :character   Mode  :character   Median :0.477539  
                                       Mean   :0.493794  
                                       3rd Qu.:0.746819  
                                       Max.   :0.999234

3.3.2. Distribution of Household Size

First we will convert householdSize into categorical variable by executing the following code.

participants_data$householdSize <- as.factor(participants_data$householdSize)

We observe that there are 3 distinct groups under householdSize which are 1, 2 and 3. This indicates that for this particular sample, the participants relatively have smaller family size, with householdSize = 2 having the highest frequency.

ggplot(data = participants_data,
       aes(x = householdSize)) +
  geom_bar(fill = "navy") + 
  geom_text(stat="count", 
      aes(label=paste0(..count.., " (", 
      round(..count../sum(..count..)*100,
            1), "%)")),
      vjust=-1) +
  xlab("Household Size") +
  ylab("Count") +
  theme_classic() +
  coord_cartesian(ylim=c(0,400)) +
  ggtitle("Distribution of Ohio participants by Household Size")

When we take a look at the average of persons per household, 2016-2020 for Ohio obtained from U.S.Census Bureau, it roughly agrees with the above observation with a value of 2.41. Thus, depicting that the Ohio population has a preference to have smaller families.

3.3.3. Distribution of ‘Have Kids’

For the haveKids variable, it is a logical data which takes either TRUE or FALSE values. We see that majority of the participants do not have kids yet. This variable has similar details as the householdSize variable, where householdSize = 3 would also give an indication whether the participants have kids.

ggplot(data = participants_data,
       aes(x = haveKids)) +
  geom_bar(fill = "navy") + 
  geom_text(stat="count", 
      aes(label=paste0(..count.., " (", 
      round(..count../sum(..count..)*100,
            1), "%)")),
      vjust=-1) +
  xlab("Have Kids") +
  ylab("Count") +
  theme_classic() +
  coord_cartesian(ylim=c(0,750)) + 
  ggtitle("Distribution of 'Have Kids'")

3.3.4. Distribution by Age

As age consists of many distinct categories, the count label for each age is omitted and instead we increase the number of ticks and include horizontal grid lines for readability. The age ranges from 18-60 years old, indicating that the participants surveyed for this analysis are from the working population.

ggplot(data = participants_data,
       aes(x = age)) +
  geom_bar(fill = "navy") +
  xlab("Age") +
  ylab("Count") +
  theme_classic() +
  theme(panel.grid.major.y = element_line(color = "grey")) +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(breaks = seq(0, 40, by = 5)) +
  ggtitle("Distribution of Ohio participants by Age") +
  geom_vline(aes(xintercept=mean(age,na.rm=T)),
             color="red", linetype="dashed", size=1) +
  geom_text(aes(x=40, label="mean = 39.07", y=30), colour="red", angle=90, text=element_text(size=9))

Based on the summary statistics calculated at section 3.3.1, the mean age of the participants is 39.07. As it is difficult to compare the discrete ages, we will group them in interval of 5 years from age 20 - 60 years old.

3.3.4.1. Distribution of Age Groups

We can do this by implementing the following code.

participants_data$ageGroup <- cut(participants_data$age,
                                  breaks = c(-Inf,20, 25, 30, 35, 40, 45, 50, 55, 60, Inf),
                                  labels = c("<20", "20-24", "25-29","30-34", "35-39", 
                                             "40-44", "45-49","50-54", "55-59", ">60"),
                                  right = FALSE)

When we plot the distribution of the Ohio participants by age group, we see that the frequency is relatively similar among the groups with <20 and >60 groups as the exception. This is due to the fact that for age <20, it only includes participants with age 18 or 19. Whereas for age >60 it only includes participants who are aged 60 at the time of the survey.

ggplot(data = participants_data,
       aes(x = ageGroup)) +
  geom_bar(fill = "navy") +
  xlab("Age Group (years)") +
  ylab("Count") +
  geom_text(stat="count", 
      aes(label = paste0(round(..count../sum(..count..)*100,1), "%"), vjust=-1)) +
  theme_classic() +
  theme(panel.grid.major.y = element_line(color = "grey")) +
  scale_y_continuous(breaks = seq(0, 140, by = 20), limits = c(0, 140)) +
  ggtitle("Distribution of Ohio participants by Age Group")

3.3.5. Distribution of Interest Groups

The distribution of interestGroup are relatively the same for all groups, in which each group consists of 8.2% - 11.5% of the total participants. As the information on interestGroup have been redacted to avoid bias, we do not have specific details on what this variable means.

ggplot(data = participants_data,
       aes(x = interestGroup)) +
  geom_bar(fill = "navy") +
  xlab("Interest Group") +
  ylab("Count") +
  geom_text(stat="count", 
      aes(label = paste0(round(..count../sum(..count..)*100,1), "%"), vjust=-1)) +
  theme_classic() +
  theme(panel.grid.major.y = element_line(color = "grey")) +
  scale_y_continuous(breaks = seq(0, 120, by = 20), limits = c(0, 120)) +
  ggtitle("Distribution of Ohio participants by Interest Group")

3.3.6. Distribution by Education Level

When the bar chart based on the education level was plotted, we note that majority of the participants minimally has completed high school or colleges. More than one fifth of them has a bachelor’s degree.

ggplot(data = participants_data, aes(x =reorder(educationLevel, educationLevel, function(x)-length(x)))) +
  geom_bar(fill = "navy") +
  xlab("Education Level") +
  ylab("Count") +
  geom_text(stat="count", 
      aes(label=paste0(..count.., " (", round(..count../sum(..count..)*100,1), "%)")), vjust=-1) +
  theme_classic() +
  theme(panel.grid.major.y = element_line(color = "grey")) +
  scale_y_continuous(breaks = seq(0, 550, by = 100), limits = c(0, 550)) +
  ggtitle("Distribution of Ohio participants by Education Level")

4. Deep Dive

We will further investigate, if there are any underlying patterns based on the demographic distribution of the Ohio participants.

4.1. Comparing two variables

4.1.1. Do younger participants achieve higher academic certifications than the older generation?

Although there are slightly more younger participants (age 20-34) who have a bachelors’ degree, the distribution of age group between the different education level is relatively similar.

ggplot(data = participants_data,
       aes(x = ageGroup)) +
  geom_bar(aes(fill = educationLevel)) +
  facet_grid(.~educationLevel) +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Does age have any influence on the education level?")

ggplot(data = participants_data,
       aes(x = age, y = educationLevel)) +
  geom_boxplot() +
  geom_vline(aes(xintercept=mean(age,na.rm=T)),
             color="red", linetype="dashed", size=1) +
  stat_summary(geom = "point",fun.y="mean",colour ="green",size=2) +
  ggtitle("Does age have any influence on the education level?")

4.1.2. Does age have any influence on the participants’ household size?

Despite separating the participants based on their household size, we observe little difference in the mean age group. This shows that the age spread has little influence on the household size of the participants.

ggplot(data = participants_data,
       aes(y = age, x = householdSize)) +
  geom_boxplot() +
  geom_hline(aes(yintercept=mean(age,na.rm=T)),
             color="red", linetype="dashed", size=1) +
  geom_text(aes(x=3.3, label="mean = 39.07", y=40), colour="red", angle=0, text=element_text(size=9)) +
  stat_summary(geom = "point",fun.y="mean",colour ="green",size=2) +
  ggtitle("Does age have any influence on the household size?")

4.2. Participants’ Joviality based on the Demographic Factor

4.2.1. Joviality Index based on Age Group and Education Level

Comparing the joviality index of the participant based on the level of education as well as age group, we spot that participants age less than 20 with either a bachelors degree or graduate certification has a higher jovial median than other combination groups. On the other hand, participant with low education level and age between 30-34 years old have the lowest median joviality.

ggplot(data = participants_data,
       aes(x = ageGroup, y = joviality)) +
  geom_boxplot() +
  facet_grid(educationLevel ~.)

4.2.2 Joviality Index based on Interest Group and Education Level

Graduate participants with interest group A and low educated participants with interest group F have the two lowest median joviality values.

ggplot(data = participants_data,
       aes(x = interestGroup, y = joviality)) +
  geom_boxplot() +
  facet_grid(educationLevel~.)

5. Challenges faced

No information on the gender, race, employment and economic status of the participants are found in the dataset. These information could further aid us in understanding if certain demographic factor has any influence on the needs of the community.
As the details for interestGroup has been redacted to eliminate biasness, we do not have an idea on what this variable means and thus might not be very useful for our analysis.
Although we noticed that educated participants with age less than 20 have higher median joviality index whereas the low educated participants of age 30-34 years old have the lowest median joviality index, we cannot be too quick to assume that the education level has large influence on the joviality values. We will need more details such as economic factor and employment details to better understand what is the real cause for such a difference in the median joviality index.

Take-home Exercise 1