In this Take-home Exercise 1, we will examine the deomgraphic distribution of volunteered participants from Ohio USA.
In this take-home exercise, appropriate static statistical graphics methods are used to reveal the demographic of the city of Engagement, Ohio USA.
The data would be processed by using appropriate tidyverse family of packages and the statistical graphics would be prepared using ggplot2 and its extensions.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The chunk code below will do the trick.
The Participants dataset has been obtained from the Attributes folder retrieved from Vast Challenge 2022 website.
The code chunk below import Participants.csv from the data
folder by using read_csv()
of readr
into R and save it as an tibble data frame called
participants_data.
participants_data <- read_csv("data/Participants.csv")
The following data definition has been extracted from the VAST Challenge 2022 Dataset Descriptions file which can be obtained from the Vast-Challenge-2022 folder downloaded earlier.
Participants.csv data contains information about the residents of Engagement, OH that have agreed to participate in this study.
● participantId (integer): unique ID assigned to each participant
● householdSize (integer): the number of people in the participant’s household
● haveKids (boolean): whether there are children living in the participant’s household
● age (integer): participant’s age in years at the start of the study
● educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
● interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
● joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
The participants data consists of 1011 observations with 7 variables namely participantID, householdSize, haveKids, age, educationLevel, interestGroup and joviality. We will omit participantID from our exploration as it has no significant meaning. To visualize the distribution of the 6 other variables, we will plot a bar chart to understand the spread.
The code chunk below plot a bar chart by using geom_bar
of ggplot2.
To have a rough idea on the distribution of the variables, we
ontained the summary statistics using the built-in function summary().
summary(participants_data)
participantId householdSize haveKids age
Min. : 0.0 Min. :1.000 Mode :logical Min. :18.00
1st Qu.: 252.5 1st Qu.:1.000 FALSE:710 1st Qu.:29.00
Median : 505.0 Median :2.000 TRUE :301 Median :39.00
Mean : 505.0 Mean :1.964 Mean :39.07
3rd Qu.: 757.5 3rd Qu.:3.000 3rd Qu.:50.00
Max. :1010.0 Max. :3.000 Max. :60.00
educationLevel interestGroup joviality
Length:1011 Length:1011 Min. :0.000204
Class :character Class :character 1st Qu.:0.240074
Mode :character Mode :character Median :0.477539
Mean :0.493794
3rd Qu.:0.746819
Max. :0.999234
First we will convert householdSize into categorical variable by executing the following code.
participants_data$householdSize <- as.factor(participants_data$householdSize)
We observe that there are 3 distinct groups under householdSize which are 1, 2 and 3. This indicates that for this particular sample, the participants relatively have smaller family size, with householdSize = 2 having the highest frequency.
ggplot(data = participants_data,
aes(x = householdSize)) +
geom_bar(fill = "navy") +
geom_text(stat="count",
aes(label=paste0(..count.., " (",
round(..count../sum(..count..)*100,
1), "%)")),
vjust=-1) +
xlab("Household Size") +
ylab("Count") +
theme_classic() +
coord_cartesian(ylim=c(0,400)) +
ggtitle("Distribution of Ohio participants by Household Size")
When we take a look at the average of persons per household, 2016-2020 for Ohio obtained from U.S.Census Bureau, it roughly agrees with the above observation with a value of 2.41. Thus, depicting that the Ohio population has a preference to have smaller families.
For the haveKids variable, it is a logical data which takes either TRUE or FALSE values. We see that majority of the participants do not have kids yet. This variable has similar details as the householdSize variable, where householdSize = 3 would also give an indication whether the participants have kids.
ggplot(data = participants_data,
aes(x = haveKids)) +
geom_bar(fill = "navy") +
geom_text(stat="count",
aes(label=paste0(..count.., " (",
round(..count../sum(..count..)*100,
1), "%)")),
vjust=-1) +
xlab("Have Kids") +
ylab("Count") +
theme_classic() +
coord_cartesian(ylim=c(0,750)) +
ggtitle("Distribution of 'Have Kids'")
As age consists of many distinct categories, the count label for each age is omitted and instead we increase the number of ticks and include horizontal grid lines for readability. The age ranges from 18-60 years old, indicating that the participants surveyed for this analysis are from the working population.
ggplot(data = participants_data,
aes(x = age)) +
geom_bar(fill = "navy") +
xlab("Age") +
ylab("Count") +
theme_classic() +
theme(panel.grid.major.y = element_line(color = "grey")) +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(breaks = seq(0, 40, by = 5)) +
ggtitle("Distribution of Ohio participants by Age") +
geom_vline(aes(xintercept=mean(age,na.rm=T)),
color="red", linetype="dashed", size=1) +
geom_text(aes(x=40, label="mean = 39.07", y=30), colour="red", angle=90, text=element_text(size=9))
Based on the summary statistics calculated at section 3.3.1, the mean age of the participants is 39.07. As it is difficult to compare the discrete ages, we will group them in interval of 5 years from age 20 - 60 years old.
We can do this by implementing the following code.
When we plot the distribution of the Ohio participants by age group, we see that the frequency is relatively similar among the groups with <20 and >60 groups as the exception. This is due to the fact that for age <20, it only includes participants with age 18 or 19. Whereas for age >60 it only includes participants who are aged 60 at the time of the survey.
ggplot(data = participants_data,
aes(x = ageGroup)) +
geom_bar(fill = "navy") +
xlab("Age Group (years)") +
ylab("Count") +
geom_text(stat="count",
aes(label = paste0(round(..count../sum(..count..)*100,1), "%"), vjust=-1)) +
theme_classic() +
theme(panel.grid.major.y = element_line(color = "grey")) +
scale_y_continuous(breaks = seq(0, 140, by = 20), limits = c(0, 140)) +
ggtitle("Distribution of Ohio participants by Age Group")
The distribution of interestGroup are relatively the same for all groups, in which each group consists of 8.2% - 11.5% of the total participants. As the information on interestGroup have been redacted to avoid bias, we do not have specific details on what this variable means.
ggplot(data = participants_data,
aes(x = interestGroup)) +
geom_bar(fill = "navy") +
xlab("Interest Group") +
ylab("Count") +
geom_text(stat="count",
aes(label = paste0(round(..count../sum(..count..)*100,1), "%"), vjust=-1)) +
theme_classic() +
theme(panel.grid.major.y = element_line(color = "grey")) +
scale_y_continuous(breaks = seq(0, 120, by = 20), limits = c(0, 120)) +
ggtitle("Distribution of Ohio participants by Interest Group")
When the bar chart based on the education level was plotted, we note that majority of the participants minimally has completed high school or colleges. More than one fifth of them has a bachelor’s degree.
ggplot(data = participants_data, aes(x =reorder(educationLevel, educationLevel, function(x)-length(x)))) +
geom_bar(fill = "navy") +
xlab("Education Level") +
ylab("Count") +
geom_text(stat="count",
aes(label=paste0(..count.., " (", round(..count../sum(..count..)*100,1), "%)")), vjust=-1) +
theme_classic() +
theme(panel.grid.major.y = element_line(color = "grey")) +
scale_y_continuous(breaks = seq(0, 550, by = 100), limits = c(0, 550)) +
ggtitle("Distribution of Ohio participants by Education Level")
We will further investigate, if there are any underlying patterns based on the demographic distribution of the Ohio participants.
Although there are slightly more younger participants (age 20-34) who have a bachelors’ degree, the distribution of age group between the different education level is relatively similar.
ggplot(data = participants_data,
aes(x = ageGroup)) +
geom_bar(aes(fill = educationLevel)) +
facet_grid(.~educationLevel) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Does age have any influence on the education level?")
ggplot(data = participants_data,
aes(x = age, y = educationLevel)) +
geom_boxplot() +
geom_vline(aes(xintercept=mean(age,na.rm=T)),
color="red", linetype="dashed", size=1) +
stat_summary(geom = "point",fun.y="mean",colour ="green",size=2) +
ggtitle("Does age have any influence on the education level?")
Despite separating the participants based on their household size, we observe little difference in the mean age group. This shows that the age spread has little influence on the household size of the participants.
ggplot(data = participants_data,
aes(y = age, x = householdSize)) +
geom_boxplot() +
geom_hline(aes(yintercept=mean(age,na.rm=T)),
color="red", linetype="dashed", size=1) +
geom_text(aes(x=3.3, label="mean = 39.07", y=40), colour="red", angle=0, text=element_text(size=9)) +
stat_summary(geom = "point",fun.y="mean",colour ="green",size=2) +
ggtitle("Does age have any influence on the household size?")
Comparing the joviality index of the participant based on the level of education as well as age group, we spot that participants age less than 20 with either a bachelors degree or graduate certification has a higher jovial median than other combination groups. On the other hand, participant with low education level and age between 30-34 years old have the lowest median joviality.
ggplot(data = participants_data,
aes(x = ageGroup, y = joviality)) +
geom_boxplot() +
facet_grid(educationLevel ~.)
Graduate participants with interest group A and low educated participants with interest group F have the two lowest median joviality values.
ggplot(data = participants_data,
aes(x = interestGroup, y = joviality)) +
geom_boxplot() +
facet_grid(educationLevel~.)
No information on the gender, race, employment and economic status of the participants are found in the dataset. These information could further aid us in understanding if certain demographic factor has any influence on the needs of the community.
As the details for interestGroup has been redacted to eliminate biasness, we do not have an idea on what this variable means and thus might not be very useful for our analysis.
Although we noticed that educated participants with age less than 20 have higher median joviality index whereas the low educated participants of age 30-34 years old have the lowest median joviality index, we cannot be too quick to assume that the education level has large influence on the joviality values. We will need more details such as economic factor and employment details to better understand what is the real cause for such a difference in the median joviality index.