Tutorial 11: Categorical Data
Q1 - Proportion of fruity candies
For this tutorial, we will be using a Halloween candy dataset that compares different popular candy brands. Calculate the proportion of fruity candies from the dataset.
With categorical datsets like this one, information is usually stored in the form of indicator variables, i.e. value=1 means yes and value=0 means no. For such a 0/1 indicator, the proportion of 1s is mean(indicator == 1).
Feel free to run this code block to visualize the data.
Use the right indicator variable (0 or 1) for fruity candies.
df <- read.csv("candy-data.csv")
mean(df$fruity == 1)
df <- read.csv("candy-data.csv")
mean(df$fruity == 1)Q2 — Conditional Proportion
Return the proportion of hard candies among all fruity candies, i.e. return a single numeric value for P(Hard Candy = 1 | Fruity Candy = 1).
Feel free to run this code block to visualize the data.
Subset to fruity == 1 and then compute mean(hard == 1) in that subset.
df <- read.csv("candy-data.csv")
sub <- subset(df, fruity == 1)
mean(sub$hard == 1)
df <- read.csv("candy-data.csv")
sub <- subset(df, fruity == 1)
mean(sub$hard == 1)Q3 — Difference in Conditional Proportions
Compute \(p_1\) = P(bar == 1 | chocolate == 1), i.e. proportion of candy bars among all chocolate candies and then compute \(p_0\) P(bar == 1 | chocolate == 0), i.e. proportion of candy bars among candies that are not made of chocolate. Finally return \(p_1\) - \(p_0\) .
Feel free to run this code block to visualize the data.
Use bar as the event and compare chocolate vs. no chocolate.
df <- read.csv("candy-data.csv")
p1 <- mean(subset(df, chocolate == 1)$bar == 1)
p0 <- mean(subset(df, chocolate == 0)$bar == 1)
p1 - p0
df <- read.csv("candy-data.csv")
p1 <- mean(subset(df, chocolate == 1)$bar == 1)
p0 <- mean(subset(df, chocolate == 0)$bar == 1)
p1 - p0Q4 — Chi-square Test of Independence
Perform a chi-sq test of independence between chocolate and bar and return the p-value.
A chi-sq test of independence essentially uses a contingency table to test: \(H_0\): The variables are independent. vs. \(H_a\): The variables are associated.
In R, we can test this by creating a table(x, y) of the two variables and then perform a test on them.
Feel free to run this code block to visualize the data.
Make the appropriate table and fill in the correct test.
df <- read.csv("candy-data.csv")
tab <- table(df$chocolate, df$bar)
chisq.test(tab, correct = FALSE)$p.value
df <- read.csv("candy-data.csv")
tab <- table(df$chocolate, df$bar)
chisq.test(tab, correct = FALSE)$p.valueQ5 — Fisher’s Exact Test
Run Fisher’s exact test for association between fruity and hard candies. Return the p-value.
Fisher’s exact test is often used for 2×2 tables, especially when some expected counts may be small.It’s a non-parametric test, meaning it doesn’t assume data follows a specific distribution, making it more accurate than the Chi-squared test when expected counts are low.
Feel free to run this code block to visualize the data.
Make the appropriate table and fill in the correct test.
df <- read.csv("candy-data.csv")
tab <- table(df$fruity, df$hard)
fisher.test(tab)$p.value
df <- read.csv("candy-data.csv")
tab <- table(df$fruity, df$hard)
fisher.test(tab)$p.valueQ6 — Create a Categorical Outcome
Data for this dataset was collected from actual people by creating a website where participants were presented with two fun-sized candies and asked to click on the one they would prefer to receive. In total, more than 269 thousand votes were collected from 8,371 different IP addresses.
Hence, each candy has a win percent rate associated with it. For this question:
Create high_win as:
“High” if winpercent >= median(winpercent)
“Low” otherwise
Then test whether the proportion of High differs between chocolate and non-chocolate candies using prop.test(…). Return the p-value.
This is a two-proportion test comparing:
group 1: chocolate = 1
group 2: chocolate = 0
for the “success” outcome: high_win == “High”
Feel free to run this code block to visualize the data.
Find the cutoff by computing the median and then add in the appropriate indicator variable in the blanks.
df <- read.csv("candy-data.csv")
cutoff <- median(df$winpercent)
high_win <- ifelse(df$winpercent >= cutoff, "High", "Low")
x1 <- sum(high_win[df$chocolate == 1] == "High")
n1 <- sum(df$chocolate == 1)
x2 <- sum(high_win[df$chocolate == 0] == "High")
n2 <- sum(df$chocolate == 0)
prop.test(c(x1, x2), c(n1, n2), correct = FALSE)$p.value
df <- read.csv("candy-data.csv")
cutoff <- median(df$winpercent)
high_win <- ifelse(df$winpercent >= cutoff, "High", "Low")
x1 <- sum(high_win[df$chocolate == 1] == "High")
n1 <- sum(df$chocolate == 1)
x2 <- sum(high_win[df$chocolate == 0] == "High")
n2 <- sum(df$chocolate == 0)
prop.test(c(x1, x2), c(n1, n2), correct = FALSE)$p.valueQ7 — Odds Ratio
Compute the odds ratio for the 2×2 table of chocolate (rows) vs bar (columns)
\[
OR = (a/b) / (c/d)
\] where:
- a = #(chocolate=1, bar=1)
- b = #(chocolate=1, bar=0)
- c = #(chocolate=0, bar=1)
- d = #(chocolate=0, bar=0)
Odds ratio is a common association measure for 2×2 categorical data:
OR > 1 suggests positive association
OR = 1 suggests no association
OR < 1 suggests negative association
Use the Odd Ratio formula after making the appropriate table.
df <- read.csv("candy-data.csv")
tab <- table(df$chocolate, df$bar)
a <- tab["1","1"]
b <- tab["1","0"]
c <- tab["0","1"]
d <- tab["0","0"]
odds_choc1 <- a / b
odds_choc0 <- c / d
(or <- odds_choc1 / odds_choc0)
df <- read.csv("candy-data.csv")
tab <- table(df$chocolate, df$bar)
a <- tab["1","1"]
b <- tab["1","0"]
c <- tab["0","1"]
d <- tab["0","0"]
odds_choc1 <- a / b
odds_choc0 <- c / d
(or <- odds_choc1 / odds_choc0)