Tutorial 11: Categorical Data

Q1 - Proportion of fruity candies

For this tutorial, we will be using a Halloween candy dataset that compares different popular candy brands. Calculate the proportion of fruity candies from the dataset.

Info

With categorical datsets like this one, information is usually stored in the form of indicator variables, i.e. value=1 means yes and value=0 means no. For such a 0/1 indicator, the proportion of 1s is mean(indicator == 1).

Preview

Feel free to run this code block to visualize the data.

Q2 — Conditional Proportion

Return the proportion of hard candies among all fruity candies, i.e. return a single numeric value for P(Hard Candy = 1 | Fruity Candy = 1).

Preview

Feel free to run this code block to visualize the data.

Q3 — Difference in Conditional Proportions

Compute \(p_1\) = P(bar == 1 | chocolate == 1), i.e. proportion of candy bars among all chocolate candies and then compute \(p_0\) P(bar == 1 | chocolate == 0), i.e. proportion of candy bars among candies that are not made of chocolate. Finally return \(p_1\) - \(p_0\) .

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("candy-data.csv")

p1 <- mean(subset(df, chocolate == 1)$bar == 1)
p0 <- mean(subset(df, chocolate == 0)$bar == 1)

p1 - p0

Q4 — Chi-square Test of Independence

Perform a chi-sq test of independence between chocolate and bar and return the p-value.

Info

A chi-sq test of independence essentially uses a contingency table to test: \(H_0\): The variables are independent. vs. \(H_a\): The variables are associated.

In R, we can test this by creating a table(x, y) of the two variables and then perform a test on them.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("candy-data.csv")

tab <- table(df$chocolate, df$bar)
chisq.test(tab, correct = FALSE)$p.value

Q5 — Fisher’s Exact Test

Run Fisher’s exact test for association between fruity and hard candies. Return the p-value.

Info

Fisher’s exact test is often used for 2×2 tables, especially when some expected counts may be small.It’s a non-parametric test, meaning it doesn’t assume data follows a specific distribution, making it more accurate than the Chi-squared test when expected counts are low.

Preview

Feel free to run this code block to visualize the data.

Q6 — Create a Categorical Outcome

Data for this dataset was collected from actual people by creating a website where participants were presented with two fun-sized candies and asked to click on the one they would prefer to receive. In total, more than 269 thousand votes were collected from 8,371 different IP addresses.

Hence, each candy has a win percent rate associated with it. For this question:

Create high_win as:
“High” if winpercent >= median(winpercent)
“Low” otherwise

Then test whether the proportion of High differs between chocolate and non-chocolate candies using prop.test(…). Return the p-value.

Info

This is a two-proportion test comparing:

group 1: chocolate = 1

group 2: chocolate = 0
for the “success” outcome: high_win == “High”

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("candy-data.csv")

cutoff <- median(df$winpercent)
high_win <- ifelse(df$winpercent >= cutoff, "High", "Low")

x1 <- sum(high_win[df$chocolate == 1] == "High")
n1 <- sum(df$chocolate == 1)

x2 <- sum(high_win[df$chocolate == 0] == "High")
n2 <- sum(df$chocolate == 0)

prop.test(c(x1, x2), c(n1, n2), correct = FALSE)$p.value

Q7 — Odds Ratio

Compute the odds ratio for the 2×2 table of chocolate (rows) vs bar (columns)

\[ OR = (a/b) / (c/d) \] where:
- a = #(chocolate=1, bar=1)
- b = #(chocolate=1, bar=0)
- c = #(chocolate=0, bar=1)
- d = #(chocolate=0, bar=0)

Info

Odds ratio is a common association measure for 2×2 categorical data:

OR > 1 suggests positive association

OR = 1 suggests no association

OR < 1 suggests negative association

Preview

df <- read.csv("candy-data.csv")

tab <- table(df$chocolate, df$bar)

a <- tab["1","1"]
b <- tab["1","0"]
c <- tab["0","1"]
d <- tab["0","0"]

odds_choc1 <- a / b
odds_choc0 <- c / d

(or <- odds_choc1 / odds_choc0)