Tutorial 11: Categorical Data

Q1 - Proportion of fruity candies

For this tutorial, we will be using a Halloween candy dataset that compares different popular candy brands. Calculate the proportion of fruity candies from the dataset.

Photo by Joanna Kosinska on Unsplash
NoteInfo

With categorical datsets like this one, information is usually stored in the form of indicator variables, i.e. value=1 means yes and value=0 means no. For such a 0/1 indicator, the proportion of 1s is mean(indicator == 1).

NotePreview

Feel free to run this code block to visualize the data.

Use the right indicator variable (0 or 1) for fruity candies.

df <- read.csv("candy-data.csv") mean(df$fruity == 1)

df <- read.csv("candy-data.csv")
mean(df$fruity == 1)

Q2 — Conditional Proportion

Return the proportion of hard candies among all fruity candies, i.e. return a single numeric value for P(Hard Candy = 1 | Fruity Candy = 1).

Photo by Joanna Kosinska on Unsplash
NotePreview

Feel free to run this code block to visualize the data.

Subset to fruity == 1 and then compute mean(hard == 1) in that subset.

df <- read.csv("candy-data.csv") sub <- subset(df, fruity == 1) mean(sub$hard == 1)
df <- read.csv("candy-data.csv")

sub <- subset(df, fruity == 1)
mean(sub$hard == 1)

Q3 — Difference in Conditional Proportions

Compute \(p_1\) = P(bar == 1 | chocolate == 1), i.e. proportion of candy bars among all chocolate candies and then compute \(p_0\) P(bar == 1 | chocolate == 0), i.e. proportion of candy bars among candies that are not made of chocolate. Finally return \(p_1\) - \(p_0\) .

Photo by Denny Müller on Unsplash
NotePreview

Feel free to run this code block to visualize the data.

Use bar as the event and compare chocolate vs. no chocolate.

df <- read.csv("candy-data.csv") p1 <- mean(subset(df, chocolate == 1)$bar == 1) p0 <- mean(subset(df, chocolate == 0)$bar == 1) p1 - p0
df <- read.csv("candy-data.csv")

p1 <- mean(subset(df, chocolate == 1)$bar == 1)
p0 <- mean(subset(df, chocolate == 0)$bar == 1)

p1 - p0

Q4 — Chi-square Test of Independence

Perform a chi-sq test of independence between chocolate and bar and return the p-value.

Photo by Denny Müller on Unsplash
NoteInfo

A chi-sq test of independence essentially uses a contingency table to test: \(H_0\): The variables are independent. vs.  \(H_a\): The variables are associated.

In R, we can test this by creating a table(x, y) of the two variables and then perform a test on them.

NotePreview

Feel free to run this code block to visualize the data.

Make the appropriate table and fill in the correct test.

df <- read.csv("candy-data.csv") tab <- table(df$chocolate, df$bar) chisq.test(tab, correct = FALSE)$p.value
df <- read.csv("candy-data.csv")

tab <- table(df$chocolate, df$bar)
chisq.test(tab, correct = FALSE)$p.value

Q5 — Fisher’s Exact Test

Run Fisher’s exact test for association between fruity and hard candies. Return the p-value.

Photo by Joanna Kosinska on Unsplash
NoteInfo

Fisher’s exact test is often used for 2×2 tables, especially when some expected counts may be small.It’s a non-parametric test, meaning it doesn’t assume data follows a specific distribution, making it more accurate than the Chi-squared test when expected counts are low.

NotePreview

Feel free to run this code block to visualize the data.

Make the appropriate table and fill in the correct test.

df <- read.csv("candy-data.csv") tab <- table(df$fruity, df$hard) fisher.test(tab)$p.value
df <- read.csv("candy-data.csv")

tab <- table(df$fruity, df$hard)
fisher.test(tab)$p.value

Q6 — Create a Categorical Outcome

Data for this dataset was collected from actual people by creating a website where participants were presented with two fun-sized candies and asked to click on the one they would prefer to receive. In total, more than 269 thousand votes were collected from 8,371 different IP addresses.

Hence, each candy has a win percent rate associated with it. For this question:

Create high_win as:
“High” if winpercent >= median(winpercent)
“Low” otherwise

Then test whether the proportion of High differs between chocolate and non-chocolate candies using prop.test(…). Return the p-value.

NoteInfo

This is a two-proportion test comparing:

group 1: chocolate = 1

group 2: chocolate = 0
for the “success” outcome: high_win == “High”

NotePreview

Feel free to run this code block to visualize the data.

Find the cutoff by computing the median and then add in the appropriate indicator variable in the blanks.

df <- read.csv("candy-data.csv") cutoff <- median(df$winpercent) high_win <- ifelse(df$winpercent >= cutoff, "High", "Low") x1 <- sum(high_win[df$chocolate == 1] == "High") n1 <- sum(df$chocolate == 1) x2 <- sum(high_win[df$chocolate == 0] == "High") n2 <- sum(df$chocolate == 0) prop.test(c(x1, x2), c(n1, n2), correct = FALSE)$p.value
df <- read.csv("candy-data.csv")

cutoff <- median(df$winpercent)
high_win <- ifelse(df$winpercent >= cutoff, "High", "Low")

x1 <- sum(high_win[df$chocolate == 1] == "High")
n1 <- sum(df$chocolate == 1)

x2 <- sum(high_win[df$chocolate == 0] == "High")
n2 <- sum(df$chocolate == 0)

prop.test(c(x1, x2), c(n1, n2), correct = FALSE)$p.value

Q7 — Odds Ratio

Compute the odds ratio for the 2×2 table of chocolate (rows) vs bar (columns)

\[ OR = (a/b) / (c/d) \] where:
- a = #(chocolate=1, bar=1)
- b = #(chocolate=1, bar=0)
- c = #(chocolate=0, bar=1)
- d = #(chocolate=0, bar=0)

Photo by Denny Müller on Unsplash
NoteInfo

Odds ratio is a common association measure for 2×2 categorical data:

OR > 1 suggests positive association

OR = 1 suggests no association

OR < 1 suggests negative association

NotePreview

Use the Odd Ratio formula after making the appropriate table.

df <- read.csv("candy-data.csv") tab <- table(df$chocolate, df$bar) a <- tab["1","1"] b <- tab["1","0"] c <- tab["0","1"] d <- tab["0","0"] odds_choc1 <- a / b odds_choc0 <- c / d (or <- odds_choc1 / odds_choc0)
df <- read.csv("candy-data.csv")

tab <- table(df$chocolate, df$bar)

a <- tab["1","1"]
b <- tab["1","0"]
c <- tab["0","1"]
d <- tab["0","0"]

odds_choc1 <- a / b
odds_choc0 <- c / d

(or <- odds_choc1 / odds_choc0)