Tutorial 08: ANOVA

Q1 — One-way ANOVA (manual): Happiness score by region

We start this tutorial with the city lifestyle dataset, that gives info about In the city lifestyle dataset, we want to test whether the mean happiness score differs across four regions: Europe, Asia, North America, and Africa.

We test:

\[ H_0: \mu_{\text{Europe}} = \mu_{\text{Asia}} = \mu_{\text{NorthAm}} = \mu_{\text{Africa}} \]

\[ H_1:\ \text{at least one regional mean is different.} \]

Use a one-way ANOVA on happiness score, computed manually from sums of squares.

Your task:

Subset to those four regions.
Remove rows with missing happiness scores.
Compute SSR, SSE, $df1$, $df2$, MSR, MSE, F, and the p-value

\[ F = \frac{\text{MSR}}{\text{MSE}}, \qquad p = P\bigl(F_{df1, df2} \ge F\bigr). \]

Return a named numeric vector: c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).

Info

For a one-way ANOVA with response $y_{ij}$ in group (i) (of size $n_i$), group means $bar{y}_i$, and overall mean $bar{y}$:

Between–group sum of squares \[ \mathrm{SSR} = \sum_i n_i \, (\hat{y}_i - \bar{y})^2 \]

Within–group sum of squares \[ \mathrm{SSE} = \sum_i (y_{i} - \hat{y_{i}})^2 \]

Degrees of freedom \[ df_{\text{between}} = k - 1, \qquad df_{\text{within}} = N - k \]

Mean squares \[ \mathrm{MSR} = \frac{\mathrm{SSB}}{df_{\text{between}}}, \qquad \mathrm{MSE} = \frac{\mathrm{SSW}}{df_{\text{within}}} \]

Preview


df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

y <- sub$happiness_score
g <- sub$region

group_means <- tapply(y, g, mean)
group_ns    <- tapply(y, g, length)

overall_mean <- mean(y)

SSB <- sum(group_ns * (group_means - overall_mean)^2)

SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))

k  <- length(group_means)
N  <- length(y)

df1 <- k - 1
df2  <- N - k

MSR <- SSB / df1
MSE <- SSW / df2

Fstat <- MSR / MSE
pval  <- pf(Fstat, df1, df2, lower.tail = FALSE)

c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

Q2 — One-way ANOVA (built-in): Happiness score by region

Now we will redo the same test using R’s built-in aov() function on the same subset of the city dataset.

Return c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = p) extracted from the ANOVA table.

Info

For a one-way ANOVA fit with fit <- aov(y ~ group, data = …), you can extract the ANOVA table with summary(fit)[[1]]. The row corresponding to the factor has columns:

“Df”: factor degrees of freedom

“F value”: F statistic

“Pr(>F)”: p-value

Preview


df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

fit <- aov(happiness_score ~ region, data = sub)

tab <- summary(fit)[[1]]

df_region <- tab["region", "Df"]
df_resid  <- tab["Residuals", "Df"]
Fstat     <- tab["region", "F value"]
pval      <- tab["region", "Pr(>F)"]

c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = pval)

Q3 — Fisher’s LSD (no adjustment): Pairwise region comparisons

Using the same city subset as in Q1–Q2, perform pairwise t-tests on happiness scores between regions without multiple-comparison adjustment. This corresponds to Fisher’s LSD after a significant global ANOVA.

Use pairwise.t.test() with p.adjust.method = “none” and return the matrix of p-values.

Info

Fisher’s LSD (Least Significant Difference) is essentially:

Run a global ANOVA to check that at least one mean differs.

If significant, run unadjusted pairwise t-tests between groups, using the same residual variance.

In R, pairwise.t.test(y, group, p.adjust.method = “none”)$p.value gives a matrix of unadjusted pairwise p-values.

Preview


df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "none")

out$p.value

Q4 — Bonferroni-adjusted pairwise region comparisons

Now repeat the previous question but use a Bonferroni correction for multiple comparisons. Again, return the matrix of p-values.

Info

The Bonferroni adjustment is a simple and conservative way to control the family-wise error rate when making many comparisons. In R, use p.adjust.method = “bonferroni” in pairwise.t.test().

Preview


df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "bonferroni")

out$p.value

Q5 — One-way ANOVA (manual): Exam score by motivation level

In the student performance dataset, we want to test whether the mean exam score differs across motivation levels (Low, Medium, High).

We test

\[ H_0:\ \mu_{\text{Low}} = \mu_{\text{Medium}} = \mu_{\text{High}} \]

\[ H_1:\ \text{at least one of } \mu_{\text{Low}},\, \mu_{\text{Medium}},\, \mu_{\text{High}} \text{ is different.} \]

Use a one-way ANOVA on Exam_Score, computed manually from sums of squares.

Your task:

Subset to rows with non-missing Exam_Score and Motivation_Level.
Treat Motivation_Level as a factor with three levels (Low, Medium, High).
Compute SSR, SSE, df1, df2, MSR, MSE, (F), and the p-value

\[ F = \frac{\mathrm{MSR}}{\mathrm{MSE}}, \qquad p = P\bigl(F_{df_1, df_2} \ge F\bigr). \]

Return a named numeric vector:
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).

Preview


df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

y <- sub$Exam_Score
g <- sub$mot

group_means <- tapply(y, g, mean)
group_ns    <- tapply(y, g, length)

overall_mean <- mean(y)

SSB <- sum(group_ns * (group_means - overall_mean)^2)

SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))

k  <- length(group_means)
N  <- length(y)

df1 <- k - 1
df2  <- N - k

MSR <- SSB / df1
MSE <- SSW / df2

Fstat <- MSR / MSE
pval  <- pf(Fstat, df1, df2, lower.tail = FALSE)

c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

Q6 — One-way ANOVA (built-in): Exam score by motivation level

Now use aov() to test for differences in mean exam score across motivation levels.

Return c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval).

Preview


df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

fit <- aov(Exam_Score ~ mot, data = sub)

tab <- summary(fit)[[1]]

df_mot   <- tab["mot", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat    <- tab["mot", "F value"]
pval     <- tab["mot", "Pr(>F)"]

c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval)

Q7 — Fisher’s LSD: Exam score pairwise comparisons by motivation

Using the same subset as Q5–Q6, compute unadjusted pairwise t-tests on Exam_Score across motivation levels (Fisher’s LSD).

Return the matrix of p-values.

Preview


df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

out <- pairwise.t.test(sub$Exam_Score,
sub$mot,
p.adjust.method = "none")

out$p.value

Q8 — Bonferroni: Exam score pairwise comparisons by parental education

Finally, investigate another factor: parental education level (High School, College, Postgraduate).

Use the student dataset to:

Subset to rows with non-missing Exam_Score and Parental_Education_Level.

Treat Parental_Education_Level as a factor with three levels.

Run pairwise.t.test() on Exam_Score across parental education levels, using Bonferroni adjustment.

Return the p-value matrix.

Preview


df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Parental_Education_Level))

sub$edu <- factor(sub$Parental_Education_Level)

out <- pairwise.t.test(sub$Exam_Score,
sub$edu,
p.adjust.method = "bonferroni")

out$p.value