Tutorial 08: ANOVA

Q1 — One-way ANOVA (manual): Happiness score by region

We start this tutorial with the city lifestyle dataset, that gives info about In the city lifestyle dataset, we want to test whether the mean happiness score differs across four regions: Europe, Asia, North America, and Africa.

We test:

\[ H_0: \mu_{\text{Europe}} = \mu_{\text{Asia}} = \mu_{\text{NorthAm}} = \mu_{\text{Africa}} \]

vs

\[ H_1:\ \text{at least one regional mean is different.} \]

Use a one-way ANOVA on happiness score, computed manually from sums of squares.

Your task:

  • Subset to those four regions.
  • Remove rows with missing happiness scores.
  • Compute SSR, SSE, \(df1\), \(df2\), MSR, MSE, F, and the p-value

\[ F = \frac{\text{MSR}}{\text{MSE}}, \qquad p = P\bigl(F_{df1, df2} \ge F\bigr). \]

Return a named numeric vector: c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).

Photo by Arno Retief on Unsplash
NoteInfo

For a one-way ANOVA with response \(y_{ij}\) in group (i) (of size \(n_i\)), group means \(bar{y}_i\), and overall mean \(bar{y}\):

Between–group sum of squares \[ \mathrm{SSR} = \sum_i n_i \, (\hat{y}_i - \bar{y})^2 \]

Within–group sum of squares \[ \mathrm{SSE} = \sum_i (y_{i} - \hat{y_{i}})^2 \]

Degrees of freedom \[ df_{\text{between}} = k - 1, \qquad df_{\text{within}} = N - k \]

Mean squares \[ \mathrm{MSR} = \frac{\mathrm{SSB}}{df_{\text{between}}}, \qquad \mathrm{MSE} = \frac{\mathrm{SSW}}{df_{\text{within}}} \]

NotePreview

Use mean for both group and overall means. df1 = k - 1, df2 = N - k. MSR = SSB / df1, MSE = SSW / df2, then Fstat = MSR / MSE.

df <- read.csv("city_lifestyle_dataset.csv") sub <- subset( df, country %in% c("Europe","Asia","North America","Africa") & is.finite(happiness_score) ) sub$region <- factor(sub$country) y <- sub$happiness_score g <- sub$region group_means <- tapply(y, g, mean) group_ns <- tapply(y, g, length) overall_mean <- mean(y) SSB <- sum(group_ns * (group_means - overall_mean)^2) SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2))) k <- length(group_means) N <- length(y) df1 <- k - 1 df2 <- N - k MSR <- SSB / df1 MSE <- SSW / df2 Fstat <- MSR / MSE pval <- pf(Fstat, df1, df2, lower.tail = FALSE) c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

y <- sub$happiness_score
g <- sub$region

group_means <- tapply(y, g, mean)
group_ns    <- tapply(y, g, length)

overall_mean <- mean(y)

SSB <- sum(group_ns * (group_means - overall_mean)^2)

SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))

k  <- length(group_means)
N  <- length(y)

df1 <- k - 1
df2  <- N - k

MSR <- SSB / df1
MSE <- SSW / df2

Fstat <- MSR / MSE
pval  <- pf(Fstat, df1, df2, lower.tail = FALSE)

c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

Q2 — One-way ANOVA (built-in): Happiness score by region

Now we will redo the same test using R’s built-in aov() function on the same subset of the city dataset.

Return c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = p) extracted from the ANOVA table.

NoteInfo

For a one-way ANOVA fit with fit <- aov(y ~ group, data = …), you can extract the ANOVA table with summary(fit)[[1]]. The row corresponding to the factor has columns:

“Df”: factor degrees of freedom

“F value”: F statistic

“Pr(>F)”: p-value

NotePreview

Use happiness_score ~ region in aov(), and “region” as the row name in the ANOVA table.

df <- read.csv("city_lifestyle_dataset.csv") sub <- subset( df, country %in% c("Europe","Asia","North America","Africa") & is.finite(happiness_score) ) sub$region <- factor(sub$country) fit <- aov(happiness_score ~ region, data = sub) tab <- summary(fit)[[1]] df_region <- tab["region", "Df"] df_resid <- tab["Residuals", "Df"] Fstat <- tab["region", "F value"] pval <- tab["region", "Pr(>F)"] c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = pval)

df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

fit <- aov(happiness_score ~ region, data = sub)

tab <- summary(fit)[[1]]

df_region <- tab["region", "Df"]
df_resid  <- tab["Residuals", "Df"]
Fstat     <- tab["region", "F value"]
pval      <- tab["region", "Pr(>F)"]

c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = pval)

Q3 — Fisher’s LSD (no adjustment): Pairwise region comparisons

Using the same city subset as in Q1–Q2, perform pairwise t-tests on happiness scores between regions without multiple-comparison adjustment. This corresponds to Fisher’s LSD after a significant global ANOVA.

Use pairwise.t.test() with p.adjust.method = “none” and return the matrix of p-values.

NoteInfo

Fisher’s LSD (Least Significant Difference) is essentially:

Run a global ANOVA to check that at least one mean differs.

If significant, run unadjusted pairwise t-tests between groups, using the same residual variance.

In R, pairwise.t.test(y, group, p.adjust.method = “none”)$p.value gives a matrix of unadjusted pairwise p-values.

NotePreview

Use sub$happiness_score as the response, sub$region as the group, and “none” for p.adjust.method.

df <- read.csv("city_lifestyle_dataset.csv") sub <- subset( df, country %in% c("Europe","Asia","North America","Africa") & is.finite(happiness_score) ) sub$region <- factor(sub$country) out <- pairwise.t.test(sub$happiness_score, sub$region, p.adjust.method = "none") out$p.value

df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "none")

out$p.value

Q4 — Bonferroni-adjusted pairwise region comparisons

Now repeat the previous question but use a Bonferroni correction for multiple comparisons. Again, return the matrix of p-values.

NoteInfo

The Bonferroni adjustment is a simple and conservative way to control the family-wise error rate when making many comparisons. In R, use p.adjust.method = “bonferroni” in pairwise.t.test().

NotePreview

Use the same sub$happiness_score and sub$region, but set p.adjust.method = “bonferroni”.

df <- read.csv("city_lifestyle_dataset.csv") sub <- subset( df, country %in% c("Europe","Asia","North America","Africa") & is.finite(happiness_score) ) sub$region <- factor(sub$country) out <- pairwise.t.test(sub$happiness_score, sub$region, p.adjust.method = "bonferroni") out$p.value

df <- read.csv("city_lifestyle_dataset.csv")

sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)

sub$region <- factor(sub$country)

out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "bonferroni")

out$p.value

Q5 — One-way ANOVA (manual): Exam score by motivation level

In the student performance dataset, we want to test whether the mean exam score differs across motivation levels (Low, Medium, High).

We test

\[ H_0:\ \mu_{\text{Low}} = \mu_{\text{Medium}} = \mu_{\text{High}} \]

vs

\[ H_1:\ \text{at least one of } \mu_{\text{Low}},\, \mu_{\text{Medium}},\, \mu_{\text{High}} \text{ is different.} \]

Use a one-way ANOVA on Exam_Score, computed manually from sums of squares.

Your task:

  • Subset to rows with non-missing Exam_Score and Motivation_Level.
  • Treat Motivation_Level as a factor with three levels (Low, Medium, High).
  • Compute SSR, SSE, df1, df2, MSR, MSE, (F), and the p-value

\[ F = \frac{\mathrm{MSR}}{\mathrm{MSE}}, \qquad p = P\bigl(F_{df_1, df_2} \ge F\bigr). \]

Return a named numeric vector:
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).

Photo by Leonardo Vargas on Unsplash
NotePreview

Same pattern as Q1: use mean for group and overall means. df1 = k - 1, df2 = N - k, MSR = SSR/df1, MSE = SSE/df2.

df <- read.csv("student_performance.csv") sub <- subset(df, !is.na(Exam_Score) & !is.na(Motivation_Level)) sub$mot <- factor(sub$Motivation_Level) y <- sub$Exam_Score g <- sub$mot group_means <- tapply(y, g, mean) group_ns <- tapply(y, g, length) overall_mean <- mean(y) SSB <- sum(group_ns * (group_means - overall_mean)^2) SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2))) k <- length(group_means) N <- length(y) df1 <- k - 1 df2 <- N - k MSR <- SSB / df1 MSE <- SSW / df2 Fstat <- MSR / MSE pval <- pf(Fstat, df1, df2, lower.tail = FALSE) c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

y <- sub$Exam_Score
g <- sub$mot

group_means <- tapply(y, g, mean)
group_ns    <- tapply(y, g, length)

overall_mean <- mean(y)

SSB <- sum(group_ns * (group_means - overall_mean)^2)

SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))

k  <- length(group_means)
N  <- length(y)

df1 <- k - 1
df2  <- N - k

MSR <- SSB / df1
MSE <- SSW / df2

Fstat <- MSR / MSE
pval  <- pf(Fstat, df1, df2, lower.tail = FALSE)

c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)

Q6 — One-way ANOVA (built-in): Exam score by motivation level

Now use aov() to test for differences in mean exam score across motivation levels.

Return c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval).

NotePreview

Use Exam_Score ~ mot, and the row name “mot” in the ANOVA table.

df <- read.csv("student_performance.csv") sub <- subset(df, !is.na(Exam_Score) & !is.na(Motivation_Level)) sub$mot <- factor(sub$Motivation_Level) fit <- aov(Exam_Score ~ mot, data = sub) tab <- summary(fit)[[1]] df_mot <- tab["mot", "Df"] df_resid <- tab["Residuals", "Df"] Fstat <- tab["mot", "F value"] pval <- tab["mot", "Pr(>F)"] c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval)

df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

fit <- aov(Exam_Score ~ mot, data = sub)

tab <- summary(fit)[[1]]

df_mot   <- tab["mot", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat    <- tab["mot", "F value"]
pval     <- tab["mot", "Pr(>F)"]

c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval)

Q7 — Fisher’s LSD: Exam score pairwise comparisons by motivation

Using the same subset as Q5–Q6, compute unadjusted pairwise t-tests on Exam_Score across motivation levels (Fisher’s LSD).

Return the matrix of p-values.

NotePreview

Use sub$Exam_Score, sub$mot, and “none” as the adjustment method.

df <- read.csv("student_performance.csv") sub <- subset(df, !is.na(Exam_Score) & !is.na(Motivation_Level)) sub$mot <- factor(sub$Motivation_Level) out <- pairwise.t.test(sub$Exam_Score, sub$mot, p.adjust.method = "none") out$p.value

df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))

sub$mot <- factor(sub$Motivation_Level)

out <- pairwise.t.test(sub$Exam_Score,
sub$mot,
p.adjust.method = "none")

out$p.value

Q8 — Bonferroni: Exam score pairwise comparisons by parental education

Finally, investigate another factor: parental education level (High School, College, Postgraduate).

Use the student dataset to:

Subset to rows with non-missing Exam_Score and Parental_Education_Level.

Treat Parental_Education_Level as a factor with three levels.

Run pairwise.t.test() on Exam_Score across parental education levels, using Bonferroni adjustment.

Return the p-value matrix.

NotePreview

Use sub$Exam_Score and sub$edu with p.adjust.method = “bonferroni”.

df <- read.csv("student_performance.csv") sub <- subset(df, !is.na(Exam_Score) & !is.na(Parental_Education_Level)) sub$edu <- factor(sub$Parental_Education_Level) out <- pairwise.t.test(sub$Exam_Score, sub$edu, p.adjust.method = "bonferroni") out$p.value

df <- read.csv("student_performance.csv")

sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Parental_Education_Level))

sub$edu <- factor(sub$Parental_Education_Level)

out <- pairwise.t.test(sub$Exam_Score,
sub$edu,
p.adjust.method = "bonferroni")

out$p.value