Tutorial 08: ANOVA
Q1 — One-way ANOVA (manual): Happiness score by region
We start this tutorial with the city lifestyle dataset, that gives info about In the city lifestyle dataset, we want to test whether the mean happiness score differs across four regions: Europe, Asia, North America, and Africa.
We test:
\[ H_0: \mu_{\text{Europe}} = \mu_{\text{Asia}} = \mu_{\text{NorthAm}} = \mu_{\text{Africa}} \]
vs
\[ H_1:\ \text{at least one regional mean is different.} \]
Use a one-way ANOVA on happiness score, computed manually from sums of squares.
Your task:
- Subset to those four regions.
- Remove rows with missing happiness scores.
- Compute SSR, SSE, \(df1\), \(df2\), MSR, MSE, F, and the p-value
\[ F = \frac{\text{MSR}}{\text{MSE}}, \qquad p = P\bigl(F_{df1, df2} \ge F\bigr). \]
Return a named numeric vector: c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).
For a one-way ANOVA with response \(y_{ij}\) in group (i) (of size \(n_i\)), group means \(bar{y}_i\), and overall mean \(bar{y}\):
Between–group sum of squares \[ \mathrm{SSR} = \sum_i n_i \, (\hat{y}_i - \bar{y})^2 \]
Within–group sum of squares \[ \mathrm{SSE} = \sum_i (y_{i} - \hat{y_{i}})^2 \]
Degrees of freedom \[ df_{\text{between}} = k - 1, \qquad df_{\text{within}} = N - k \]
Mean squares \[ \mathrm{MSR} = \frac{\mathrm{SSB}}{df_{\text{between}}}, \qquad \mathrm{MSE} = \frac{\mathrm{SSW}}{df_{\text{within}}} \]
Use mean for both group and overall means. df1 = k - 1, df2 = N - k. MSR = SSB / df1, MSE = SSW / df2, then Fstat = MSR / MSE.
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
y <- sub$happiness_score
g <- sub$region
group_means <- tapply(y, g, mean)
group_ns <- tapply(y, g, length)
overall_mean <- mean(y)
SSB <- sum(group_ns * (group_means - overall_mean)^2)
SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))
k <- length(group_means)
N <- length(y)
df1 <- k - 1
df2 <- N - k
MSR <- SSB / df1
MSE <- SSW / df2
Fstat <- MSR / MSE
pval <- pf(Fstat, df1, df2, lower.tail = FALSE)
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
y <- sub$happiness_score
g <- sub$region
group_means <- tapply(y, g, mean)
group_ns <- tapply(y, g, length)
overall_mean <- mean(y)
SSB <- sum(group_ns * (group_means - overall_mean)^2)
SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))
k <- length(group_means)
N <- length(y)
df1 <- k - 1
df2 <- N - k
MSR <- SSB / df1
MSE <- SSW / df2
Fstat <- MSR / MSE
pval <- pf(Fstat, df1, df2, lower.tail = FALSE)
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)Q2 — One-way ANOVA (built-in): Happiness score by region
Now we will redo the same test using R’s built-in aov() function on the same subset of the city dataset.
Return c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = p) extracted from the ANOVA table.
For a one-way ANOVA fit with fit <- aov(y ~ group, data = …), you can extract the ANOVA table with summary(fit)[[1]]. The row corresponding to the factor has columns:
“Df”: factor degrees of freedom
“F value”: F statistic
“Pr(>F)”: p-value
Use happiness_score ~ region in aov(), and “region” as the row name in the ANOVA table.
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
fit <- aov(happiness_score ~ region, data = sub)
tab <- summary(fit)[[1]]
df_region <- tab["region", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat <- tab["region", "F value"]
pval <- tab["region", "Pr(>F)"]
c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = pval)
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
fit <- aov(happiness_score ~ region, data = sub)
tab <- summary(fit)[[1]]
df_region <- tab["region", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat <- tab["region", "F value"]
pval <- tab["region", "Pr(>F)"]
c(F = Fstat, df1 = df_region, df2 = df_resid, p_value = pval)Q3 — Fisher’s LSD (no adjustment): Pairwise region comparisons
Using the same city subset as in Q1–Q2, perform pairwise t-tests on happiness scores between regions without multiple-comparison adjustment. This corresponds to Fisher’s LSD after a significant global ANOVA.
Use pairwise.t.test() with p.adjust.method = “none” and return the matrix of p-values.
Fisher’s LSD (Least Significant Difference) is essentially:
Run a global ANOVA to check that at least one mean differs.
If significant, run unadjusted pairwise t-tests between groups, using the same residual variance.
In R, pairwise.t.test(y, group, p.adjust.method = “none”)$p.value gives a matrix of unadjusted pairwise p-values.
Use sub$happiness_score as the response, sub$region as the group, and “none” for p.adjust.method.
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "none")
out$p.value
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "none")
out$p.valueQ4 — Bonferroni-adjusted pairwise region comparisons
Now repeat the previous question but use a Bonferroni correction for multiple comparisons. Again, return the matrix of p-values.
The Bonferroni adjustment is a simple and conservative way to control the family-wise error rate when making many comparisons. In R, use p.adjust.method = “bonferroni” in pairwise.t.test().
Use the same sub$happiness_score and sub$region, but set p.adjust.method = “bonferroni”.
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "bonferroni")
out$p.value
df <- read.csv("city_lifestyle_dataset.csv")
sub <- subset(
df,
country %in% c("Europe","Asia","North America","Africa")
& is.finite(happiness_score)
)
sub$region <- factor(sub$country)
out <- pairwise.t.test(sub$happiness_score,
sub$region,
p.adjust.method = "bonferroni")
out$p.valueQ5 — One-way ANOVA (manual): Exam score by motivation level
In the student performance dataset, we want to test whether the mean exam score differs across motivation levels (Low, Medium, High).
We test
\[ H_0:\ \mu_{\text{Low}} = \mu_{\text{Medium}} = \mu_{\text{High}} \]
vs
\[ H_1:\ \text{at least one of } \mu_{\text{Low}},\, \mu_{\text{Medium}},\, \mu_{\text{High}} \text{ is different.} \]
Use a one-way ANOVA on Exam_Score, computed manually from sums of squares.
Your task:
- Subset to rows with non-missing
Exam_ScoreandMotivation_Level.
- Treat
Motivation_Levelas a factor with three levels (Low, Medium, High).
- Compute SSR, SSE,
df1,df2, MSR, MSE, (F), and the p-value
\[ F = \frac{\mathrm{MSR}}{\mathrm{MSE}}, \qquad p = P\bigl(F_{df_1, df_2} \ge F\bigr). \]
Return a named numeric vector:
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval).
Same pattern as Q1: use mean for group and overall means. df1 = k - 1, df2 = N - k, MSR = SSR/df1, MSE = SSE/df2.
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
y <- sub$Exam_Score
g <- sub$mot
group_means <- tapply(y, g, mean)
group_ns <- tapply(y, g, length)
overall_mean <- mean(y)
SSB <- sum(group_ns * (group_means - overall_mean)^2)
SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))
k <- length(group_means)
N <- length(y)
df1 <- k - 1
df2 <- N - k
MSR <- SSB / df1
MSE <- SSW / df2
Fstat <- MSR / MSE
pval <- pf(Fstat, df1, df2, lower.tail = FALSE)
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
y <- sub$Exam_Score
g <- sub$mot
group_means <- tapply(y, g, mean)
group_ns <- tapply(y, g, length)
overall_mean <- mean(y)
SSB <- sum(group_ns * (group_means - overall_mean)^2)
SSW <- sum(tapply(y, g, function(x) sum((x - mean(x))^2)))
k <- length(group_means)
N <- length(y)
df1 <- k - 1
df2 <- N - k
MSR <- SSB / df1
MSE <- SSW / df2
Fstat <- MSR / MSE
pval <- pf(Fstat, df1, df2, lower.tail = FALSE)
c(F = Fstat, df1 = df1, df2 = df2, p_value = pval)Q6 — One-way ANOVA (built-in): Exam score by motivation level
Now use aov() to test for differences in mean exam score across motivation levels.
Return c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval).
Use Exam_Score ~ mot, and the row name “mot” in the ANOVA table.
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
fit <- aov(Exam_Score ~ mot, data = sub)
tab <- summary(fit)[[1]]
df_mot <- tab["mot", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat <- tab["mot", "F value"]
pval <- tab["mot", "Pr(>F)"]
c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval)
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
fit <- aov(Exam_Score ~ mot, data = sub)
tab <- summary(fit)[[1]]
df_mot <- tab["mot", "Df"]
df_resid <- tab["Residuals", "Df"]
Fstat <- tab["mot", "F value"]
pval <- tab["mot", "Pr(>F)"]
c(F = Fstat, df1 = df_mot, df2 = df_resid, p_value = pval)Q7 — Fisher’s LSD: Exam score pairwise comparisons by motivation
Using the same subset as Q5–Q6, compute unadjusted pairwise t-tests on Exam_Score across motivation levels (Fisher’s LSD).
Return the matrix of p-values.
Use sub$Exam_Score, sub$mot, and “none” as the adjustment method.
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
out <- pairwise.t.test(sub$Exam_Score,
sub$mot,
p.adjust.method = "none")
out$p.value
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Motivation_Level))
sub$mot <- factor(sub$Motivation_Level)
out <- pairwise.t.test(sub$Exam_Score,
sub$mot,
p.adjust.method = "none")
out$p.valueQ8 — Bonferroni: Exam score pairwise comparisons by parental education
Finally, investigate another factor: parental education level (High School, College, Postgraduate).
Use the student dataset to:
Subset to rows with non-missing Exam_Score and Parental_Education_Level.
Treat Parental_Education_Level as a factor with three levels.
Run pairwise.t.test() on Exam_Score across parental education levels, using Bonferroni adjustment.
Return the p-value matrix.
Use sub$Exam_Score and sub$edu with p.adjust.method = “bonferroni”.
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Parental_Education_Level))
sub$edu <- factor(sub$Parental_Education_Level)
out <- pairwise.t.test(sub$Exam_Score,
sub$edu,
p.adjust.method = "bonferroni")
out$p.value
df <- read.csv("student_performance.csv")
sub <- subset(df,
!is.na(Exam_Score) &
!is.na(Parental_Education_Level))
sub$edu <- factor(sub$Parental_Education_Level)
out <- pairwise.t.test(sub$Exam_Score,
sub$edu,
p.adjust.method = "bonferroni")
out$p.value