Tutorial 06: One Sample Hypothesis Tests

Q1 — One Sample Hypothesis Test on the Mean (σ known)

For this question, we use the 2014 Soccer World Cup Tournament Predictions dataset. Test whether the average Soccer Power Index (spi) for a team equals 70 when the population standard deviation is known to be σ = 12.

We test

H₀: μ = 70

H₁: μ ≠ 70

Use a two-sided z-test with σ known.

Photo by CamYogi on Unsplash
NoteInfo

This is a one-sample test on a single numeric variable where the population standard deviation is assumed to be known. In that situation, we use a z-based procedure and compare the sample mean to the hypothesized value.

NotePreview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

Confirm you’re comparing the means of two groups formed by a boolean condition on stars. Use the pooled-variance standard error and the appropriate two-sided t critical value for a 95% interval. Print just the lower and upper bounds as a numeric vector.

df <- read.csv("world_cup_predictions.csv", stringsAsFactors = FALSE, check.names = FALSE) spi_col <- df$spi mu_0 <- 70 sigma <- 12 n <- sum(!is.na(spi_col)) #total number of teams xbar <- mean(spi_col, na.rm=TRUE) z <- (xbar - mu_0) / (sigma / sqrt(n)) p_two <- 2 * (1 - pnorm(abs(z))) c(z = z, p_value = p_two)
df <- read.csv("world_cup_predictions.csv", stringsAsFactors = FALSE, check.names = FALSE)
spi_col <- df$spi
mu_0 <- 70
sigma <- 12
n <- sum(!is.na(spi_col)) #total number of teams
xbar <- mean(spi_col, na.rm=TRUE)
z <- (xbar - mu_0) / (sigma / sqrt(n)) 
p_two <- 2 * (1 - pnorm(abs(z)))

c(z = z, p_value = p_two)

Q2 — One-sample z-test on SPI (σ known, using BSDA)

For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70 when the population standard deviation is known to be 12. This time we will use the z.test() function from the BSDA package to run the test in a single line, and you will just fill in the key arguments.

NoteInfo

Base R does not have a built-in one-sample z-test. The BSDA package provides z.test() which does: compute the z statistic, use the normal distribution, and give the p-value. We are using it here only to shorten the code; the logic is the same as in Q1.

NotePreview

Use the same hypotheses as before (test against 70) and the same known σ (12). Keep the test two-sided.

df <- read.csv("world_cup_predictions.csv") x <- df$spi library(BSDA) out <- BSDA::z.test(x, mu = 70, sigma.x = 12, alternative = "two.sided") out$p.value

df <- read.csv("world_cup_predictions.csv")
x <- df$spi

library(BSDA)

out <- BSDA::z.test(x,
mu = 70,
sigma.x = 12,
alternative = "two.sided")

out$p.value

Q3 — One-sample test on SPI (σ unknown)

Now test the same hypothesis,

H₀: μ = 70

H₁: μ ≠ 70

but now do not assume the population standard deviation is known. Use the sample SD, form the t statistic, and get the two-sided p-value from the t distribution.

NoteInfo

When σ is unknown, we use the sample SD to standardize and compare to a t distribution with n − 1 degrees of freedom.

NotePreview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

NoteInfo

For a one-sample test with unknown \(\sigma\), use \(t = \dfrac{\bar{x} - \mu_0}{s / \sqrt{n}}\) with \(df = n - 1\), and a two-sided p-value \(p = 2 \times (1 - F_t(|t|))\).

Use the SPI column, test against 70, compute mean and sd, then use the t formula and a two-sided p-value from pt().

df <- read.csv("world_cup_predictions.csv") spi_col <- df$spi mu_0 <- 70 n <- sum(!is.na(spi_col)) xbar <- mean(spi_col, na.rm = TRUE) s <- sd(spi_col, na.rm = TRUE) dfree <- n - 1 t_stat <- (xbar - mu_0) / (s / sqrt(n)) p_two <- 2 * (1 - pt(abs(t_stat), df = dfree)) c(t = t_stat, df = dfree, p_value = p_two)
df <- read.csv("world_cup_predictions.csv")
spi_col <- df$spi

mu_0 <- 70
n <- sum(!is.na(spi_col))
xbar <- mean(spi_col, na.rm = TRUE)
s <- sd(spi_col, na.rm = TRUE)

dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))

c(t = t_stat, df = dfree, p_value = p_two)

Q4 — One-sample t-test on SPI (σ unknown, built-in)

For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70, but we will let R do the one-sample t-test for us using the built-in t.test() function. This is the “practical” version of Q3.

NoteInfo

When the population standard deviation is not given, we use a one-sample t-test: t.test(x, mu = μ₀, alternative = “two.sided”) R will estimate the standard deviation, use df = n − 1, and return the p-value.

Use the SPI column from the data, test it against 70, and keep the test two-sided.

df <- read.csv("world_cup_predictions.csv") x <- df$spi out <- t.test(x, mu = 70, alternative = "two.sided") out$p.value

df <- read.csv("world_cup_predictions.csv")
x <- df$spi
out <- t.test(x, mu = 70, alternative = "two.sided")
out$p.value

Q5 — One-sample t-test on Track Score (σ unknown, manual, large n)

For this question, use the large music dataset. We want to test whether the average track score in this dataset equals 45.

H₀: μ = 45 H₁: μ ≠ 45

We will compute the test statistic and the p-value manually.

NoteInfo

One-sample t (σ unknown):

\(t=\frac{\bar{x} - \mu_0}{s/\sqrt{n}}\), with df = n-1, and two-sided p-value \(p = 2 \times(1 - F_t(|t|)).\)

NotePreview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

Use Track.Score as x, use mean() and sd() before plugging into the t formula. Make it two-sided.

df <- read.csv("spotify-2024.csv") x <- df$Track.Score x <- x[is.finite(x)] mu_0 <- 45 n <- length(x) xbar <- mean(x) s <- sd(x) dfree <- n - 1 t_stat <- (xbar - mu_0) / (s / sqrt(n)) p_two <- 2 * (1 - pt(abs(t_stat), df = dfree)) c(t = t_stat, df = dfree, p_value = p_two)

df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
x <- x[is.finite(x)]

mu_0 <- 45
n <- length(x)
xbar <- mean(x)
s <- sd(x)

dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))

c(t = t_stat, df = dfree, p_value = p_two)

Q6 — One-sample t-test on Track Score (σ unknown, built-in)

Now run the same hypothesis test using R’s built-in t.test() on the same column.

H₀: μ = 45 H₁: μ ≠ 45

NoteInfo

t.test(x, mu = 45, alternative = “two.sided”) will estimate the SD, use df = n − 1, and give the p-value.

NotePreview

Use the Track.Score column.

df <- read.csv("spotify-2024.csv") x <- df$Track.Score out <- t.test(x, mu = 45, alternative = "two.sided") out$p.value

df <- read.csv("spotify-2024.csv")
x <- df$Track.Score

out <- t.test(x,
mu = 45,
alternative = "two.sided")

out$p.value

Q7 — One-sample test on a proportion (base R prop.test())

We use the video game sales dataset. Let’s test whether 4.5% of the listed games were published by Nintendo.

Hypotheses:

\((H_0: p = 0.045)\) \((H_1: p \neq 0.045)\)

We will form a success/failure variable: “Publisher is Nintendo” = success.

NoteInfo

For a one-sample test on a proportion with counts (x) out of (n), use prop.test(x, n, p = p0, alternative = "two.sided"). This uses a chi-square test with 1 degree of freedom.

NotePreview

Count how many rows have Publisher == “Nintendo”, test against 0.045, and keep it two-sided.

vg <- read.csv("vgsales.csv") x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE) n <- sum(!is.na(vg$Publisher)) out <- prop.test(x = x, n = n, p = 0.045, alternative = "two.sided") out$p.value

vg <- read.csv("vgsales.csv")

x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))

out <- prop.test(x = x,
                 n = n,
                 p = 0.045,
                 alternative = "two.sided")

out$p.value

Q8 — From chi-square to z (follow-up on the same test)

prop.test() reports a chi-square statistic with 1 df. For a 1-df test, (^2 = z^2). Let’s extract that chi-square value and take the square root.

We use the same dataset and the same hypothesis as Q7.

NoteInfo

If a one-sample proportion test gives \((\chi^2)\) with 1 df, then \(( z = \sqrt{\chi^2} )\) (keep the sign if you need direction).

Re-run the same prop.test() as Q7 and take sqrt() of the test statistic.

vg <- read.csv("vgsales.csv") x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE) n <- sum(!is.na(vg$Publisher)) out <- prop.test(x = x, n = n, p = 0.045, alternative = "two.sided") chisq_val <- out$statistic z_val <- sqrt(chisq_val) c(chisq = chisq_val, z_from_chisq = z_val)

vg <- read.csv("vgsales.csv")

x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))

out <- prop.test(x = x,
                 n = n,
                 p = 0.045,
                 alternative = "two.sided")

chisq_val <- out$statistic
z_val <- sqrt(chisq_val)

c(chisq = chisq_val, z_from_chisq = z_val)

Q9 — One-sample proportion using prop_test() (rstatix)

Now we repeat the same test — “is the proportion of Nintendo-published games 4.5%?” — but use the prop_test() function from the rstatix package, which gives the z statistic directly.

We’ll create a logical column and then test it.

NoteInfo

rstatix::prop_test() can test a single proportion vs a hypothesized value and returns a z statistic and p-value. We’ll test against 0.045.

Use the logical column is_nintendo.

library(rstatix) vg <- read.csv("vgsales.csv") vg$is_nintendo <- vg$Publisher == "Nintendo" out <- rstatix::prop_test( x = sum(vg$is_nintendo, na.rm = TRUE), n = sum(!is.na(vg$is_nintendo)), p = 0.045 ) out
library(rstatix)

vg <- read.csv("vgsales.csv")

vg$is_nintendo <- vg$Publisher == "Nintendo"

out <- rstatix::prop_test(
  x = sum(vg$is_nintendo, na.rm = TRUE),
  n = sum(!is.na(vg$is_nintendo)),
  p = 0.045
)

out