Tutorial 06: One Sample Hypothesis Tests
Q1 — One Sample Hypothesis Test on the Mean (σ known)
For this question, we use the 2014 Soccer World Cup Tournament Predictions dataset. Test whether the average Soccer Power Index (spi) for a team equals 70 when the population standard deviation is known to be σ = 12.
We test
H₀: μ = 70
H₁: μ ≠ 70
Use a two-sided z-test with σ known.
This is a one-sample test on a single numeric variable where the population standard deviation is assumed to be known. In that situation, we use a z-based procedure and compare the sample mean to the hypothesized value.
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
Confirm you’re comparing the means of two groups formed by a boolean condition on stars. Use the pooled-variance standard error and the appropriate two-sided t critical value for a 95% interval. Print just the lower and upper bounds as a numeric vector.
df <- read.csv("world_cup_predictions.csv", stringsAsFactors = FALSE, check.names = FALSE)
spi_col <- df$spi
mu_0 <- 70
sigma <- 12
n <- sum(!is.na(spi_col)) #total number of teams
xbar <- mean(spi_col, na.rm=TRUE)
z <- (xbar - mu_0) / (sigma / sqrt(n))
p_two <- 2 * (1 - pnorm(abs(z)))
c(z = z, p_value = p_two)
df <- read.csv("world_cup_predictions.csv", stringsAsFactors = FALSE, check.names = FALSE)
spi_col <- df$spi
mu_0 <- 70
sigma <- 12
n <- sum(!is.na(spi_col)) #total number of teams
xbar <- mean(spi_col, na.rm=TRUE)
z <- (xbar - mu_0) / (sigma / sqrt(n))
p_two <- 2 * (1 - pnorm(abs(z)))
c(z = z, p_value = p_two)Q2 — One-sample z-test on SPI (σ known, using BSDA)
For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70 when the population standard deviation is known to be 12. This time we will use the z.test() function from the BSDA package to run the test in a single line, and you will just fill in the key arguments.
Base R does not have a built-in one-sample z-test. The BSDA package provides z.test() which does: compute the z statistic, use the normal distribution, and give the p-value. We are using it here only to shorten the code; the logic is the same as in Q1.
Use the same hypotheses as before (test against 70) and the same known σ (12). Keep the test two-sided.
df <- read.csv("world_cup_predictions.csv")
x <- df$spi
library(BSDA)
out <- BSDA::z.test(x,
mu = 70,
sigma.x = 12,
alternative = "two.sided")
out$p.value
df <- read.csv("world_cup_predictions.csv")
x <- df$spi
library(BSDA)
out <- BSDA::z.test(x,
mu = 70,
sigma.x = 12,
alternative = "two.sided")
out$p.valueQ3 — One-sample test on SPI (σ unknown)
Now test the same hypothesis,
H₀: μ = 70
H₁: μ ≠ 70
but now do not assume the population standard deviation is known. Use the sample SD, form the t statistic, and get the two-sided p-value from the t distribution.
When σ is unknown, we use the sample SD to standardize and compare to a t distribution with n − 1 degrees of freedom.
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
For a one-sample test with unknown \(\sigma\), use \(t = \dfrac{\bar{x} - \mu_0}{s / \sqrt{n}}\) with \(df = n - 1\), and a two-sided p-value \(p = 2 \times (1 - F_t(|t|))\).
Use the SPI column, test against 70, compute mean and sd, then use the t formula and a two-sided p-value from pt().
df <- read.csv("world_cup_predictions.csv")
spi_col <- df$spi
mu_0 <- 70
n <- sum(!is.na(spi_col))
xbar <- mean(spi_col, na.rm = TRUE)
s <- sd(spi_col, na.rm = TRUE)
dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))
c(t = t_stat, df = dfree, p_value = p_two)
df <- read.csv("world_cup_predictions.csv")
spi_col <- df$spi
mu_0 <- 70
n <- sum(!is.na(spi_col))
xbar <- mean(spi_col, na.rm = TRUE)
s <- sd(spi_col, na.rm = TRUE)
dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))
c(t = t_stat, df = dfree, p_value = p_two)Q4 — One-sample t-test on SPI (σ unknown, built-in)
For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70, but we will let R do the one-sample t-test for us using the built-in t.test() function. This is the “practical” version of Q3.
When the population standard deviation is not given, we use a one-sample t-test: t.test(x, mu = μ₀, alternative = “two.sided”) R will estimate the standard deviation, use df = n − 1, and return the p-value.
Use the SPI column from the data, test it against 70, and keep the test two-sided.
df <- read.csv("world_cup_predictions.csv")
x <- df$spi
out <- t.test(x, mu = 70, alternative = "two.sided")
out$p.value
df <- read.csv("world_cup_predictions.csv")
x <- df$spi
out <- t.test(x, mu = 70, alternative = "two.sided")
out$p.valueQ5 — One-sample t-test on Track Score (σ unknown, manual, large n)
For this question, use the large music dataset. We want to test whether the average track score in this dataset equals 45.
H₀: μ = 45 H₁: μ ≠ 45
We will compute the test statistic and the p-value manually.
One-sample t (σ unknown):
\(t=\frac{\bar{x} - \mu_0}{s/\sqrt{n}}\), with df = n-1, and two-sided p-value \(p = 2 \times(1 - F_t(|t|)).\)
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
Use Track.Score as x, use mean() and sd() before plugging into the t formula. Make it two-sided.
df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
x <- x[is.finite(x)]
mu_0 <- 45
n <- length(x)
xbar <- mean(x)
s <- sd(x)
dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))
c(t = t_stat, df = dfree, p_value = p_two)
df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
x <- x[is.finite(x)]
mu_0 <- 45
n <- length(x)
xbar <- mean(x)
s <- sd(x)
dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))
c(t = t_stat, df = dfree, p_value = p_two)Q6 — One-sample t-test on Track Score (σ unknown, built-in)
Now run the same hypothesis test using R’s built-in t.test() on the same column.
H₀: μ = 45 H₁: μ ≠ 45
t.test(x, mu = 45, alternative = “two.sided”) will estimate the SD, use df = n − 1, and give the p-value.
Use the Track.Score column.
df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
out <- t.test(x,
mu = 45,
alternative = "two.sided")
out$p.value
df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
out <- t.test(x,
mu = 45,
alternative = "two.sided")
out$p.valueQ7 — One-sample test on a proportion (base R prop.test())
We use the video game sales dataset. Let’s test whether 4.5% of the listed games were published by Nintendo.
Hypotheses:
\((H_0: p = 0.045)\) \((H_1: p \neq 0.045)\)
We will form a success/failure variable: “Publisher is Nintendo” = success.
For a one-sample test on a proportion with counts (x) out of (n), use prop.test(x, n, p = p0, alternative = "two.sided"). This uses a chi-square test with 1 degree of freedom.
Count how many rows have Publisher == “Nintendo”, test against 0.045, and keep it two-sided.
vg <- read.csv("vgsales.csv")
x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))
out <- prop.test(x = x,
n = n,
p = 0.045,
alternative = "two.sided")
out$p.value
vg <- read.csv("vgsales.csv")
x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))
out <- prop.test(x = x,
n = n,
p = 0.045,
alternative = "two.sided")
out$p.valueQ8 — From chi-square to z (follow-up on the same test)
prop.test() reports a chi-square statistic with 1 df. For a 1-df test, (^2 = z^2). Let’s extract that chi-square value and take the square root.
We use the same dataset and the same hypothesis as Q7.
If a one-sample proportion test gives \((\chi^2)\) with 1 df, then \(( z = \sqrt{\chi^2} )\) (keep the sign if you need direction).
Re-run the same prop.test() as Q7 and take sqrt() of the test statistic.
vg <- read.csv("vgsales.csv")
x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))
out <- prop.test(x = x,
n = n,
p = 0.045,
alternative = "two.sided")
chisq_val <- out$statistic
z_val <- sqrt(chisq_val)
c(chisq = chisq_val, z_from_chisq = z_val)
vg <- read.csv("vgsales.csv")
x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))
out <- prop.test(x = x,
n = n,
p = 0.045,
alternative = "two.sided")
chisq_val <- out$statistic
z_val <- sqrt(chisq_val)
c(chisq = chisq_val, z_from_chisq = z_val)Q9 — One-sample proportion using prop_test() (rstatix)
Now we repeat the same test — “is the proportion of Nintendo-published games 4.5%?” — but use the prop_test() function from the rstatix package, which gives the z statistic directly.
We’ll create a logical column and then test it.
rstatix::prop_test() can test a single proportion vs a hypothesized value and returns a z statistic and p-value. We’ll test against 0.045.
Use the logical column is_nintendo.
library(rstatix)
vg <- read.csv("vgsales.csv")
vg$is_nintendo <- vg$Publisher == "Nintendo"
out <- rstatix::prop_test(
x = sum(vg$is_nintendo, na.rm = TRUE),
n = sum(!is.na(vg$is_nintendo)),
p = 0.045
)
out
library(rstatix)
vg <- read.csv("vgsales.csv")
vg$is_nintendo <- vg$Publisher == "Nintendo"
out <- rstatix::prop_test(
x = sum(vg$is_nintendo, na.rm = TRUE),
n = sum(!is.na(vg$is_nintendo)),
p = 0.045
)
out