Tutorial 06: One Sample Hypothesis Tests

Q1 — One Sample Hypothesis Test on the Mean (σ known)

For this question, we use the 2014 Soccer World Cup Tournament Predictions dataset. Test whether the average Soccer Power Index (spi) for a team equals 70 when the population standard deviation is known to be σ = 12.

We test

H₀: μ = 70

H₁: μ ≠ 70

Use a two-sided z-test with σ known.

Info

This is a one-sample test on a single numeric variable where the population standard deviation is assumed to be known. In that situation, we use a z-based procedure and compare the sample mean to the hypothesized value.

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

df <- read.csv("world_cup_predictions.csv", stringsAsFactors = FALSE, check.names = FALSE)
spi_col <- df$spi
mu_0 <- 70
sigma <- 12
n <- sum(!is.na(spi_col)) #total number of teams
xbar <- mean(spi_col, na.rm=TRUE)
z <- (xbar - mu_0) / (sigma / sqrt(n)) 
p_two <- 2 * (1 - pnorm(abs(z)))

c(z = z, p_value = p_two)

Q2 — One-sample z-test on SPI (σ known, using BSDA)

For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70 when the population standard deviation is known to be 12. This time we will use the z.test() function from the BSDA package to run the test in a single line, and you will just fill in the key arguments.

Info

Base R does not have a built-in one-sample z-test. The BSDA package provides z.test() which does: compute the z statistic, use the normal distribution, and give the p-value. We are using it here only to shorten the code; the logic is the same as in Q1.

Preview


df <- read.csv("world_cup_predictions.csv")
x <- df$spi

library(BSDA)

out <- BSDA::z.test(x,
mu = 70,
sigma.x = 12,
alternative = "two.sided")

out$p.value

Q3 — One-sample test on SPI (σ unknown)

Now test the same hypothesis,

H₀: μ = 70

H₁: μ ≠ 70

but now do not assume the population standard deviation is known. Use the sample SD, form the t statistic, and get the two-sided p-value from the t distribution.

Info

When σ is unknown, we use the sample SD to standardize and compare to a t distribution with n − 1 degrees of freedom.

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

Info

For a one-sample test with unknown \(\sigma\), use \(t = \dfrac{\bar{x} - \mu_0}{s / \sqrt{n}}\) with \(df = n - 1\), and a two-sided p-value \(p = 2 \times (1 - F_t(|t|))\).

df <- read.csv("world_cup_predictions.csv")
spi_col <- df$spi

mu_0 <- 70
n <- sum(!is.na(spi_col))
xbar <- mean(spi_col, na.rm = TRUE)
s <- sd(spi_col, na.rm = TRUE)

dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))

c(t = t_stat, df = dfree, p_value = p_two)

Q4 — One-sample t-test on SPI (σ unknown, built-in)

For this question, we again use the 2014 Soccer World Cup Tournament Predictions dataset. We want to test whether the average Soccer Power Index (spi) for all teams equals 70, but we will let R do the one-sample t-test for us using the built-in t.test() function. This is the “practical” version of Q3.

Info

When the population standard deviation is not given, we use a one-sample t-test: t.test(x, mu = μ₀, alternative = “two.sided”) R will estimate the standard deviation, use df = n − 1, and return the p-value.


df <- read.csv("world_cup_predictions.csv")
x <- df$spi
out <- t.test(x, mu = 70, alternative = "two.sided")
out$p.value

Q5 — One-sample t-test on Track Score (σ unknown, manual, large n)

For this question, use the large music dataset. We want to test whether the average track score in this dataset equals 45.

H₀: μ = 45 H₁: μ ≠ 45

We will compute the test statistic and the p-value manually.

Info

One-sample t (σ unknown):

\(t=\frac{\bar{x} - \mu_0}{s/\sqrt{n}}\), with df = n-1, and two-sided p-value \(p = 2 \times(1 - F_t(|t|)).\)

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.


df <- read.csv("spotify-2024.csv")
x <- df$Track.Score
x <- x[is.finite(x)]

mu_0 <- 45
n <- length(x)
xbar <- mean(x)
s <- sd(x)

dfree <- n - 1
t_stat <- (xbar - mu_0) / (s / sqrt(n))
p_two <- 2 * (1 - pt(abs(t_stat), df = dfree))

c(t = t_stat, df = dfree, p_value = p_two)

Q6 — One-sample t-test on Track Score (σ unknown, built-in)

Now run the same hypothesis test using R’s built-in t.test() on the same column.

H₀: μ = 45 H₁: μ ≠ 45

Info

t.test(x, mu = 45, alternative = “two.sided”) will estimate the SD, use df = n − 1, and give the p-value.

Preview


df <- read.csv("spotify-2024.csv")
x <- df$Track.Score

out <- t.test(x,
mu = 45,
alternative = "two.sided")

out$p.value

Q7 — One-sample test on a proportion (base R `prop.test()`)

We use the video game sales dataset. Let’s test whether 4.5% of the listed games were published by Nintendo.

Hypotheses:

\((H_0: p = 0.045)\) \((H_1: p \neq 0.045)\)

We will form a success/failure variable: “Publisher is Nintendo” = success.

Info

For a one-sample test on a proportion with counts (x) out of (n), use prop.test(x, n, p = p0, alternative = "two.sided"). This uses a chi-square test with 1 degree of freedom.

Preview


vg <- read.csv("vgsales.csv")

x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))

out <- prop.test(x = x,
                 n = n,
                 p = 0.045,
                 alternative = "two.sided")

out$p.value

Q8 — From chi-square to z (follow-up on the same test)

prop.test() reports a chi-square statistic with 1 df. For a 1-df test, (^2 = z^2). Let’s extract that chi-square value and take the square root.

We use the same dataset and the same hypothesis as Q7.

Info

If a one-sample proportion test gives \((\chi^2)\) with 1 df, then \(( z = \sqrt{\chi^2} )\) (keep the sign if you need direction).


vg <- read.csv("vgsales.csv")

x <- sum(vg$Publisher == "Nintendo", na.rm = TRUE)
n <- sum(!is.na(vg$Publisher))

out <- prop.test(x = x,
                 n = n,
                 p = 0.045,
                 alternative = "two.sided")

chisq_val <- out$statistic
z_val <- sqrt(chisq_val)

c(chisq = chisq_val, z_from_chisq = z_val)

Q9 — One-sample proportion using `prop_test()` (rstatix)

Now we repeat the same test — “is the proportion of Nintendo-published games 4.5%?” — but use the prop_test() function from the rstatix package, which gives the z statistic directly.

We’ll create a logical column and then test it.

Info

rstatix::prop_test() can test a single proportion vs a hypothesized value and returns a z statistic and p-value. We’ll test against 0.045.

library(rstatix)

vg <- read.csv("vgsales.csv")

vg$is_nintendo <- vg$Publisher == "Nintendo"

out <- rstatix::prop_test(
  x = sum(vg$is_nintendo, na.rm = TRUE),
  n = sum(!is.na(vg$is_nintendo)),
  p = 0.045
)

out

Q1 — One Sample Hypothesis Test on the Mean (σ known)

Q2 — One-sample z-test on SPI (σ known, using BSDA)

Q3 — One-sample test on SPI (σ unknown)

Q4 — One-sample t-test on SPI (σ unknown, built-in)

Q5 — One-sample t-test on Track Score (σ unknown, manual, large n)

Q6 — One-sample t-test on Track Score (σ unknown, built-in)

Q7 — One-sample test on a proportion (base R prop.test())

Q8 — From chi-square to z (follow-up on the same test)

Q9 — One-sample proportion using prop_test() (rstatix)

Q7 — One-sample test on a proportion (base R `prop.test()`)

Q9 — One-sample proportion using `prop_test()` (rstatix)