Tutorial 05: Two Sample Confidence Intervals
Q1 — Pooled (Equal Variances): Reviews
Using the local file recipe_reviews.csv, compute a 95% two-sided confidence interval for \[ \mu_{FiveStar} - \mu_{NotFive}\] where the outcome is best_score and the groups are stars == 5 (FiveStar) vs stars != 5 (NotFive). Read the CSV with read.csv(“recipe_reviews.csv”), build the two groups, and print the 2-element vector c(lower, upper).
For a pooled two-sample CI with equal variances:
\[ \bar{x}_1 - \bar{x}_2 \ \ \pm \ \ t^*_{df=n_1 + n_2 - 2} s_p \sqrt{\frac{1}{n1} + \frac{1}{n2}}, \ \ s^2_p = \frac{(n_1 - 1)s^2_1 + (n_2-1)s^2_2}{n_1 + n_2 - 2}\] Use \(\alpha\) = 0.05 \(\implies t^* = t_{1 - \frac{\alpha}{2}, df}\)
Group 1 = FiveStar (rows with stars == 5), Group 2 = NotFive (rows with stars != 5).
The “Recipe Reviews and User Feedback Dataset” is a comprehensive repository of data encompassing various aspects of recipe reviews and user interactions. It includes essential information such as the recipe name, its ranking on the top 100 recipes list, a unique recipe code, and user details like user ID, user name, and an internal user reputation score. Each review comment is uniquely identified with a comment ID and comes with additional attributes, including the creation timestamp, reply count, and the number of up-votes and down-votes received. Users’ sentiment towards recipes is quantified on a 1 to 5 star rating scale, with a score of 0 denoting an absence of rating. This dataset is a valuable resource for researchers and data scientists, facilitating endeavors in sentiment analysis, user behavior analysis, recipe recommendation systems, and more. It offers a window into the dynamics of recipe reviews and user feedback within the culinary website domain.
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
Confirm you’re comparing the means of two groups formed by a boolean condition on stars. Use the pooled-variance standard error and the appropriate two-sided t critical value for a 95% interval. Print just the lower and upper bounds as a numeric vector.
df <- read.csv("recipe_reviews.csv")
x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score
n1 <- sum(!is.na(x1)); n2 <- sum(!is.na(x2))
m1 <- mean(x1, na.rm = TRUE); m2 <- mean(x2, na.rm = TRUE)
s1 <- stats::sd(x1, na.rm = TRUE); s2 <- stats::sd(x2, na.rm = TRUE)
dfree <- n1 + n2 - 2
sp2 <- (((n1 - 1) * s1^2) + ((n2 - 1) * s2^2)) / dfree
sp <- sqrt(sp2)
tstar <- qt(0.975, dfree)
se <- sp * sqrt(1/n1 + 1/n2)
est <- m1 - m2
c(est - tstar*se, est + tstar*se)
df <- read.csv("recipe_reviews.csv")
x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score
n1 <- sum(!is.na(x1)); n2 <- sum(!is.na(x2))
m1 <- mean(x1, na.rm = TRUE); m2 <- mean(x2, na.rm = TRUE)
s1 <- stats::sd(x1, na.rm = TRUE); s2 <- stats::sd(x2, na.rm = TRUE)
dfree <- n1 + n2 - 2
sp2 <- (((n1 - 1) * s1^2) + ((n2 - 1) * s2^2)) / dfree
sp <- sqrt(sp2)
tstar <- qt(0.975, dfree)
se <- sp * sqrt(1/n1 + 1/n2)
est <- m1 - m2
c(est - tstar*se, est + tstar*se)Q2 — Pooled (Equal Variances): Using t.test()
Compute a 90% two-sided confidence interval for \[ \mu_{Adelie} - \mu_{Gentoo}\] for bill length (mm) using pooled variances. Use the local file penguins.csv and only the two species Adelie and Gentoo.
For equal-variance two-sample CIs, t.test(yield ~ method, var.equal = TRUE, conf.level = 0.90) uses the pooled standard error and df = \(n_1 + n_2 - 2\). To get the CI for Adelie − Gentoo, set the factor level order to c(“Adelie”,“Gentoo”).
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
When σ is known, \[ \mathrm{CI}_{1-\alpha}:\ \bar{x} \pm x_{1-\alpha/2}\,\frac{\sigma}{\sqrt{n}}. \]
Use a two-group factor for the species and the pooled-variance option in t.test. Ensure you’re returning just the interval bounds, not the whole test object.
df <- read.csv("penguins.csv")
keep <- df$species %in% c("Adelie","Gentoo")
df <- subset(df, keep & !is.na(bill_length_mm))
df$method <- factor(df$species, levels = c("Adelie","Gentoo")) # order matters for the contrast
df$yield <- df$bill_length_mm
tt <- t.test(yield ~ method, data = df, var.equal = TRUE, conf.level = 0.90)
tt$conf.int
df <- read.csv("penguins.csv")
keep <- df$species %in% c("Adelie","Gentoo")
df <- subset(df, keep & !is.na(bill_length_mm))
df$method <- factor(df$species, levels = c("Adelie","Gentoo")) # order matters for the contrast
df$yield <- df$bill_length_mm
tt <- t.test(yield ~ method, data = df, var.equal = TRUE, conf.level = 0.90)
tt$conf.intQ3 — Welch (Unequal Variances): Using t.test()
Compute an 80% two-sided confidence interval for \[ \mu_{Chinstrap} - \mu_{Gentoo}\] for flipper_length_mm using Welch’s unequal-variance method (the default in t.test). Use the local file penguins.csv and only the species Chinstrap and Gentoo. Print only the 2-number CI vector.
t.test(y ~ g, conf.level = 0.80) uses Welch by default (var.equal = FALSE). To obtain Chinstrap − Gentoo, set the factor level order to c(“Chinstrap”,“Gentoo”).
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
Ensure you keep exactly two species and set their order to match the contrast. Return just the two CI bounds from the test result.
df <- read.csv("penguins.csv")
keep <- df$species %in% c("Chinstrap","Gentoo")
df <- subset(df, keep & !is.na(flipper_length_mm))
df$method <- factor(df$species, levels = c("Chinstrap","Gentoo"))
df$y <- df$flipper_length_mm
tt <- t.test(y ~ method, data = df, conf.level = 0.80) # Welch by default
tt$conf.int
df <- read.csv("penguins.csv")
keep <- df$species %in% c("Chinstrap","Gentoo")
df <- subset(df, keep & !is.na(flipper_length_mm))
df$method <- factor(df$species, levels = c("Chinstrap","Gentoo"))
df$y <- df$flipper_length_mm
tt <- t.test(y ~ method, data = df, conf.level = 0.80) # Welch by default
tt$conf.intQ4 — Two-Sample CI for a Difference of Proportions
In recipe_reviews.csv, compare the proportion of reviews where thumbs_up > thumbs_down between FiveStar (stars == 5) and NotFive (stars != 5). Compute a 95% two-sided CI for \[p_{FiveStar} - p_{NotFive}\]. Print c(lower, uppper).
Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.
CI for difference of proportions: \[ \mathrm{CI}_{1-\alpha}:\ \hat{p_1} - \hat{p_2} \pm z_{1-\alpha/2} * \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}. \]
Compute \[\hat{p}\] per group, then compute their SE and z* at 0.975.
df <- read.csv("recipe_reviews.csv")
g1 <- subset(df, stars == 5) #segregating the dataset
g2 <- subset(df, stars != 5)
count1 <- as.integer(g1$thumbs_up > g1$thumbs_down)
count2 <- as.integer(g2$thumbs_up > g2$thumbs_down)
n1 <- sum(is.finite(count1)); n2 <- sum(is.finite(count2))
p1 <- mean(count1, na.rm = TRUE); p2 <- mean(count2, na.rm = TRUE)
se <- sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z <- qnorm(0.975);
(p1 - p2) + c(-1,1)*z*se
df <- read.csv("recipe_reviews.csv")
g1 <- subset(df, stars == 5) #segregating the dataset
g2 <- subset(df, stars != 5)
count1 <- as.integer(g1$thumbs_up > g1$thumbs_down)
count2 <- as.integer(g2$thumbs_up > g2$thumbs_down)
n1 <- sum(is.finite(count1)); n2 <- sum(is.finite(count2))
p1 <- mean(count1, na.rm = TRUE); p2 <- mean(count2, na.rm = TRUE)
se <- sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z <- qnorm(0.975);
(p1 - p2) + c(-1,1)*z*seQ5 — Two-Sample CI for a Difference of Proportions using prop.test
In penguins.csv, compare the male proportion between Chinstrap and Gentoo. Compute an evidence-only CI (no hypothesis interpretation needed) for \(p_{Chinstrap} - p_{Gentoo}\) at 90% and print c(lower upper).
The prop.test() function in R performs tests of proportions, allowing for the comparison of proportions across groups or against specific values. Here is an example according to this question. here’s a super simple one-sample prop.test using penguins.csv. It asks: among Gentoo penguins, what fraction are male? It prints the estimate and a 95% CI. Run the code to see!
Filter to two species, count “male” in each, and feed (x1,n1),(x2,n2) to prop.test.
df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Chinstrap","Gentoo") & !is.na(sex))
x1 <- sum(df$species=="Chinstrap" & df$sex=="male")
n1 <- sum(df$species=="Chinstrap")
x2 <- sum(df$species=="Gentoo" & df$sex=="male")
n2 <- sum(df$species=="Gentoo")
prop.test(x = c(x1, x2), n = c(n1, n2), conf.level = 0.90, correct = FALSE)$conf.int
df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Chinstrap","Gentoo") & !is.na(sex))
x1 <- sum(df$species=="Chinstrap" & df$sex=="male")
n1 <- sum(df$species=="Chinstrap")
x2 <- sum(df$species=="Gentoo" & df$sex=="male")
n2 <- sum(df$species=="Gentoo")
prop.test(x = c(x1, x2), n = c(n1, n2), conf.level = 0.90, correct = FALSE)$conf.intQ6 — Two-Sample CI for a Difference of Proportions (island=Biscoe)
Using penguins.csv, compare the proportion of penguins on Biscoe island between Adelie and Gentoo. Compute a 95% two-sided CI for \(p_{Adelie} - p_{Gentoo}\) and print c(lower upper).
Count “Biscoe” within each species to get x1, x2. Use totals for n1, n2, then prop.test(…)$conf.int. Order your counts as (Adelie, Gentoo) to get Adelie − Gentoo.
df <- read.csv("penguins.csv")
sub <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(island))
x1 <- sum(sub$species=="Adelie" & sub$island=="Biscoe"); n1 <- sum(sub$species=="Adelie")
x2 <- sum(sub$species=="Gentoo" & sub$island=="Biscoe"); n2 <- sum(sub$species=="Gentoo")
prop.test(x=c(x1,x2), n=c(n1,n2), conf.level=0.95, correct=FALSE)$conf.int
df <- read.csv("penguins.csv")
sub <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(island))
x1 <- sum(sub$species=="Adelie" & sub$island=="Biscoe"); n1 <- sum(sub$species=="Adelie")
x2 <- sum(sub$species=="Gentoo" & sub$island=="Biscoe"); n2 <- sum(sub$species=="Gentoo")
prop.test(x=c(x1,x2), n=c(n1,n2), conf.level=0.95, correct=FALSE)$conf.intQ7 — Two-Sample CI for a Ratio of Variances (manual F-based)
Using recipe_reviews.csv, compare the variance of best_score for FiveStar vs NotFive. Build a 95% CI for \(\sigma^2_{FiveStar} / \sigma^2_{NotFive}\) and print c(lower, upper).
Here is a glimpse of the dataset:
Taking group 1 as FiveStar here, with \(s_1^2\) and \(s_2^2\) as sample variances, here is the CI: \[ \text{CI}_{1-\alpha}\!\left(\frac{\sigma_1^2}{\sigma_2^2}\right) = \left( \frac{{s_1^2}/{s_2^2}}{F_{1-\alpha/2,\,df_1,\,df_2}}, \; \frac{{s_1^2}/{s_2^2}}{F_{\alpha/2,\,df_1,\,df_2}} \right), \qquad df_1=n_1-1,\; df_2=n_2-1. \]
Use FiveStar as numerator. Compute \(s_1^2\) and \(s_2^2\) and divide by the F cutoffs (0.975 and 0.025).
df <- read.csv("recipe_reviews.csv")
x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score
x1 <- x1[is.finite(x1)]
x2 <- x2[is.finite(x2)]
n1 <- length(x1)
n2 <- length(x2)
s1 <- stats::var(x1)
s2 <- stats::var(x2)
df1 <- n1 - 1
df2 <- n2 - 1
ratio <- s1/s2
F_lo <- qf(0.975, df1, df2)
F_hi <- qf(0.025, df1, df2)
c(ratio/F_lo, ratio/F_hi)
df <- read.csv("recipe_reviews.csv")
x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score
x1 <- x1[is.finite(x1)]
x2 <- x2[is.finite(x2)]
n1 <- length(x1)
n2 <- length(x2)
s1 <- stats::var(x1)
s2 <- stats::var(x2)
df1 <- n1 - 1
df2 <- n2 - 1
ratio <- s1/s2
F_lo <- qf(0.975, df1, df2)
F_hi <- qf(0.025, df1, df2)
c(ratio/F_lo, ratio/F_hi)Q8 - Two-Sample CI for a Ratio of Variances using var.test
In penguins.csv, compare variance of bill_length_mm between Adelie (group 1) and Gentoo (group 2). Compute an 80% CI for \(\sigma^2_{Adelie} / \sigma^2_{Gentoo}\) using var.test and print the two bounds.
Here is an example to demonstrate how var.test works. We are taking the recipe reviews dataset to showcase the usage. Run the code to view results.
Filter to Adelie and Gentoo, remove NAs, then feed the two numeric vectors to var.test.
df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(bill_length_mm))
x <- df$bill_length_mm[df$species=="Adelie"]
y <- df$bill_length_mm[df$species=="Gentoo"]
var.test(x, y, conf.level = 0.80)$conf.int
df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(bill_length_mm))
x <- df$bill_length_mm[df$species=="Adelie"]
y <- df$bill_length_mm[df$species=="Gentoo"]
var.test(x, y, conf.level = 0.80)$conf.int