Tutorial 05: Two Sample Confidence Intervals

Q1 — Pooled (Equal Variances): Reviews

Using the local file recipe_reviews.csv, compute a 95% two-sided confidence interval for \[ \mu_{FiveStar} - \mu_{NotFive}\] where the outcome is best_score and the groups are stars == 5 (FiveStar) vs stars != 5 (NotFive). Read the CSV with read.csv(“recipe_reviews.csv”), build the two groups, and print the 2-element vector c(lower, upper).

Info

For a pooled two-sample CI with equal variances:

\[ \bar{x}_1 - \bar{x}_2 \ \ \pm \ \ t^*_{df=n_1 + n_2 - 2} s_p \sqrt{\frac{1}{n1} + \frac{1}{n2}}, \ \ s^2_p = \frac{(n_1 - 1)s^2_1 + (n_2-1)s^2_2}{n_1 + n_2 - 2}\] Use \(\alpha\) = 0.05 \(\implies t^* = t_{1 - \frac{\alpha}{2}, df}\)

Group 1 = FiveStar (rows with stars == 5), Group 2 = NotFive (rows with stars != 5).

Preview

The “Recipe Reviews and User Feedback Dataset” is a comprehensive repository of data encompassing various aspects of recipe reviews and user interactions. It includes essential information such as the recipe name, its ranking on the top 100 recipes list, a unique recipe code, and user details like user ID, user name, and an internal user reputation score. Each review comment is uniquely identified with a comment ID and comes with additional attributes, including the creation timestamp, reply count, and the number of up-votes and down-votes received. Users’ sentiment towards recipes is quantified on a 1 to 5 star rating scale, with a score of 0 denoting an absence of rating. This dataset is a valuable resource for researchers and data scientists, facilitating endeavors in sentiment analysis, user behavior analysis, recipe recommendation systems, and more. It offers a window into the dynamics of recipe reviews and user feedback within the culinary website domain.

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

df <- read.csv("recipe_reviews.csv")

x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score

n1 <- sum(!is.na(x1)); n2 <- sum(!is.na(x2))
m1 <- mean(x1, na.rm = TRUE); m2 <- mean(x2, na.rm = TRUE)
s1 <- stats::sd(x1, na.rm = TRUE); s2 <- stats::sd(x2, na.rm = TRUE)

dfree <- n1 + n2 - 2
sp2 <- (((n1 - 1) * s1^2) + ((n2 - 1) * s2^2)) / dfree
sp <- sqrt(sp2)

tstar <- qt(0.975, dfree)
se <- sp * sqrt(1/n1 + 1/n2)
est <- m1 - m2

c(est - tstar*se, est + tstar*se)

Q2 — Pooled (Equal Variances): Using t.test()

Compute a 90% two-sided confidence interval for \[ \mu_{Adelie} - \mu_{Gentoo}\] for bill length (mm) using pooled variances. Use the local file penguins.csv and only the two species Adelie and Gentoo.

Info

For equal-variance two-sample CIs, t.test(yield ~ method, var.equal = TRUE, conf.level = 0.90) uses the pooled standard error and df = \(n_1 + n_2 - 2\). To get the CI for Adelie − Gentoo, set the factor level order to c(“Adelie”,“Gentoo”).

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

Info

When σ is known, \[ \mathrm{CI}_{1-\alpha}:\ \bar{x} \pm x_{1-\alpha/2}\,\frac{\sigma}{\sqrt{n}}. \]

df <- read.csv("penguins.csv")
keep <- df$species %in% c("Adelie","Gentoo")
df <- subset(df, keep & !is.na(bill_length_mm))

df$method <- factor(df$species, levels = c("Adelie","Gentoo")) # order matters for the contrast
df$yield <- df$bill_length_mm

tt <- t.test(yield ~ method, data = df, var.equal = TRUE, conf.level = 0.90)
tt$conf.int

Q3 — Welch (Unequal Variances): Using t.test()

Compute an 80% two-sided confidence interval for \[ \mu_{Chinstrap} - \mu_{Gentoo}\] for flipper_length_mm using Welch’s unequal-variance method (the default in t.test). Use the local file penguins.csv and only the species Chinstrap and Gentoo. Print only the 2-number CI vector.

Info

t.test(y ~ g, conf.level = 0.80) uses Welch by default (var.equal = FALSE). To obtain Chinstrap − Gentoo, set the factor level order to c(“Chinstrap”,“Gentoo”).

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.


df <- read.csv("penguins.csv")

keep <- df$species %in% c("Chinstrap","Gentoo")
df <- subset(df, keep & !is.na(flipper_length_mm))

df$method <- factor(df$species, levels = c("Chinstrap","Gentoo"))
df$y <- df$flipper_length_mm

tt <- t.test(y ~ method, data = df, conf.level = 0.80) # Welch by default
tt$conf.int

Q4 — Two-Sample CI for a Difference of Proportions

In recipe_reviews.csv, compare the proportion of reviews where thumbs_up > thumbs_down between FiveStar (stars == 5) and NotFive (stars != 5). Compute a 95% two-sided CI for \[p_{FiveStar} - p_{NotFive}\]. Print c(lower, uppper).

Preview

Run this code chunk to get a glimpse of the dataset. Feel free to change the values to visualize more/less number of rows.

Info

CI for difference of proportions: \[ \mathrm{CI}_{1-\alpha}:\ \hat{p_1} - \hat{p_2} \pm z_{1-\alpha/2} * \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}. \]

df <- read.csv("recipe_reviews.csv")
g1 <- subset(df, stars == 5) #segregating the dataset
g2 <- subset(df, stars != 5)
count1 <- as.integer(g1$thumbs_up > g1$thumbs_down)
count2 <- as.integer(g2$thumbs_up > g2$thumbs_down)
n1 <- sum(is.finite(count1)); n2 <- sum(is.finite(count2))
p1 <- mean(count1, na.rm = TRUE); p2 <- mean(count2, na.rm = TRUE)
se <- sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z <- qnorm(0.975); 
(p1 - p2) + c(-1,1)*z*se

Q5 — Two-Sample CI for a Difference of Proportions using prop.test

In penguins.csv, compare the male proportion between Chinstrap and Gentoo. Compute an evidence-only CI (no hypothesis interpretation needed) for \(p_{Chinstrap} - p_{Gentoo}\) at 90% and print c(lower upper).

Preview

Info

The prop.test() function in R performs tests of proportions, allowing for the comparison of proportions across groups or against specific values. Here is an example according to this question. here’s a super simple one-sample prop.test using penguins.csv. It asks: among Gentoo penguins, what fraction are male? It prints the estimate and a 95% CI. Run the code to see!

df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Chinstrap","Gentoo") & !is.na(sex))
x1 <- sum(df$species=="Chinstrap" & df$sex=="male")
n1 <- sum(df$species=="Chinstrap")
x2 <- sum(df$species=="Gentoo" & df$sex=="male")
n2 <- sum(df$species=="Gentoo")
prop.test(x = c(x1, x2), n = c(n1, n2), conf.level = 0.90, correct = FALSE)$conf.int

Q6 — Two-Sample CI for a Difference of Proportions (island=Biscoe)

Using penguins.csv, compare the proportion of penguins on Biscoe island between Adelie and Gentoo. Compute a 95% two-sided CI for \(p_{Adelie} - p_{Gentoo}\) and print c(lower upper).

Preview

df <- read.csv("penguins.csv")
sub <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(island))
x1 <- sum(sub$species=="Adelie" & sub$island=="Biscoe"); n1 <- sum(sub$species=="Adelie")
x2 <- sum(sub$species=="Gentoo" & sub$island=="Biscoe"); n2 <- sum(sub$species=="Gentoo")
prop.test(x=c(x1,x2), n=c(n1,n2), conf.level=0.95, correct=FALSE)$conf.int

Q7 — Two-Sample CI for a Ratio of Variances (manual F-based)

Using recipe_reviews.csv, compare the variance of best_score for FiveStar vs NotFive. Build a 95% CI for \(\sigma^2_{FiveStar} / \sigma^2_{NotFive}\) and print c(lower, upper).

Preview

Here is a glimpse of the dataset:

Info

Taking group 1 as FiveStar here, with \(s_1^2\) and \(s_2^2\) as sample variances, here is the CI: \[ \text{CI}_{1-\alpha}\!\left(\frac{\sigma_1^2}{\sigma_2^2}\right) = \left( \frac{{s_1^2}/{s_2^2}}{F_{1-\alpha/2,\,df_1,\,df_2}}, \; \frac{{s_1^2}/{s_2^2}}{F_{\alpha/2,\,df_1,\,df_2}} \right), \qquad df_1=n_1-1,\; df_2=n_2-1. \]

df <- read.csv("recipe_reviews.csv")
x1 <- subset(df, stars == 5)$best_score
x2 <- subset(df, stars != 5)$best_score
x1 <- x1[is.finite(x1)]
x2 <- x2[is.finite(x2)]
n1 <- length(x1)
n2 <- length(x2)
s1 <- stats::var(x1)
s2 <- stats::var(x2)
df1 <- n1 - 1
df2 <- n2 - 1
ratio <- s1/s2
F_lo <- qf(0.975, df1, df2)
F_hi <- qf(0.025, df1, df2)
c(ratio/F_lo, ratio/F_hi)

Q8 - Two-Sample CI for a Ratio of Variances using var.test

In penguins.csv, compare variance of bill_length_mm between Adelie (group 1) and Gentoo (group 2). Compute an 80% CI for \(\sigma^2_{Adelie} / \sigma^2_{Gentoo}\) using var.test and print the two bounds.

Info

Here is an example to demonstrate how var.test works. We are taking the recipe reviews dataset to showcase the usage. Run the code to view results.

Preview

df <- read.csv("penguins.csv")
df <- subset(df, species %in% c("Adelie","Gentoo") & !is.na(bill_length_mm))
x <- df$bill_length_mm[df$species=="Adelie"]
y <- df$bill_length_mm[df$species=="Gentoo"]
var.test(x, y, conf.level = 0.80)$conf.int