Tutorial 10: Inference Techniques on Regression
Q1 - Fit the Regression Model
For this tutorial, we will be using a Life Expectancy dataset from the World Health Organization (WHO). From your tutorial 9 knowledge, fit a simple linear regression model with Life Expectancy as X and Schooling as Y. Return the estimated slope.
Feel free to run this code block to visualize the data.
Fill coef(model)[___] with right column name.
df <- read.csv("life_expectancy.csv", check.names=FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
coef(model)["schooling"]
df <- read.csv("life_expectancy.csv", check.names=FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
coef(model)["schooling"]Q2 — Interpret the slope:
A policy increases schooling by 2 years for every country. Using the fitted regression model, compute the predicted change in life expectancy caused by this policy.
Do this in two ways:
- Compute two fitted values at \(x_0\) and \(x_0 + 2\), then take the difference.
- Use the model’s slope to compute the same change.
In a simple linear regression line, the “effect” of increasing (x) by a fixed amount (x) can be seen by comparing fitted values at two (x)-values separated by (x).
A good way to sanity-check an interpretation is to compute the same effect using (1) fitted values and (2) the slope coefficient, and confirm they agree.
Feel free to run this code block to visualize the data.
Use the idea “effect = fitted value at (new x) minus fitted value at (old x)”. You can get fitted values either from the line equation or from predict() using newdata. For the slope method, interpret the slope as “change in fitted response per 1 unit of x”.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
x0 <- mean(le$schooling)
yhat0 <- unname(predict(model, newdata = data.frame(schooling = x0)))
yhat2 <- unname(predict(model, newdata = data.frame(schooling = x0 + 2)))
effect_via_preds <- yhat2 - yhat0
b1 <- unname(coef(model)["schooling"])
effect_via_slope <- 2 * b1
c(effect_via_preds = effect_via_preds,
effect_via_slope = effect_via_slope)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
x0 <- mean(le$schooling)
yhat0 <- unname(predict(model, newdata = data.frame(schooling = x0)))
yhat2 <- unname(predict(model, newdata = data.frame(schooling = x0 + 2)))
effect_via_preds <- yhat2 - yhat0
b1 <- unname(coef(model)["schooling"])
effect_via_slope <- 2 * b1
c(effect_via_preds = effect_via_preds,
effect_via_slope = effect_via_slope)Q3 — Compute \(R^2\)
Return the model’s \(R^2\)
\(R^2\) is the proportion of variability in life expectancy explained by schooling.
Feel free to run this code block to visualize the data.
Produce model summary and return the appropriate statistic from it.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
summary(model)$r.squared
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
summary(model)$r.squaredQ4 — Computing P-value
Compute the p-value for \(b_1\) for testing \(H_0: b_1 = 0\).
This tests whether there is evidence of a linear relationship between schooling and life expectancy.
Feel free to run this code block to visualize the data.
The second argument in coefficients represents the column ‘Pr(>|t|)’.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)
summary(mod)$coefficients["schooling", "Pr(>|t|)"]
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)
summary(mod)$coefficients["schooling", "Pr(>|t|)"]Q5 — 95% CI for the slope
Return a 95% confidence interval for the slope using the R built-in command confint().
The confint() function in R is used to compute confidence intervals for one or more parameters in a fitted model.
Feel free to run this code block to visualize the data.
Enter the right column names and research how confint() is used here.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)
as.numeric(confint(mod, "schooling"))
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)
as.numeric(confint(mod, "schooling"))Q6 — Predict the mean life expectancy
Predict the mean life expectancy at schooling = 12 years.
A point prediction is essentially \(\hat{y}\) at a certain x value. In this case, x=12.
Feel free to run this code block to visualize the data.
Look up how to use predict().
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
predict(model, newdata = data.frame(schooling = 12))
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
predict(model, newdata = data.frame(schooling = 12))Q7 — Residuals sum
Confirm if the residuals in the model sum to 0 or not. Round the sum.
Usually with an intercept in the model, the residuals sum up to 0 (with a tiny numerical error).
Use resid() to get residuals.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
round(sum(resid(model)), 6)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
round(sum(resid(model)), 6)Q8 — Standardized residuals
Compute standardized residuals and figure out how many satisfy the condition |\(r_i\)| > 2.
Standardized residuals are useful for spotting outliers in data. If the condition |\(r_i\)| > 2 is satisfied, it could be a common red flag.
Use rstandard() to get standardized residuals.
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
r <- rstandard(model)
sum(abs(r) > 2)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)
r <- rstandard(model)
sum(abs(r) > 2)