Tutorial 10: Inference Techniques on Regression

Q1 - Fit the Regression Model

For this tutorial, we will be using a Life Expectancy dataset from the World Health Organization (WHO). From your tutorial 9 knowledge, fit a simple linear regression model with Life Expectancy as X and Schooling as Y. Return the estimated slope.

Preview

Feel free to run this code block to visualize the data.


df <- read.csv("life_expectancy.csv", check.names=FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)

model <- lm(life_exp ~ schooling, data = le)
coef(model)["schooling"]

Q2 — Interpret the slope:

A policy increases schooling by 2 years for every country. Using the fitted regression model, compute the predicted change in life expectancy caused by this policy.

Do this in two ways:

Compute two fitted values at \(x_0\) and \(x_0 + 2\), then take the difference.
Use the model’s slope to compute the same change.

Info

In a simple linear regression line, the “effect” of increasing (x) by a fixed amount (x) can be seen by comparing fitted values at two (x)-values separated by (x).

A good way to sanity-check an interpretation is to compute the same effect using (1) fitted values and (2) the slope coefficient, and confirm they agree.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))

le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)

model <- lm(life_exp ~ schooling, data = le)

x0 <- mean(le$schooling)

yhat0 <- unname(predict(model, newdata = data.frame(schooling = x0)))
yhat2 <- unname(predict(model, newdata = data.frame(schooling = x0 + 2)))
effect_via_preds <- yhat2 - yhat0

b1 <- unname(coef(model)["schooling"])
effect_via_slope <- 2 * b1

c(effect_via_preds = effect_via_preds,
effect_via_slope = effect_via_slope)

Q3 — Compute \(R^2\)

Return the model’s \(R^2\)

Info

\(R^2\) is the proportion of variability in life expectancy explained by schooling.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

summary(model)$r.squared

Q4 — Computing P-value

Compute the p-value for \(b_1\) for testing \(H_0: b_1 = 0\).

Info

This tests whether there is evidence of a linear relationship between schooling and life expectancy.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)

summary(mod)$coefficients["schooling", "Pr(>|t|)"]

Q5 — 95% CI for the slope

Return a 95% confidence interval for the slope using the R built-in command confint().

Info

The confint() function in R is used to compute confidence intervals for one or more parameters in a fitted model.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)

as.numeric(confint(mod, "schooling"))

Q6 — Predict the mean life expectancy

Predict the mean life expectancy at schooling = 12 years.

Info

A point prediction is essentially \(\hat{y}\) at a certain x value. In this case, x=12.

Preview

Feel free to run this code block to visualize the data.

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

predict(model, newdata = data.frame(schooling = 12))

Q7 — Residuals sum

Confirm if the residuals in the model sum to 0 or not. Round the sum.

Info

Usually with an intercept in the model, the residuals sum up to 0 (with a tiny numerical error).

Preview

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

round(sum(resid(model)), 6)

Q8 — Standardized residuals

Compute standardized residuals and figure out how many satisfy the condition |\(r_i\)| > 2.

Info

Standardized residuals are useful for spotting outliers in data. If the condition |\(r_i\)| > 2 is satisfied, it could be a common red flag.

Preview

df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

r <- rstandard(model)
sum(abs(r) > 2)