Tutorial 10: Inference Techniques on Regression

Q1 - Fit the Regression Model

For this tutorial, we will be using a Life Expectancy dataset from the World Health Organization (WHO). From your tutorial 9 knowledge, fit a simple linear regression model with Life Expectancy as X and Schooling as Y. Return the estimated slope.

Photo by Kenny Eliason on Unsplash
NotePreview

Feel free to run this code block to visualize the data.

Fill coef(model)[___] with right column name.

df <- read.csv("life_expectancy.csv", check.names=FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) coef(model)["schooling"]

df <- read.csv("life_expectancy.csv", check.names=FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)

model <- lm(life_exp ~ schooling, data = le)
coef(model)["schooling"]

Q2 — Interpret the slope:

A policy increases schooling by 2 years for every country. Using the fitted regression model, compute the predicted change in life expectancy caused by this policy.

Do this in two ways:

  1. Compute two fitted values at \(x_0\) and \(x_0 + 2\), then take the difference.
  2. Use the model’s slope to compute the same change.
Photo by MD Duran on Unsplash
NoteInfo

In a simple linear regression line, the “effect” of increasing (x) by a fixed amount (x) can be seen by comparing fitted values at two (x)-values separated by (x).

A good way to sanity-check an interpretation is to compute the same effect using (1) fitted values and (2) the slope coefficient, and confirm they agree.

NotePreview

Feel free to run this code block to visualize the data.

Use the idea “effect = fitted value at (new x) minus fitted value at (old x)”. You can get fitted values either from the line equation or from predict() using newdata. For the slope method, interpret the slope as “change in fitted response per 1 unit of x”.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) x0 <- mean(le$schooling) yhat0 <- unname(predict(model, newdata = data.frame(schooling = x0))) yhat2 <- unname(predict(model, newdata = data.frame(schooling = x0 + 2))) effect_via_preds <- yhat2 - yhat0 b1 <- unname(coef(model)["schooling"]) effect_via_slope <- 2 * b1 c(effect_via_preds = effect_via_preds, effect_via_slope = effect_via_slope)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))

le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)

model <- lm(life_exp ~ schooling, data = le)

x0 <- mean(le$schooling)

yhat0 <- unname(predict(model, newdata = data.frame(schooling = x0)))
yhat2 <- unname(predict(model, newdata = data.frame(schooling = x0 + 2)))
effect_via_preds <- yhat2 - yhat0

b1 <- unname(coef(model)["schooling"])
effect_via_slope <- 2 * b1

c(effect_via_preds = effect_via_preds,
effect_via_slope = effect_via_slope)

Q3 — Compute \(R^2\)

Return the model’s \(R^2\)

NoteInfo

\(R^2\) is the proportion of variability in life expectancy explained by schooling.

NotePreview

Feel free to run this code block to visualize the data.

Produce model summary and return the appropriate statistic from it.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) summary(model)$r.squared
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

summary(model)$r.squared

Q4 — Computing P-value

Compute the p-value for \(b_1\) for testing \(H_0: b_1 = 0\).

NoteInfo

This tests whether there is evidence of a linear relationship between schooling and life expectancy.

NotePreview

Feel free to run this code block to visualize the data.

The second argument in coefficients represents the column ‘Pr(>|t|)’.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) mod <- lm(life_exp ~ schooling, data = le) summary(mod)$coefficients["schooling", "Pr(>|t|)"]
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)

summary(mod)$coefficients["schooling", "Pr(>|t|)"]

Q5 — 95% CI for the slope

Return a 95% confidence interval for the slope using the R built-in command confint().

NoteInfo

The confint() function in R is used to compute confidence intervals for one or more parameters in a fitted model.

NotePreview

Feel free to run this code block to visualize the data.

Enter the right column names and research how confint() is used here.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) mod <- lm(life_exp ~ schooling, data = le) as.numeric(confint(mod, "schooling"))
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
mod <- lm(life_exp ~ schooling, data = le)

as.numeric(confint(mod, "schooling"))

Q6 — Predict the mean life expectancy

Predict the mean life expectancy at schooling = 12 years.

Photo by Kenny Eliason on Unsplash
NoteInfo

A point prediction is essentially \(\hat{y}\) at a certain x value. In this case, x=12.

NotePreview

Feel free to run this code block to visualize the data.

Look up how to use predict().

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) predict(model, newdata = data.frame(schooling = 12))
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

predict(model, newdata = data.frame(schooling = 12))

Q7 — Residuals sum

Confirm if the residuals in the model sum to 0 or not. Round the sum.

NoteInfo

Usually with an intercept in the model, the residuals sum up to 0 (with a tiny numerical error).

NotePreview

Use resid() to get residuals.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) round(sum(resid(model)), 6)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

round(sum(resid(model)), 6)

Q8 — Standardized residuals

Compute standardized residuals and figure out how many satisfy the condition |\(r_i\)| > 2.

NoteInfo

Standardized residuals are useful for spotting outliers in data. If the condition |\(r_i\)| > 2 is satisfied, it could be a common red flag.

NotePreview

Use rstandard() to get standardized residuals.

df <- read.csv("life_expectancy.csv", check.names = FALSE) names(df) <- trimws(names(df)) le <- df[, c("Life expectancy","Schooling")] names(le) <- c("life_exp","schooling") le <- na.omit(le) model <- lm(life_exp ~ schooling, data = le) r <- rstandard(model) sum(abs(r) > 2)
df <- read.csv("life_expectancy.csv", check.names = FALSE)
names(df) <- trimws(names(df))
le <- df[, c("Life expectancy","Schooling")]
names(le) <- c("life_exp","schooling")
le <- na.omit(le)
model <- lm(life_exp ~ schooling, data = le)

r <- rstandard(model)
sum(abs(r) > 2)