Tutorial 09: Intro to Regression
Q1 — Fit a simple linear regression model (movies only)
Using titles.csv, we will work with movies only and fit:
\[y = \beta_0 + \beta_1 x + \varepsilon\] keeping y = imdb_score and x = runtime.
Your task:
- Fit a regression line
- Return a named numeric vector: c(b0 = ..., b1 = ...)
A simple linear regression estimates a straight-line relationship between a predictor x and a response y.
- \(b_0\) (intercept): the predicted value of y when x = 0
- \(b_1\) (slope): the predicted change in y for a one-unit increase in x
In R, once you fit the model, you can extract the two estimated coefficients from the fitted object.
You need a fitted regression model object using the filtered movie data. Then extract the two coefficient estimates from that object (intercept first, slope second).
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)
c(b0 = unname(b[1]), b1 = unname(b[2]))
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)
c(b0 = unname(b[1]), b1 = unname(b[2]))Q2 — Compute (\(\hat\beta_1\)) (slope) by computation
Compute the slope estimator using the formula:
\[\hat\beta_1=\frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}\]
Here, (x=runtime) and (y=imdb_score) (movies only). Return the numeric value b1_hat.
Let:
The numerator measures how x and y move together (a “covariance-like” quantity).
The denominator measures how spread out x is (a “variance-like” quantity).
The slope is the “co-movement” scaled by the spread in x.
Start by centering both variables around their sample means. Then compute the two sums shown in the formula (top and bottom) and combine them to get the slope.
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
x <- sub$runtime
y <- sub$imdb_score
xbar <- mean(x)
ybar <- mean(y)
num <- sum((x - xbar) * (y - ybar))
den <- sum((x - xbar)^2)
b1_hat <- num / den
b1_hat
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
x <- sub$runtime
y <- sub$imdb_score
xbar <- mean(x)
ybar <- mean(y)
num <- sum((x - xbar) * (y - ybar))
den <- sum((x - xbar)^2)
b1_hat <- num / den
b1_hatQ3 — Compute (\(\hat\beta_0\)) (intercept) by computation
Compute the intercept estimator:
\[ \hat\beta_0 = \bar y - \hat\beta_1 \bar x \]
Return the numeric value \(\hat{\beta_0}\).
Conceptually:
The fitted regression line must pass through the point (\(\bar{x}, \bar{y}\)).
Once you have the slope, the intercept is whatever value makes the line go through that mean point.
Use the fact that the fitted line goes through (\(\bar{x}, \bar{y}\)). If you already computed the slope in Q2, you only need sample means to determine the intercept.
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
x <- sub$runtime
y <- sub$imdb_score
xbar <- mean(x)
ybar <- mean(y)
b1_hat <- sum((x - xbar) * (y - ybar)) / sum((x - xbar)^2)
b0_hat <- ybar - b1_hat * xbar
b0_hat
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
x <- sub$runtime
y <- sub$imdb_score
xbar <- mean(x)
ybar <- mean(y)
b1_hat <- sum((x - xbar) * (y - ybar)) / sum((x - xbar)^2)
b0_hat <- ybar - b1_hat * xbar
b0_hatQ4 — Cross-check ( \(\hat\beta_0\), \(\hat\beta_1\)) using summary(lm)
Fit the same model as Q1 and extract coefficients from:
summary(fit)$coefficients
Return c(b0 = ..., b1 = ...).
Most regression summaries include a coefficient table where:
each row corresponds to a coefficient (intercept + predictor)
the main estimate column gives the fitted coefficient values
Your goal is to extract the two estimates for this model.
Look for the coefficient table in the model summary. It will have a row for the intercept and a row for the predictor, and an “estimate”-type column containing the fitted values.
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
tab <- summary(fit)$coefficients
b0 <- tab["(Intercept)", "Estimate"]
b1 <- tab["runtime", "Estimate"]
c(b0 = b0, b1 = b1)
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
tab <- summary(fit)$coefficients
b0 <- tab["(Intercept)", "Estimate"]
b1 <- tab["runtime", "Estimate"]
c(b0 = b0, b1 = b1)Q5 — Point prediction by hand (using b0, b1)
Using the same filtered movie subset and the fitted line, compute the predicted IMDB score when:
- \(x_0 = 90\) minutes
Return a single numeric value: yhat_90.
A point prediction on a fitted line is the model’s estimated mean response at a chosen \(x_0\). Conceptually: “plug in \(x_0\) into the fitted line.”
A point prediction uses the fitted line’s two coefficients and a chosen (x_0). Use the intercept + slope idea (one part is the “baseline”, the other is the “change per minute” times (x_0)).
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)
x0 <- 90
yhat_90 <- b[1] + b[2] * x0
yhat_90
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)
x0 <- 90
yhat_90 <- b[1] + b[2] * x0
yhat_90Q6 — Point prediction using predict()
Use the fitted model to predict the IMDB score at:
- \((x_0 = 90)\) minutes
This time, use predict() and a newdata data frame.
Return a single numeric value: yhat_90.
Built-in prediction methods require:
- a fitted model object
- a
newdatadata frame with the predictor column named exactly as in the model
Create a one-row newdata data frame whose column name matches the predictor used in the model, then call the model’s prediction function and extract the numeric value.
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
new <- data.frame(runtime = 90)
yhat_90 <- predict(fit, newdata = new)
yhat_90
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
new <- data.frame(runtime = 90)
yhat_90 <- predict(fit, newdata = new)
yhat_90Q7 — Compare two predictions (difference in fitted values)
Using the fitted model, compute:
\[\hat{y}(120) - \hat y(80)\]
Return a single numeric value: diff_hat.
This question is about comparing fitted values at two predictor values. Conceptually: “how much does the model expect the response to change when (x) changes from 80 to 120?”
Get two fitted values for (x=80) and (x=120) (any reasonable method), then subtract in the requested order.
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
yhat <- predict(fit, newdata = data.frame(runtime = c(80, 120)))
diff_hat <- yhat[2] - yhat[1]
diff_hat
df <- read.csv("titles.csv")
sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)
fit <- lm(imdb_score ~ runtime, data = sub)
yhat <- predict(fit, newdata = data.frame(runtime = c(80, 120)))
diff_hat <- yhat[2] - yhat[1]
diff_hat