Tutorial 09: Intro to Regression

Q1 — Fit a simple linear regression model (movies only)

Using titles.csv, we will work with movies only and fit:

\[y = \beta_0 + \beta_1 x + \varepsilon\] keeping y = imdb_score and x = runtime.

Your task:
- Fit a regression line
- Return a named numeric vector: c(b0 = ..., b1 = ...)

Photo by Thibault Penin on Unsplash
NoteInfo

A simple linear regression estimates a straight-line relationship between a predictor x and a response y.

  • \(b_0\) (intercept): the predicted value of y when x = 0
  • \(b_1\) (slope): the predicted change in y for a one-unit increase in x

In R, once you fit the model, you can extract the two estimated coefficients from the fitted object.

NotePreview

You need a fitted regression model object using the filtered movie data. Then extract the two coefficient estimates from that object (intercept first, slope second).

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) fit <- lm(imdb_score ~ runtime, data = sub) b <- coef(fit) c(b0 = unname(b[1]), b1 = unname(b[2]))

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

b <- coef(fit)
c(b0 = unname(b[1]), b1 = unname(b[2]))

Q2 — Compute (\(\hat\beta_1\)) (slope) by computation

Compute the slope estimator using the formula:

\[\hat\beta_1=\frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}\]

Here, (x=runtime) and (y=imdb_score) (movies only). Return the numeric value b1_hat.

NoteInfo

Let:

  • The numerator measures how x and y move together (a “covariance-like” quantity).

  • The denominator measures how spread out x is (a “variance-like” quantity).

  • The slope is the “co-movement” scaled by the spread in x.

NotePreview

Start by centering both variables around their sample means. Then compute the two sums shown in the formula (top and bottom) and combine them to get the slope.

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) x <- sub$runtime y <- sub$imdb_score xbar <- mean(x) ybar <- mean(y) num <- sum((x - xbar) * (y - ybar)) den <- sum((x - xbar)^2) b1_hat <- num / den b1_hat

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

x <- sub$runtime
y <- sub$imdb_score

xbar <- mean(x)
ybar <- mean(y)

num <- sum((x - xbar) * (y - ybar))
den <- sum((x - xbar)^2)

b1_hat <- num / den
b1_hat

Q3 — Compute (\(\hat\beta_0\)) (intercept) by computation

Compute the intercept estimator:

\[ \hat\beta_0 = \bar y - \hat\beta_1 \bar x \]

Return the numeric value \(\hat{\beta_0}\).

NoteInfo

Conceptually:

  • The fitted regression line must pass through the point (\(\bar{x}, \bar{y}\)).

  • Once you have the slope, the intercept is whatever value makes the line go through that mean point.

NotePreview

Use the fact that the fitted line goes through (\(\bar{x}, \bar{y}\)). If you already computed the slope in Q2, you only need sample means to determine the intercept.

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) x <- sub$runtime y <- sub$imdb_score xbar <- mean(x) ybar <- mean(y) b1_hat <- sum((x - xbar) * (y - ybar)) / sum((x - xbar)^2) b0_hat <- ybar - b1_hat * xbar b0_hat

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

x <- sub$runtime
y <- sub$imdb_score

xbar <- mean(x)
ybar <- mean(y)

b1_hat <- sum((x - xbar) * (y - ybar)) / sum((x - xbar)^2)

b0_hat <- ybar - b1_hat * xbar
b0_hat

Q4 — Cross-check ( \(\hat\beta_0\), \(\hat\beta_1\)) using summary(lm)

Fit the same model as Q1 and extract coefficients from:

summary(fit)$coefficients

Return c(b0 = ..., b1 = ...).

NoteInfo

Most regression summaries include a coefficient table where:

  • each row corresponds to a coefficient (intercept + predictor)

  • the main estimate column gives the fitted coefficient values

Your goal is to extract the two estimates for this model.

NotePreview

Look for the coefficient table in the model summary. It will have a row for the intercept and a row for the predictor, and an “estimate”-type column containing the fitted values.

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) fit <- lm(imdb_score ~ runtime, data = sub) tab <- summary(fit)$coefficients b0 <- tab["(Intercept)", "Estimate"] b1 <- tab["runtime", "Estimate"] c(b0 = b0, b1 = b1)

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

tab <- summary(fit)$coefficients

b0 <- tab["(Intercept)", "Estimate"]
b1 <- tab["runtime", "Estimate"]

c(b0 = b0, b1 = b1)

Q5 — Point prediction by hand (using b0, b1)

Using the same filtered movie subset and the fitted line, compute the predicted IMDB score when:

  • \(x_0 = 90\) minutes

Return a single numeric value: yhat_90.

NoteInfo

A point prediction on a fitted line is the model’s estimated mean response at a chosen \(x_0\). Conceptually: “plug in \(x_0\) into the fitted line.”

NotePreview

A point prediction uses the fitted line’s two coefficients and a chosen (x_0). Use the intercept + slope idea (one part is the “baseline”, the other is the “change per minute” times (x_0)).

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) fit <- lm(imdb_score ~ runtime, data = sub) b <- coef(fit) x0 <- 90 yhat_90 <- b[1] + b[2] * x0 yhat_90

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)

x0 <- 90
yhat_90 <- b[1] + b[2] * x0
yhat_90

Q6 — Point prediction using predict()

Use the fitted model to predict the IMDB score at:

  • \((x_0 = 90)\) minutes

This time, use predict() and a newdata data frame.

Return a single numeric value: yhat_90.

NoteInfo

Built-in prediction methods require:

  • a fitted model object
  • a newdata data frame with the predictor column named exactly as in the model
NotePreview

Create a one-row newdata data frame whose column name matches the predictor used in the model, then call the model’s prediction function and extract the numeric value.

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) fit <- lm(imdb_score ~ runtime, data = sub) new <- data.frame(runtime = 90) yhat_90 <- predict(fit, newdata = new) yhat_90

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

new <- data.frame(runtime = 90)
yhat_90 <- predict(fit, newdata = new)
yhat_90

Q7 — Compare two predictions (difference in fitted values)

Using the fitted model, compute:

\[\hat{y}(120) - \hat y(80)\]

Return a single numeric value: diff_hat.

NoteInfo

This question is about comparing fitted values at two predictor values. Conceptually: “how much does the model expect the response to change when (x) changes from 80 to 120?”

NotePreview

Get two fitted values for (x=80) and (x=120) (any reasonable method), then subtract in the requested order.

df <- read.csv("titles.csv") sub <- subset(df, type == "MOVIE" & is.finite(runtime) & is.finite(imdb_score) & runtime > 0 ) fit <- lm(imdb_score ~ runtime, data = sub) yhat <- predict(fit, newdata = data.frame(runtime = c(80, 120))) diff_hat <- yhat[2] - yhat[1] diff_hat

df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

yhat <- predict(fit, newdata = data.frame(runtime = c(80, 120)))
diff_hat <- yhat[2] - yhat[1]
diff_hat