Tutorial 09: Intro to Regression

Q1 — Fit a simple linear regression model (movies only)

Using titles.csv, we will work with movies only and fit:

\[y = \beta_0 + \beta_1 x + \varepsilon\] keeping y = imdb_score and x = runtime.

Your task:
- Fit a regression line
- Return a named numeric vector: c(b0 = ..., b1 = ...)

Info

A simple linear regression estimates a straight-line relationship between a predictor x and a response y.

$b_0$ (intercept): the predicted value of y when x = 0
$b_1$ (slope): the predicted change in y for a one-unit increase in x

In R, once you fit the model, you can extract the two estimated coefficients from the fitted object.

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

b <- coef(fit)
c(b0 = unname(b[1]), b1 = unname(b[2]))

Q2 — Compute ($\hat\beta_1$) (slope) by computation

Compute the slope estimator using the formula:

\[\hat\beta_1=\frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}\]

Here, (x=runtime) and (y=imdb_score) (movies only). Return the numeric value b1_hat.

Info

Let:

The numerator measures how x and y move together (a “covariance-like” quantity).
The denominator measures how spread out x is (a “variance-like” quantity).
The slope is the “co-movement” scaled by the spread in x.

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

x <- sub$runtime
y <- sub$imdb_score

xbar <- mean(x)
ybar <- mean(y)

num <- sum((x - xbar) * (y - ybar))
den <- sum((x - xbar)^2)

b1_hat <- num / den
b1_hat

Q3 — Compute ($\hat\beta_0$) (intercept) by computation

Compute the intercept estimator:

\[ \hat\beta_0 = \bar y - \hat\beta_1 \bar x \]

Return the numeric value $\hat{\beta_0}$.

Info

Conceptually:

The fitted regression line must pass through the point ($\bar{x}, \bar{y}$).
Once you have the slope, the intercept is whatever value makes the line go through that mean point.

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

x <- sub$runtime
y <- sub$imdb_score

xbar <- mean(x)
ybar <- mean(y)

b1_hat <- sum((x - xbar) * (y - ybar)) / sum((x - xbar)^2)

b0_hat <- ybar - b1_hat * xbar
b0_hat

Q4 — Cross-check ( $\hat\beta_0$, $\hat\beta_1$) using summary(lm)

Fit the same model as Q1 and extract coefficients from:

summary(fit)$coefficients

Return c(b0 = ..., b1 = ...).

Info

Most regression summaries include a coefficient table where:

each row corresponds to a coefficient (intercept + predictor)
the main estimate column gives the fitted coefficient values

Your goal is to extract the two estimates for this model.

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

tab <- summary(fit)$coefficients

b0 <- tab["(Intercept)", "Estimate"]
b1 <- tab["runtime", "Estimate"]

c(b0 = b0, b1 = b1)

Q5 — Point prediction by hand (using b0, b1)

Using the same filtered movie subset and the fitted line, compute the predicted IMDB score when:

$x_0 = 90$ minutes

Return a single numeric value: yhat_90.

Info

A point prediction on a fitted line is the model’s estimated mean response at a chosen $x_0$. Conceptually: “plug in $x_0$ into the fitted line.”

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)
b <- coef(fit)

x0 <- 90
yhat_90 <- b[1] + b[2] * x0
yhat_90

Q6 — Point prediction using predict()

Use the fitted model to predict the IMDB score at:

$(x_0 = 90)$ minutes

This time, use predict() and a newdata data frame.

Return a single numeric value: yhat_90.

Info

Built-in prediction methods require:

a fitted model object
a newdata data frame with the predictor column named exactly as in the model

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

new <- data.frame(runtime = 90)
yhat_90 <- predict(fit, newdata = new)
yhat_90

Q7 — Compare two predictions (difference in fitted values)

Using the fitted model, compute:

\[\hat{y}(120) - \hat y(80)\]

Return a single numeric value: diff_hat.

Info

This question is about comparing fitted values at two predictor values. Conceptually: “how much does the model expect the response to change when (x) changes from 80 to 120?”

Preview


df <- read.csv("titles.csv")

sub <- subset(df,
type == "MOVIE" &
is.finite(runtime) &
is.finite(imdb_score) &
runtime > 0
)

fit <- lm(imdb_score ~ runtime, data = sub)

yhat <- predict(fit, newdata = data.frame(runtime = c(80, 120)))
diff_hat <- yhat[2] - yhat[1]
diff_hat

Tutorial 09: Intro to Regression

Q1 — Fit a simple linear regression model (movies only)

Q2 — Compute (\(\hat\beta_1\)) (slope) by computation

Q3 — Compute (\(\hat\beta_0\)) (intercept) by computation

Q4 — Cross-check ( \(\hat\beta_0\), \(\hat\beta_1\)) using summary(lm)

Q5 — Point prediction by hand (using b0, b1)

Q6 — Point prediction using predict()

Q7 — Compare two predictions (difference in fitted values)