Tutorial 02: Descriptive Statistics

Q1 — Five-number summary

The five-number summary (Min, Q1, Median, Q3, Max) + IQR quickly reveals center and spread. Return a five number summary for the dataset airquality’s Ozone column. Further, return the IQR.

Info

The five-number summary (Min, Q1, Median, Q3, Max) quickly shows center and spread. The IQR = Q3 − Q1 is robust to outliers and underlies boxplot fences (Q1 − 1.5·IQR, Q3 + 1.5·IQR). You’ll use base R functions that compute these directly for a numeric

Preview

Run the code below to see a preview of the dataset.

Q2 — Mean, variance, and sd

Compute Mean, Variance and Standard Deviation of airquality dataset’s Temp column.

Info

Mean, variance, and standard deviation summarize center and spread. Variance and SD are in squared and original units respectively. Real-world data often has NAs; make sure your summary ignores them appropriately.

Preview

Run the code below to see a preview of the dataset.

c(
  mean = mean(airquality$Temp, na.rm = TRUE),
  var  = var(airquality$Temp,  na.rm = TRUE),
  sd   = sd(airquality$Temp,   na.rm = TRUE)
)

Q3 — Grouped summary by Month

Compute the mean Ozone by Month using base R. Return a named numeric vector (names are months).

Preview

Run the code below to see a preview of the dataset.

Q4 — ggplot Boxplot ordered by median (ChickWeight)

Now, we are moving on to the ChickWeight dataset which has data from an experiment on the effect of diet on early growth of chicks. Return a ggplot object that boxplots weight by Diet, with Diet reordered by the median weight. Fill in the blanks.

Info

You can reorder categories by a statistic (like the median) right inside aes() to make comparisons meaningful. In ggplot2, this is commonly done with reorder() (or forcats helpers). The layer for boxplots is geom_boxplot().

library(ggplot2)
ggplot(
ChickWeight,
aes(x = reorder(factor(Diet), weight, FUN = median), y = weight)
) +
geom_boxplot()

Q5 — Histogram with Tukey fences

Show a histogram of ChickWeight$weight and add vertical lines at Q1, Q3, and the Tukey fences Q1−1.5·IQR, Q3+1.5·IQR

Info

Tukey fences are statistical boundaries, used to identify potential outliers in a dataset. Data points falling outside these fences are considered outliers. Q1 − 1.5·IQR and Q3 + 1.5·IQR are Tukey fences!

Preview

Here is a preview of the dataset:

library(ggplot2)
qs <- quantile(ChickWeight$weight, c(.25,.75))
lo <- qs[1] - 1.5*diff(qs)
hi <- qs[2] + 1.5*diff(qs)
ggplot(ChickWeight, aes(weight)) +
geom_histogram(bins = 12) +
geom_vline(xintercept = c(qs[1], qs[2], lo, hi))

Q6 — Histogram of Weight (ggplot2)

Make a histogram with ~12 bins. Store in p_hist.

Info

Use the built-in ChickWeight dataset. Plot the distribution of weight as a histogram with about 12 bins and assign the plot to`p_hist. Return a ggplot object (no printing required).

Preview

Here is a preview of the dataset:

Q7 — Boxplot of weight by Diet (ggplot2)

Make a boxplot of weight grouped by Diet (treat Diet as categorical). Store the plot in p_box

Info

Here is a preview of the dataset:

library(ggplot2)
p_box <- ggplot(ChickWeight, aes(x = factor(Diet), y = weight)) +
geom_boxplot()
p_box

Q8 — Scatter with smoother: Time vs weight (ggplot2)

Build a scatterplot from ChickWeight mapping Time → x and weight → y. Store in p_scatter.

Preview

library(ggplot2)
p_scatter <- ggplot(ChickWeight, aes(Time, weight)) +
geom_point() +
geom_smooth(se = FALSE)
p_scatter