Tutorial 02: Descriptive Statistics

Q1 — Five-number summary

The five-number summary (Min, Q1, Median, Q3, Max) + IQR quickly reveals center and spread. Return a five number summary for the dataset airquality’s Ozone column. Further, return the IQR.

NoteInfo

The five-number summary (Min, Q1, Median, Q3, Max) quickly shows center and spread. The IQR = Q3 − Q1 is robust to outliers and underlies boxplot fences (Q1 − 1.5·IQR, Q3 + 1.5·IQR). You’ll use base R functions that compute these directly for a numeric

NotePreview

Run the code below to see a preview of the dataset.

Call the base function that prints the five key stats, then on the next line return the robust spread of that same column; remember missing values.

summary(airquality$Ozone) IQR(airquality$Ozone, na.rm = TRUE)
summary(airquality$Ozone)
IQR(airquality$Ozone, na.rm = TRUE)

Q2 — Mean, variance, and sd

Compute Mean, Variance and Standard Deviation of airquality dataset’s Temp column.

NoteInfo

Mean, variance, and standard deviation summarize center and spread. Variance and SD are in squared and original units respectively. Real-world data often has NAs; make sure your summary ignores them appropriately.

Photo by Tim Witzdam on Unsplash
NotePreview

Run the code below to see a preview of the dataset.

Construct a single named vector with three entries; each entry calls the corresponding base summary function with missing-value handling.

c( mean = mean(airquality$Temp, na.rm = TRUE), var = var(airquality$Temp, na.rm = TRUE), sd = sd(airquality$Temp, na.rm = TRUE) )
c(
  mean = mean(airquality$Temp, na.rm = TRUE),
  var  = var(airquality$Temp,  na.rm = TRUE),
  sd   = sd(airquality$Temp,   na.rm = TRUE)
)

Q3 — Grouped summary by Month

Compute the mean Ozone by Month using base R. Return a named numeric vector (names are months).

NotePreview

Run the code below to see a preview of the dataset.

Learn about the function tapply().

tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)

Q4 — ggplot Boxplot ordered by median (ChickWeight)

Now, we are moving on to the ChickWeight dataset which has data from an experiment on the effect of diet on early growth of chicks. Return a ggplot object that boxplots weight by Diet, with Diet reordered by the median weight. Fill in the blanks.

Photo by Karim MANJRA on Unsplash
NoteInfo

You can reorder categories by a statistic (like the median) right inside aes() to make comparisons meaningful. In ggplot2, this is commonly done with reorder() (or forcats helpers). The layer for boxplots is geom_boxplot().

Map Diet to x and weight to y; reorder x by median(weight) inside aes(…). Use geom_boxplot() for the layer.

library(ggplot2) ggplot( ChickWeight, aes(x = reorder(factor(Diet), weight, FUN = median), y = weight) ) + geom_boxplot()
library(ggplot2)
ggplot(
ChickWeight,
aes(x = reorder(factor(Diet), weight, FUN = median), y = weight)
) +
geom_boxplot()

Q5 — Histogram with Tukey fences

Show a histogram of ChickWeight$weight and add vertical lines at Q1, Q3, and the Tukey fences Q1−1.5·IQR, Q3+1.5·IQR

NoteInfo

Tukey fences are statistical boundaries, used to identify potential outliers in a dataset. Data points falling outside these fences are considered outliers. Q1 − 1.5·IQR and Q3 + 1.5·IQR are Tukey fences!

NotePreview

Here is a preview of the dataset:

geom_vline is used to add vertical lines in the plot at certain positions.

library(ggplot2) qs <- quantile(ChickWeight$weight, c(.25,.75)) lo <- qs[1] - 1.5*diff(qs) hi <- qs[2] + 1.5*diff(qs) ggplot(ChickWeight, aes(weight)) + geom_histogram(bins = 12) + geom_vline(xintercept = c(qs[1], qs[2], lo, hi))
library(ggplot2)
qs <- quantile(ChickWeight$weight, c(.25,.75))
lo <- qs[1] - 1.5*diff(qs)
hi <- qs[2] + 1.5*diff(qs)
ggplot(ChickWeight, aes(weight)) +
geom_histogram(bins = 12) +
geom_vline(xintercept = c(qs[1], qs[2], lo, hi))

Q6 — Histogram of Weight (ggplot2)

Make a histogram with ~12 bins. Store in p_hist.

NoteInfo

Use the built-in ChickWeight dataset. Plot the distribution of weight as a histogram with about 12 bins and assign the plot to`p_hist. Return a ggplot object (no printing required).

NotePreview

Here is a preview of the dataset:

Create a ggplot using ChickWeight with weight mapped on x. Add a histogram layer with ~12 bins. Assign the plot object to p_hist.

library(ggplot2) p_hist <- ggplot(ChickWeight, aes(weight)) + geom_histogram(bins = 12) p_hist
library(ggplot2)
p_hist <- ggplot(ChickWeight, aes(weight)) +
geom_histogram(bins = 12)
p_hist

Q7 — Boxplot of weight by Diet (ggplot2)

Make a boxplot of weight grouped by Diet (treat Diet as categorical). Store the plot in p_box

NoteInfo

Here is a preview of the dataset:

Build a ggplot from ChickWeight. Map factor(Diet) to x and weight to y. Add a boxplot layer. Assign the result to p_box.

library(ggplot2) p_box <- ggplot(ChickWeight, aes(x = factor(Diet), y = weight)) + geom_boxplot() p_box
library(ggplot2)
p_box <- ggplot(ChickWeight, aes(x = factor(Diet), y = weight)) +
geom_boxplot()
p_box

Q8 — Scatter with smoother: Time vs weight (ggplot2)

Build a scatterplot from ChickWeight mapping Time → x and weight → y. Store in p_scatter.

NotePreview

Build a ggplot from ChickWeight mapping Time → x and weight → y. Add points, then a smoother (hide the SE ribbon). Assign the final plot to p_scatter.

library(ggplot2) p_scatter <- ggplot(ChickWeight, aes(Time, weight)) + geom_point() + geom_smooth(se = FALSE) p_scatter
library(ggplot2)
p_scatter <- ggplot(ChickWeight, aes(Time, weight)) +
geom_point() +
geom_smooth(se = FALSE)
p_scatter