Chapter 1 Introduction

1.1 Foundations

Intuitively, statistics can be considered the science of uncertainty. Formally,

Definition 1.1 (Statistics) Statistics is the science of collecting, classifying, summarizing, analyzing and interpreting data.

In statistics, researchers are often interested in characteristics of a large group of interest. They often observe behaviours, patterns, trends etc. to give a conclusion for the group. To make the conclusions, researchers require data to support them.

Definition 1.2 (Population) In statistics, a population is the entire group of individuals, items, or measurements that share a characteristic of interest and about which conclusions are to be drawn.

A population can be a set of existing objects such as all people in Canada, or hypothetical group of existing objects such as the set of all possible hands in a game of poker. Populations are often large. Additional examples are given below.

Example 1.1

  • All students at the University of Toronto
  • All residents of Mississauga

Data collection from every object in a population is often not feasible to perform. Researchers often select a finite number of observations to study.

Definition 1.3 (Sample) A sample is a subset from population.

Samples are smaller than a population and are therefore easier to manage. Researchers are often able to take measurements on all units in a sample.

Example 1.2

  • A random sample of 50 students at the University of Toronto.
  • A random sample of 100 residents of Mississauga.

Definition 1.4 (Unit) A unit (or observational unit, element) is the smallest entity in a study from which data are collected or measured. It is the fundamental building block of the data set.

Example 1.3

  • In a sample of 10 students, a unit would be one of the students in this sample.

Each unit has certain characteristics we can measure. When we record the values of all characteristics for a particular unit, we obtain an observation.

Definition 1.5 (Observation) An observation is the collection of measured values for all variables from a single unit. In a data table, one row typically represents one observation.

Example 1.4

  • If a student’s age, height, and weight are recorded, the set of these three values for that student is one observation.

The characteristics themselves are called variables. Variables can be numerical (e.g., height) or categorical (e.g., eye colour) and describe different aspects of the units we are studying.

Definition 1.6 (Variable) A variable is a characteristic or attribute that can be measured or recorded for each unit, and can vary from unit to unit. In a data table, one column typically represents one variable.

Example 1.5

  • If we record the age of each student in a study, “age” is the variable.

In any statistical investigation, we first define the population—the full set of units we want to study. A population unit is simply one member of that full set.

Definition 1.7 (Population Unit) A population unit is a unit that belongs to the entire population — the complete set of units that share the characteristic(s) of interest in a study.

Example 1.6

  • If the population is all students at a university, then one student at that university is a population unit.

Because studying an entire population is often impractical, we select a smaller group—a sample—to represent it. A sample unit is one member of that smaller group.

Definition 1.8 (Sample Unit) A sample unit is a unit that is part of the sample — the subset of the population selected for measurement or observation in the study.

Example 1.7

  • If we select 50 students from a university for a survey, one of these selected students is a sample unit.

By clearly distinguishing these terms, we can describe our data precisely, communicate our methods effectively, and avoid confusion when interpreting results.

There are characteristics of a population and sample which are of interest to us.

Definition 1.9 (Parameter) A parameter is a numerical quantity of a population which summarizes a characteristic of the population.

Some examples of parameters are introduced in this section. These terms will be discussed in more detail in later chapters.

Example 1.8

  • Population mean \(\mu\)
  • Population standard deviation \(\sigma\)
  • Population proportion \(p\)

The true value of a parameter is usually unknown since it is extremely difficult to take measurements on every unit in a population. Therefore we often use measurements from a sample to estimate the value of a parameter.

Definition 1.10 (Statistic) A statistic is a numerical quantity of a sample which summarizes a characteristic of the sample.

Since we have control over a sample, the numerical values of statistics are often calculated and known.

Some examples of statistics are given below and similar to the parameters in 1.8, these terms will also be discussed in more detail in later chapters.

Example 1.9

  • Sample mean \(\bar{x}\)
  • Sample standard deviation \(s\)
  • Sample proportion \(\hat{p}\)

Statistics can be broken down into two broad categories: descriptive statistics and inferential statistics which are given in definitions 1.11 and 1.12 below.

Definition 1.11 (Descriptive Statistics) Descriptive statistics are numerical and graphical methods used to analyze, interpret, and represent data.

Definition 1.12 (Inferential Statistics) Inferential statistics use information from a sample to make generalizations about a larger population.

1.2 Types of data

Data can be classified into two main categories: quantitative and qualitative data.

Definition 1.13 (Quantitative data) Data which can be measured numerically.

Example 1.10

  • Height
  • weight
  • Age
  • Temperature

Definition 1.14 (Qualitative data) Data which can not be measured numerically. Qualitative data falls into categories instead.

Example 1.11

  • Favourite colour out of red, green, blue
  • Favourite flavour of ice-cream out of chocolate, vanilla, strawberry

1.3 Introduction to Inferential statistics

As discussed in Section 1.1, we mentioned that the numerical values of parameters are usually unknown, however the numerical values of statistics are calculated and known.

The picture below shows the relationship between parameters from a population and statistics from a sample.

Figure 1.1: Illustration of parameters in a population and statistics in a sample

The aim of statistical inference is to produce estimators of the population parameters and examine how accurate these estimators are in terms of a probability statement. We also quantify our confidence that the statistic is representative of the parameter and use statistics to test hypotheses about the population parameters. Inference is discussed in more detail in Section 4.1.