Chapter 10 Analysis of Categorical Data
We now consider applications of the very important chi-square statistic, first proposed by Karl Pearson in 1900. So we can get some idea as to why Pearson first proposed his chi-square statistic, we begin with the Binomial case. That is, let \(Y_{1}\) be \(\operatorname{Bin}\left(n, p_{1}\right)\), where \(0<p<1\). According to the Central Limit Theorem,
\[ Z=\frac{Y_{1}-n p_{1}}{\sqrt{n p_{1}\left(1-p_{1}\right)}} \]
has a distribution that is approximately \(N(0,1)\) for large \(n\), particularly when \(n p_{1} \geq 5\) and \(n\left(1-p_{1}\right) \geq 5\).
Thus, it is not surprising that \(Q_{1}=Z^{2}\) is approximately \(\chi^{2}(1)\). If we let \(Y_{2}=n-Y_{1}\) and \(p_{2}=1-p_{1}\), we see that \(Q_{1}\) may be written as
\[ Q_{1}=\frac{\left(Y_{1}-n p_{1}\right)^{2}}{n p_{1}\left(1-p_{1}\right)}=\frac{\left(Y_{1}-n p_{1}\right)^{2}}{n p_{1}}+\frac{\left(Y_{2}-n p_{2}\right)^{2}}{n p_{2}} \]
or
\[ Q_{1}=\sum_{i=1}^{2} \frac{\left(Y_{i}-n p_{i}\right)^{2}}{n p_{i}} \]
To generalize, we let an experiment have \(k\) (instead of only two) mutually exclusive and exhaustive outcomes, say, \(A_{1}, A_{2}, \ldots, A_{k}\). Let \(p_{i}=P\left(A_{i}\right)\), and thus \(\sum_{i=1}^{k} p_{i}=1\). The experiment is repeated \(n\) independent times, and we let \(Y_{i}\) represent the number of times the experiment results in \(A_{i}, i=1,2, \ldots, k\). This joint distribution of \(Y_{1}, Y_{2}, \ldots, Y_{k-1}\) is a generalization of the Binomial distribution.
The joint pmf of \(Y_{1}, Y_{2}, \ldots, Y_{k-1}\) is
\[ \begin{aligned} & \qquad f\left(y_{1}, y_{2}, \ldots, y_{k-1}\right)=\frac{n!}{y_{1}!y_{2}!\cdots y_{k}!} p_{1}^{y_{1}} p_{2}^{y_{2}} \cdots p_{k}^{y_{k}}, \text{ for: } y_{1} \geq 0, y_{2} \geq 0, \cdots, y_{k-1} \geq 0;\\ & \text{ and } y_{1}+y_{2}+\ldots+y_{k-1} \leq n . \\ \end{aligned} \] We say that \(Y_{1}, Y_{2}, \ldots, Y_{k-1}\) have a multinomial distribution with parameters \(n\) and \(p_{1}, p_{2}, \ldots, p_{k-1}.\)
Pearson then extended \(Q_{1}\) to an expression that we denote by \(Q_{k-1}\),
\[ Q_{k-1}=\sum_{i=1}^{k} \frac{\left(Y_{i}-n p_{i}\right)^{2}}{n p_{i}} \]
He argued that \(Q_{k-1}\) has an approximate chi-square distribution with \(k-1\) degrees of freedom in a similar way we argued that \(Q_{1}\) is approximately \(\chi^{2}(1)\).
10.1 Multinomial Response Model
- When a categorical (qualitative) variable of interest results in one of two responses ( \(k=2\) ) (e.g., “Yes” or “No” to had epidural to reduce pain during child birth, “Yes” or “No” still breastfeeding at six months) the data ? called counts ? can be analyzed with a Binomial probability distribution.
- Categorical data with more than two levels (classes, categories) often result from a multinomial model.
- Binomial model is a multinomial model with \(k=2\).
Properties of the Multinomial Model
- The are \(n\) identical trials.
- There are k (or: \(r \times c\) ” r for number of rows” and ” c for number of columns”) possible outcomes to each trial. These outcomes are sometimes called classes, categories, or cells.
- The probabilities of the \(k\) outcomes, denoted by \(p_{1}, p_{2}, p_{3}, \ldots, p_{k}\), where \(p_{1}+p_{2}+p_{3}+\ldots+p_{k}=1\), remain the same from trial to trial.
- The trials are independent.
- The random variables of interest are the cell counts \(n_{1}, n_{2}, \ldots, n_{k}\) of the number of observations that fall into each of the \(k\) categories.
Example 10.1 Suppose that customers can purchase one of the three brands of milk at a supermarket. In a study to determine whether one brand is preferred over another, a record is made of a sample of \(n=300\) milk purchases. The data are shown below. Do the data provide sufficient evidence to indicate a preference for one or more brands?
Table 18.1: Table of Example 18.1
Brand 1 | Brand 2 | Brand 3 | Total |
---|---|---|---|
100 | 100 | 100 | 300 |
Step 1. State Hypotheses.
If all the brands are equally preferred, then the probability that a purchaser will choose any one brand is the same as the probability of choosing any other - that is, \(p_{1}=p_{2}=p_{3}=1 / 3\). Therefore, the null hypothesis of “no preference” is
\[ H_{0}: p_{1}=p_{2}=p_{3}=1 / 3 \]
If \(p_{1}, p_{2}\), and \(p_{3}\) are not all equal, the brands are not equally preferred; in other words, the purchasers must have a preference for one (or possibly) two brands. The alternative hypothesis is
\[ H_{a}: p_{1}, p_{2}, \text { and } p_{3} \text { are not all equal } \]
Therefore, we seek a test statistic that will detect a lack of fit of the observed cell counts to our hypothesized (null) expected cell counts based on the hypothesized cell probabilities. These expected values are:
\[ \begin{aligned} & E\left(n_{1}\right)=n p_{1}=(300)\left(\frac{1}{3}\right)=100 \\ & E\left(n_{2}\right)=n p_{2}=(300)\left(\frac{1}{3}\right)=100 \\ & E\left(n_{3}\right)=n p_{3}=(300)\left(\frac{1}{3}\right)=100 \end{aligned} \]
Step 2. Computing test statistic.
Table 18.2: Table of Expected Counts
Brand 1 | Brand 2 | Brand 3 | Total |
---|---|---|---|
100 | 100 | 100 | 300 |
The test statistic for comparing the observed and expected cell counts (and, consequently, testing \(H_{0}: p_{1}=p_{2}=p_{3}=1 / 3\) is the \(\mathbf{X}^{\mathbf{2}}\) statistic:
\[ \begin{align} \chi^{2} &=\sum_{\text {all cells }} \frac{(\text { observed }- \text { expected })^{2}}{\text { expected }}\\ & =\sum_{\text {all cells }} \frac{\left(n_{i}-E\left(n_{i}\right)\right)^{2}}{E\left(n_{i}\right)} \\ &=\frac{(78-100)^{2}}{100}+\frac{(117-100)^{2}}{100}+\frac{(105-100)^{2}}{100} \\ &=4.84+2.89+0.25=7.98 \end{align} \]
Step 3. Finding P-value.
To find the P -value, compare \(X^{2}\) with critical values from the chi-square distribution with degrees of freedom one fewer than the number of values the brand can take. That’s \(3-1=2\) degrees of freedom. From Table D, we see that \(X^{2}=7.98\) falls between 0.02 and 0.01 critical values of the chi-square distribution with 2 degrees of freedom. So the P -value of \(X^{2}=7.98\) is between 0.01 and 0.02 ( \(0.01<P-\) value \(<0.02\) ).
Step 4. Conclusion.
If we used \(\alpha=0.05\), since our \(P-\) value \(<\alpha=0.05\), we could reject \(H_{0}\) at the \(5 \%\) significance level. We would conclude that the three brands of milk are not equally preferred.
Example 10.2 Raymond Weil is about to come out with a new watch and wants to find out whether people have special preferences of the color of the watchband, or whether all four colors under consideration are equally preferred. A random sample of 80 prospective watch buyers is selected. Each person is shown the watch with four different band colors and asked to state his or her preference. The results (observed counts) are given below.
Table 18.3: Table of Example 18.2
Tan | Brown | Maroon | Black | Total |
---|---|---|---|---|
12 | 40 | 8 | 20 | 80 |
Use \(\alpha=0.01\).
Step 1. State Hypotheses.
If all the brands are equally preferred, then the probability that a purchaser will choose any one color is the same as the probability of choosing any other - that is, \(p_{1}=p_{2}=p_{3}=p_{4}=1 / 4\). Therefore, the null hypothesis of “no preference” is
\[ H_{0}: p_{1}=p_{2}=p_{3}=p_{4}=1 / 4 \]
If \(p_{1}, p_{2}, p_{3}\) and \(p_{4}\) are not all equal, the colors are not equally preferred. The alternative hypothesis is
\[ H_{a}: p_{1}, p_{2}, p_{3} \text { and } p_{4} \text { are not all equal } \]
Therefore, we seek a test statistic that will detect a lack of fit of the observed cell counts to our hypothesized (null) expected cell counts based on the hypothesized cell probabilities. These expected values are:
\[ \begin{aligned} & E\left(n_{1}\right)=n p_{1}=(80)\left(\frac{1}{4}\right)=20 \\ & E\left(n_{2}\right)=n p_{2}=(80)\left(\frac{1}{4}\right)=20 \\ & E\left(n_{3}\right)=n p_{3}=(80)\left(\frac{1}{4}\right)=20 \\ & E\left(n_{4}\right)=n p_{4}=(80)\left(\frac{1}{4}\right)=20 \end{aligned} \]
Table 18.4: Table of Expected Counts
Tan | Brown | Maroon | Black | Total |
---|---|---|---|---|
20 | 20 | 20 | 20 | 80 |
Step 2. Computing test statistic.
The test statistic for comparing the observed and expected cell counts (and, consequently, testing \(H_{0}: p_{1}=p_{2}=p_{3}=p_{4}=1 / 4\) is the \(\mathbf{X}^{\mathbf{2}}\) statistic:
\[\begin{aligned} \chi^{2} &=\sum_{\text {all cells }} \frac{(\text { observed }- \text { expected })^{2}}{\text { expected }} \\ &= \sum_{\text {all cells }} \frac{\left(n_{i}-E\left(n_{i}\right)\right)^{2}}{E\left(n_{i}\right)} \\ &= \frac{(12-20)^{2}}{20}+\frac{(40-20)^{2}}{20}+\frac{(8-20)^{2}}{20}+\frac{(2-20)^{2}}{20} \\ &= 64 / 20+400 / 20+144 / 20+0=30.4 \end{aligned}\]Step 3. Finding P-value.
To find the P -value, compare \(X^{2}\) with critical values from the chi-square distribution with degrees of freedom one fewer than the number of values the color can take. That’s \(4-1=3\) degrees of freedom. From Table D , we see that \(X^{2}=30.4\) is greater than the greatest entry in the \(\mathrm{df}=3\) row, which is the critical value for tail area 0.0005 . The P -value is therefore smaller than 0.0005 .
Step 4. Conclusion.
Since our \(P\) - value \(<\alpha=0.01\), we conclude that there is evidence to reject the null hypothesis that all four colors are equally likely to be chosen. Some colors are probably preferable to others. Our P-value is very small.
x = c(12, 40, 8, 20);
chisq.test(x);
# chisq.test(x) gives you test statistic;
# degrees of freedom and P-value;
chisq.test(x)$expected;
# gives you expected counts;
##
## Chi-squared test for given probabilities
##
## data: x
## X-squared = 30.4, df = 3, p-value = 1.137e-06
## [1] 20 20 20 20
Exercise 10.1 Consider a multinomial experiment involving \(n=150\) trials and \(k=5\) cells. The observed frequencies resulting from the experiment are shown in the accompanying table, and the null hypothesis to be tested is as follows:
\[ H_{0}: p_{1}=0.1, p_{2}=0.2, p_{3}=0.3, p_{4}=0.2, p_{5}=0.2 \]
Test the hypothesis at the \(1 \%\) significance level.
Table 18.5: Table of Exercise 18.1
Cell | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Frequency | 12 | 32 | 42 | 36 | 28 |
Therefore, we seek a test statistic that will detect a lack of fit of the observed cell counts to our hypothesized (null) expected cell counts based on the hypothesized cell probabilities. These expected values are:
\[ \begin{aligned} & E\left(n_{1}\right)=n p_{1}=(150)\left(\frac{1}{10}\right)=15 \\ & E\left(n_{2}\right)=n p_{2}=(150)\left(\frac{2}{10}\right)=30 \\ & E\left(n_{3}\right)=n p_{3}=(150)\left(\frac{3}{10}\right)=45 \\ & E\left(n_{4}\right)=n p_{4}=(150)\left(\frac{2}{10}\right)=30 \\ & E\left(n_{5}\right)=n p_{5}=(150)\left(\frac{2}{10}\right)=30 \end{aligned} \]
Table 18.6: Table of Expected Counts
Cell | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Frequency | 15 | 30 | 45 | 30 | 30 |
Step 2. Computing test statistic.
\[\begin{aligned} \chi^{2} &=\sum_{\text {all cells }} \frac{(\text { observed }- \text { expected })^{2}}{\text { expected }}\\ &=\sum_{\text {all cells }} \frac{ \big(n_{i}-E(n_{i}) \big)^{2}}{E\left(n_{i}\right)} \\ &=\frac{(12-15)^{2}}{15}+\frac{(32-30)^{2}}{30}+\frac{(42-45)^{2}}{45}+\frac{(36-30)^{2}}{30}+\frac{(28-30)^{2}}{30} \\ &=9 / 15+4 / 30+9 / 45+36 / 30+4 / 30=2.2667 \end{aligned}\]Step 3. Finding P-value.
To find the P -value, compare \(X^{2}\) with critical values from the chi-square distribution with degrees of freedom one fewer than the number of “columns”. That’s \(5-1=4\) degrees of freedom. From Table D, we see that \(X^{2}=2.2667\) is smaller than the smallest entry (5.39) in the \(\mathrm{df}=4\) row, which is the critical value for tail area 0.25 . The P -value is therefore greater than 0.25 .
Step 4. Conclusion.
Since our \(P-\) value \(>0.25>\alpha=0.01\), we conclude that there is NOT enough evidence to reject the null hypothesis \(H_{0}\). There is not enough evidence to infer that at least one \(p_{i}\) is not equal to its specified value.
10.2 The Chi-square Test for Goodness of fit
A categorical variable has \(k\) possible outcomes, with probabilities \(p_{1}, p_{2}, \ldots, p_{k}\). That is , \(p_{i}\) is the probability of the \(i\) th outcome. We have \(n\) independent observations from this categorical variable. To test the null hypothesis that the probabilities have specified values
\[\begin{aligned} H_{0}:& \, p_{1} = p_{hyp_1}, \, p_{2} = p_{hyp_2}, \, \ldots, \, p_{k} = p_{hyp_k} \\ H_{a}:& \, \text{at least one of } \, p_{1}, p_{2}, p_{3} \, \text { is different } \end{aligned}\]Use the chi-square statistic
\[ \chi^{2}=\sum \frac{(\text { observed count }- \text { expected count })^{2}}{\text { expected count }} \]
(expected counts are equal to \(n p_{i}\) ). The P -value is the area to the right of \(X^{2}\) under the density curve of the chi-square distribution with \(k-1\) degrees of freedom.
10.3 Sample Size Assumption
We must have enough data for the methods to work. We usually check the following: Expected Cell Frequency Condition: We should expect to see at least five individuals in each cell. The expected cell frequency condition sounds like the condition that \(n p\) and \(n(1-p)\) be at least 10 when we tested proportions, it is similar to that.