mailing an educational pamphlet; and
calling each resident.
Then, the city randomly selects \(200\) households and randomly assigns them to one of three categories:
receiving the pamphlet;
receiving a phone call;
the control group (no form of intervention).
Finally, the city will use the results of this test to decide what is the best way to ask their residents to recycle more.
Can you guess which hypothesis test they will use to make this decision? A Chi-square test for independence!
Chi-Square Test of Independence Definition
Occasionally, you want to know if there is a relationship between two categorical variables.
Think of it this way:
If you know something about one variable, can you use that information to learn about the other variable?
You can use a Chi-square test of independence to do just that.
A Chi-square \( (\chi^{2}) \) test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.
If there is a relationship between the two categorical variables, then knowing the value of one variable tells you something about the value of the other variable.
If there is no relationship between the two categorical variables, then they are independent.
Assumptions for a Chi-Square Test of Independence
All the Pearson Chi-square tests, for independence, homogeneity, and goodness of fit, share the same basic assumptions. The main difference is how the assumptions apply in practice. To be able to use this test, the assumptions for a Chi-square test of independence are:
The two variables must be categorical.
Groups must be mutually exclusive; i.e., the sample is randomly selected.
Continuing from the introductory example, three months after the city's intervention methods are tested, they look at the outcome and put the data into a contingency table. The groups that must be mutually exclusive are the subgroups: (Recycles-Pamphlet), (Does Not Recycle-Control), etc.
Table 1. Contingency table, Chi-square test for independence.
Contingency Table |
---|
Intervention | Recycles | Does Not Recycle | Row Totals |
Pamphlet | 46 | 18 | 56 |
Phone Call | 47 | 19 | 77 |
Control | 49 | 21 | 67 |
Column Totals | 142 | 58 | \(n =\) 200 |
Null Hypothesis and Alternative Hypothesis for a Chi-Square Test of Independence
When it comes to independence of variables, you almost always assume that two variables are independent, then try to prove that they aren’t.
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]
Notice that the Chi-square test for independence makes no claims about the kind of relationship between the two categorical variables, only whether a relationship exists.
Replacing “Variable A” and “Variable B” with the variables in the city recycling example, you get:
Your population is all the households in your city.
- Null Hypothesis \[ \begin{align}H_{0}: &\text{“if a household recycles” and} \\&\text{“the type of intervention received”} \\&\text{are not related.}\end{align} \]
- Alternative Hypothesis \[ \begin{align}H_{a}: &\text{“if a household recycles” and} \\&\text{“the type of intervention received”} \\&\text{are related.}\end{align} \]
Expected Frequencies of a Chi-Square Test of Independence
As with other Chi-square tests, a Chi-square test of independence works by comparing your observed and expected frequencies. You calculate expected frequencies using the contingency table. So, the expected frequency for row \(r\) and column \(c\) is given by the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
where,
\(E_{r,c}\) is the expected frequency for population (or, row) \(r\) at level (or, column) \(c\) of the categorical variable,
\(r\) is the number of populations, which is also the number of rows in a contingency table,
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table,
\(n_{r}\) is the number of observations from population (or, row) \(r\),
\(n_{c}\) is the number of observations from level (or, column) \(c\) of the categorical variable, and
\(n\) is the total sample size.
Continuing with the city recycling example:
Your city now calculates the expected frequencies using the formula above and the contingency table.
- \(E_{1,1}=\frac{56 \cdot 142}{200} = 39.76\)
- \(E_{1,2}=\frac{56 \cdot 58}{200} = 16.24\)
- \(E_{2,1}=\frac{77 \cdot 142}{200} = 54.67\)
- \(E_{2,2}=\frac{77 \cdot 58}{200} = 22.33\)
- \(E_{3,1}=\frac{67 \cdot 142}{200} = 47.57\)
- \(E_{3,2}=\frac{67 \cdot 58}{200} = 19.43\)
Table 2. Contingency table with observed frequencies and expected frequencies, Chi-square test for independence.
Contingency Table with Observed (O) Frequencies and Expected Frequencies (E) |
---|
Intervention | Recycles | Does Not Recycle | Row Totals |
Pamphlet | O1,1 = 46E1,1 = 39.76 | O1,2 = 18E1,2 = 16.24 | 56 |
Phone Call | O2,1 = 47E2,1 = 54.67 | O2,2 = 19E2,2 = 22.33 | 77 |
Control | O3,1 = 49E3,1 = 47.57 | O3,2 = 21E3,2 = 19.43 | 67 |
Column Totals | 142 | 58 | \(n =\) 200 |
Degrees of Freedom for a Chi-Square Test of Independence
Like in the Chi-square test for homogeneity, you are comparing two variables and need the contingency table to add up in both dimensions.
The formula for the degrees of freedom is the same in both the homogeneity and independence tests:
\[ k = (r - 1) (c - 1) \]
where,
\(k\) is the degrees of freedom,
\(r\) is the number of populations, which is also the number of rows in a contingency table, and
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table.
Formula for a Chi-Square Test of Independence
The formula (also called a test statistic) for a Chi-square test of independence is:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
where,
\(O_{r,c}\) is the observed frequency for population \(r\) at level \(c\), and
\(E_{r,c}\) is the expected frequency for population \(r\) at level \(c\).
The Chi-square test statistic measures how much your observed frequencies differ from your expected frequencies if the two variables are unrelated.
Steps to Calculate the Test Statistic for a Chi-Square Test of Independence
Step \(1\): Create a Table
Using your contingency table, create a table that separates your observed and expected values into two columns.
Table 3. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed and Expected Frequencies |
---|
Intervention | Outcome | Observed Frequency | Expected Frequency |
Pamphlet | Recycles | 46 | 39.76 |
Does Not Recycle | 18 | 16.24 |
Phone Call | Recycles | 47 | 54.67 |
Does Not Recycle | 19 | 22.33 |
Control | Recycles | 49 | 47.57 |
Does Not Recycle | 21 | 19.43 |
Step \(2\): Subtract Expected Frequencies from Observed Frequencies
Add a new column to your table called “O – E”. In this column, put the result of subtracting the expected frequency from the observed frequency.
Table 4. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, and O-E Frequencies |
---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E |
Pamphlet | Recycles | 46 | 39.76 | 6.24 |
Does Not Recycle | 18 | 16.24 | 1.76 |
Phone Call | Recycles | 47 | 54.67 | -7.67 |
Does Not Recycle | 19 | 22.33 | -3.33 |
Control | Recycles | 49 | 47.57 | 1.43 |
Does Not Recycle | 21 | 19.43 | 1.57 |
Decimals in this table are rounded to \(2\) digits.
Step \(3\): Square the Results from Step \(2\)
Add a new column to your table called “(O – E)2”. In this column, put the result of squaring the results from the previous column.
Table 5. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, O-E, and (O-E)2 Frequencies |
---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E | (O – E)2 |
Pamphlet | Recycles | 46 | 39.76 | 6.24 | 38.94 |
Does Not Recycle | 18 | 16.24 | 1.76 | 3.10 |
Phone Call | Recycles | 47 | 54.67 | -7.67 | 58.83 |
Does Not Recycle | 19 | 22.33 | -3.33 | 11.09 |
Control | Recycles | 49 | 47.57 | 1.43 | 2.04 |
Does Not Recycle | 21 | 19.43 | 1.57 | 2.46 |
Decimals in this table are rounded to \(2\) digits.
Step \(4\): Divide the Results from Step \(3\) by the Expected Frequencies
Add a new column to your table called “(O – E)2”/E. In this column, put the result of dividing the results from the previous column by their expected frequencies.
Table 6. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, O-E, (O-E)2, and (O-E)2/E Frequencies |
---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E | (O – E)2 | (O – E)2/E |
Pamphlet | Recycles | 46 | 39.76 | 6.24 | 38.94 | 0.98 |
Does Not Recycle | 18 | 16.24 | 1.76 | 3.10 | 0.19 |
Phone Call | Recycles | 47 | 54.67 | -7.67 | 58.83 | 1.08 |
Does Not Recycle | 19 | 22.33 | -3.33 | 11.09 | 0.50 |
Control | Recycles | 49 | 47.57 | 1.43 | 2.04 | 0.04 |
Does Not Recycle | 21 | 19.43 | 1.57 | 2.46 | 0.13 |
Decimals in this table are rounded to \(2\) digits.
Step \(5\): Add the Results from Step \(4\) to get the Chi-Square Test Statistic
Finally, add up all the values in the last column of your table to calculate your Chi-square test statistic:
\[ \begin{align}\chi^{2} &= \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \\&= 0.9793 + 0.1907 + 1.0761 + 0.4966 + 0.04299 + 0.1269 \\&= 2.91259\end{align} \]
The formula here uses the non-rounded numbers from the tables above to get a more accurate answer.
The Chi-square test statistic for the Chi-square test of independence in the city recycling example is:
\[ \chi^{2} = 2.91259 \]
Steps to Perform a Chi-Square Test of Independence
If your calculated test statistic is large enough, then you can draw the conclusion that the observed frequencies are not what you would expect if the variables are indeed unrelated. But what is considered “large enough”?
To determine whether the test statistic is large enough to reject the null hypothesis, you compare the test statistic to a critical value from a Chi-square distribution table. This act of comparison is the heart of the Chi-square test of independence.
Follow the \(6\) steps below to perform a Chi-square test of independence.
Note that steps \(1, 2\) and \(3\) were outlined in detail above.
Step \(1\): State the Hypotheses
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]
Step \(2\): Calculate the Expected Frequencies
Use your contingency table to calculate the expected frequencies using the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
Step \(3\): Calculate the Chi-Square Test Statistic
Use the formula for a Chi-square test of independence to calculate the Chi-square test statistic:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
Step \(4\): Find the Critical Chi-Square Value
You have two options for finding the critical value:
use a Chi-square distribution table, or
use a critical value calculator.
Either way, there are two pieces of information you need to know to find the critical value:
the degrees of freedom, \(k\), given by the formula:
\[ k = (r - 1) (c - 1) \]
and the significance level, \( \alpha \), which is usually \( 0.05 \).
Referring back to the city recycling example, find the critical value.
Find the critical Chi-square value.
- Calculate the degrees of freedom.
- Using the contingency table for the city recycling example, recall that there are \(3\) intervention groups (the rows of the contingency table) and \(2\) outcome groups (the columns of the contingency table). So, the degrees of freedom are:\[ \begin{align} k &= (r - 1) (c - 1) \\&= (3 - 1) (2 - 1) \\&= 2 \text{ degrees of freedom}\end{align} \]
- Choose a significance level.
- Typically, a significance level of \( 0.05 \) is used, so use that here.
- Using either a Chi-square distribution table or a critical value calculator, determine the critical value.
- According to the Chi-square distribution table below, for \(k = 2\) and \( \alpha = 0.05 \), the critical value is:\[ \chi^{2} \text{critical value} = 5.99 \]
Table 7. Percentage of points, Chi-square test for independence.
Percentage Points of the Chi-Square Distribution |
---|
Degrees of Freedom (k) | Probability of a Larger Value of X2; Significance Level (α) |
0.99 | 0.95 | 0.90 | 0.75 | 0.50 | 0.25 | 0.10 | 0.05 | 0.01 |
1 | 0.000 | 0.004 | 0.016 | 0.102 | 0.455 | 1.32 | 2.71 | 3.84 | 6.63 |
2 | 0.020 | 0.103 | 0.211 | 0.575 | 1.386 | 2.77 | 4.61 | 5.99 | 9.21 |
3 | 0.115 | 0.352 | 0.584 | 1.212 | 2.366 | 4.11 | 6.25 | 7.81 | 11.34 |
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value
Now for the moment of truth! Is your test statistic large enough to reject the null hypothesis? Compare it to the critical value you just found to find out.
Again, continuing with the city recycling example, compare the test statistic to the critical value.
The Chi-square test statistic is: \( \chi^{2} = 2.91259 \)
The critical value is: \( 5.99 \)
The Chi-square test statistic is less than the critical value.
Step \(6\): Decide Whether to Reject the Null Hypothesis
Finally, decide whether to reject the null hypothesis.
If the Chi-square value is greater than the critical value, then the difference between the observed and expected frequencies is significant; \( (p < \alpha) \)
If the Chi-square value is less than the critical value, then the difference between the observed and expected frequencies is not significant; \( (p > \alpha) \)
Decide whether to reject the null hypothesis for the city recycling example.
The Chi-square value is less than the critical value.
- So, the city does not reject the null hypothesis that whether a household recycles and the type of intervention they receive are unrelated.
- There is not a significant difference between the observed frequencies and the expected frequencies. This suggests that the proportion of households that recycle is the same for all interventions.
The city concludes that their interventions do not have an effect on whether households choose to recycle.
Using Critical Value VS Using P-Value
In the steps to perform a Chi-square test of independence, you calculated and used the critical value to decide whether to reject the null hypothesis.
A critical value of a Chi-square test of independence is a value that is compared to the value of the test statistic, so you can determine whether to reject the null hypothesis.
It is important to know, however, that there is another option you can use: the \(p\)-value.
The \(p\)-value of a Chi-square test of independence is associated with the calculated value of its test statistic. It is the area to the right of the \( \chi^{2} \) under the chi square curve, and it has \(k\) degrees of freedom.
The image below sums up the critical value approach vs. the \(p\)-value approach.
Figure 1. A diagram showing how you can use either a \(p\)-value or a critical value to determine whether to reject the null hypothesis.
Chi-Square Test for Independence – Example
Many jobseekers are applying via online job boards these days. Sites like Indeed, ZipRecruiter, and CareerBuilder have thousands of enticing posts inviting people to apply. It’s never been easier for fraudulent recruiters to lure in unsuspecting and vulnerable people.
Are fraudulent recruiters more prevalent in some industries than others?
The contingency table below contains real counts of fraudulent and non-fraudulent online job openings, by industry. These are the \(10\) most common industries in the dataset. This is quite a big dataset, but a good representation of what statisticians do in the real world.
Table 7. Contingency table, Chi-square test for independence.
Contingency Table |
---|
Industry | Real | Fraud | Row Totals |
Information Technology | 1702 | 32 | 1734 |
Computer Software | 1371 | 5 | 1376 |
Internet | 1062 | 0 | 1062 |
Marketing / Advertising | 783 | 45 | 828 |
Education | 822 | 0 | 822 |
Financial Services | 744 | 35 | 779 |
Healthcare | 446 | 51 | 497 |
Consumer Services | 334 | 24 | 358 |
Telecom. | 316 | 26 | 342 |
Oil / Energy | 178 | 109 | 287 |
Column Totals | 7758 | 327 | \(n=\) 8085 |
Solution:
Step \(1\): State the Hypotheses.
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“if a job post is real” and “the job industry” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“if a job post is real” and “the job industry” are related.} \]
Step \(2\): Calculate Expected Frequencies.
- Using the contingency table above and the formula:\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n}, \]create a table that has your calculated expected frequencies.
Table 7. Table of expected frequencies, Chi-square test for independence.
Table of Expected Frequencies |
---|
Industry | Real | Fraud | Row Totals |
Information Technology | 1663.8679 | 70.1321 | 1734 |
Computer Software | 1320.3473 | 55.6527 | 1376 |
Internet | 1019.0471 | 42.9529 | 1062 |
Marketing / Advertising | 794.5113 | 33.4887 | 828 |
Education | 788.754 | 33.246 | 822 |
Financial Services | 747.4931 | 31.5069 | 779 |
Healthcare | 476.8987 | 20.1013 | 497 |
Consumer Services | 343.5206 | 14.4794 | 358 |
Telecom. | 328.1677 | 13.8323 | 324 |
Oil / Energy | 275.3922 | 11.6078 | 287 |
Column Totals | 7758 | 327 | \(n =\) 8085 |
Step \(3\): Calculate the Chi-Square Test Statistic.
- Create a table to hold your calculated values and use the formula:\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]to calculate your test statistic.
Table 7. Chi-square test statistics.
Using a Table to Calculate the Chi-Square Test Statistic |
---|
Industry | Job Post Status | Observed Frequency | Expected Frequency | O – E | (O – E)2 | (O – E)2/E |
Information Technology | Real | 1702 | 1633.868 | 68.132 | 4641.983 | 2.841 |
Fraud | 32 | 70.132 | -38.132 | 1454.057 | 20.733 |
Computer Software | Real | 1371 | 1320.347 | 50.653 | 2565.696 | 1.943 |
Fraud | 5 | 55.653 | -50.653 | 2565.696 | 46.102 |
Internet | Real | 1062 | 1019.047 | 42.953 | 1844.952 | 1.811 |
Fraud | 0 | 42.953 | -42.953 | 1844.952 | 42.953 |
Marketing / Advertising | Real | 783 | 794.511 | -11.511 | 132.510 | 0.167 |
Fraud | 45 | 33.4888 | 11.511 | 132.510 | 3.957 |
Education | Real | 822 | 788.754 | 33.246 | 1105.297 | 1.401 |
Fraud | 0 | 33.246 | -33.246 | 1105.297 | 33.246 |
Financial Services | Real | 744 | 747.493 | -3.493 | 12.202 | 0.016 |
Fraud | 35 | 31.507 | 3.493 | 12.202 | 0.387 |
Healthcare | Real | 446 | 476.899 | -30.899 | 954.730 | 2.002 |
Fraud | 51 | 20.101 | 30.899 | 954.730 | 47.496 |
Consumer Services | Real | 334 | 343.521 | -9.521 | 90.642 | 0.264 |
Fraud | 24 | 14.479 | 9.521 | 90.642 | 6.260 |
Telecom. | Real | 316 | 328.168 | -12.168 | 148.053 | 0.451 |
Fraud | 26 | 13.832 | 12.168 | 148.053 | 10.703 |
Oil / Energy | Real | 178 | 275.392 | -97.392 | 9485.241 | 34.443 |
Fraud | 109 | 11.608 | 97.392 | 9485.241 | 817.144 |
Decimals in this table are rounded to \(3\) digits.
- Add all the values in the last column of the table above to calculate the test statistic:\[ \begin{align}\chi^{2} &= 2.8411 + 20.7331 + 1.9432 + 46.1019 + 1.8105 \\&+ 42.9529 + 0.1668 + 3.9569 + 1.4013 + 33.246 \\&+ 0.0163 + 0.3873 + 2.0020 + 47.4959 + 0.2639 \\&+ 6.2601 + 0.4512 + 10.7034 + 34.4427 + 817.1437 \\&= 1074.319971.\end{align} \]
The formula here uses the non-rounded numbers from the table above to get a more accurate answer.
- The Chi-square test statistic is:\[ \chi^{2} = 1074.319971 .\]
Step \(4\): Find the Critical Chi-Square Value and the \(P\)-Value.
In the real world, a statistician would likely be more interested in calculating the \(p\)-value than simply reporting whether there was a significant result, but people much prefer to get a more specific conclusion. Say you want to be really sure that there is a relationship before you report one, and choose a significance level of \(\alpha = 0.01\).
- Calculate the degrees of freedom: \[ \begin{align}k &= (r - 1)(c - 1) \\&= (2 - 1) (10 - 1) \\&= 1 \cdot 9 \\&= 9 \text{ degrees of freedom}\end{align} \]
- Using a Chi-square distribution table, look at the row for \(9\) degrees of freedom and the column for \(0.01\) significance to find the critical value of \(21.67\).
- To use a \(p\)-value calculator, you need the test statistic and degrees of freedom.
- Plugging the degrees of freedom and the test statistic into a \(p\)-value calculator, you get a \(p\)-value very close to \(0\).
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value.
- The test statistic of \(1074.319971\) is much, much larger than the critical value of \(21.67\), which means you have sufficient evidence to reject the null hypothesis.
- The \(p\)-value is also very low, much less than the significance level, which would also let you reject the null hypothesis.
Step \(6\): Decide Whether to Reject the Null Hypothesis.
- It looks like there is a strong relationship between industry and the number of fraudulent recruiters out there.
- Look at the table from step \(2\).
- Here, you can see that the number of fraudulent jobs in the Oil industry is way higher than expected, and by itself contributes enough for you to conclude that industry and recruiter scams are not independent.
Therefore, you can confidently reject the null hypothesis.
Chi-Square Test for Independence – Key takeaways
- A Chi-square test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.
- The following must be true in order to use a Chi-square test of independence:
- The two variables must be categorical.
- Groups must be mutually exclusive; i.e., the sample is randomly selected.
- Expected counts must be at least \(5\).
- Observations must be independent.
- The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.
- The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.
- The expected frequency for row \(r\) and column \(c\) of a Chi-square test of independence is given by the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
- The degrees of freedomfor a Chi-square test of independence is given by the formula:
\[ k = (r - 1) (c - 1) \]
The formula (also called a test statistic) for a Chi-square test of independence is:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]