Chi-Square Test for Homogeneity Definition
When you want to know if two categorical variables follow the same probability distribution (like in the movie preference question above), you can use a Chi-square test for homogeneity.
A Chi-square \( (\chi^{2}) \) test for homogeneity is a non-parametric Pearson Chi-square test that you apply to a single categorical variable from two or more different populations to determine whether they have the same distribution.
In this test, you randomly collect data from a population to determine if there is a significant association between \(2\) or more categorical variables.
Conditions for a Chi-Square Test for Homogeneity
All the Pearson Chi-square tests share the same basic conditions. The main difference is how the conditions apply in practice. A Chi-square test for homogeneity requires a categorical variable from at least two populations, and the data needs to be the raw count of members of each category. This test is used to check if the two variables follow the same distribution.
To be able to use this test, the conditions for a Chi-square test of homogeneity are:
Reference the study: “Out-of-Hospital Cardiac Arrest in High-Rise Buildings: Delays to Patient Care and Effect on Survival”1 – which was published in the Canadian Medical Association Journal (CMAJ) on April \(5, 2016\).
This study compared how adults live (house or townhouse, \(1^{st}\) or \(2^{nd}\) floor apartment, and \(3^{rd}\) or higher floor apartment) with their survival rate of a heart attack (survived or did not survive).
Your goal is to learn if there is a difference in the survival category proportions (i.e., are you more likely to survive a heart attack depending on where you live?) for the \(3\) populations:
- heart attack victims who live in either a house or a townhouse,
- heart attack victims who live on the \(1^{st}\) or \(2^{nd}\) floor of an apartment building, and
- heart attack victims who live on the \(3^{rd}\) or higher floor of an apartment building.
Contingency Table |
---|
Living Arrangement | Survived | Did Not Survive | Row Totals |
House or Townhouse | 217 | 5314 | 5531 |
1st or 2nd Floor Apartment | 35 | 632 | 667 |
3rd or Higher Floor Apartment | 46 | 1650 | 1696 |
Column Totals | 298 | 7596 | \(n =\) 7894 |
Table 1. Table of contingency, Chi-Square test for homogeneity.
Chi-Square Test for Homogeneity: Null Hypothesis and Alternative Hypothesis
The question underlying this hypothesis test is: Do these two variables follow the same distribution?
The hypotheses are formed to answer that question.
- The null hypothesis is that the two variables are from the same distribution.\[ \begin{align}H_{0}: p_{1,1} &= p_{2,1} \text{ AND } \\p_{1,2} &= p_{2,2} \text{ AND } \ldots \text{ AND } \\p_{1,n} &= p_{2,n}\end{align} \]
The null hypothesis requires every single category to have the same probability between the two variables.
The alternative hypothesis is that the two variables are not from the same distribution, i.e., at least one of the null hypotheses is false.\[ \begin{align}H_{a}: p_{1,1} &\neq p_{2,1} \text{ OR } \\p_{1,2} &\neq p_{2,2} \text{ OR } \ldots \text{ OR } \\p_{1,n} &\neq p_{2,n}\end{align} \]
If even one category is different from one variable to the other, then the test will return a significant result and provide evidence to reject the null hypothesis.
The null and alternative hypotheses in the heart attack survival study are:
The population is people who live in houses, townhouses, or apartments and who have had a heart attack.
- Null Hypothesis\( H_{0}: \) The proportions in each survival category are the same for all \(3\) groups of people.
- Alternative Hypothesis\( H_{a}: \) The proportions in each survival category are not the same for all \(3\) groups of people.
Expected Frequencies for a Chi-Square Test for Homogeneity
You must calculate the expected frequencies for a Chi-square test for homogeneity individually for each population at each level of the categorical variable, as given by the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
where,
\(E_{r,c}\) is the expected frequency for population \(r\) at level \(c\) of the categorical variable,
\(r\) is the number of populations, which is also the number of rows in a contingency table,
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table,
\(n_{r}\) is the number of observations from population \(r\),
\(n_{c}\) is the number of observations from level \(c\) of the categorical variable, and
\(n\) is the total sample size.
Continuing with the heart attack survival study:
Next, you calculate the expected frequencies using the formula above and the contingency table, putting your results into a modified contingency table to keep your data organized.
- \( E_{1,1} = \frac{5531 \cdot 298}{7894} = 208.795 \)
- \( E_{1,2} = \frac{5531 \cdot 7596}{7894} = 5322.205 \)
- \( E_{2,1} = \frac{667 \cdot 298}{7894} = 25.179 \)
- \( E_{2,2} = \frac{667 \cdot 7596}{7894} = 641.821 \)
- \( E_{3,1} = \frac{1696 \cdot 298}{7894} = 64.024 \)
- \( E_{3,2} = \frac{1696 \cdot 7596}{7894} = 1631.976 \)
Table 2. Table of contingency with observed frequencies, Chi-Square test for homogeneity.
Contingency Table with Observed (O) Frequencies and Expected (E) Frequencies |
---|
Living Arrangement | Survived | Did Not Survive | Row Totals |
House or Townhouse | O1,1: 217E1,1: 208.795 | O1,2: 5314E1,2: 5322.205 | 5531 |
1st or 2nd Floor Apartment | O2,1: 35E2,1: 25.179 | O2,2: 632E2,2: 641.821 | 667 |
3rd or Higher Floor Apartment | O3,1: 46E3,1: 64.024 | O3,2: 1650E3,2: 1631.976 | 1696 |
Column Totals | 298 | 7596 | \(n =\) 7894 |
Decimals in the table are rounded to \(3\) digits.
Degrees of Freedom for a Chi-Square Test for Homogeneity
There are two variables in a Chi-square test for homogeneity. Therefore, you are comparing two variables and need the contingency table to add up in both dimensions.
Since you need the rows to add up and the columns to add up, the degrees of freedom is calculated by:
\[ k = (r - 1) (c - 1) \]
where,
\(k\) is the degrees of freedom,
\(r\) is the number of populations, which is also the number of rows in a contingency table, and
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table.
Chi-Square Test for Homogeneity: Formula
The formula (also called a test statistic) of a Chi-square test for homogeneity is:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
where,
\(O_{r,c}\) is the observed frequency for population \(r\) at level \(c\), and
\(E_{r,c}\) is the expected frequency for population \(r\) at level \(c\).
How to Calculate the Test Statistic for a Chi-Square Test for Homogeneity
Step \(1\): Create a Table
Starting with your contingency table, remove the “Row Totals” column and the “Column Totals” row. Then, separate your observed and expected frequencies into two columns, like so:
Table 3. Table of observed and expected frequencies, Chi-Square test for homogeneity.
Table of Observed and Expected Frequencies |
---|
Living Arrangement | Status | Observed Frequency | Expected Frequency |
House or Townhouse | Survived | 217 | 208.795 |
Did Not Survive | 5314 | 5322.205 |
1st or 2nd Floor Apartment | Survived | 35 | 25.179 |
Did Not Survive | 632 | 641.821 |
3rd or Higher Floor Apartment | Survived | 46 | 64.024 |
Did Not Survive | 1650 | 1631.976 |
Decimals in this table are rounded to \(3\) digits.
Step \(2\): Subtract Expected Frequencies from Observed Frequencies
Add a new column to your table called “O – E”. In this column, put the result of subtracting the expected frequency from the observed frequency:
Table 4. Table of observed and expected frequencies, Chi-Square test for homogeneity.
Table of Observed, Expected, and O – E Frequencies | |
---|
Living Arrangement | Status | Observed Frequency | Expected Frequency | O – E |
House or Townhouse | Survived | 217 | 208.795 | 8.205 |
Did Not Survive | 5314 | 5322.205 | -8.205 |
1st or 2nd Floor Apartment | Survived | 35 | 25.179 | 9.821 |
Did Not Survive | 632 | 641.821 | -9.821 |
3rd or Higher Floor Apartment | Survived | 46 | 64.024 | -18.024 |
Did Not Survive | 1650 | 1631.976 | 18.024 |
Decimals in this table are rounded to \(3\) digits.
Step \(3\): Square the Results from Step \(2\)Add another new column to your table called “(O – E)2”. In this column, put the result of squaring the results from the previous column:
Table 5. Table of observed and expected frequencies, Chi-Square test for homogeneity.
Table of Observed, Expected, O – E, and (O – E)2 Frequencies | | |
---|
Living Arrangement | Status | Observed Frequency | Expected Frequency | O – E | (O – E)2 |
House or Townhouse | Survived | 217 | 208.795 | 8.205 | 67.322 |
Did Not Survive | 5314 | 5322.205 | -8.205 | 67.322 |
1st or 2nd Floor Apartment | Survived | 35 | 25.179 | 9.821 | 96.452 |
Did Not Survive | 632 | 641.821 | -9.821 | 96.452 |
3rd or Higher Floor Apartment | Survived | 46 | 64.024 | -18.024 | 324.865 |
Did Not Survive | 1650 | 1631.976 | 18.024 | 324.865 |
Decimals in this table are rounded to \(3\) digits.
Step \(4\): Divide the Results from Step \(3\) by the Expected FrequenciesAdd a final new column to your table called “(O – E)2/E”. In this column, put the result of dividing the results from the previous column by their expected frequencies:
Table 6. Table of observed and expected frequencies, Chi-Square test for homogeneity.
Table of Observed, Expected, O – E, (O – E)2, and (O – E)2/E Frequencies | | | |
---|
Living Arrangement | Status | Observed Frequency | Expected Frequency | O – E | (O – E)2 | (O – E)2/E |
House or Townhouse | Survived | 217 | 208.795 | 8.205 | 67.322 | 0.322 |
Did Not Survive | 5314 | 5322.205 | -8.205 | 67.322 | 0.013 |
1st or 2nd Floor Apartment | Survived | 35 | 25.179 | 9.821 | 96.452 | 3.831 |
Did Not Survive | 632 | 641.821 | -9.821 | 96.452 | 0.150 |
3rd or Higher Floor Apartment | Survived | 46 | 64.024 | -18.024 | 324.865 | 5.074 |
Did Not Survive | 1650 | 1631.976 | 18.024 | 324.865 | 0.199 |
Decimals in this table are rounded to \(3\) digits.
Step \(5\): Sum the Results from Step \(4\) to get the Chi-Square Test StatisticFinally, add up all the values in the last column of your table to calculate your Chi-square test statistic:
\[ \begin{align}\chi^{2} &= \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \\&= 0.322 + 0.013 + 3.831 + 0.150 + 5.074 + 0.199 \\&= 9.589.\end{align} \]
The Chi-square test statistic for the Chi-square test for homogeneity in the heart attack survival study is:
\[ \chi^{2} = 9.589. \]
Steps to Perform a Chi-Square Test for Homogeneity
To determine whether the test statistic is large enough to reject the null hypothesis, you compare the test statistic to a critical value from a Chi-square distribution table. This act of comparison is the heart of the Chi-square test of homogeneity.
Follow the \(6\) steps below to perform a Chi-square test of homogeneity.
Steps \(1, 2\) and \(3\) are outlined in detail in the previous sections: “Chi-Square Test for Homogeneity: Null Hypothesis and Alternative Hypothesis”, “Expected Frequencies for a Chi-Square Test for Homogeneity”, and “How to Calculate the Test Statistic for a Chi-Square Test for Homogeneity”.
Step \(1\): State the Hypotheses
- The null hypothesis is that the two variables are from the same distribution.\[ \begin{align}H_{0}: p_{1,1} &= p_{2,1} \text{ AND } \\p_{1,2} &= p_{2,2} \text{ AND } \ldots \text{ AND } \\p_{1,n} &= p_{2,n}\end{align} \]
The alternative hypothesis is that the two variables are not from the same distribution, i.e., at least one of the null hypotheses is false.\[ \begin{align}H_{a}: p_{1,1} &\neq p_{2,1} \text{ OR } \\p_{1,2} &\neq p_{2,2} \text{ OR } \ldots \text{ OR } \\p_{1,n} &\neq p_{2,n}\end{align} \]
Step \(2\): Calculate the Expected Frequencies
Reference your contingency table to calculate the expected frequencies using the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
Step \(3\): Calculate the Chi-Square Test Statistic
Use the formula for a Chi-square test for homogeneity to calculate the Chi-square test statistic:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
Step \(4\): Find the Critical Chi-Square Value
To find the critical Chi-square value, you can either:
use a Chi-square distribution table, or
use a critical value calculator.
No matter which method you choose, you need \(2\) pieces of information:
the degrees of freedom, \(k\), given by the formula:
\[ k = (r - 1) (c - 1) \]
and the significance level, \(\alpha\), which is usually \(0.05\).
Find the critical value of the heart attack survival study.
To find the critical value:
- Calculate the degrees of freedom.
- Using the contingency table, notice that there are \(3\) rows and \(2\) columns of raw data. Therefore, the degrees of freedom are:\[ \begin{align}k &= (r - 1) (c - 1) \\&= (3-1) (2-1) \\&= 2 \text{ degrees of freedom}\end{align} \]
- Pick a significance level.
- Generally, unless otherwise specified, the significance level of \( \alpha = 0.05 \) is what you want to use. This study also used that significance level.
- Determine the critical value (you can use a Chi-square distribution table or a calculator). A Chi-square distribution table is used here.
- According to the Chi-square distribution table below, for \( k = 2 \) and \( \alpha = 0.05 \), the critical value is:\[ \chi^{2} \text{ critical value} = 5.99. \]
Table 7. Table of percentage points, Chi-Square test for homogeneity.
Percentage Points of the Chi-Square Distribution |
---|
Degrees of Freedom (k) | Probability of a Larger Value of X2; Significance Level (α) |
0.99 | 0.95 | 0.90 | 0.75 | 0.50 | 0.25 | 0.10 | 0.05 | 0.01 |
1 | 0.000 | 0.004 | 0.016 | 0.102 | 0.455 | 1.32 | 2.71 | 3.84 | 6.63 |
2 | 0.020 | 0.103 | 0.211 | 0.575 | 1.386 | 2.77 | 4.61 | 5.99 | 9.21 |
3 | 0.115 | 0.352 | 0.584 | 1.212 | 2.366 | 4.11 | 6.25 | 7.81 | 11.34 |
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value
Is your test statistic large enough to reject the null hypothesis? To find out, compare it to the critical value.
Compare your test statistic to the critical value in the heart attack survival study:
The Chi-square test statistic is: \( \chi^{2} = 9.589 \)
The critical Chi-square value is: \( 5.99 \)
The Chi-square test statistic is greater than the critical value.
Step \(6\): Decide Whether to Reject the Null Hypothesis
Finally, decide if you can reject the null hypothesis.
If the Chi-square value is less than the critical value, then you have an insignificant difference between the observed and expected frequencies; i.e., \( p > \alpha \).
If the Chi-square value is greater than the critical value, then you have a significant difference between the observed and expected frequencies; i.e., \( p < \alpha \).
Now you can decide whether to reject the null hypothesis for the heart attack survival study:
The Chi-square test statistic is greater than the critical value; i.e., the \(p\)-value is less than the significance level.
- So, you have strong evidence to support that the proportions in the survival categories are not the same for the \(3\) groups.
You conclude that there is a smaller chance of survival for those who suffer a heart attack and live on the third or higher floor of an apartment, and therefore reject the null hypothesis.
P-Value of a Chi-Square Test for Homogeneity
The \(p\)-value of a Chi-square test for homogeneity is the probability that the test statistic, with \(k\) degrees of freedom, is more extreme than its calculated value. You can use a Chi-square distribution calculator to find the \(p\)-value of a test statistic. Alternatively, you can use a chi-square distribution table to determine if the value of your chi-square test statistic is above a certain significance level.
Chi-Square Test for Homogeneity VS Independence
At this point, you might ask yourself, what is the difference between a Chi-square test for homogeneity and a Chi-square test for independence?
You use the Chi-square test for homogeneity when you have only \(1\) categorical variable from \(2\) (or more) populations.
When surveying students in a school, you might ask them for their favorite subject. You ask the same question to \(2\) different populations of students:
You use a Chi-square test for homogeneity to determine if the freshmen's preferences differed significantly from the seniors' preferences.
You use the Chi-square test for independence when you have \(2\) categorical variables from the same population.
In a school, students could be classified by:
- their handedness (left- or right-handed) or by
- their field of study (math, physics, economics, etc.).
You use a Chi-square test for independence to determine if handedness is related to choice of study.
Chi-Square Test for Homogeneity Example
Continuing from the example in the introduction, you decide to find an answer to the question: do men and women have different movie preferences?
You select a random sample of \(400\) college freshmen: \(200\) men and \(300\) women. Each person is asked which of the following movies they like best: The Terminator; The Princess Bride; or The Lego Movie. The results are shown in the contingency table below.
Table 8. Contigency table, Chi-Square test for homogeneity.
| Contingency Table | |
---|
Movie | Men | Women | Row Totals |
The Terminator | 120 | 50 | 170 |
The Princess Bride | 20 | 140 | 160 |
The Lego Movie | 60 | 110 | 170 |
Column Totals | 200 | 300 | \(n =\) 500 |
Solution:
Step \(1\): State the Hypotheses.
- Null hypothesis: the proportion of men who prefer each movie is equal to the proportion of women who prefer each movie. So,\[ \begin{align}H_{0}: p_{\text{men like The Terminator}} &= p_{\text{women like The Terminator}} \text{ AND} \\H_{0}: p_{\text{men like The Princess Bride}} &= p_{\text{women like The Princess Bride}} \text{ AND} \\H_{0}: p_{\text{men like The Lego Movie}} &= p_{\text{women like The Lego Movie}}\end{align} \]
- Alternative hypothesis: At least one of the null hypotheses is false. So,\[ \begin{align}H_{a}: p_{\text{men like The Terminator}} &\neq p_{\text{women like The Terminator}} \text{ OR} \\H_{a}: p_{\text{men like The Princess Bride}} &\neq p_{\text{women like The Princess Bride}} \text{ OR} \\H_{a}: p_{\text{men like The Lego Movie}} &\neq p_{\text{women like The Lego Movie}}\end{align} \]
Step \(2\): Calculate Expected Frequencies.
- Using the above contingency table and the formula for expected frequencies:\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n}, \]create a table of expected frequencies.
Table 9. Table of data for movies, Chi-Square test for homogeneity.
Movie | Men | Women | Row Totals |
The Terminator | 68 | 102 | 170 |
The Princess Bride | 64 | 96 | 160 |
The Lego Movie | 68 | 102 | 170 |
Column Totals | 200 | 300 | \(n =\) 500 |
Step \(3\): Calculate the Chi-Square Test Statistic.
- Create a table to hold your calculated values and use the formula:\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]to calculate your test statistic.
Table 10. Table of data for movies, Chi-Square test for homogeneity.
Movie | Person | Observed Frequency | Expected Frequency | O-E | (O-E)2 | (O-E)2/E |
Terminator | Men | 120 | 68 | 52 | 2704 | 39.767 |
Women | 50 | 102 | -52 | 2704 | 26.510 |
Princess Bride | Men | 20 | 64 | -44 | 1936 | 30.250 |
Women | 140 | 96 | 44 | 1936 | 20.167 |
Lego Movie | Men | 60 | 68 | -8 | 64 | 0.941 |
Women | 110 | 102 | 8 | 64 | 0.627 |
Decimals in this table are rounded to \(3\) digits.
- Add all the values in the last column of the table above to calculate the Chi-square test statistic:\[ \begin{align}\chi^{2} &= 39.76470588 + 26.50980392 \\&+ 30.25 + 20.16667 \\&+ 0.9411764706 + 0.6274509804 \\&= 118.2598039.\end{align} \]
The formula here uses the non-rounded numbers from the table above to get a more accurate answer.
- The Chi-square test statistic is:\[ \chi^{2} = 118.2598039. \]
Step \(4\): Find the Critical Chi-Square Value and the \(P\)-Value.
- Calculate the degrees of freedom.\[ \begin{align}k &= (r - 1) (c - 1) \\&= (3 - 1) (2 - 1) \\&= 2\end{align} \]
- Using a Chi-square distribution table, look at the row for \(2\) degrees of freedom and the column for \(0.05\) significance to find the critical value of \(5.99\).
- To use a \(p\)-value calculator, you need the test statistic and degrees of freedom.
- Input the degrees of freedom and the Chi-square critical value into the calculator to get:\[ P(\chi^{2} > 118.2598039) = 0. \]
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value.
- The test statistic of \(118.2598039\) is significantly larger than the critical value of \(5.99\).
- The \(p\)-value is also much less than the significance level.
Step \(6\): Decide Whether to Reject the Null Hypothesis.
- Because the test statistic is larger than the critical value and the \(p\)-value is less than the significance level,
you have sufficient evidence to reject the null hypothesis.
Chi-Square Test for Homogeneity – Key takeaways
- A Chi-square test for homogeneity is a Chi-square test that is applied to a single categorical variable from two or more different populations to determine whether they have the same distribution.
- This test has the same basic conditions as any other Pearson Chi-square test;
- The variables must be categorical.
- Groups must be mutually exclusive.
- Expected counts must be at least \(5\).
- Observations must be independent.
- The null hypothesis is that the variables are from the same distribution.
- The alternative hypothesis is that the variables are not from the same distribution.
- The degrees of freedom for a Chi-square test for homogeneity is given by the formula:\[ k = (r - 1) (c - 1) \]
- The expected frequency for row \(r\) and column \(c\) of a Chi-square test for homogeneity is given by the formula:\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
- The formula (or test statistic) for a Chi-square test for homogeneity is given by the formula:\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
References
- https://pubmed.ncbi.nlm.nih.gov/26783332/