Why is the expected frequency during a chi square dependence test calculated the way that it is?

Kayla Mcdowell

Kayla Mcdowell

Answered question

2022-10-15

Why is the expected frequency during a chi square dependence test calculated the way that it is?
I understand the chi square test for testing whether or not a certain model is appropriate. I understand the process based upon which we pick the expected values. But, when it comes to the dependence test (the one where we use a contingency table), I don't understand why the expected frequency is calculated from the observed frequencies in the contingency table using (row total x column total)/grand total.
Someone please explain.

Answer & Explanation

periasemdy

periasemdy

Beginner2022-10-16Added 15 answers

Suppose you have two categorical variables X with 4 levels I,II,III, and IV and Y with three levels A,B, and C. We have the following table:
Levels of X
Levels of Y I II III IV Total
--------------------------------------------------------
A 15 32 18 5 70
B 8 29 23 18 78
C 1 20 25 22 68
--------------------------------------------------------
Total 24 81 66 45 216
You want to test whether categorical variables X and Y are independent, using a chi-squared goodness-of-fit (GOF) test statistic.
The key idea is to use the null hypothesis of independence to get the expected values Eij for each of the 12 cells in the table:
We estimate P ( X = I ) as P ^ ( X = I ) = 24 / 216.. Similarly, P ^ ( Y = A ) = 70 / 216. Then by independence, we multiply to find P ^ ( X = I , Y = A ) = 24 216 × 70 216 = 1680 46 , 656 .
Then to find the expected count for cell (A,I), we multiply by the total sample size 216. to get E A , I = E 11 = 216 × 1690 46.656 . Altogether, canceling a factor 216 in numerator and denominator, we have found
E A , I = (Row A total)(Column I total) Grand Total = 24 ( 70 ) 216 = 7.78.
Notice that it is OK to round slightly, but don't round the E i j to integers.
To complete the analysis you need to use the same procedure to find the other eleven E i j 's.
Then you find the GOF statistic
Q = i j ( n i j = E i j ) 2 E i j ,
where there are twelve terms in the double sum. Each term is called a contribution to Q.
To find the critical value for testing the null hypothesis that X and Y are independent categorical variables, we use the fact that Q a p r x C h i s q ( ( r 1 ) ( c 1 ) ) , where the table has R rows and c columns. The approximation is valid provided that all of the E i j 's are larger than 5. (Some authors say it is OK for a few to be as small as 3 if most are larger than 5.)
The degrees of freedom are ( r 1 ) ( c 1 ) = 2 ( 3 ) = 6 in our example, so that the critical value for a test at the 5% level is q = 12.59 from printed chi-squared tables. In our example, after a lot of computation perhaps best done on a computer, we get Q = 27.135 > 12.59 , so we reject the null hypothesis that the two categorical variables are independent.
Note about degrees of freedom: Notice that if you have the six counts 15, 32, 18, 8, 29, 23, along with the row and column Totals, we can figure out the remaining six entries in the body of the table. One says that, given the totals, only six of the entries are 'free to vary'. This is the origin of the phrase 'degrees of freedom'.
Below is an analysis of these data from Minitab 17 statistical software. In the printout, notice the value E A , I = 7.78 , as computed above.
Chi-Square Test for Association
I II III IV All
A 15 32 18 5 70
7.78 26.25 21.39 14.58
6.7063 1.2595 0.5369 6.2976
B 8 29 23 18 78
8.67 29.25 23.83 16.25
0.0513 0.0021 0.0291 0.1885
C 1 20 25 22 68
7.56 25.50 20.78 14.17
5.6879 1.1863 0.8580 4.3314
All 24 81 66 45 216
Cell Contents: Count
Expected count
Contribution to Chi-square
Pearson Chi-Square = 27.135, DF = 6, P-Value = 0.000

Do you have a similar question?

Recalculate according to your conditions!

New Questions in High school probability

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?