Chi-Square

Nominal and Rank Order Data

Chi-Square Statistical Tests

Characteristics of the Chi-Square Distribution

The computed value of Chi-Square is always positive because the diffierence between the Observed frequency and the Expected frequency is squared, that is ( O - E )² and the demoninator is the number expected which must also be positive.
There is a family of Chi-Square distributions. There is aChi-Square distribution for 1 degree of freedom, another for 2 degrees of freedom, another for 3 degrees of freedom, and so on.
The shape of the Chi-Square distribution does not depend on the size of the sample. It does depend upon the number of categories.
The Chi-Square distribution is positively skewed. However, as the number of degrees of freedom increases, the distribution begins to approximate the normal distribution.

Limitations of Chi-Square

If there are only two cells, the expected frequency in each cell should be 5 or more.
For more than two cells chi-square should not be applied if more than 20% of the expected frequency (E) cells have expected frequencies less than 5.

Goodness-of-fit Analsysis- Chi-Square can be used to determine if observed data fits the pattern that was expected (the claim). In the simplest form of this type of Chi-Square test the observed data is listed in a single column along with the associated expected frequencies. The statistic is calculated using the formula shown above with the the number of degrees of freedom determined by k-1 where k is the number of categories. In the simplest form the test can involve the situation where the expected frequencies are equal or it a situation where the expected frequencies differ. A couple of examples should make the difference clear.

Example 1 - Goodness-of-fit - Equal expected frequencies. - Suppose that you run a plant and you want to determine if employees take sick leave with equal frequency on all days of the week. You collect the data contained in the table shown below. The data in the table indicates that on Monday there were 30 people who missed work due to illness. On the other days of the week it was 14, 18, 16 and 27 respectively. All told 105 days of work were missed during the week. If the distribution of missed days was uniform, the same for each day, then we would expect 105/5 = 21 workers to miss work on each day.

Day Mon. Tue. Wed. Thur. Fri. Totals
Observed 30 14 18 16 27 105
Expected 21 21 21 21 21 105

Day	Mon.	Tue.	Wed.	Thur.	Fri.	Totals
Observed	30	14	18	16	27	105
Expected	21	21	21	21	21	105

The null hypothesis will be that the frequencies are all the same (claim was equal frequency for all days) and the alternative will be that one or more are different.
We will use alpha = 0.025.
The data is nominal and the Chi-Square statistic is appropriate.
The critical value of the Chi-Square statistic with alpha = 0.025 and degrees of freedom = 5-1 = 4 is 11.143. If the value we calculate is greater than 11.143 we will reject the null hypothesis and conclude that the frequencies are not all the same.
Chi-Square=(30-21)²/21 + (14-21)²/21 +(18-21)²/21 +(16-21)²/21 +(27-21)²/21= 9.52.
We cannot reject the null hypothesis. Practically this means that at the 0.025 level of significance our data does not differ from that expected if the sick day frequencies were all the same.

Example 2 - Goodness-of-fit - Unequal expected frequencies. - Suppose that it was claimed that the car colors in your area were present in the following proportions: 40% silver, 25% red, 15% blue, 10% green and other colors 10%. If you decided to test this claim and went out and took a random sample of 100 cars you might end up with the following resuls: 35 silver, 22 red, 21 blue, 6 green and 16 other colors. Now you want to do an hyopthesis test.

The null hypothesis will be that the frequencies in your area are consistent with those claimed. The alternative will be that one or more of the frequencies is different from that claimed.
We will use alpha = 0.05.
The data is nominal and the Chi-Square statistic is appropriate.
The critical value of the Chi-Square statistic with alpha = 0.05 and 5 - 1 = 4 degrees of freedom is 9.488. If the calculated value of Chi-Square is greater than 9.488 we will reject the null hypothesis and accept the alternative. Otherwise we will not reject the null hypothesis.
To get the expected frequencies (E) we will multiply 100 (the number of cars observed) by the percent of each car that we expected to find in each class:

Silver Red Blue Green Other
0.40(100)=40 0.25(100)=25 0.15(100)=15 0.10(100)=10 0.10(100)=10
Chi-Square=(35-40)²/40 + (22-25)²/25 +(21-15)²/15 +(6-10)²/10 +(16-10)²/10= 8.59
We cannot reject the null hypothesis. As far as we can tell (at the alpha = 0.05 level of significance) the data is consistent with the claim that the frequencies match the stated frequencies.

Silver	Red	Blue	Green	Other
0.40(100)=40	0.25(100)=25	0.15(100)=15	0.10(100)=10	0.10(100)=10

Contingency Table Analysis - Suppose we want to determine if speeding on the highway is independent of driver age. We take a random sample of 200 drivers and obtain the following data:

Age Under 25 25 to 55 Over 55 Total
Not Speeding 70 65 8 140
Speeding 30 15 2 60
Totals 100 80 20 200

Age	Under 25	25 to 55	Over 55	Total
Not Speeding	70	65	8	140
Speeding	30	15	2	60
Totals	100	80	20	200

The null hypothesis is that the proportions of speeds/non-speeders are independent of the age of the driver. The alternative hypothesis would be that they are not independent.
We use alpha = 0.01.
The data is nominal and the Chi-Square statistic is appropriate.
For a contingency table degrees of freedom = [ (rows -1)-(columns -1)] = (2-1)(3-1) = 2 degrees of freedom. With alpha = 0.01 and 2 degrees of freedom the critical value of Chi-Square is 9.210.
To get the expected frequencies when you have a contingency table for each cell in the table you multiply the row total times the column total and divide by the grand total. For each cell in our table this looks like the following:
- (140)(100)/(200) = 70
- (140)(80)/(200) = 56
- (140)(20)/(200) = 14
- <(60)(100)/(200) = 30
- (60)(80)/(200) = 24
- (60)(20)/(200) = 6
Chi-Square = ((70-70)²/70 + (65-56)²/56 + (8-14)²/14 + (30-30)²/30 + (15-24)²/24 + (2-6)²/6 =10.06
The calculated value of Chi-Square is greater than the critical value so we reject the null hypothesis and conclude and age and speed are not independent.