Chi-Square Statistical Tests
Characteristics of the Chi-Square Distribution
- The computed value of Chi-Square is always positive because the diffierence between the Observed frequency and the Expected frequency is squared, that is ( O - E )2 and the demoninator is the number expected which must also be positive.
- There is a family of Chi-Square distributions. There is aChi-Square distribution for 1 degree of freedom, another for 2 degrees of freedom, another for 3 degrees of freedom, and so on.
- The shape of the Chi-Square distribution does not depend on the size of the sample. It does depend upon the number of categories.
- The Chi-Square distribution is positively skewed. However, as the number of degrees of freedom increases, the distribution begins to approximate the normal distribution.
Limitations of Chi-Square
- If there are only two cells, the expected frequency in each cell should be 5 or more.
- For more than two cells chi-square should not be applied if more than 20% of the expected frequency (E) cells have expected frequencies less than 5.
Goodness-of-fit Analsysis- Chi-Square can be used to determine if observed data fits the pattern that was expected (the claim). In the simplest form of this type of Chi-Square test the observed data is listed in a single column along with the associated expected frequencies. The statistic is calculated using the formula shown above with the the number of degrees of freedom determined by k-1 where k is the number of categories. In the simplest form the test can involve the situation where the expected frequencies are equal or it a situation where the expected frequencies differ. A couple of examples should make the difference clear.
Example 1 - Goodness-of-fit - Equal expected frequencies. - Suppose that you run a plant and you want to determine if employees take sick leave with equal frequency on all days of the week. You collect the data contained in the table shown below. The data in the table indicates that on Monday there were 30 people who missed work due to illness. On the other days of the week it was 14, 18, 16 and 27 respectively. All told 105 days of work were missed during the week. If the distribution of missed days was uniform, the same for each day, then we would expect 105/5 = 21 workers to miss work on each day.
Day | Mon. | Tue. | Wed. | Thur. | Fri. | Totals |
---|---|---|---|---|---|---|
Observed | 30 | 14 | 18 | 16 | 27 | 105 |
Expected | 21 | 21 | 21 | 21 | 21 | 105 |
Example 2 - Goodness-of-fit - Unequal expected frequencies. - Suppose that it was claimed that the car colors in your area were present in the following proportions: 40% silver, 25% red, 15% blue, 10% green and other colors 10%. If you decided to test this claim and went out and took a random sample of 100 cars you might end up with the following resuls: 35 silver, 22 red, 21 blue, 6 green and 16 other colors. Now you want to do an hyopthesis test.
Silver | Red | Blue | Green | Other |
---|---|---|---|---|
0.40(100)=40 | 0.25(100)=25 | 0.15(100)=15 | 0.10(100)=10 | 0.10(100)=10 |
Contingency Table Analysis - Suppose we want to determine if speeding on the highway is independent of driver age. We take a random sample of 200 drivers and obtain the following data:
Age | Under 25 | 25 to 55 | Over 55 | Total |
---|---|---|---|---|
Not Speeding | 70 | 65 | 8 | 140 |
Speeding | 30 | 15 | 2 | 60 |
Totals | 100 | 80 | 20 | 200 |
Starting with the top-left cell and going across each row in turn the expected frequencies are:
- (140)(100)/(200) = 70
- (140)(80)/(200) = 56
- (140)(20)/(200) = 14
- <(60)(100)/(200) = 30
- (60)(80)/(200) = 24
- (60)(20)/(200) = 6