Date: Mon, 21 Mar 2005
I was wondering if you might be able to help me with a small question regarding
the chi squared test? If you have identified the genotypic frequencies of two
alleles at a particular locus (eg AA=0.65 AG=0.27 GG=0.08) and you want to
compare with what the expected genotpyes would be under Hardy-Weinburg, are
there one or two degrees of freedom? If only one, is it necessary to apply the
continuity correction (ie. (|O-E|-0.5)^2/E )? If the p value suggests that the
chi squared value is not significant, does this mean that the data does not
deviate from expected HW frequencies?
1) You are aware that you do the test on expected NUMBERS and not expected frequencies, are you?
2) Degrees of freedom. As you say, normally, in a goodness of fit test, where you are fitting three numbers to their expected numbers, you take one away from the no. of values you are fitting. However, this assumes you are generating the hypotheses to be tested from outside the data. A 9:3:3:1 ratio in a dihybrid dominant gene F2 cross with no linkage is an example (or a 1:2:1 ratio for a single locus F2, or 1:1:1:1:1:1 for a perfectly fair dice); there are 3 and 2 and 5 degrees of freedom for each type of test, respectively.
Why do you take off one degree of freedom? The reason is that if you have three cells, you obtain the expected numbers partly based on the total you found in the experiment; thus you get one parameter from the experiment. So you only have two free parameters you can actually fit with three numbers; the third parameter is obtained by subtraction from the total.
But when you are fitting a Hardy-Weinberg population ratio for a single gene with two alleles, you are actually obtaining yet another parameter from the data; you are estimating the gene frequency, as well as using the the total as well. This means you are left with only one free parameter (or degree of freedom) from your three numbers, with the total and the gene frequency being the two parameters you estimate from the data. In the 1:2:1 ratio of a mendelian cross, the hypothesis comes from outside the data; here more information for the ratios come from inside the data.
Does this explain the degrees of freedom problem?
3) Continuity corrections. Statisticians disagree about whether you should use them, and then, I think, only on 2x2 contingency tables (see below if you don't know what that means), not in goodness of fit tests. They are used to take account of the fact that we often compare some statistic, such as "Sum [((O-E)^2)/E]" with a true chi-square distribution, but that small numbers mean that the actual distribution of the statistic is very bitty, and not very like the chi-square at all under the null hypothesis. (NB, "Sum [((O-E)^2)/E]" is NOT the chi-square distribution, it only approximates one when N is large).
But Yates' correction, the method of subtracting 0.5 from the value of O-E to make the values less extreme is very crude, if you think about it! I realize that you are often taught to use this correction in introductory statistics courses, but if you read one of the most widely available statistics books, Sokal and Rohlf, they advise against them. It is unnecessary to use a correction if you have reasonable sample sizes in any case, since the 0.5 is a small number in comparison with the potential O-E value.
Personally, I feel that you should not use them, because if you have enough data to say anything at all, you shouldn't worry about whether your P value is say 0.04 or 0.06. If you have a P <0.001, it won't make a difference anyway. The P value is an approximation anyway, and it is the general order of magnitude you should be worried about, not by whether it strays over an arbitrary boundary such as 0.05.
In any case, such corrections should not be used on a goodness of fit test like this, or the totals won't add up properly.
Instead, I like Sokal & Rohlf's suggestion that instead of the O-E etc. test, you should use a G-test, which is another statistic which is also approximately true chi-square distributed when your sample size is large under the -null hypothesis. The statistic is G = 2Sum[Oln(O/E)], where ln is the natural logarithm. It has the virtue of being fairly close to chi-square distributed even at quite low sample sizes (I have done simulations to test it), and it also has the added justification in that it is just twice the likelihood ratio that your data fit the null hypothesis.
I am not going to go into what likelihood means here, perhaps you have heard of it, but the likelihood ratio itself and derived Bayesian probabilities are today viewed by most statisticians (including in the UCL stats Dept) as being much more important means of inference that the P values that you usually get taught about in Biostats courses.
Also, if you wanted to compare two sets of variables (eg in one population the
frequency of allele A is 0.8 and allele B is 0.2, and in another population
allele A= 0.75 and allele B=0.25) to see whether the difference between them
was significant, what kind of test would you use?Kate
4) Your final question is about what if we have say two populations, each which were sampled, and you got allelic numbers of say A 32, a 8 in population 1, and A 60 and a 20 in population 2. How do you go about testing whether they have the same allele frequencies? -- this is the null hypothesis.
First, let me reiterate, that the problem as you set it up for me, just about frequencies, cannot be tested. You need the actual numbers to do the test, that is why I have used whole numbers above.
What we have here is a "2x2 contingency table", or a "test of homogeneity". If the gene frequencies are homogeneous, the expected numbers are given by the average gene frequencies as estimated from the totals. So the frequency of A should be 92/120 in both populations under the null hypothesis. The expected numbers of A and a in population 1 are therefore 40 x 92/120, and a are 40 x 28/120. Similarly A and a expected numbers are 80 x 92/120 and 80 x 28/120 for population 2. Now this leads to a nice easy rule, in a contingency table, obtain the expected values for each cell by mutliplying the row total by the column total and dividing by the overall total.
Finally, you need to know that in a contingency table, the numbers of degrees of freedom = (r-1)(c-1) where r is the number of rows, c is the number of columns. Here, we have a 2x2 table, so there is only 1 degree of freedom.
Here are the complete test results:
A |
a |
row totals |
|
Population | |||
1 |
32 |
8
|
40 |
2 |
60 |
20 |
80 |
column totals:..... | 92 |
28 |
120 |
table of expected
numbers |
30.67 |
9.33
|
|
61.33 |
18.67 |
Degrees of freedom = 1
G-test statistic, G = 0.38
Chi-square statistic, X^2 = 0.37
(n.b. NO Yates correction!)
P>>0.05! So not "significantly" different.