16 Hypothesis tests

16.1 For two categorical variables

A common procedure is to test for an association between two categorical variables. We can illustrate this procedure using a tabulation with a \(\chi{}^2\) statistic. (Of course, the variable sex is not necessarily binary valued.)

table sex class, expected chi2

16.2 Exercise

Create the table with \(\chi{}^2\) statistic and expected values as above. Should you reject the H0 that sex and class are not associated?

16.3 For one continuous and one categorical variable of two levels

If we have one continuous numeric variable and one two level categorical variable (such as employed vs unemployed) that would divide our data into two groups, we can ask ourselves whether the mean of the continuous variable differs for the groups (with H0 being that they do not).

If our two groups are independent, then we must first ask if the variance in the data is more or less equal between groups. The null hypothesis is that the variances are equal. This is tested by the comparison of variances using Stata’s robvar command. We can test the maths scores by sex in our data

robvar maths, by(sex)

Knowing whether or not we are dealing with groups displaying (more or less) equal variance in the variable of interest, we can go on to conduct an independent samples t-test. The code is

ttest maths, by(sex)

(Assuming that we have interpreted the results of robvar to mean the variance in maths for the two groups is equal).

16.4 Exercise

Run the robvar procedure above but for the history and sex variables. What are the three W statistics produced? Which of them tests that the variances are equal for a comparison of means? Is there strong enough evidence in this case to reject the null hypothesis?

Use the ttest command to test the null hypothesis that

\[\mu{}\ english _{\ female \ students} = \mu \ english _{\ male\ students}\]

What conclusion do you draw?

16.5 The paired samples ttest

We can also compare the same group of subjects on two measures to see if the means differ. In this case there is no need to check the variances before conducting the test. For example we could test whether or not mean scores in English and History differ (with the null hypothesis that they do not)

ttest english == history

Using this procedure, how do English scores compare to History scores and how do English scores compare to Mathematics scores?

16.6 Once continuous and one categorical variable of more than two levels

We can compare the level avxm by teacher, this is to say test the null hypothesis

\[\mu{}\ avxm \ _{teacher \ one} = \mu{}\ avxm \ _{teacher \ two} = \mu{}\ avxm \ _{teacher \ three} \]

16.6.1 One way ANOVA and post-hoc testing

The Stata command to test the null hypothesis above is

oneway maths teacher, bonferroni tabulate

This command produces summary statistics the ANOVA statistic F, its associated probability, and other quantities calculated as part of the ANOVA. In the version given above, we have included a tabulation of pairwise comparisons using the bonferroni correction. We can separately examine the pairwise comparisons if we wish with

pwmean avxm, over(teacher) mcompare(bonferroni) effects

This method does not display the ANOVA table itself and the mcompare() option gives us access to a slightly different range of correction options.

16.7 Two continuous variables

16.8 Correlation

Analysis of two continuous variables begins with calculating the Pearson Correlation Coefficient: R. This statistic ranges from

  • -1 indicating an inverse or negative correlation
  • 0 indicating no correlation
  • +1 indicating a positive correlation

We should take note that a correlation has not only magnitude and direction, but that there is an associated hypothesis test: the the true correlation is 0. This test gives a p value associated with R.

The code to compute R in Stata is

correlate var1 var2

This computes R for var1 and var2. If you do not specify a variable list, Stata computes correlations between all non-string variables in your data set.

16.9 Exercise

Compute Pearson correlations with significance values for the pairs

  • english-maths
  • english-history

Explain to your learning partner what the results mean to you.

16.9.1 Simple visualisation of correlation

The simplest way to visualise a correlation is with a scatter plot. You may wish to consider, based on your plans for further analysis which variable you wish to assign to which axis. To create a scatter plot you can start with

scatter english history

To add the trend line:

scatter english history || lfit english history

And add a confidence interval:

scatter english history || lfitci english history

Now you can add labels, titles and so on

twoway lfitci english history  || scatter english history, jitter(5) ///
  title("English as a predictor of History scores") ///
    legend(off) ///
    mcolor(red) ///
    msymbol(Oh) ///
    subtitle("For all students") ///
    xtitle("English exam scores") ///
    ytitle("History exam scores") ///
    scheme(sj)

Stata has a very large range of graphing commands and options. While they are reasonably complicated, a good way to explore them is through this gallery.

16.10 Exercise

Using any resources you can find, try to find more Stata graph schemes and try at least three on the code above.