12 Some exploratory analysis

Before embarking on the systematic modelling and testing of data, you may wish to explore its broad outline. There are several useful Stata procedures for this task, including:

  • simple visualisatons:

    • box plots;
    • histograms;
    • bar charts;
  • summary statistics;

  • tables.

12.1 Simple visualisations

For continuous numeric data you can create box plots and histograms. In your do file add the line

hist maths

You might like to create graphs for a list of variables. You can try to first create a macro containing the variable names, then create the histograms. As in the following lines

local conts maths english history
hist `conts'

Unfortunately, this does not work (since hist can only be followed by a single variable) and in any case soon as Stata creates a new graph, the currently open graph window is destroyed. We can avoid this by creating and exporting the histograms we create in a program 7.

Add the following lines to your script

foreach var in `conts' {
    hist `var'
    graph export `var'.png
}

Once this loop terminates, you can look in your current directory to find the exported graphs.

You can dig deeper into your data by grouping values by any factor (categorical) variables, for example

graph box `conts', by(sex)

12.2 Exercise

Create histograms for the english andhistory data. How similar or dissimilar do you think these data are?

Create box plots of maths for each level ofsex. How do you think the male and female maths scores compare?

Using your web searching powers, see if you can generate a box plot of maths scores that is subset by both sex and teacher. You should look out for mention of an option `over’.

Having created the box plot by sex above, open Stata’s graph editor and add the title to the plot “Maths scores by gender”.

12.3 Summary statistics

Quick summary statistics for continuous numeric variables can be calculated with the summarize command. Try the command

summarize maths

You will see that this gives a brief summary of the variable.

you can add detail to the summary with the detail option

summarize maths english, detail

Summary statistics for continuous numeric variables can also be created with the tabstat command.

tabstat maths english

and you can specify statistics with the statistic option:

tabstat maths english, statistic(median var skew)

A Stata command that calculates statistics will display some default output. You can check what is displayed in help. However, many routines actually compute more statistics than are displayed. These quantities are stored in the return list for the command and can be accessed after it runs. For descriptive statistics, the list is called r(). The brackets indicate that r() is an array and that we can select particular statistics from the list. Consider

summarize maths

The default output looks like

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       maths |         30    53.76667    6.295501         39         64

If we now use the command return list we see

scalars:
                  r(N) =  30
              r(sum_w) =  30
               r(mean) =  53.76666666666667
                r(Var) =  39.63333333333333
                 r(sd) =  6.295501039101918
                r(min) =  39
                r(max) =  64
                r(sum) =  1613

These further can be viewed individually using, for example,

di r(Var)

and in more advanced procedures you can use them in collections and in mathematical expressions.

12.4 Exercise

Compute the summary statistics with detail for history. Use the return list from the command and then display the kurtosis of the variable.