12 Some exploratory analysis
Before embarking on the systematic modelling and testing of data, you may wish to explore its broad outline. There are several useful Stata procedures for this task, including:
simple visualisatons:
- box plots;
- histograms;
- bar charts;
summary statistics;
tables.
12.1 Simple visualisations
For continuous numeric data you can create box plots and histograms. In your do file add the line
hist maths
You might like to create graphs for a list of variables. You can try to first create a macro containing the variable names, then create the histograms. As in the following lines
local conts maths english history
hist `conts'
Unfortunately, this does not work (since hist
can only be followed by a single variable) and in any case soon as Stata creates a new graph, the currently open graph window is destroyed. We can avoid this by creating and exporting the histograms we create in a program 7.
Add the following lines to your script
foreach var in `conts' {
hist `var'
graph export `var'.png
}
Once this loop terminates, you can look in your current directory to find the exported graphs.
You can dig deeper into your data by grouping values by any factor (categorical) variables, for example
graph box `conts', by(sex)
12.2 Exercise
Create histograms for the english
andhistory
data. How similar or dissimilar do you think these data are?
Create box plots of maths
for each level ofsex
. How do you think the male and female maths scores compare?
Using your web searching powers, see if you can generate a box plot of maths scores that is subset by both sex
and teacher
. You should look out for mention of an option `over’.
Having created the box plot by sex
above, open Stata’s graph editor and add the title to the plot “Maths scores by gender”.
12.3 Summary statistics
Quick summary statistics for continuous numeric variables can be calculated with the summarize
command. Try the command
summarize maths
You will see that this gives a brief summary of the variable.
you can add detail to the summary with the detail
option
summarize maths english, detail
Summary statistics for continuous numeric variables can also be created with the tabstat
command.
tabstat maths english
and you can specify statistics with the statistic
option:
tabstat maths english, statistic(median var skew)
A Stata command that calculates statistics will display some default output. You can check what is displayed in help
. However, many routines actually compute more statistics than are displayed. These quantities are stored in the return list for the command and can be accessed after it runs. For descriptive statistics, the list is called r(). The brackets indicate that r() is an array and that we can select particular statistics from the list. Consider
summarize maths
The default output looks like
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
maths | 30 53.76667 6.295501 39 64
If we now use the command return list
we see
scalars:
r(N) = 30
r(sum_w) = 30
r(mean) = 53.76666666666667
r(Var) = 39.63333333333333
r(sd) = 6.295501039101918
r(min) = 39
r(max) = 64
r(sum) = 1613
These further can be viewed individually using, for example,
di r(Var)
and in more advanced procedures you can use them in collections and in mathematical expressions.