10 Selecting
Selection of either variables or cases in Stata is often presented as applied to a single command or procedure. So, we use a list of variables after a command name and we may or may not then apply some criterion to filter the cases displayed. This works. It is however needlessly repetitive and sometimes we should consider
- temporarily reducing the data set in memory and operating over all data;
- creating a filter variable to reduce complex selection expressions.
In what follows I mainly treat selecting data in the “traditional” way, but I will suggest that judicious use of restore
and preserve
as well as the practice of storing selection criteria in filter variables can improve your Stata experience.4
10.1 Selecting variables
We will use the command list
which displays rows of variable values to illustrate the selection of variables.
With most commands, variables can be included in the varlist that follows a command name. So on the command line we can type
list maths english history
which displays all the values for those three variables. Much of the time this is the only selection of variables you need.
But, there are times when you wish to select a subset of variables for manipulation perhaps for a series of operations. In this case we can use preserve
and restore
.
The command preserve
takes a snapshot of a data set. If we then manipulate or modify the data, we can return to the snapshot state with the command restore
. You can now use drop varlist
to remove variables from the workplace or keep varlist
to specify variables to be kept in the work space.
10.2 Exercise
Use the preserve
command to take a snapshot of your data and then drop the variables surname
and sex
. List the first ten observations in your data. Use restore
to return to your original data. List the first ten observations in your data. Describe to your rubber duck5 the effect of preserve
and restore
.
10.3 Selecting cases
If you need to select cases, that is rows from your data, you should use the if
operator in your command. We will use list
again to illustrate. Type the following on the console
list surname english maths if english > 60 & maths < 50
We can add wildcards to our selection criteria using regular expressions. To do this we use Stata’s regexm() function. The following example illustrates the use of regexm()in a compound condition on list
6 .
list surname english maths if !regexm(surname,"^B") & english > 60 & maths < 60
10.4 Using a filter variable
A filter variable is a variable created to indicate membership of a sub-group of your data. Using the generate
command with if
conditions you can reduce a complex selection operation to a simpler expression. For example, suppose that you wish to select cases where
- teacher is three;
- maths is less than 55;
- history is greater 55.
If we first try with a list command, we will write
list if teacher == 3 & maths < 55 & history > 55
You should find that this lists just three cases. If we want to continually operate over just these cases for some part of our analysis then rather than writing this complex expression each time, we can generate a variable to act as a filter
generate filtervar = 1 if teacher == 3 & maths < 55 & history > 55
and now the selection condition for further operations is reduced to
list if filtervar == 1
This method requires some discipline to remove filter variables (with drop
) when their work is done.
10.5 The uses of _all
Stata has a built in macro (Stata speak for a script variable) named _all that contains all the variable names currently in memory.
The _all
macro is obviously useful. but, it can be more useful combined with drop
and keep
. The command drop varlist
removes variables from the workspace, while keep varlist
drops all but the named variables.
So, if we want to produce summary statistics for all continuous variables in our data, we can use keep
followed by the list of names and then calculate the summaries for _all
.
10.6 Exercise
Add the following lines to your script
preserve
keep maths english history
summarize _all
restore
What is the effect of these lines?
How many variables are in working memory after the keep
command? How many variables are in working memory after the restore
command?
Add the following lines to your script
preserve
keep maths english history
tabstat _all, statistics(mean sd var kurt skew)
restore
Answer these questions:
- What is the Skewness of the mathematics scores?
- Which scores show more variability, English or History?
- Which subject has the lowest mean score?
10.7 Creating a custom variable list
Since many commands take a list of variables to operate on, it can be useful to create a specific list of variables that you can easily refer to repeatedly. We will do this with a Stata macro. Stata macros are programming variables or, if you prefer, containers for text. Stata has both local and global macros and for the most part you will use local macros in your scripts.
Add the following lines to your script
local conts maths english history
summarize `conts'
Since in this case we don’t drop any variables, we don’t need to use preserve
and restore
to work on a subset of our data unless we otherwise transform any values.
How do I select a subset of observations using a complicated criterion?↩︎
Your rubber duck may be an actual rubber duck and it may require some imagination to talk to your duck about Stata, but it will work! Alternatively, you may have a learning partner or study buddy and you can exploit them to listen to your explanation.↩︎
What are regular expressions and how can I use them in Stata?↩︎