10 Selecting

Selection of either variables or cases in Stata is often presented as applied to a single command or procedure. So, we use a list of variables after a command name and we may or may not then apply some criterion to filter the cases displayed. This works. It is however needlessly repetitive and sometimes we should consider

  • temporarily reducing the data set in memory and operating over all data;
  • creating a filter variable to reduce complex selection expressions.

In what follows I mainly treat selecting data in the “traditional” way, but I will suggest that judicious use of restoreand preserve as well as the practice of storing selection criteria in filter variables can improve your Stata experience.4

10.1 Selecting variables

We will use the command list which displays rows of variable values to illustrate the selection of variables.

With most commands, variables can be included in the varlist that follows a command name. So on the command line we can type

list maths english history

which displays all the values for those three variables. Much of the time this is the only selection of variables you need.

But, there are times when you wish to select a subset of variables for manipulation perhaps for a series of operations. In this case we can use preserve and restore.

The command preserve takes a snapshot of a data set. If we then manipulate or modify the data, we can return to the snapshot state with the command restore. You can now use drop varlist to remove variables from the workplace or keep varlist to specify variables to be kept in the work space.

10.2 Exercise

Use the preserve command to take a snapshot of your data and then drop the variables surname and sex. List the first ten observations in your data. Use restore to return to your original data. List the first ten observations in your data. Describe to your rubber duck5 the effect of preserve and restore.

10.3 Selecting cases

If you need to select cases, that is rows from your data, you should use the if operator in your command. We will use list again to illustrate. Type the following on the console

 list surname english maths if english > 60 & maths < 50

We can add wildcards to our selection criteria using regular expressions. To do this we use Stata’s regexm() function. The following example illustrates the use of regexm()in a compound condition on list 6 .

list surname english maths if !regexm(surname,"^B") & english > 60 & maths < 60

10.4 Using a filter variable

A filter variable is a variable created to indicate membership of a sub-group of your data. Using the generate command with if conditions you can reduce a complex selection operation to a simpler expression. For example, suppose that you wish to select cases where

  • teacher is three;
  • maths is less than 55;
  • history is greater 55.

If we first try with a list command, we will write

list if teacher == 3 & maths < 55 & history > 55

You should find that this lists just three cases. If we want to continually operate over just these cases for some part of our analysis then rather than writing this complex expression each time, we can generate a variable to act as a filter

generate filtervar = 1 if teacher == 3 & maths < 55 & history > 55

and now the selection condition for further operations is reduced to

list if filtervar == 1

This method requires some discipline to remove filter variables (with drop) when their work is done.

10.5 The uses of _all

Stata has a built in macro (Stata speak for a script variable) named _all that contains all the variable names currently in memory.

The _all macro is obviously useful. but, it can be more useful combined with drop and keep. The command drop varlist removes variables from the workspace, while keep varlist drops all but the named variables.

So, if we want to produce summary statistics for all continuous variables in our data, we can use keep followed by the list of names and then calculate the summaries for _all.

10.6 Exercise

Add the following lines to your script

    preserve
    keep maths english history
    summarize _all
    restore

What is the effect of these lines?

How many variables are in working memory after the keep command? How many variables are in working memory after the restore command?

Add the following lines to your script

preserve
keep maths english history
tabstat _all, statistics(mean sd var kurt skew)
restore

Answer these questions:

  1. What is the Skewness of the mathematics scores?
  2. Which scores show more variability, English or History?
  3. Which subject has the lowest mean score?

10.7 Creating a custom variable list

Since many commands take a list of variables to operate on, it can be useful to create a specific list of variables that you can easily refer to repeatedly. We will do this with a Stata macro. Stata macros are programming variables or, if you prefer, containers for text. Stata has both local and global macros and for the most part you will use local macros in your scripts.

Add the following lines to your script

local conts maths english history
summarize `conts'

Since in this case we don’t drop any variables, we don’t need to use preserve and restore to work on a subset of our data unless we otherwise transform any values.


  1. How do I select a subset of observations using a complicated criterion?↩︎

  2. Your rubber duck may be an actual rubber duck and it may require some imagination to talk to your duck about Stata, but it will work! Alternatively, you may have a learning partner or study buddy and you can exploit them to listen to your explanation.↩︎

  3. What are regular expressions and how can I use them in Stata?↩︎