9 Generating new values and recoding existing values

9.1 New values

There are two commands that generate new values in Stata: generate (which we have already seen, abbreviated to gen) and egen (extended generate). The first is a very fast, basic Stata command and should be used whenever simple arithmetic or other manipulation of your data will create the new values, for example

generate avxm = (maths + english + history)/3

The second is a more complicated command that can apply any of a list of pre-defined functions to your data. The same result as above for example can be created by

egen avxm = rowmean(maths english history)

The full list of functions for egen is listed in Stata help.

The gen command can be used in conjunction with replace to achieve complex operations of generation and modification of variables, for example we may generate a new value and then conditionally make it missing for some cases (don’t run this code!):

generate avxm = (maths + english + history)/3

replace avxm = .a if avxm < 40

9.2 Exercise

Use either of generate or egen to create a new variable avxm that is the mean of the three scores, maths, english, and history. Round the result to show no decimal places.(Read the Stata help on the round() function).

9.3 Recoding values

It is not unusual to need to change the coding of a measure. For instance, we may have a continuous variable like avxm, that we wish to recode into groups (giving a ranked variable that we will call stream). We can do this in a number of ways in Stata. I will use the most obvious, if verbose, method first.

Suppose that having calculated avxm as average examination score for each student, we now want to group the students according to their avxm. We will use the Stata if statement to do this.

When you undertake a more complicated data management task, it is very helpful to write out in pseudo-English what you want to do. So, I want to apply this rule to my data:

    If avxm >= 60, stream = “high”,
        if avxm < 60 & avxm > 50, stream = “mid”,
            if avxm < 50, stream = “low”

I have purposefully written this out in the most explicit way. Now I’m going to reduce the complexity a little:

    If avxm >= 60, stream = “high”,
        if avxm < 50, stream = “low”,
            else stream = “mid”

In this version I don’t have to spell out the compound condition. Win!

Now, to write this in Stata, I will move the default ‘else’ condition to the top and use gen and replace with if to get my new variable:

gen stream = 2 //the else or default condition
replace stream = 3 if avxm >= 60 //cut off the top
replace stream = 1 if avxm < 50 //cut off the bottom

I often prefer a numeric code to a string variable. Notice that I have used 3 for “high” against my own prejudice that it should be 1. In this way Stata and I will agree about ordering the data. I will explicitly label the data later to make it easier to read.

9.4 Exercise

Examine the first ten cases in the data. Do they appear to be assigned to the correct stream according to the rule above?

Read these instructions on using cut and then use cut to create a new variable with the same distribution as stream - give the variable any new name you like.