9 Generating new values and recoding existing values
9.1 New values
There are two commands that generate new values in Stata: generate
(which we have already seen, abbreviated to gen
) and egen
(extended generate). The first is a very fast, basic Stata command and should be used whenever simple arithmetic or other manipulation of your data will create the new values, for example
generate avxm = (maths + english + history)/3
The second is a more complicated command that can apply any of a list of pre-defined functions to your data. The same result as above for example can be created by
egen avxm = rowmean(maths english history)
The full list of functions for egen
is listed in Stata help.
The gen
command can be used in conjunction with replace
to achieve complex operations of generation and modification of variables, for example we may generate a new value and then conditionally make it missing for some cases (don’t run this code!):
generate avxm = (maths + english + history)/3
replace avxm = .a if avxm < 40
9.2 Exercise
Use either of generate
or egen
to create a new variable avxm
that is the mean of the three scores, maths
, english
, and history
. Round the result to show no decimal places.(Read the Stata help on the round()
function).
9.3 Recoding values
It is not unusual to need to change the coding of a measure. For instance, we may have a continuous variable like avxm
, that we wish to recode into groups (giving a ranked variable that we will call stream
). We can do this in a number of ways in Stata. I will use the most obvious, if verbose, method first.
Suppose that having calculated avxm
as average examination score for each student, we now want to group the students according to their avxm
. We will use the Stata if
statement to do this.
When you undertake a more complicated data management task, it is very helpful to write out in pseudo-English what you want to do. So, I want to apply this rule to my data:
If avxm >= 60, stream = “high”,
if avxm < 60 & avxm > 50, stream = “mid”,
if avxm < 50, stream = “low”
I have purposefully written this out in the most explicit way. Now I’m going to reduce the complexity a little:
If avxm >= 60, stream = “high”,
if avxm < 50, stream = “low”,
else stream = “mid”
In this version I don’t have to spell out the compound condition. Win!
Now, to write this in Stata, I will move the default ‘else’ condition to the top and use gen
and replace
with if
to get my new variable:
gen stream = 2 //the else or default condition
replace stream = 3 if avxm >= 60 //cut off the top
replace stream = 1 if avxm < 50 //cut off the bottom
I often prefer a numeric code to a string variable. Notice that I have used 3 for “high” against my own prejudice that it should be 1. In this way Stata and I will agree about ordering the data. I will explicitly label the data later to make it easier to read.
9.4 Exercise
Examine the first ten cases in the data. Do they appear to be assigned to the correct stream
according to the rule above?
Read these instructions on using cut
and then use cut
to create a new variable with the same distribution as stream
- give the variable any new name you like.