8 Detecting and correcting
In the data as you find it, there is an anomalous maths score. We can find this by simple inspection of the data because we have a small data set and few variables.
We could for example use list
with an if
condition, like this
list surname maths if maths > 100
which would list any cases with a maths
value greater than the allowable maximum.
The comparison operator we use here is greater than. We must remember that the operator for equals is ==.
If we had a larger data set with many variables this would be much more difficult. We will write some code to help us in the detection of variables.
In Stata we can use programming functions that return values3. Many functions return values true or false. There is a function inrange(variable, min, max) that returns true if variable is greater than min and less than or equal to max. We negate functions with the operator meaning not: !.
8.1 Exercise
In your script add the following lines:
gen anomaly = 0
replace anomaly = 1 if !inrange(maths,0,100)
The Stata command gen
creates a new variable in our data set (for all cases) and gives it an initial value. The command repace
replaces the values for cases of a variable - in this case with a condition.
8.2 Exercise
After you run the lines above, use list
with if
to determine if there are any anomalous cases in your data set. Have you detected any? How many?
The Stata symbol for or is |, sometimes called bar or pipe. In Stata you can create a complex condition (like if
) with the pipe, but note that each side of the pipe must be a complete Stata statement. So, if we consider a comparison with English, you must say:
“Would you like coffee or would you like tea?”
and not
“Would you like coffee or tea?”
List all cases where maths < 50
or maths > 60
.
Alter the second line in the previous exercise example above (the replace
clause) so that it checks not only maths
but the english
, and history
variables as well.
8.3 Replacing values
In the data for this tutorial, there is one score in maths
that is clearly out of range. In this case we need to replace the maths
score for the student with surname
DENCIK. We can do that on the Stata command line with a replace command. For the sake of this exercise, snapshot the current state of your data with preserve
and then type this command on the console:
replace maths = 57 if surname == "DENCIK"
When you have inspected the data to ensure the correction has been made, restore
the snapshot from before correction (This is to ensure that you can complete the next exercise. In a normal situtation you might preserve only at the end of your analysis or simply not save the results of changes you make to your data).