8 Detecting and correcting

In the data as you find it, there is an anomalous maths score. We can find this by simple inspection of the data because we have a small data set and few variables.

We could for example use list with an if condition, like this

list surname maths if maths > 100 

which would list any cases with a maths value greater than the allowable maximum.

The comparison operator we use here is greater than. We must remember that the operator for equals is ==.

If we had a larger data set with many variables this would be much more difficult. We will write some code to help us in the detection of variables.

In Stata we can use programming functions that return values3. Many functions return values true or false. There is a function inrange(variable, min, max) that returns true if variable is greater than min and less than or equal to max. We negate functions with the operator meaning not: !.

8.1 Exercise

In your script add the following lines:

gen anomaly = 0
replace anomaly = 1 if !inrange(maths,0,100)

The Stata command gen creates a new variable in our data set (for all cases) and gives it an initial value. The command repace replaces the values for cases of a variable - in this case with a condition.

8.2 Exercise

After you run the lines above, use list with if to determine if there are any anomalous cases in your data set. Have you detected any? How many?

The Stata symbol for or is |, sometimes called bar or pipe. In Stata you can create a complex condition (like if) with the pipe, but note that each side of the pipe must be a complete Stata statement. So, if we consider a comparison with English, you must say:

“Would you like coffee or would you like tea?”

and not

“Would you like coffee or tea?”

List all cases where maths < 50 or maths > 60.

Alter the second line in the previous exercise example above (the replace clause) so that it checks not only maths but the english, and history variables as well.

8.3 Replacing values

In the data for this tutorial, there is one score in maths that is clearly out of range. In this case we need to replace the maths score for the student with surname DENCIK. We can do that on the Stata command line with a replace command. For the sake of this exercise, snapshot the current state of your data with preserve and then type this command on the console:

replace maths = 57 if surname == "DENCIK"

When you have inspected the data to ensure the correction has been made, restore the snapshot from before correction (This is to ensure that you can complete the next exercise. In a normal situtation you might preserve only at the end of your analysis or simply not save the results of changes you make to your data).

8.4 Exercise

Correct the anomalous maths score, but do not base the replacement on the surname variable, rather use only the maths values.