Introduction to Quantitative Methods

7. Interactions

7.1 Seminar

Setting a Working Directory

Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100 folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.

Once the folder is created, use the setwd() function in R to set your working directory.

Recommended Folder Location R Function
UCL Computers N: Drive setwd("N:/PUBLG100")
Personal Laptop (Windows) C: Drive setwd("C:/PUBLG100")
Personal Laptop (Mac) Home Folder setwd("~/PUBLG100")

After you've set the working directory, verify it by calling the getwd() function.

getwd()

Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100 folder.

If you're using a UCL computer, please make sure you're running R version 3.3.1 or newer. Some of the seminar tasks and exercises will not work with older versions of R.
First, lets load the required packages in this order:

library(foreign)
library(Zelig)
library(texreg)
library(dplyr)

Clear the environment.

rm(list = ls())

Loading Data

Download 'Institute for Quality of Governance' Dataset

We will use the Quality of Government data again using four variables:

Variable Description
former_col 0 = not a former colony
1 = former colony
undp_hdi UNDP Human Development Index. Higher values mean better quality of life
wbgi_cce Control of corruption. Higher values mean better control of corruption
wdi_gdpc GDP per capita in US dollars
world_data <- read.dta("QoG2012.dta")

names(world_data)
[1] "h_j"         "wdi_gdpc"    "undp_hdi"    "wbgi_cce"    "wbgi_pse"   
[6] "former_col"  "lp_lat_abst"

Let's select only the variables we care about and rename the cryptic ones:

world_data <- select(world_data, 
                     hdi = undp_hdi,
                     corruption_control = wbgi_cce,
                     gdp = wdi_gdpc,
                     former_col)

head(world_data)
    hdi corruption_control        gdp former_col
1    NA         -1.5453584   628.4074          0
2 0.781         -0.8538115  4954.1982          0
3 0.704         -0.7301510  6349.7207          1
4    NA          1.3267342         NA          0
5 0.381         -1.2065741  2856.7517          1
6 0.800          0.8624368 13981.9795          1

Now let's look at summary statistics

summary(world_data)
      hdi         corruption_control      gdp            former_col    
 Min.   :0.2730   Min.   :-1.69953   Min.   :  226.2   Min.   :0.0000  
 1st Qu.:0.5390   1st Qu.:-0.81965   1st Qu.: 1768.0   1st Qu.:0.0000  
 Median :0.7510   Median :-0.30476   Median : 5326.1   Median :1.0000  
 Mean   :0.6982   Mean   :-0.05072   Mean   :10184.1   Mean   :0.6289  
 3rd Qu.:0.8335   3rd Qu.: 0.50649   3rd Qu.:12976.5   3rd Qu.:1.0000  
 Max.   :0.9560   Max.   : 2.44565   Max.   :63686.7   Max.   :1.0000  
 NA's   :19       NA's   :2          NA's   :16                        

former_col is a categorical variable, let's see how many observations are in each category:

table(world_data$former_col)
  0   1 
 72 122 

Now let's create a scatterplot between corruption_control and hdi and color the points based on the value of former_col.

NOTE: We're using pch = 20 to plot solid cicles. You can see other available styles by typing ?points or help(points) at the console.

plot(hdi ~ corruption_control, 
     data = world_data, 
     col = as.factor(former_col), 
     pch = 20)

# add a legend
legend("bottomright", 
       c("former_col = 0", "former_col = 1"), 
       col=c("black","red"), 
       pch = 20)

Dummy Variables

The variable former_col is a dummy variable (also known as a binary variable or an indicator variable) and it can take on the values 0 or 1. Such a variable does not affect the slope. It shifts the regression line. You end up with a second regression line (when the value is 1) parallel to the first (when the value is 0).

So, while continuous variables affect the slope, binary variables affect the intercept. Let's run a linear regression model and see the results:

model1 <- lm(hdi ~ corruption_control + former_col, data = world_data)

screenreg(model1)
==============================
                    Model 1   
------------------------------
(Intercept)           0.77 ***
                     (0.02)   
corruption_control    0.10 ***
                     (0.01)   
former_col           -0.11 ***
                     (0.02)   
------------------------------
R^2                   0.54    
Adj. R^2              0.54    
Num. obs.           175       
RMSE                  0.12    
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05

The Effect of a Dummy Variable

Former Colony Intercept Slope
0 = not a former colony Intercept + (former_col * 0)
= 0.77 + (-0.11 * 0)
= 0.77
corruption_control
= 0.10
1 = former colony Intercept + (former_col * 1)
= 0.77 + (-0.11 * 1)
= 0.66
corruption_control
= 0.10

To illustrate the effect of a dummy we can access the coefficients of our lm model directly with the coefficients() function:

model_coef <- coefficients(model1)
model_coef
       (Intercept) corruption_control         former_col 
         0.7733939          0.1021898         -0.1122634 

We can use the square brackets [ ] to access individual coefficients.

To illustrate the effect of a dummy we draw a scatterplot and then use abline() to draw two regression lines, one with former_col = 0 and another with former_col = 1.

Instead of passing the model as the first argument to abline(), we can just pass the intercept and slope as two separate arguments.

plot(hdi ~ corruption_control, data = world_data,
     col = as.factor(former_col), 
     pch = 20,
     xlab = "Corruption Control",
     ylab = "Human Development Index",
     main = "Effect of a Dummy Variable")

# the regression line when former_col = 0
abline(model_coef[1], model_coef[2], col = "black")

# the regression line when former_col = 1
abline(model_coef[1] + model_coef[3], model_coef[2], col = "red")

# add a legend to the plot
legend("bottomright", 
       c("former_col = 0", "former_col = 1"), 
       col=c("black","red"), 
       lwd = 1,   # line width = 1 for adding a line to the legend
       pch = 20)

Interactions: Continuous and Binary

There are two ways to specify an interaction term, either using a colon (:) or with an asterisk (*)

Example
: A:B - Only the interaction between A and B is included
* A*B - In addition to the interaction term, both the constituents are also included. So A*B translates to A + B + A:B

We interact control of corruption (continuous) with having been colonized (binary). We estimate a second model using an interaction term and use the F-test to see whether our new model does better than the first one.

What to keep in mind when modelling interactions
Include interaction terms and constituents of the interaction in your model.
Constituent 1 corruption_control
Constituent 2 former_col
Interaction corruption_control:former_col
model2 <- lm(hdi ~ corruption_control + former_col + corruption_control:former_col, 
             data = world_data)

screenreg(list(model1, model2))
=====================================================
                               Model 1     Model 2   
-----------------------------------------------------
(Intercept)                      0.77 ***    0.79 ***
                                (0.02)      (0.02)   
corruption_control               0.10 ***    0.08 ***
                                (0.01)      (0.01)   
former_col                      -0.11 ***   -0.12 ***
                                (0.02)      (0.02)   
corruption_control:former_col                0.05 ** 
                                            (0.02)   
-----------------------------------------------------
R^2                              0.54        0.56    
Adj. R^2                         0.54        0.55    
Num. obs.                      175         175       
RMSE                             0.12        0.12    
=====================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
anova(model1, model2)
Analysis of Variance Table

Model 1: hdi ~ corruption_control + former_col
Model 2: hdi ~ corruption_control + former_col + corruption_control:former_col
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    172 2.5953                                
2    171 2.4930  1   0.10237 7.0215 0.008809 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, we now examine what our interaction in the second model does by plotting our results with zelig(). We will plot our results using simulation in zelig(). We first estimate a linear model the same way we did with lm():

z.out <- zelig(hdi ~ corruption_control + former_col + corruption_control:former_col, 
               data = world_data, 
               model = "ls")
How to cite this model in Zelig:
  R Core Team. 2007.
  ls: Least Squares Regression for Continuous Dependent Variables
  in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
  "Zelig: Everyone's Statistical Software," http://zeligproject.org/

The first scenario x.out1 represents countries that were never colonized. The second, x.out2 represents former colonies. We vary control of corruption from -2 to 3 which is roughly the minimum to the maximum of the variable.

NOTE: We know the range of values for corruption_control from the summary statistics we obtained after loading the dataset at the begining of the seminar. You can also use the range() function, but make sure to provide na.rm = TRUE argument as there are some missing values in the dataset.

range(world_data$corruption_control, na.rm = TRUE)
[1] -1.699529  2.445654
# set covariates for countries that weren't colonised
x.out1 <- setx(z.out, former_col = 0, corruption_control = -2:3)

# set covariates for colonised countries
x.out2 <- setx(z.out, former_col = 1, corruption_control = -2:3)

# simulate 
s.out <- sim(z.out, x = x.out1, x1 = x.out2)

Now let's plot with ci.plot

ci.plot(s.out, 
        ci = 95,
        xlab = "Control of Corruption",
        ylab = "Expected Value of Human Development Index",
        main = "Effect of Control of Corruption by Colonial Past"
        ) 

# add text to Zelig plot. 
# NOTE: x and y are just plot coordinates so you can change them to control where 
#       the text is displayed
text(x = -1, y = 0.8, labels = "former_col = 0")
text(x = 1, y = 0.7, labels = "former_col = 1")

Interactions: Two Continuous Variables

Now let's look at the relationship between GDP per capita and human development index

plot(hdi ~ gdp, data = world_data)

When looking at the scatter plot, we clearly see a non-linear relationship between the GDP per capita and human development index. We fit a linear regression model and plot a line through the point cloud using abline().

model1 <- lm(hdi ~ gdp, data = world_data)
summary(model1)
Call:
lm(formula = hdi ~ gdp, data = world_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38931 -0.10169  0.05228  0.10126  0.17392 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.918e-01  1.293e-02   45.78   <2e-16 ***
gdp         1.017e-05  7.976e-07   12.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1298 on 170 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  0.4888,    Adjusted R-squared:  0.4858 
F-statistic: 162.6 on 1 and 170 DF,  p-value: < 2.2e-16
plot(hdi ~ gdp, data = world_data)
abline(model1)

The linear model isn't a very good fit as we can see from the plot. We observe that additional wealth has a large effect on the quality of life in a poor country while the effect becomes smaller and smaller when higher levels of wealth are reached. That, however, implies that the slope of our line (the effect of the independent variable) should depend on the level of the independent variable.

We can model such a non-linear relationship by including a quadratic term of GDP per capita. The syntax is the following:

I(independent_variable^2)
Component Description
I The capital I is actually a function that allows us to pass the quadratic term "as is" to lm()
^2 The ^ followed by a 2 means: raised to the power of 2
model2 <- lm(hdi ~ gdp + I(gdp^2), data = world_data)
summary(model2)
Call:
lm(formula = hdi ~ gdp + I(gdp^2), data = world_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.268048 -0.064632 -0.001202  0.085496  0.227648 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.264e-01  1.249e-02  42.148   <2e-16 ***
gdp          2.537e-05  1.716e-06  14.781   <2e-16 ***
I(gdp^2)    -3.542e-10  3.707e-11  -9.556   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1049 on 169 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  0.6681,    Adjusted R-squared:  0.6642 
F-statistic: 170.1 on 2 and 169 DF,  p-value: < 2.2e-16

Do we need the quadratic term or is the relationship in fact linear? We can use an F-Test to check:

anova(model1, model2)
Analysis of Variance Table

Model 1: hdi ~ gdp
Model 2: hdi ~ gdp + I(gdp^2)
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    170 2.8649                                 
2    169 1.8600  1    1.0049 91.31 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now let's run a model with zelig(), run simulation and plot the results

z.out <- zelig(hdi ~ gdp + I(gdp^2), data = world_data, model = "ls")
How to cite this model in Zelig:
  R Core Team. 2007.
  ls: Least Squares Regression for Continuous Dependent Variables
  in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
  "Zelig: Everyone's Statistical Software," http://zeligproject.org/
# setting covariates; GDP/captia is a sequence from 0 to 60000 in steps of 1000
x.out <- setx(z.out, gdp = seq(0, 60000, 1000))

# simulate using our model and our covariates
s.out <- sim(z.out, x = x.out)

# plot the results
ci.plot(s.out, ci = 95)

Exercises

  1. Using our model z.out, where we included the square of GDP/capita, what is the effect of:

    • an increase of GDP/capita from 5000 to 15000?
    • an increase of GDP/capita from 25000 to 35000?
  2. Estimate a model where wbgi_pse (political stability) is the response variable and h_j (1 if country has an independent judiciary) and former_col are the explanatory variables. Include an interaction between your explanatory variables. What is the marginal effect of:

    • An independent judiciary when the country is a former colony?
    • An independent judiciary when the country was not colonized?
    • Does the interaction between h_j and former_col improve model fit?
  3. Run a model on the human development index (hdi), interacting an independent judiciary (h_j) and control of corruption (corruption_control). What is the effect of control of corruption:

    • In countries without an independent judiciary?
    • When there is an independent judiciary?
    • Illustrate your results.
    • Does the interaction improve model fit?
  4. Clear your workspace and download the California Test Score Data used by Stock and Watson.

    • Download 'caschool.dta' Dataset
    • Draw a scatterplot between avginc and testscr variables.
    • Run two regressions with testscr as the dependent variable.
      • In the first model use avginc as the independent variable.
      • In the second model use quadratic avginc as the independent variable.
    • Test whether the quadratic model fits the data better.