Before you begin, make sure to set your working directory to a folder where your course related files such as R scripts and datasets are kept. We recommend that you create a PUBLG100
folder for all your work. Create this folder on N: drive if you're using a UCL computer or somewhere on your local disk if you're using a personal laptop.
Once the folder is created, use the setwd()
function in R to set your working directory.
Recommended Folder Location | R Function | |
---|---|---|
UCL Computers | N: Drive | setwd("N:/PUBLG100") |
Personal Laptop (Windows) | C: Drive | setwd("C:/PUBLG100") |
Personal Laptop (Mac) | Home Folder | setwd("~/PUBLG100") |
After you've set the working directory, verify it by calling the getwd()
function.
getwd()
Now download the R script for this seminar from the "Download .R Script" button above, and save it to your PUBLG100
folder.
library(foreign)
library(Zelig)
library(texreg)
library(dplyr)
Clear the environment.
rm(list = ls())
Download 'Institute for Quality of Governance' Dataset
We will use the Quality of Government data again using four variables:
Variable | Description |
---|---|
former_col |
0 = not a former colony 1 = former colony |
undp_hdi |
UNDP Human Development Index. Higher values mean better quality of life |
wbgi_cce |
Control of corruption. Higher values mean better control of corruption |
wdi_gdpc |
GDP per capita in US dollars |
world_data <- read.dta("QoG2012.dta")
names(world_data)
[1] "h_j" "wdi_gdpc" "undp_hdi" "wbgi_cce" "wbgi_pse"
[6] "former_col" "lp_lat_abst"
Let's select only the variables we care about and rename the cryptic ones:
world_data <- select(world_data,
hdi = undp_hdi,
corruption_control = wbgi_cce,
gdp = wdi_gdpc,
former_col)
head(world_data)
hdi corruption_control gdp former_col
1 NA -1.5453584 628.4074 0
2 0.781 -0.8538115 4954.1982 0
3 0.704 -0.7301510 6349.7207 1
4 NA 1.3267342 NA 0
5 0.381 -1.2065741 2856.7517 1
6 0.800 0.8624368 13981.9795 1
Now let's look at summary statistics
summary(world_data)
hdi corruption_control gdp former_col
Min. :0.2730 Min. :-1.69953 Min. : 226.2 Min. :0.0000
1st Qu.:0.5390 1st Qu.:-0.81965 1st Qu.: 1768.0 1st Qu.:0.0000
Median :0.7510 Median :-0.30476 Median : 5326.1 Median :1.0000
Mean :0.6982 Mean :-0.05072 Mean :10184.1 Mean :0.6289
3rd Qu.:0.8335 3rd Qu.: 0.50649 3rd Qu.:12976.5 3rd Qu.:1.0000
Max. :0.9560 Max. : 2.44565 Max. :63686.7 Max. :1.0000
NA's :19 NA's :2 NA's :16
former_col
is a categorical variable, let's see how many observations are in each category:
table(world_data$former_col)
0 1
72 122
Now let's create a scatterplot between corruption_control
and hdi
and color the points based on the value of former_col
.
NOTE: We're using pch = 20
to plot solid cicles. You can see other available styles by typing ?points
or help(points)
at the console.
plot(hdi ~ corruption_control,
data = world_data,
col = as.factor(former_col),
pch = 20)
# add a legend
legend("bottomright",
c("former_col = 0", "former_col = 1"),
col=c("black","red"),
pch = 20)
The variable former_col
is a dummy variable (also known as a binary variable or an indicator variable) and it can take on the values 0
or 1
. Such a variable does not affect the slope. It shifts the regression line. You end up with a second regression line (when the value is 1
) parallel to the first (when the value is 0
).
So, while continuous variables affect the slope, binary variables affect the intercept. Let's run a linear regression model and see the results:
model1 <- lm(hdi ~ corruption_control + former_col, data = world_data)
screenreg(model1)
==============================
Model 1
------------------------------
(Intercept) 0.77 ***
(0.02)
corruption_control 0.10 ***
(0.01)
former_col -0.11 ***
(0.02)
------------------------------
R^2 0.54
Adj. R^2 0.54
Num. obs. 175
RMSE 0.12
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05
Former Colony | Intercept | Slope |
---|---|---|
0 = not a former colony | Intercept + (former_col * 0)= 0.77 + (-0.11 * 0) = 0.77 |
corruption_control = 0.10 |
1 = former colony | Intercept + (former_col * 1)= 0.77 + (-0.11 * 1) = 0.66 |
corruption_control = 0.10 |
To illustrate the effect of a dummy we can access the coefficients of our lm
model directly with the coefficients()
function:
model_coef <- coefficients(model1)
model_coef
(Intercept) corruption_control former_col
0.7733939 0.1021898 -0.1122634
We can use the square brackets [ ]
to access individual coefficients.
To illustrate the effect of a dummy we draw a scatterplot and then use abline()
to draw two regression lines, one with former_col = 0
and another with former_col = 1
.
Instead of passing the model as the first argument to abline()
, we can just pass the intercept and slope as two separate arguments.
plot(hdi ~ corruption_control, data = world_data,
col = as.factor(former_col),
pch = 20,
xlab = "Corruption Control",
ylab = "Human Development Index",
main = "Effect of a Dummy Variable")
# the regression line when former_col = 0
abline(model_coef[1], model_coef[2], col = "black")
# the regression line when former_col = 1
abline(model_coef[1] + model_coef[3], model_coef[2], col = "red")
# add a legend to the plot
legend("bottomright",
c("former_col = 0", "former_col = 1"),
col=c("black","red"),
lwd = 1, # line width = 1 for adding a line to the legend
pch = 20)
There are two ways to specify an interaction term, either using a colon (:
) or with an asterisk (*
)
Example | |
---|---|
: |
A:B - Only the interaction between A and B is included |
* |
A*B - In addition to the interaction term, both the constituents are also included. So A*B translates to A + B + A:B |
We interact control of corruption (continuous) with having been colonized (binary). We estimate a second model using an interaction term and use the F-test to see whether our new model does better than the first one.
What to keep in mind when modelling interactions | |
---|---|
Include interaction terms and constituents of the interaction in your model. | |
Constituent 1 | corruption_control |
Constituent 2 | former_col |
Interaction | corruption_control:former_col |
model2 <- lm(hdi ~ corruption_control + former_col + corruption_control:former_col,
data = world_data)
screenreg(list(model1, model2))
=====================================================
Model 1 Model 2
-----------------------------------------------------
(Intercept) 0.77 *** 0.79 ***
(0.02) (0.02)
corruption_control 0.10 *** 0.08 ***
(0.01) (0.01)
former_col -0.11 *** -0.12 ***
(0.02) (0.02)
corruption_control:former_col 0.05 **
(0.02)
-----------------------------------------------------
R^2 0.54 0.56
Adj. R^2 0.54 0.55
Num. obs. 175 175
RMSE 0.12 0.12
=====================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
anova(model1, model2)
Analysis of Variance Table
Model 1: hdi ~ corruption_control + former_col
Model 2: hdi ~ corruption_control + former_col + corruption_control:former_col
Res.Df RSS Df Sum of Sq F Pr(>F)
1 172 2.5953
2 171 2.4930 1 0.10237 7.0215 0.008809 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, we now examine what our interaction in the second model does by plotting our results with zelig()
. We will plot our results using simulation in zelig()
. We first estimate a linear model the same way we did with lm()
:
z.out <- zelig(hdi ~ corruption_control + former_col + corruption_control:former_col,
data = world_data,
model = "ls")
How to cite this model in Zelig:
R Core Team. 2007.
ls: Least Squares Regression for Continuous Dependent Variables
in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
"Zelig: Everyone's Statistical Software," http://zeligproject.org/
The first scenario x.out1
represents countries that were never colonized. The second, x.out2
represents former colonies. We vary control of corruption from -2
to 3
which is roughly the minimum to the maximum of the variable.
NOTE: We know the range of values for corruption_control
from the summary statistics we obtained after loading the dataset at the begining of the seminar. You can also use the range()
function, but make sure to provide na.rm = TRUE
argument as there are some missing values in the dataset.
range(world_data$corruption_control, na.rm = TRUE)
[1] -1.699529 2.445654
# set covariates for countries that weren't colonised
x.out1 <- setx(z.out, former_col = 0, corruption_control = -2:3)
# set covariates for colonised countries
x.out2 <- setx(z.out, former_col = 1, corruption_control = -2:3)
# simulate
s.out <- sim(z.out, x = x.out1, x1 = x.out2)
Now let's plot with ci.plot
ci.plot(s.out,
ci = 95,
xlab = "Control of Corruption",
ylab = "Expected Value of Human Development Index",
main = "Effect of Control of Corruption by Colonial Past"
)
# add text to Zelig plot.
# NOTE: x and y are just plot coordinates so you can change them to control where
# the text is displayed
text(x = -1, y = 0.8, labels = "former_col = 0")
text(x = 1, y = 0.7, labels = "former_col = 1")
Now let's look at the relationship between GDP per capita and human development index
plot(hdi ~ gdp, data = world_data)
When looking at the scatter plot, we clearly see a non-linear relationship between the GDP per capita and human development index. We fit a linear regression model and plot a line through the point cloud using abline()
.
model1 <- lm(hdi ~ gdp, data = world_data)
summary(model1)
Call:
lm(formula = hdi ~ gdp, data = world_data)
Residuals:
Min 1Q Median 3Q Max
-0.38931 -0.10169 0.05228 0.10126 0.17392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.918e-01 1.293e-02 45.78 <2e-16 ***
gdp 1.017e-05 7.976e-07 12.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1298 on 170 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.4888, Adjusted R-squared: 0.4858
F-statistic: 162.6 on 1 and 170 DF, p-value: < 2.2e-16
plot(hdi ~ gdp, data = world_data)
abline(model1)
The linear model isn't a very good fit as we can see from the plot. We observe that additional wealth has a large effect on the quality of life in a poor country while the effect becomes smaller and smaller when higher levels of wealth are reached. That, however, implies that the slope of our line (the effect of the independent variable) should depend on the level of the independent variable.
We can model such a non-linear relationship by including a quadratic term of GDP per capita. The syntax is the following:
I(independent_variable^2)
Component | Description |
---|---|
I |
The capital I is actually a function that allows us to pass the quadratic term "as is" to lm() |
^2 |
The ^ followed by a 2 means: raised to the power of 2 |
model2 <- lm(hdi ~ gdp + I(gdp^2), data = world_data)
summary(model2)
Call:
lm(formula = hdi ~ gdp + I(gdp^2), data = world_data)
Residuals:
Min 1Q Median 3Q Max
-0.268048 -0.064632 -0.001202 0.085496 0.227648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.264e-01 1.249e-02 42.148 <2e-16 ***
gdp 2.537e-05 1.716e-06 14.781 <2e-16 ***
I(gdp^2) -3.542e-10 3.707e-11 -9.556 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1049 on 169 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.6681, Adjusted R-squared: 0.6642
F-statistic: 170.1 on 2 and 169 DF, p-value: < 2.2e-16
Do we need the quadratic term or is the relationship in fact linear? We can use an F-Test to check:
anova(model1, model2)
Analysis of Variance Table
Model 1: hdi ~ gdp
Model 2: hdi ~ gdp + I(gdp^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 170 2.8649
2 169 1.8600 1 1.0049 91.31 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let's run a model with zelig()
, run simulation and plot the results
z.out <- zelig(hdi ~ gdp + I(gdp^2), data = world_data, model = "ls")
How to cite this model in Zelig:
R Core Team. 2007.
ls: Least Squares Regression for Continuous Dependent Variables
in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
"Zelig: Everyone's Statistical Software," http://zeligproject.org/
# setting covariates; GDP/captia is a sequence from 0 to 60000 in steps of 1000
x.out <- setx(z.out, gdp = seq(0, 60000, 1000))
# simulate using our model and our covariates
s.out <- sim(z.out, x = x.out)
# plot the results
ci.plot(s.out, ci = 95)
Using our model z.out
, where we included the square of GDP/capita, what is the effect of:
Estimate a model where wbgi_pse
(political stability) is the response variable and h_j
(1 if country has an independent judiciary) and former_col
are the explanatory variables. Include an interaction between your explanatory variables. What is the marginal effect of:
h_j
and former_col
improve model fit?Run a model on the human development index (hdi
), interacting an independent judiciary (h_j
) and control of corruption (corruption_control
). What is the effect of control of corruption:
Clear your workspace and download the California Test Score Data used by Stock and Watson.
avginc
and testscr
variables.testscr
as the dependent variable.avginc
as the independent variable.avginc
as the independent variable.