Use the World Development Indicators and Polity IV datasets from 2012 to estimate a model explaining the effects of the following variables on levels of democracy. Make sure to test and correct for heteroskedasticity and Omitted Variable Bias.
Dataset: - World Development Indicators 2012 - Polity IV 2012
NOTE: This exercise will require you to merge the two datasets together. Use iso3n
country code as the matching columns for merging.
Let's start by clearing the environment
rm(list = ls())
Next, we load the texreg
, lmtest
, sandwich
and dplyr
packages
library(texreg)
Warning: package 'texreg' was built under R version 3.2.3
library(lmtest)
library(sandwich)
library(dplyr)
Load the two datasets, and merge using the iso3n
column.
wdi <- read.csv("https://uclspp.github.io/PUBLG100/data/WDI_2012.csv")
polity <- read.csv("https://uclspp.github.io/PUBLG100/data/polity4_2012.csv")
merged_data <- merge(wdi, polity, by = "iso3n")
A number of observations have missing values so we need to remove them. Using na.omit()
is problematic because it removes observation if it finds NA
in any column. We're only interested in UnemploymentRate
, Inflation
, MilitaryExpenditure
, HealthExpenditure
, and MobileSubscribers
columns so we use the function complete.cases()
instead and get a subset of observations that do not have any missing values in the columns we care about.
merged_data <- subset(merged_data,
complete.cases(merged_data[, c("UnemploymentRate", "Inflation", "MilitaryExpenditure", "HealthExpenditure", "MobileSubscribers")]))
Since we have 5 explanatory variables, we need to somehow find a combination of variables that yields the best model. We will use Adjusted R-squared as the criterion for determining what is considered the best model. Next we need a systematic way to find and test various models with different combinations of explanatory variables. There are three possible approaches we can use:
Exhaustive search: Use an exhaustive search and try every possible combination of explanatory variables available to us. With 5
variables, an exhaustive search would require (2 ^ 5) -1
or 31
different models. As the number of variables increase, the it become extremely difficult to do an exhaustive search. For example, with 10
variables, we would need 1024
models, and with 15
variables, we would need over 32,768
models to compare.
Forward selection: The next option is to start with the simplest model of one variable and start adding variables one at a time until the addition of new variables no longer improves the model.
Backward elimination: The last option is to start with the most general model that includes every explanatory variable and remove one variable at a time as long as removing variables results in an improved model.
For this exercise we will use the forward selection method and start with simple models of one variable each:
model_1 = lm(polity2 ~ UnemploymentRate, data = merged_data)
model_2 = lm(polity2 ~ Inflation, data = merged_data)
model_3 = lm(polity2 ~ MilitaryExpenditure, data = merged_data)
model_4 = lm(polity2 ~ HealthExpenditure, data = merged_data)
model_5 = lm(polity2 ~ MobileSubscribers, data = merged_data)
Next we compare the models using screenreg()
. Notice how we're providing custom model names for screenreg()
to display:
screenreg(list(model_1, model_2, model_3, model_4, model_5),
digits = 3,
custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
===========================================================================
Model 1 Model 2 Model 3 Model 4 Model 5
---------------------------------------------------------------------------
(Intercept) 6.691 *** 8.031 *** 10.700 *** 3.359 * 4.889
(0.942) (0.875) (0.692) (1.629) (2.515)
UnemploymentRate 0.092
(0.084)
Inflation -0.139
(0.206)
MilitaryExpenditure -0.478 ***
(0.084)
HealthExpenditure 0.528 **
(0.196)
MobileSubscribers 0.023
(0.021)
---------------------------------------------------------------------------
R^2 0.019 0.007 0.340 0.104 0.018
Adj. R^2 0.003 -0.009 0.330 0.089 0.003
Num. obs. 65 65 65 65 65
RMSE 4.110 4.133 3.369 3.927 4.110
===========================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Model 3 with MilitaryExpenditure
as the explanatory variable has the highest Adjusted R-squared, so let's use that as the baseline and add the remaining variables one at a time and do another comparison.
model_6 = lm(polity2 ~ MilitaryExpenditure + UnemploymentRate, data = merged_data)
model_7 = lm(polity2 ~ MilitaryExpenditure + Inflation, data = merged_data)
model_8 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure, data = merged_data)
model_9 = lm(polity2 ~ MilitaryExpenditure + MobileSubscribers, data = merged_data)
screenreg(list(model_3, model_6, model_7, model_8, model_9),
digits = 3,
custom.model.names = c("Model 3", "Model 6", "Model 7", "Model 8", "Model 9"))
===============================================================================
Model 3 Model 6 Model 7 Model 8 Model 9
-------------------------------------------------------------------------------
(Intercept) 10.700 *** 10.977 *** 11.064 *** 7.902 *** 8.703 ***
(0.692) (1.099) (0.895) (1.624) (2.171)
MilitaryExpenditure -0.478 *** -0.486 *** -0.476 *** -0.440 *** -0.473 ***
(0.084) (0.088) (0.084) (0.085) (0.084)
UnemploymentRate -0.024
(0.073)
Inflation -0.109
(0.169)
HealthExpenditure 0.321
(0.169)
MobileSubscribers 0.017
(0.017)
-------------------------------------------------------------------------------
R^2 0.340 0.342 0.345 0.377 0.350
Adj. R^2 0.330 0.320 0.324 0.356 0.329
Num. obs. 65 65 65 65 65
RMSE 3.369 3.393 3.385 3.302 3.370
===============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Adding HealthExpenditure
improves the model slightly so we'll use MilitaryExpenditure + HealthExpenditure
as the new baseline and repeat the process of forward selection.
model_10 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + UnemploymentRate, data = merged_data)
model_11 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + Inflation, data = merged_data)
model_12 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers, data = merged_data)
screenreg(list(model_8, model_10, model_11, model_12),
digits = 3,
custom.model.names = c("Model 8", "Model 10", "Model 11", "Model 12"))
===================================================================
Model 8 Model 10 Model 11 Model 12
-------------------------------------------------------------------
(Intercept) 7.902 *** 8.116 *** 7.914 *** 5.222
(1.624) (1.873) (1.991) (2.723)
MilitaryExpenditure -0.440 *** -0.446 *** -0.440 *** -0.431 ***
(0.085) (0.089) (0.085) (0.084)
HealthExpenditure 0.321 0.319 0.321 0.345 *
(0.169) (0.171) (0.182) (0.170)
UnemploymentRate -0.017
(0.071)
Inflation -0.002
(0.177)
MobileSubscribers 0.021
(0.017)
-------------------------------------------------------------------
R^2 0.377 0.377 0.377 0.392
Adj. R^2 0.356 0.347 0.346 0.362
Num. obs. 65 65 65 65
RMSE 3.302 3.327 3.328 3.288
===================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
The addition of MobileSubscribers as a control variable has improved the the model slightly, so now we're only left with two variables UnemploymentRate
and Inflation
to test.
model_13 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + UnemploymentRate, data = merged_data)
model_14 = lm(polity2 ~ MilitaryExpenditure + HealthExpenditure + MobileSubscribers + Inflation, data = merged_data)
screenreg(list(model_12, model_13, model_14),
digits = 3,
custom.model.names = c("Model 12", "Model 13", "Model 14"))
=======================================================
Model 12 Model 13 Model 14
-------------------------------------------------------
(Intercept) 5.222 5.272 4.957
(2.723) (3.032) (3.122)
MilitaryExpenditure -0.431 *** -0.432 *** -0.430 ***
(0.084) (0.090) (0.085)
HealthExpenditure 0.345 * 0.345 * 0.357
(0.170) (0.172) (0.183)
MobileSubscribers 0.021 0.020 0.021
(0.017) (0.017) (0.017)
UnemploymentRate -0.003
(0.072)
Inflation 0.032
(0.178)
-------------------------------------------------------
R^2 0.392 0.392 0.392
Adj. R^2 0.362 0.351 0.351
Num. obs. 65 65 65
RMSE 3.288 3.316 3.315
=======================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Adding UnemploymentRate
or Inflation
does not improve our model at all, so we can keep Model 12
as the best performing model based on adjusted r-squared.
Now, let's test for heteroskedasticity in model_12
using Breusch-Pagan test from the lmtest
package.
bptest(model_12)
studentized Breusch-Pagan test
data: model_12
BP = 13.168, df = 3, p-value = 0.004288
Since bptest()
is telling us that we're dealing with heteroskedastic errors, we must correct them using the coeftest()
function:
corrected_errors <- coeftest(model_12, vcov = vcovHC(model_12))
screenreg(model_12,
digits = 3,
custom.model.names = c("Model 12"),
override.se = corrected_errors[, "Std. Error"],
override.pval = corrected_errors[, "Pr(>|t|)"])
==============================
Model 12
------------------------------
(Intercept) 5.222
(3.752)
MilitaryExpenditure -0.431 **
(0.128)
HealthExpenditure 0.345
(0.261)
MobileSubscribers 0.021
(0.026)
------------------------------
R^2 0.392
Adj. R^2 0.362
Num. obs. 65
RMSE 3.288
==============================
*** p < 0.001, ** p < 0.01, * p < 0.05
Use the Massachusetts Test Scores dataset from Stock and Watson to fit a model and explain the effects of the following variables on test scores.
NOTE: Use the 8th grade test scores from the variable score8
. The names of explanatory variables are given in parenthesis above:
Dataset: - Massachusetts Test Scores
Let's clear the workspace to ensure that we don't accidentally use variables from previous exercise.
rm(list = ls())
Load the MA Schools dataset
ma_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/MA_Schools.csv")
Let's look at the summary of our dependent and independent variables:
summary(ma_schools$score8)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
641.0 685.0 698.0 698.4 712.0 747.0 40
summary(ma_schools$income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.686 15.220 17.130 18.750 20.380 46.860
summary(ma_schools$lunch)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.40 5.30 10.55 15.32 20.02 76.20
summary(ma_schools$english)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 1.1180 0.8859 24.4900
summary(ma_schools$stratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.40 15.80 17.10 17.34 19.03 27.00
In the first exercise we used the forward selection method for finding the model that best explained the relationship between our independent variables and the dependent variable. In this exercise, we'll use the backward elimination method and start by the the most general model that includes all variables of interest (i.e. stratio
, english
, lunch
, and income
)
model_1 <- lm(score8 ~ stratio + english + lunch + income, data = ma_schools)
summary(model_1)
Call:
lm(formula = score8 ~ stratio + english + lunch + income, data = ma_schools)
Residuals:
Min 1Q Median 3Q Max
-21.557 -5.754 -1.120 4.835 31.856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 684.79554 6.63042 103.281 <2e-16 ***
stratio -0.34361 0.31353 -1.096 0.275
english -0.23289 0.30761 -0.757 0.450
lunch -0.71728 0.07021 -10.216 <2e-16 ***
income 1.67359 0.14882 11.245 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.794 on 175 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.8294, Adjusted R-squared: 0.8255
F-statistic: 212.7 on 4 and 175 DF, p-value: < 2.2e-16
Next we remove one variable at a time to see the difference it makes in our model.
model_2 <- lm(score8 ~ stratio + english + lunch, data = ma_schools)
model_3 <- lm(score8 ~ stratio + english + income, data = ma_schools)
model_4 <- lm(score8 ~ stratio + lunch + income, data = ma_schools)
model_5 <- lm(score8 ~ english + lunch + income, data = ma_schools)
screenreg(list(model_1, model_2, model_3, model_4, model_5),
digits = 3,
custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
============================================================================
Model 1 Model 2 Model 3 Model 4 Model 5
----------------------------------------------------------------------------
(Intercept) 684.796 *** 730.125 *** 665.233 *** 685.837 *** 678.590 ***
(6.630) (6.890) (7.998) (6.478) (3.452)
stratio -0.344 -0.810 * -0.635 -0.356
(0.314) (0.407) (0.393) (0.313)
english -0.233 0.648 -2.389 *** -0.250
(0.308) (0.389) (0.282) (0.307)
lunch -0.717 *** -1.155 *** -0.754 *** -0.724 ***
(0.070) (0.076) (0.051) (0.070)
income 1.674 *** 2.516 *** 1.645 *** 1.695 ***
(0.149) (0.156) (0.144) (0.148)
----------------------------------------------------------------------------
R^2 0.829 0.706 0.728 0.829 0.828
Adj. R^2 0.826 0.701 0.723 0.826 0.825
Num. obs. 180 180 180 180 180
RMSE 8.794 11.509 11.079 8.783 8.799
============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Removing the variable that measures the percent of english learners gives us the best model based on adjusted R-squared.
Let's see whether we can get better results by simplifying the model further.
model_6 <- lm(score8 ~ stratio + lunch, data = ma_schools)
model_7 <- lm(score8 ~ stratio + income, data = ma_schools)
model_8 <- lm(score8 ~ lunch + income, data = ma_schools)
screenreg(list(model_4, model_6, model_7, model_8),
digits = 3,
custom.model.names = c("Model 4", "Model 6", "Model 7", "Model 8"))
===============================================================
Model 4 Model 6 Model 7 Model 8
---------------------------------------------------------------
(Intercept) 685.837 *** 729.338 *** 666.144 *** 679.474 ***
(6.478) (6.908) (9.461) (3.274)
stratio -0.356 -0.797 -1.159 *
(0.313) (0.409) (0.460)
lunch -0.754 *** -1.069 *** -0.764 ***
(0.051) (0.057) (0.050)
income 1.645 *** 2.790 *** 1.665 ***
(0.144) (0.181) (0.143)
---------------------------------------------------------------
R^2 0.829 0.702 0.617 0.828
Adj. R^2 0.826 0.698 0.612 0.826
Num. obs. 180 180 180 180
RMSE 8.783 11.566 13.107 8.790
===============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
The results tell us that when we control for lunch subsidies and district income, the student teacher ratio has no effect on test scores.
Can we simplify the model further?
model_9 <- lm(score8 ~ lunch, data = ma_schools)
model_10 <- lm(score8 ~ income, data = ma_schools)
screenreg(list(model_8, model_9, model_10),
digits = 3,
custom.model.names = c("Model 8", "Model 9", "Model 10"))
==================================================
Model 8 Model 9 Model 10
--------------------------------------------------
(Intercept) 679.474 *** 716.080 *** 643.894 ***
(3.274) (1.235) (3.461)
lunch -0.764 *** -1.100 ***
(0.050) (0.055)
income 1.665 *** 2.909 ***
(0.143) (0.177)
--------------------------------------------------
R^2 0.828 0.695 0.603
Adj. R^2 0.826 0.693 0.601
Num. obs. 180 180 180
RMSE 8.790 11.657 13.303
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
It turns out that the model that include lunch subsidies and district income best captures the relationship between our independent variables and the dependent variable.
Now let's try the Breusch-Pagan test for heteroskedasticity
bptest(model_8)
studentized Breusch-Pagan test
data: model_8
BP = 1.8209, df = 2, p-value = 0.4023
The Breusch-Pagan test tells us that we don't have heteroskedastic errors so we don't need to correct for them.
Can the same model be used for California? Why or why not? What assumptions would it violate and how would you correct for them?
Dataset: - California Test Scores
Let's clear the workspace again ensure that we don't accidentally use variables from previous exercise.
rm(list = ls())
Load the California test score dataset and see how it compares to the Massachusetts.
ca_schools <- read.csv("https://uclspp.github.io/PUBLG100/data/CA_Schools.csv")
Filter the results to only include grade 8 and calculate the student-teacher ratio and average score.
ca_schools <- ca_schools %>%
filter(grades == "KK-08") %>%
mutate(stratio = students / teachers,
avg_score = (read + math)/2)
Let's look at the summary of our dependent and independent variables:
summary(ca_schools$avg_score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
605.6 639.4 653.5 652.9 665.8 706.8
summary(ca_schools$income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.335 10.470 13.440 14.840 16.340 55.330
summary(ca_schools$lunch)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 25.35 44.68 45.93 67.09 100.00
summary(ca_schools$english)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.530 7.857 15.360 22.320 85.540
summary(ca_schools$stratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.00 18.67 19.78 19.71 20.89 25.80
Now lets estimate a model with all 4 explanatory variable stratio
, english
, lunch
and income
and compare it against 4 other models with each variable removed.
model_1 <- lm(avg_score ~ stratio + english + lunch + income, data = ca_schools)
model_2 <- lm(avg_score ~ stratio + english + lunch, data = ca_schools)
model_3 <- lm(avg_score ~ stratio + english + income, data = ca_schools)
model_4 <- lm(avg_score ~ stratio + lunch + income, data = ca_schools)
model_5 <- lm(avg_score ~ english + lunch + income, data = ca_schools)
screenreg(list(model_1, model_2, model_3, model_4, model_5),
digits = 3,
custom.model.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"))
============================================================================
Model 1 Model 2 Model 3 Model 4 Model 5
----------------------------------------------------------------------------
(Intercept) 673.416 *** 696.957 *** 638.237 *** 684.309 *** 663.166 ***
(5.900) (5.206) (6.292) (5.947) (2.321)
stratio -0.479 -0.894 *** 0.025 -0.791 **
(0.253) (0.263) (0.302) (0.262)
english -0.214 *** -0.147 *** -0.489 *** -0.227 ***
(0.034) (0.035) (0.031) (0.033)
lunch -0.383 *** -0.527 *** -0.504 *** -0.374 ***
(0.030) (0.024) (0.024) (0.030)
income 0.659 *** 1.461 *** 0.493 *** 0.700 ***
(0.093) (0.083) (0.094) (0.091)
----------------------------------------------------------------------------
R^2 0.789 0.759 0.692 0.765 0.787
Adj. R^2 0.786 0.757 0.689 0.763 0.785
Num. obs. 359 359 359 359 359
RMSE 8.637 9.217 10.420 9.101 8.668
============================================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Comparing the model above we can see that the model with all 4 explanatory variable stratio
, english
, lunch
and income
gives us the highest Adjusted R-squared. This differs from the Massachusetts study where lunch subsidy and district income resulted in the best model.
Whether the model from Massachusetts can be applied to California is a matter of external validity and depends a great deal on whether the two populations are similar. Both datasets provides measurements at the district level using standardized test scores. However, the mean test score from Massachusetts and California are different because of the difference in how the tests are conducted in each state.
The student-teacher ratio also differs slightly between the two datasets. The income level in California has larger spread than in Massachusetts but the average income in Massachusetts is 20% higher. The number of students receiving lunch subsidy and the percentage of english language learners are both higher in California.