Using our model z.out
, where we included the square of GDP/capita, what is the effect of:
Load the necessary packages
library(foreign)
library(texreg)
library(Zelig)
library(dplyr)
Clear the workspace
rm(list = ls())
Load the dataset and rename the variables to something sensible
qog_data <- read.dta("QoG2012.dta")
qog_data <- rename(qog_data,
hdi = undp_hdi,
gdp = wdi_gdpc)
We will re-estimate the same model we had done in class.
z.out <- zelig(hdi ~ gdp + I(gdp^2), model = "ls", data = qog_data)
How to cite this model in Zelig:
R Core Team. 2007.
ls: Least Squares Regression for Continuous Dependent Variables
in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
"Zelig: Everyone's Statistical Software," http://zeligproject.org/
summary(z.out)
Model:
Call:
z5$zelig(formula = hdi ~ gdp + I(gdp^2), data = qog_data)
Residuals:
Min 1Q Median 3Q Max
-0.268048 -0.064632 -0.001202 0.085496 0.227648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.264e-01 1.249e-02 42.148 <2e-16
gdp 2.537e-05 1.716e-06 14.781 <2e-16
I(gdp^2) -3.542e-10 3.707e-11 -9.556 <2e-16
Residual standard error: 0.1049 on 169 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.6681, Adjusted R-squared: 0.6642
F-statistic: 170.1 on 2 and 169 DF, p-value: < 2.2e-16
Next step: Use 'setx' method
The model that we estimated is not linear because we included a quadratic term. The effect of a $10,000 increase on the quality of life, thus, depends on the initial level of wealth. We proceed like this:
We use Zelig to simulate the quality of life when GDP/captia is 5,000
and when it is 15,000
. Zelig takes the difference between these two values for us, called first differences. It also reports the 95%
confidence interval, telling us whether we can say that the difference is statistically significant.
We use Zelig to do the same thing again but this time for GDP/captia values of 25,000
and 35,000
. Again, we see whether the difference is statistically significant.
The differences are the effects of increasing GDP/captia by 10,000
, taking the initial level of wealth into account. We compare how large the differences are in relation to the range of the dependent variable, to say whether the effect is large or small.
x.low <- setx(z.out, gdp = 5000)
x.high <- setx(z.out, gdp = 15000)
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
1 0.6446471 0.008782749 0.6444935 0.6276184 0.6625812
pv
mean sd 50% 2.5% 97.5%
[1,] 0.6433824 0.1060944 0.643428 0.4271726 0.8509397
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
1 0.8276729 0.01243337 0.8281921 0.8024493 0.8515821
pv
mean sd 50% 2.5% 97.5%
[1,] 0.8312159 0.1066876 0.8343049 0.6348777 1.026377
fd
mean sd 50% 2.5% 97.5%
1 0.1830258 0.0103986 0.1831643 0.1646652 0.2024002
The first difference is 0.18
. To calculate what the effect is in percentage terms, we can compare it to the entire range of the variable hdi
.
summary(qog_data$hdi)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.2730 0.5390 0.7510 0.6982 0.8335 0.9560 19
The entire range of hdi
is 0.96 - 0.27 = 0.68
. So the percentage increase in our dependent variable is 0.18 / 0.68 = 0.27
or 27%
. You can easily calculate this in R using the diff()
and range()
functions:
0.182 / diff(range(qog_data$hdi, na.rm = TRUE))
[1] 0.2664715
Now let's do the same for an increase from 25,000
to 35,000
x.low <- setx(z.out, gdp = 25000)
x.high <- setx(z.out, gdp = 35000)
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
1 0.9392739 0.01570933 0.9395426 0.9089002 0.9694682
pv
mean sd 50% 2.5% 97.5%
[1,] 0.9417586 0.1024977 0.9432037 0.7396054 1.131924
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
1 0.9807222 0.01786632 0.9807791 0.9460799 1.014927
pv
mean sd 50% 2.5% 97.5%
[1,] 0.9788898 0.1125125 0.9790425 0.7563236 1.190987
fd
mean sd 50% 2.5% 97.5%
1 0.04144836 0.009033525 0.04122369 0.02398032 0.05904382
The percentage increase in HDI
in this case is 0.04 / 0.68 = 0.06
or 6%
. Compare that to the effect when moving from GDP/capita of 5000
to 15,000
and we can say that the effect of wealth levels when moving from being a rich country to a super rich country (GDP/captia from 25,000
to 35,000
)
Estimate a model where wbgi_pse
(political stability) is the response variable and h_j
(1 if country has an independent judiciary) and former_col
are the explanatory variables. Include an interaction between your explanatory variables. What is the marginal effect of:
h_j
and former_col
improve model fit?Rename the variables to something we can easily remember.
qog_data <- rename(qog_data, pol_stability = wbgi_pse, indep_judiciary = h_j)
Let's take a look at the indep_judiciary
variable:
table(qog_data$indep_judiciary)
-5 1
105 64
We will recode the -5
value of indep_judiciary
to 0
:
qog_data$indep_judiciary[qog_data$indep_judiciary == -5] = 0
Estimate the two models and compare the results:
base.model <- lm(pol_stability ~ indep_judiciary + former_col,
data = qog_data)
interaction.model <- lm(pol_stability ~ indep_judiciary + former_col +
indep_judiciary:former_col,
data = qog_data)
screenreg(list(base.model, interaction.model))
==================================================
Model 1 Model 2
--------------------------------------------------
(Intercept) -0.45 ** -0.74 ***
(0.15) (0.18)
indep_judiciary 1.01 *** 1.45 ***
(0.15) (0.22)
former_col -0.16 0.21
(0.15) (0.20)
indep_judiciary:former_col -0.84 **
(0.30)
--------------------------------------------------
R^2 0.28 0.31
Adj. R^2 0.27 0.30
Num. obs. 169 169
RMSE 0.86 0.84
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Now we run the F-test to see which model is better:
anova(base.model, interaction.model)
Analysis of Variance Table
Model 1: pol_stability ~ indep_judiciary + former_col
Model 2: pol_stability ~ indep_judiciary + former_col + indep_judiciary:former_col
Res.Df RSS Df Sum of Sq F Pr(>F)
1 166 121.85
2 165 116.47 1 5.3762 7.6163 0.006438 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In model 1, we find that countries with an independent judiciary are more stable than those without. And in addition, there is no difference in political stability between countries that have been colonised and those that have not been.
In the second model, the interaction is significant and the F-test is significant as well, indicating that our new model does better at explaining political stability. The effect of an independent judiciary is conditional on whether a country was colonised or not.
We get the effect of an independent judiciary when a country was not colonised directly from model 2 in our table. The difference between countries with an independent judiciary and without, conditional on not having been colonised is 1.45
on the political stability scale (the effect of the dummy indep_judiciary
). The difference that having an independent judiciary makes for countries that have been colonised is smaller. It is 1.45 + (-0.84) = 0.61
on the political stability scale (the effect of the dummy indep_judiciary
plus the effect of the interaction).
Are those differences large or small?
summary(qog_data$pol_stability)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.46700 -0.72900 0.02772 -0.03957 0.79850 1.67600
The entire range of pol_stability
is 1.68 - (-2.47) = 4.14
. You can also get it using diff()
and range()
functions.
diff(range(qog_data$pol_stability))
[1] 4.14307
Percentage change in pol_stability
of having independent judiciary if NOT colonised
1.45 / diff(range(qog_data$pol_stability))
[1] 0.349982
Percentage change in pol_stability
of having independent judiciary if colonised
0.61 / diff(range(qog_data$pol_stability))
[1] 0.1472338
Those countries that have an independent judiciary and that were not colonised are 1.45 / 4.14 = 0.35
or 35%
more stable (compared to the range of the political stability scale as observed in the sample) than those without an independent judiciary. Having an independent judiciary makes a difference of 0.61 / 4.14 = 0.15
or 15%
if the country was colonised.
Run a model on the human development index (undp_hdi
), interacting an independent judiciary (h_j
) and control of corruption (wbgi_cce
). What is the effect of control of corruption:
We rename wbgi_cce
and estimate the two models
qog_data <- rename(qog_data, corruption_control = wbgi_cce)
base.model <- lm(hdi ~ corruption_control + indep_judiciary, data = qog_data)
interaction.model <- lm(hdi ~ corruption_control * indep_judiciary, data = qog_data)
screenreg(list(base.model, interaction.model))
==========================================================
Model 1 Model 2
----------------------------------------------------------
(Intercept) 0.68 *** 0.67 ***
(0.02) (0.02)
corruption_control 0.11 *** 0.10 ***
(0.01) (0.02)
indep_judiciary 0.05 0.05
(0.03) (0.03)
corruption_control:indep_judiciary 0.01
(0.03)
----------------------------------------------------------
R^2 0.48 0.48
Adj. R^2 0.47 0.47
Num. obs. 160 160
RMSE 0.13 0.13
==========================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Now we run F-test
anova(base.model, interaction.model)
Analysis of Variance Table
Model 1: hdi ~ corruption_control + indep_judiciary
Model 2: hdi ~ corruption_control * indep_judiciary
Res.Df RSS Df Sum of Sq F Pr(>F)
1 157 2.8320
2 156 2.8285 1 0.0035607 0.1964 0.6583
Comparing the two models we see that there is an effect of control of corruption on the quality of life. The better the institutions (more control of corruption), the higher the quality of life. We test the theory that the effect of control of corruption is conditional on having an independent judiciary, i.e. the magnitude of the effect is different in the two sub-groups. Our interaction term is insignificant. The effect of control of corruption seems to be independent of whether an independent judiciary exists or not. Our F-test also clearly rejects that our model with the interaction is doing better at explaining quality of life.
Normally, we would stop here. There is no point in illustrating a conditional effect that does not exist. However, below we illustrate the effect of control of corruption conditional on having an independent judiciary. The effects overlap and we cannot tell a difference.
Note: The interpretation relies on statistical significance. We do not find support for the conditional effect of X on Y.
z.out <- zelig(hdi ~ corruption_control * indep_judiciary,
data = qog_data, model = "ls")
How to cite this model in Zelig:
R Core Team. 2007.
ls: Least Squares Regression for Continuous Dependent Variables
in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
"Zelig: Everyone's Statistical Software," http://zeligproject.org/
summary(qog_data$corruption_control)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-1.70000 -0.81960 -0.30480 -0.05072 0.50650 2.44600 2
x.out1 <- setx(z.out, indep_judiciary = 0, corruption_control = seq(-1.7, 2.5, 0.1) )
x.out2 <- setx(z.out, indep_judiciary = 1, corruption_control = seq(-1.7, 2.5, 0.1) )
s.out <- sim(z.out, x.out1, x.out2)
# plot results
ci.plot(s.out,
ci = 95,
main = "Effect of Control of Corruption Conditional on Indep. Judiciary",
xlab = "Control of Corruption",
ylab = "Quality of Life")
Clear your workspace and download the California Test Score Data used in Chapters 4-9 by Stock and Watson.
Clear workspace
rm(list = ls())
Load the dataset and plot district income against test scores
ca_schools <- read.dta("caschool.dta")
# scatter plot
plot(testscr ~ avginc, data = ca_schools,
main = "Relationship b/w Income and Test Scores",
xlab = "Average District Income in $1000's",
ylab = "Average Test Score of Reading and Math")
Estimate the two models -- one with quadratic income, and one without
model1 <- lm(testscr ~ avginc, data = ca_schools)
model2 <- lm(testscr ~ avginc + I(avginc^2), data = ca_schools)
screenreg(list(model1, model2))
===================================
Model 1 Model 2
-----------------------------------
(Intercept) 625.38 *** 607.30 ***
(1.53) (3.05)
avginc 1.88 *** 3.85 ***
(0.09) (0.30)
I(avginc^2) -0.04 ***
(0.01)
-----------------------------------
R^2 0.51 0.56
Adj. R^2 0.51 0.55
Num. obs. 420 420
RMSE 13.39 12.72
===================================
*** p < 0.001, ** p < 0.01, * p < 0.05
In both the models, average district income is statistically significant. Students from richer districts do better. Our quadratic term is also significant in the second model. Thus, the effect is stronger for poorer districts and smaller for richer districts. The F-test confirms that our non-linear model does better at explaining test scores.
anova(model1, model2)
Analysis of Variance Table
Model 1: testscr ~ avginc
Model 2: testscr ~ avginc + I(avginc^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 418 74905
2 417 67510 1 7394.9 45.677 4.713e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can go a step further and compare the effects of income increase in richer vs poorer districts.
Let's see what happens when we move from 10th percentile to 15th percentile of income:
z.out <- zelig(testscr ~ avginc + I(avginc^2), data = ca_schools, model = "ls")
How to cite this model in Zelig:
R Core Team. 2007.
ls: Least Squares Regression for Continuous Dependent Variables
in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
"Zelig: Everyone's Statistical Software," http://zeligproject.org/
x.low <- setx(z.out, avginc = quantile(ca_schools$avginc, 0.10))
x.high <- setx(z.out, avginc = quantile(ca_schools$avginc, 0.15))
s.out <- sim(z.out, x.low, x.high)
summary(s.out)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
1 638.2858 1.019334 638.2767 636.3581 640.2284
pv
mean sd 50% 2.5% 97.5%
[1,] 637.9882 12.81654 638.2065 611.6127 662.5857
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
1 640.8613 0.9015329 640.8714 639.1829 642.5643
pv
mean sd 50% 2.5% 97.5%
[1,] 640.8461 12.83206 640.9189 615.7214 664.4534
fd
mean sd 50% 2.5% 97.5%
1 2.57554 0.1648453 2.580249 2.261046 2.892726
The simulation shows that the effect of moving from the 10th percentile in district wealth to the 15th percentile is 2.58
points. The difference is statistically significant, we can say that on a 95% confidence level test scores would increase between 2.26
and 2.89
points.
The difference in income between the 10th percentil and 15th percentile is roughly $840
. Remember that the income is reported in x1000 US dollars, so we multiply the difference by 1000 to get the dollar amount:
quantile(ca_schools$avginc, c(0.10, 0.15))
10% 15%
8.929933 9.770200
So the difference is 9.77 - 8.93 = 0.84
or $840
.
We can also check for the effect that increasing income by the same amount ($840
) would have in a wealthier district, for example let's say in the 75th percentile.
x.low <- setx(z.out, avginc = quantile(ca_schools$avginc, 0.75))
x.high <- setx(z.out, avginc = quantile(ca_schools$avginc, 0.75) + 0.84)
s.out <- sim(z.out, x.low, x.high)
summary(s.out)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
1 662.0423 0.8242851 662.0282 660.5017 663.6549
pv
mean sd 50% 2.5% 97.5%
[1,] 661.8853 12.50672 661.4153 637.9642 687.1944
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
1 663.9952 0.8781278 663.9976 662.3599 665.7192
pv
mean sd 50% 2.5% 97.5%
[1,] 664.18 12.43611 663.8773 639.6473 688.6908
fd
mean sd 50% 2.5% 97.5%
1 1.952878 0.09072237 1.951245 1.776513 2.137769
When a district at the 75th percentile would become even richer by $840
average income, that would increase test scores by 1.95
points on average. On a 95% confidence level the increase would between 1.78
and 2.14
points. That effect is smaller than in poor districts.
Note: Here, the effect of X on Y was presented in the absolute change (point change of test scores) in this exercise as opposed to percentage increase as we did in previous examples. When the absolute increase of the dependent variable is easy to understand it is good to do that.