10 Analysis
In this lecture, you will learn how to do basic linear regressions in R.
10.1 Load packages
Here are some packages you will need for this lesson.
# load package
library("here")
library("haven")
# here are some useful data analysis packages
library("Rmisc")
library("magrittr")
library("tidyverse")
library("ggplot2")
library("broom")
library("MASS")
library("estimatr")
library("modelsummary")
10.2 Load data
In this lesson, we will be using the American National Election Survey (ANES) 2016 dataset. The ANES is a nationally representative dataset of eligible voters in the U.S. fielded by the University of Michigan before and after every presidential election to gauge what Americans thought about major issues.
Using the here
package, you can set up folders and directories.
# setup folders and directories
here("code")
here("data")
# read stata data in R
anes2016 <- read_dta(here("data", "anes_timeseries_2016.dta"))
10.3 Rename variables
Many of the variables in the ANES are just gibberish code. For example:
# DV: V162034: POST: Did R vote for President
# IV: V162256: PRE: R's interest in politics
# control: V161019: PRE: party of registration
You may want to change your variables of interests to names that are easier to reference.
# these names are hard to remember, so you can change them
anes2016$post.vote <- anes2016$V162034
anes2016$pre.interest <- anes2016$V162256
anes2016$pre.party <- anes2016$V161019
anes2016$pre.registered <- anes2016$V161011a
anes2016$pre.gender <- anes2016$V161342
10.4 Descriptive analysis
For non-regression analysis, we may just want to look at the confidence intervals. To do this, we can use the group.CI()
command. Since we are not grouping post.vote, we just put “1” instead of an actual group.
Note that >
precedes R commands. If you want to run the line in R, please remove >
.
# since we are not grouping post.vote, we just put "1" instead of an actual group
> group.CI(post.vote ~ 1, data = anes2016)
post.vote.upper post.vote.mean post.vote.lower
1 1.01874 1.014286 1.009832
To group the post.vote
variable by pre.gender
, we can do the following…
# now grouping vote for president...
> groupCI_anes <- group.CI(post.vote ~ pre.gender, data = anes2016)
> groupCI_anes
pre.gender post.vote.upper post.vote.mean post.vote.lower
1 -9 1.000000 1.000000 1.000000
2 1 1.023858 1.016746 1.009635
3 2 1.018220 1.012483 1.006745
4 3 1.000000 1.000000 1.000000
10.4.1 Descriptive graph
Since this is a dataframe so you can turn it into a graph pretty easily.
# this is a dataframe so you can turn it into a graph pretty easily
ggplot(groupCI_anes, aes(x = as.factor(pre.gender), y = post.vote.mean)) +
geom_pointrange(aes(ymin = post.vote.lower, ymax = post.vote.upper)) +
theme(panel.grid.minor = element_blank())

ggplot-groupCI
10.5 Data summary
The modelsummary
package includes a powerful suite of data summary tools. It has many tools that can create the tables you often see in published articles.
You may want to look at each varaible by category. Let’s use the iris
dataset because there are fewer varibles…the ANES has too many variables and may not make the best example…
datasummary_skim(iris)

datasummary-skim-iris
You can also look at the correlation between the variables.
datasummary_correlation(iris)

datasummary-correlation-iris
10.6 Regression analysis
For all your data analysis needs in R, I ghihly recommend that you check out the UCLA data analysis page. If you ever want to know whether you can use a particular regression or how to use a particular regression, go to the UCLA stats page! It is super helpful.
If you don’t know how to find what you are looking for, just type “[name of regression or type of dependent variable] UCLA R data analysis” into Google.
If the options provided to you on the UCLA stats page are confusing, you should use it as a stepping stone to search for something that is easier to understand or suit your needs better.
10.6.1 Linear regression
Run a simple ordinary least squared regression. The left-hand side is always the dependent variable (outcome), and the right-hand side should contain the independent variable and control variables.
You typically use an OLS when their dependent variable is continuous & unbounded. There are some arguments that you can use OLS for binary dependent variable in experiment, as Gerber and Green (2012) argue that the average treatment effect (ATE) is unbiased.
If you ever doubt your use of OLS, you should check out OLS assumptions and consider your dependnet variables.
For the sake of creating examples, I might be violating some assumptions and not thinking about the variables carefully.
You should think about your tests carefully in your own work and if you do want to use something that’s not the norm, you should make sure that you justify your case. One great way to do this is cite other people’s work that you’re borrowing from, or conduct some sort of validation mechanism.
ols <-
anes2016 %>%
lm(post.vote ~ pre.interest + pre.party, data = .)
Look at the regression table.
> summary(ols)
Call:
lm(formula = post.vote ~ pre.interest + pre.party, data = .)
Residuals:
Min 1Q Median 3Q Max
-0.02985 -0.01699 -0.01430 -0.01181 0.99338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.002978 0.005224 192.011 <2e-16 ***
pre.interest 0.005182 0.002323 2.231 0.0258 *
pre.party 0.001536 0.001235 1.244 0.2136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1186 on 2727 degrees of freedom
Multiple R-squared: 0.002354, Adjusted R-squared: 0.001622
F-statistic: 3.217 on 2 and 2727 DF, p-value: 0.04021
10.6.2 Diagnostic plots
Look at the diagnostic plots for residuals, fitted values, Cook’s distance, and leverage. It is important to look at diagnostic plots so you can figure out if there are data points that might influence your results.
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)
par(opar)

opar
You can use a simple base R command to examine the observations. You can see where these respondents are from, etc. and see why their observations are out of the norm.
For example, I want to see if they are all from the same state (V161010d
is the state variable).
> anes2016[c(1884, 1918, 2464, 2458), "V161010d"]
# A tibble: 4 x 1
V161010d
<dbl+lbl>
1 42 [42. Pennsylvania]
2 47 [47. Tennessee]
3 39 [39. Ohio]
4 34 [34. New Jersey]
10.6.3 Linear regression with standard errors
Robust standard errors are useful in social sciences where you don’t know the structure of the variations. So to get unnbiased standard errors for OLS, you want to insert your regression into lm_robust
. Learn more about the command here.
Note that the UCLA page suggests rlm()
but I think lm_robust
is easier to use.
> lm_robust(post.vote ~ pre.interest + pre.party, data = anes2016)
Estimate Std. Error t value Pr(>|t|) CI Lower
(Intercept) 1.002977920 0.004447216 225.529416 0.00000000 0.9942576670
pre.interest 0.005182496 0.002366346 2.190084 0.02860257 0.0005424841
pre.party 0.001535810 0.001336292 1.149308 0.25052992 -0.0010844361
CI Upper DF
(Intercept) 1.011698172 2727
pre.interest 0.009822509 2727
pre.party 0.004156056 2727
10.6.4 Logit regression
You may want to use a generalized linear model if your dependent variable contains binary outcomes.
You may want to first check the “family” of distributions of your outcome using a histogram. I like to use base R for this so I can quickly eyeball the distribution of the data.
hist(anes2016$post.vote)

hist-before
The data is a bit of a mess. You may want to only keep the respondents who either voted or did not vote.
anes2016 <-
anes2016 %>%
filter(post.vote == 1 | post.vote == 2)
You can look at the histogram again.
hist(anes2016$post.vote)
Annoyingly, 1
= voted for president and 2
= did not vote for president. Grrr…
Many binary variables are not perfectly 0
and 1
unless you convert it. I personally like to create a new variable if I am modifying an existing variable (so I always have the original variable around in case I need to go back and check something).
anes2016 <-
anes2016 %>%
mutate(
post.vote.binary = ifelse(post.vote == 1, 1, 0))
hist(anes2016$post.vote.binary)

hist-after
Now it is 1
= voted for president and 2
= did not vote for president. Yay!
For the gender variable, you may convert it to only look at the women vote. For a dataset like the ANES, the survey does not take into careful consideration of gender issues, but other surveys—and hopefully your own—take it into account. But one aspect of representative voting that people are interested in, and that a representative dataset like the ANES is able to evaluate, is how women vote. One way to do this is to convert the gender variable to “woman” or “not woman.”
Here, if pre.gender
== 2, then the respondent is a woman. From this variable, you want to create a variable for respondents who identify as women.
anes2016 <-
anes2016 %>%
mutate(
pre.female = ifelse(pre.gender == 2, 1, 0))
Now, you want to know if identifying as female makes a respondent more likely to vote for president. Since pre.female
is a binary variable, you will want to use a GLM.
logit <- glm(post.vote.binary ~ pre.female, family = binomial, data = anes2016)
Then, you want to look at the regression table.
> summary(logit)
Call:
glm(formula = post.vote.binary ~ pre.female, family = binomial,
data = anes2016)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9609 0.1585 0.1585 0.1813 0.1813
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.0999 0.2200 18.634 <2e-16 ***
pre.female 0.2710 0.3235 0.838 0.402
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 408.82 on 2729 degrees of freedom
Residual deviance: 408.12 on 2728 degrees of freedom
AIC: 412.12
Number of Fisher Scoring iterations: 7
You may ask yourself, “How do I transfer my regression table into R Markdown??”
Let me introduce you to the stargazer
package. Here is the best Stargazer tutorial.
Install and load stargazer
.
install.packages("stargazer")
library(stargazer)
First off, stargazer makes your tables look NICE.
stargazer(logit, type = "text",
# you can customize the labels
title = "Effect of gender on voting for president",
covariate.labels = c("Female"),
dep.var.labels = "Voted for president")
This is a nice stargazer
text table. It kinda looks like you ripped it out of a published journal!
Effect of gender on voting for president
=============================================
Dependent variable:
---------------------------
Voted for president
---------------------------------------------
Female 0.271
(0.324)
Constant 4.100***
(0.220)
---------------------------------------------
Observations 2,730
Log Likelihood -204.059
Akaike Inf. Crit. 412.119
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
But how do you insert this into your R Markdown document?
Well, you will want to print it as a LaTeX table because R Markdown has built-in tex options. I think tex is nicer for formatting and typesetting, but R markdown is more convenient (I think!).
Remove the beginning hashtags #
in the code…
# ```{r, results='asis'}
stargazer(logit, type='latex',
title = "Effect of gender on voting for president",
covariate.labels = c("Female"),
dep.var.labels = "Voted for president",
# you can cite the package in your References section, but it might be frustrating to see a citation every time you run stargazer...but you can just set it to header = FALSE and everything is okay
header = FALSE)
# ```
You’ll need to do this for your homework assignment!
Now, you can create a second regression with some controls…
# V165510: age group
# V165511: education
In reality, you should clean the variables like I did with the vote and gender variables.
logit2 <- glm(post.vote.binary ~ pre.female + V165510 + V165511, family = binomial, data = anes2016)
# check out the table
Here’s the GLM summary table.
> summary(logit2)
Call:
glm(formula = post.vote.binary ~ pre.female + V165510 + V165511,
family = binomial, data = anes2016)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0468 0.1590 0.1601 0.1817 0.2058
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.1517 0.2530 16.412 <2e-16 ***
pre.female 0.2688 0.3236 0.831 0.406
V165510 0.1548 0.3269 0.473 0.636
V165511 -0.1266 0.2767 -0.458 0.647
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 408.82 on 2729 degrees of freedom
Residual deviance: 407.89 on 2726 degrees of freedom
AIC: 415.89
Number of Fisher Scoring iterations: 7
Personally I like seeing the table in R first before I put it in R Markdown, but this is a matter of personal taste.
stargazer(logit, logit2, type='text',
title = "Effect of gender on voting for president",
covariate.labels = c("Female", "Age", "Education"),
dep.var.labels = "Voted for president",
header = FALSE)
This is how it looks…
Effect of gender on voting for president
==============================================
Dependent variable:
----------------------------
Voted for president
(1) (2)
----------------------------------------------
Female 0.271 0.269
(0.324) (0.324)
Age 0.155
(0.327)
Education -0.127
(0.277)
Constant 4.100*** 4.152***
(0.220) (0.253)
----------------------------------------------
Observations 2,730 2,730
Log Likelihood -204.059 -203.945
Akaike Inf. Crit. 412.119 415.890
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
For log odds in a GLM table, you may want to convert the coefficients into odds ratios so they’re easier to interpret. To do this, I recommend Jorge Cimentada’s nifty function.
I like tables but there are people out there who do not like tables. To appease society, you can use coefplots
package or the tools in the modelsummary
package. Here is an example of plotting coefficient plots in the modelsummary
package.
modelplot(ols)
# change variable names on coefplot
vars <- c('pre.interest' = 'Interest',
'pre.party' = 'Party')
Now for the graph…
# ta-da!
modelplot(ols, coef_map = vars)

modelsummary-coefplot