10 Analysis

In this lecture, you will learn how to do basic linear regressions in R.

10.1 Load packages

Here are some packages you will need for this lesson.

# load package
library("here")
library("haven")

# here are some useful data analysis packages
library("Rmisc")
library("magrittr")
library("tidyverse")
library("ggplot2")
library("broom")
library("MASS")
library("estimatr")
library("modelsummary")

10.2 Load data

In this lesson, we will be using the American National Election Survey (ANES) 2016 dataset. The ANES is a nationally representative dataset of eligible voters in the U.S. fielded by the University of Michigan before and after every presidential election to gauge what Americans thought about major issues.

Using the here package, you can set up folders and directories.

# setup folders and directories
here("code")
here("data")

# read stata data in R
anes2016 <- read_dta(here("data", "anes_timeseries_2016.dta"))

10.3 Rename variables

Many of the variables in the ANES are just gibberish code. For example:

# DV: V162034: POST: Did R vote for President
# IV: V162256: PRE: R's interest in politics
# control: V161019: PRE: party of registration

You may want to change your variables of interests to names that are easier to reference.

# these names are hard to remember, so you can change them
anes2016$post.vote <- anes2016$V162034
anes2016$pre.interest <- anes2016$V162256
anes2016$pre.party <- anes2016$V161019
anes2016$pre.registered <- anes2016$V161011a
anes2016$pre.gender <- anes2016$V161342

10.4 Descriptive analysis

For non-regression analysis, we may just want to look at the confidence intervals. To do this, we can use the group.CI() command. Since we are not grouping post.vote, we just put “1” instead of an actual group.

Note that > precedes R commands. If you want to run the line in R, please remove >.

# since we are not grouping post.vote, we just put "1" instead of an actual group

> group.CI(post.vote ~ 1, data = anes2016)
  post.vote.upper post.vote.mean post.vote.lower
1         1.01874       1.014286        1.009832

To group the post.vote variable by pre.gender, we can do the following…

# now grouping vote for president...

> groupCI_anes <- group.CI(post.vote ~ pre.gender, data = anes2016)
> groupCI_anes
  pre.gender post.vote.upper post.vote.mean post.vote.lower
1         -9        1.000000       1.000000        1.000000
2          1        1.023858       1.016746        1.009635
3          2        1.018220       1.012483        1.006745
4          3        1.000000       1.000000        1.000000

10.4.1 Descriptive graph

Since this is a dataframe so you can turn it into a graph pretty easily.

# this is a dataframe so you can turn it into a graph pretty easily
ggplot(groupCI_anes, aes(x = as.factor(pre.gender), y = post.vote.mean)) + 
  geom_pointrange(aes(ymin = post.vote.lower, ymax = post.vote.upper)) +
  theme(panel.grid.minor = element_blank())

ggplot-groupCI

10.5 Data summary

The modelsummary package includes a powerful suite of data summary tools. It has many tools that can create the tables you often see in published articles.

You may want to look at each varaible by category. Let’s use the iris dataset because there are fewer varibles…the ANES has too many variables and may not make the best example…

datasummary_skim(iris)

datasummary-skim-iris

You can also look at the correlation between the variables.

datasummary_correlation(iris)

datasummary-correlation-iris

10.6 Regression analysis

For all your data analysis needs in R, I ghihly recommend that you check out the UCLA data analysis page. If you ever want to know whether you can use a particular regression or how to use a particular regression, go to the UCLA stats page! It is super helpful.

If you don’t know how to find what you are looking for, just type “[name of regression or type of dependent variable] UCLA R data analysis” into Google.

If the options provided to you on the UCLA stats page are confusing, you should use it as a stepping stone to search for something that is easier to understand or suit your needs better.

10.6.1 Linear regression

Run a simple ordinary least squared regression. The left-hand side is always the dependent variable (outcome), and the right-hand side should contain the independent variable and control variables.

You typically use an OLS when their dependent variable is continuous & unbounded. There are some arguments that you can use OLS for binary dependent variable in experiment, as Gerber and Green (2012) argue that the average treatment effect (ATE) is unbiased.

If you ever doubt your use of OLS, you should check out OLS assumptions and consider your dependnet variables.

For the sake of creating examples, I might be violating some assumptions and not thinking about the variables carefully.

You should think about your tests carefully in your own work and if you do want to use something that’s not the norm, you should make sure that you justify your case. One great way to do this is cite other people’s work that you’re borrowing from, or conduct some sort of validation mechanism.

ols <-
  anes2016 %>%
  lm(post.vote ~ pre.interest + pre.party, data = .)

Look at the regression table.

> summary(ols)

Call:
lm(formula = post.vote ~ pre.interest + pre.party, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.02985 -0.01699 -0.01430 -0.01181  0.99338 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.002978   0.005224 192.011   <2e-16 ***
pre.interest 0.005182   0.002323   2.231   0.0258 *  
pre.party    0.001536   0.001235   1.244   0.2136    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1186 on 2727 degrees of freedom
Multiple R-squared:  0.002354,	Adjusted R-squared:  0.001622 
F-statistic: 3.217 on 2 and 2727 DF,  p-value: 0.04021

10.6.2 Diagnostic plots

Look at the diagnostic plots for residuals, fitted values, Cook’s distance, and leverage. It is important to look at diagnostic plots so you can figure out if there are data points that might influence your results.

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)
par(opar)

opar

You can use a simple base R command to examine the observations. You can see where these respondents are from, etc. and see why their observations are out of the norm.

For example, I want to see if they are all from the same state (V161010d is the state variable).

> anes2016[c(1884, 1918, 2464, 2458), "V161010d"]
# A tibble: 4 x 1
               V161010d
              <dbl+lbl>
1 42 [42. Pennsylvania]
2 47 [47. Tennessee]   
3 39 [39. Ohio]        
4 34 [34. New Jersey]  

10.6.3 Linear regression with standard errors

Robust standard errors are useful in social sciences where you don’t know the structure of the variations. So to get unnbiased standard errors for OLS, you want to insert your regression into lm_robust. Learn more about the command here.

Note that the UCLA page suggests rlm() but I think lm_robust is easier to use.

> lm_robust(post.vote ~ pre.interest + pre.party, data = anes2016)
                Estimate  Std. Error    t value   Pr(>|t|)      CI Lower
(Intercept)  1.002977920 0.004447216 225.529416 0.00000000  0.9942576670
pre.interest 0.005182496 0.002366346   2.190084 0.02860257  0.0005424841
pre.party    0.001535810 0.001336292   1.149308 0.25052992 -0.0010844361
                CI Upper   DF
(Intercept)  1.011698172 2727
pre.interest 0.009822509 2727
pre.party    0.004156056 2727

10.6.4 Logit regression

You may want to use a generalized linear model if your dependent variable contains binary outcomes.

You may want to first check the “family” of distributions of your outcome using a histogram. I like to use base R for this so I can quickly eyeball the distribution of the data.

hist(anes2016$post.vote)

hist-before

The data is a bit of a mess. You may want to only keep the respondents who either voted or did not vote.

anes2016 <-
  anes2016 %>%
  filter(post.vote == 1 | post.vote == 2)

You can look at the histogram again.

hist(anes2016$post.vote)

Annoyingly, 1 = voted for president and 2 = did not vote for president. Grrr…

Many binary variables are not perfectly 0 and 1 unless you convert it. I personally like to create a new variable if I am modifying an existing variable (so I always have the original variable around in case I need to go back and check something).

anes2016 <-
  anes2016 %>%
  mutate(
    post.vote.binary = ifelse(post.vote == 1, 1, 0))
hist(anes2016$post.vote.binary)

hist-after

Now it is 1 = voted for president and 2 = did not vote for president. Yay!

For the gender variable, you may convert it to only look at the women vote. For a dataset like the ANES, the survey does not take into careful consideration of gender issues, but other surveys—and hopefully your own—take it into account. But one aspect of representative voting that people are interested in, and that a representative dataset like the ANES is able to evaluate, is how women vote. One way to do this is to convert the gender variable to “woman” or “not woman.”

Here, if pre.gender == 2, then the respondent is a woman. From this variable, you want to create a variable for respondents who identify as women.

anes2016 <-
  anes2016 %>%
  mutate(
    pre.female = ifelse(pre.gender == 2, 1, 0))

Now, you want to know if identifying as female makes a respondent more likely to vote for president. Since pre.female is a binary variable, you will want to use a GLM.

logit <- glm(post.vote.binary ~ pre.female, family = binomial, data = anes2016)

Then, you want to look at the regression table.

> summary(logit)

Call:
glm(formula = post.vote.binary ~ pre.female, family = binomial, 
    data = anes2016)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9609   0.1585   0.1585   0.1813   0.1813  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.0999     0.2200  18.634   <2e-16 ***
pre.female    0.2710     0.3235   0.838    0.402    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 408.82  on 2729  degrees of freedom
Residual deviance: 408.12  on 2728  degrees of freedom
AIC: 412.12

Number of Fisher Scoring iterations: 7

You may ask yourself, “How do I transfer my regression table into R Markdown??”

Let me introduce you to the stargazer package. Here is the best Stargazer tutorial.

Install and load stargazer.

install.packages("stargazer")
library(stargazer)

First off, stargazer makes your tables look NICE.

stargazer(logit, type = "text",
          # you can customize the labels
          title            = "Effect of gender on voting for president",
          covariate.labels = c("Female"),
          dep.var.labels   = "Voted for president")

This is a nice stargazer text table. It kinda looks like you ripped it out of a published journal!

Effect of gender on voting for president
=============================================
                      Dependent variable:    
                  ---------------------------
                      Voted for president    
---------------------------------------------
Female                       0.271           
                            (0.324)          
                                             
Constant                   4.100***          
                            (0.220)          
                                             
---------------------------------------------
Observations                 2,730           
Log Likelihood             -204.059          
Akaike Inf. Crit.           412.119          
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

But how do you insert this into your R Markdown document?

Well, you will want to print it as a LaTeX table because R Markdown has built-in tex options. I think tex is nicer for formatting and typesetting, but R markdown is more convenient (I think!).

Remove the beginning hashtags # in the code…

# ```{r, results='asis'}
stargazer(logit, type='latex',
          title            = "Effect of gender on voting for president",
          covariate.labels = c("Female"),
          dep.var.labels   = "Voted for president",
          # you can cite the package in your References section, but it might be frustrating to see a citation every time you run stargazer...but you can just set it to header = FALSE and everything is okay
          header = FALSE)
# ```

You’ll need to do this for your homework assignment!

Now, you can create a second regression with some controls…

# V165510: age group
# V165511: education

In reality, you should clean the variables like I did with the vote and gender variables.

logit2 <- glm(post.vote.binary ~ pre.female + V165510 + V165511, family = binomial, data = anes2016)
# check out the table

Here’s the GLM summary table.

> summary(logit2)

Call:
glm(formula = post.vote.binary ~ pre.female + V165510 + V165511, 
    family = binomial, data = anes2016)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0468   0.1590   0.1601   0.1817   0.2058  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.1517     0.2530  16.412   <2e-16 ***
pre.female    0.2688     0.3236   0.831    0.406    
V165510       0.1548     0.3269   0.473    0.636    
V165511      -0.1266     0.2767  -0.458    0.647    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 408.82  on 2729  degrees of freedom
Residual deviance: 407.89  on 2726  degrees of freedom
AIC: 415.89

Number of Fisher Scoring iterations: 7

Personally I like seeing the table in R first before I put it in R Markdown, but this is a matter of personal taste.

stargazer(logit, logit2, type='text',
          title            = "Effect of gender on voting for president",
          covariate.labels = c("Female", "Age", "Education"),
          dep.var.labels   = "Voted for president",
          header = FALSE)

This is how it looks…

Effect of gender on voting for president
==============================================
                      Dependent variable:     
                  ----------------------------
                      Voted for president     
                       (1)            (2)     
----------------------------------------------
Female                0.271          0.269    
                     (0.324)        (0.324)   
                                              
Age                                  0.155    
                                    (0.327)   
                                              
Education                           -0.127    
                                    (0.277)   
                                              
Constant             4.100***      4.152***   
                     (0.220)        (0.253)   
                                              
----------------------------------------------
Observations          2,730          2,730    
Log Likelihood       -204.059      -203.945   
Akaike Inf. Crit.    412.119        415.890   
==============================================
Note:              *p<0.1; **p<0.05; ***p<0.01

For log odds in a GLM table, you may want to convert the coefficients into odds ratios so they’re easier to interpret. To do this, I recommend Jorge Cimentada’s nifty function.

I like tables but there are people out there who do not like tables. To appease society, you can use coefplots package or the tools in the modelsummary package. Here is an example of plotting coefficient plots in the modelsummary package.

modelplot(ols)
# change variable names on coefplot
vars <- c('pre.interest' = 'Interest',
        'pre.party' = 'Party')

Now for the graph…

# ta-da!
modelplot(ols, coef_map = vars)

modelsummary-coefplot