8 Data Manipulation

In this lecture, you will learn how to manipulate data. This is probably the most important and unexpectedly time-consuming thing you will do as a social scientist.

Much of this lesson is indebted to my favorite data manipulation page on the internet: Mike DeCrescenzo’s data manipulation tutorial from his 2018 PS811 course. It is the R tutorial I keep returning to again and again.

8.1 Install packages

# packages
library("here")
library("haven")
library("magrittr")
library("tidyverse")
library("tidyr")
library("dplyr")

8.2 Read data

Here, I am going to introduce you to the here package. To read more about here, I recommend that you check out this helpful love letter to the here package.

# setup folders and directories
here("_ps811", "data")
here("_ps811", "code")

# read stata data in R
scotus <- read_dta(here("_ps811", "data", "SCDB_2020_01_justiceCentered_Citation.dta"))

There are various ways you can check out your data in R.

# how does my data LOOK??
scotus
# that was a lot
head(scotus)
# that's a little better

# bottom six?
tail(scotus)

# i want to see more than six, like eight?
head(scotus, n = 8)
tail(scotus, n = 8)

# you can use the View() command
# this can be slow if it's a large dataset
View(scotus)

# if the file is, for some reason, NOT a dataframe, you can always force it become a dataframe:

as.data.frame(scotus)

# this doesn't do anything because it is already a dataframe

# if you are ever struggling with a function, go ahead and do ?function_name
# the help guide should load in the help pane

?as.data.frame

8.3 Understand data frames

You might get tired from referencing both the dataset and the variable every time you want to refer to a variable. I know, typing is exhausting! You might be tempted to use the attach(dataset) function. Don’t use it. It’s not worth the headaches down the road. You will likely use multiple datasets with the same variable names and it is always less confusing for yourself and future reviewers/collaborators if you refer to both the dataset and the variable.

# number of rows
nrow(scotus)

# number of columns
ncol(scotus)

# literally my favorite function: names()
# super helpful for other data manipulation tasks
names(scotus)

# you cannot simply reference the variable
justice
# Error: object 'justice' not found

# you have to refer to the dataset THEN variable
scotus$justice

I recommend that you download or reference the codebook whenever possible. Sometimes variables are simply a string of numbers and letters (e.g., the ANES dataset has weird codes for their variables that can only be identified by looking them up in the codebook). Thankfully, the SCOTUS database is better than that, but you might still want to clarify a few things so it is easier for you (and your collaborators!) to work with. In any case, the variables are case sensitive so you might want to change the cases.

You can access the online codebook for the SCOTUS database. Many codebooks for other datasets may be in PDF or Word format.

8.3.1 Data frame exercise

Creating your own data frame from scratch may help you understand data frames a bit better.

# take everything from the streaming service column
streaming <- c("Netflix", "Hulu", "Amazon Prime") 

# take everything from the approve column
approve <- c(50, 30, 15) 

# take everything from the disapprove column
disapprove <- c(50, 70, 85) 

# create variables from existing variables
difference <- approve-disapprove

# create your dataframe
streaming <- data_frame(streaming, approve, disapprove, difference) 
streaming

# this is a tibble
streaming_df <- as.data.frame(streaming_dataframe)

8.4 Use tidyverse for data manipulation

You can use base R to manipulate data if you want. You should learn and know base R because many replication materials use base R. But, I gotta tell you, tidyverse is the grand suite for manipulating data. Think of it as the Adobe for data manipulation. So, for your own data manipulation endeavors, I’m going to advocate that you use the tidyverse suite of packages.

With tidyverse, you can do the following:

  • rename(): rename variables
  • mutate(): create new variables
  • select(): look at specific variable
  • filter(): filter based on conditions
  • summarize(): look at summary statistics
  • group_by(): group by variables
  • arrange(): sort variables by ascending/descending
  • merge data using join functions
    • more advanced than the base R merge() options

With tidyr, you can do the following:

  • gather(): collate column names and turn them into values of a single variable (opposite of spread())
  • spread(): untangles a variable with multiple repeating values and turns them into their own columns (opposite of gather())

8.4.1 Rename data variables

# rename(dataset, 
#        new_name = old_name,
#        new_name2 = old_name2)

scotus <- rename(scotus, 
               case.id = caseId,
               docket.id = docketId, 
               case.issues.id = caseIssuesId, 
               vote.id = voteId)

8.4.1.1 Remember piping?

Remember Lesson 6 where we talked about piping? You can also do this using the piping technique.

scotus <- scotus %>%
              rename(case.id = caseId,
               docket.id = docketId, 
               case.issues.id = caseIssuesId, 
               vote.id = voteId)

This can make some of the code run faster. You will be asked to pipe on your homework assignment.

The problem with the rename() function is that it overwrites everything. This means you will no longer have the old variable name around anymore because you have changed it to a new variable name.

But what if you don’t want to overwrite anything?

8.4.2 Create new data variables

Instead of overwriting variables, you may simply want to keep your old variables and create new ones.

# let's look at decision directions
table(scotus$decisionDirection)

You will get the following result:

    1     2     3 
38531 40574  1390 

We know from the code book that: 1. conservative 2. liberal 3. unspecifiable

Say you want to create a dummy variable named “liberal” that has all liberal cases coded “1” and all non-liberal cases coded “0”. And, you want to create a dummy variable named “conservative that has all conservative cases coded”1" and all non-conservative cases coded “0”.

In pseudocode-logical speak (I made this up but stay with me), you’re thinking along the lines of the following code:

# new_variable <- ifelse(condition, if_true, if_false)
# liberal <- ifelse(decisionDirection == 2, 1, 0)

That’s not exactly it but the logic is certainly there. Let me turn it into real code for you:

scotus <- 
  mutate(scotus, 
         liberal = ifelse(decisionDirection == 2,
                          1,
                          0), 
         conservative = ifelse(decisionDirection == 1,
                               1,
                               0))

Say you want to do create a variable that identifies the decision dates by decade. Here’s your chance to work with date values, which many of you will have to deal with somewhere along the way.

# take a look at the variable
summary(scotus$year)

# turn the dateDecision variable into a date variable with the date structure you see in the summary() results
# extract the year and put it in a variable called year
scotus$year <- as.numeric(format(as.Date(scotus$dateDecision, "%Y-%m-%d"), '%Y'))
summary(scotus$year)

# using the year variable, create the decade variable
scotus <- mutate(scotus,
                 decade = ifelse(
                   year %in% 1940:1949, 1940, NA),
                 decade = ifelse(
                   year %in% 1950:1959, 1950, decade),
                 decade = ifelse(
                   year %in% 1960:1969, 1960, decade),
                 decade = ifelse(
                   year %in% 1970:1979, 1970, decade),
                 decade = ifelse(
                   year %in% 1980:1989, 1980, decade),
                 decade = ifelse(
                   year %in% 1990:1999, 1990, decade),
                 decade = ifelse(
                   year %in% 2000:2009, 2000, decade),
                 decade = ifelse(
                   year %in% 2010:2019, 2010, decade),
                 decade = ifelse(
                   year %in% 2020:2029, 2020, decade))

# check out the number of votes per decade
table(scotus$decade)

You may also want to create new variables. Here is an example of creating a new variable that shows the number of years since a particular vote was cast by a justice.

scotus <- mutate(scotus,
                 time_trends = 2020 - year)

8.4.3 Logical operators

You can access a complete list of logical operators on the Data Mentor page, but I’ve highlighted a few that you will likely encounter.

  • <: less than
  • <=: less than or equal to
  • >: greater than
  • >=: greater than or equal to
  • ==: exactly equal to
  • !=: not equal to
  • !x: Not x
  • x | y: x OR y
  • x & y: x AND y
  • x %in% c(y, z): x is equal to y or z
  • isTRUE(x): test if x is TRUE
  • isFALSE(x): test if x is false
  • is.na(x): test if x is NA
  • !is.na(x): test if x is NOT NA

8.4.4 Same function, multiple conditions

What if we want to check multiple conditions within the same function? You use the case_when() function.

scotus <- mutate(scotus,
               decisionDirection = case_when(
                 decisionDirection == 1 ~ conservative,
                 decisionDirection == 2 ~ liberal,
                 decisionDirection == 3 ~ unspecified))

8.4.5 Select columns

Say your question is whether the decade a case was decided influences whether it is liberal or not, so you might only care about two variables.

You use the select() function. The structure of the select() function is select(dataset, var1, var2).

var1, var2, etc. are the variables you want to select from the dataset. Think of it as grabbing these variables and deleting all the rest.

select(scotus, decade, liberal)

You can also turn it into an object so you can use it later.

scotus_select <- select(scotus, decade, liberal)

Say you only want to select caseId and all background variables (consecutively ordered).

select(scotus, caseId, caseName:lcDispositionDirection)

Say you only want to select caseid and any variable that contains “case” in the name.

select(scotus, caseId, contains("case"))

Say you want to drop the docketId and keep all other variables.

select(scotus, -docketId)

# more specifically, you can do:
select(scotus, -docketId, matches("."))
# matches(".") keeps all remaining variables
# though not having also keeps all remaining variables

8.5 Filter data

You know how you can filter data in Excel? You can do this in R as well. Why would you want to know how to do it in R when you already know how to do it in Excel? Well, if you have a dataset that has hundreds of thousands of observations, it will take forever to load and filter in Excel. It will be way faster if you do it in R.

Say you only want to filter to decisions from before 1990.

filter(scotus, year < 1990)

Say you only want to filter to decisions from 1990.

filter(scotus, year == 1990)

Say you want to filter to decisions from Ruth Bader Ginsburg in 2000

filter(scotus, year == 2000, justiceName == "RBGinsburg")

8.5.1 Summarize data

Think of this as a more advanced summary() command where you can specify what you want to summarize.

# look at the mean of liberal cases (1 being liberal, 0 being non-liberal)
scotus %>%
  summarise(mean = mean(liberal, na.rm = TRUE), n = n())
  
# group by decade, then look at the mean of liberal cases
scotus %>%
  group_by(decade) %>%
  summarise(mean = mean(liberal, na.rm = TRUE), n = n())

# group by decade and look at the quantiles of liberal cases
scotus %>%
  group_by(decade) %>%
  summarise(qs = quantile(liberal, na.rm = TRUE, c(0.25, 0.75)), prob = c(0.25, 0.75))

8.5.2 Sort data

Remember this object we created earlier in the select() section?

scotus_select <- select(scotus, decade, liberal)

Sort decade by ascending order.

arrange(scotus_select, decade, liberal)

Sort decade by descending order.

arrange(scotus_select, desc(decade), liberal)

8.5.2.1 Identify what to sort

Here’s a mini exercise to help you understand sorting.

# there's an order to this
# arrange(dataset, var1, var2)
# this means you sort by var1, then by var2
# for example

sort_ex_df <- data_frame(x = c(1, 2, 3, 3, 5, 6), 
                y = c("A", "C", "D", "E", "B", "F"))

arrange(sort_ex_df, x, y)
arrange(sort_ex_df, y, x)

8.5.3 Merge datasets

You might have two datasets that you would like to merge.

For some reason, you might obtain grades for a few students from their English and math classes and be asked to create a report card for each student.

english <- data_frame(studentID = c(990055, 889765, 189245, 346789,                                         534098, 132938, 789012), 
                     grade = c(90, 85, 60, 75, 67, 93, 82))

math <- data_frame(studentID = c(990055, 889765, 189245, 346789, 534098,                                  345890), 
                   grade = c(80, 90, 50, 85, 95, 66))
                   

You can do this really quickly in base R without much thought.

merge(english, math, by="studentID")
# english grade is "grade.x" and math grade is "grade.y"

The basic merge() function only merge variables that exist in BOTH datasets. If you want to keep all variables in x or y, you can do it in merge() (and it’s only a Google search away!) but it’s WAY easier to remember the functions in the dplyr package.

# imitate merge()
inner_join(english, math, by = "studentID")

# join all, keep all variables, even if NA
full_join(english, math, by="studentID")

# keep all students in x
left_join(english, math, by = "studentID")

# keep all students in y
right_join(english, math, by = "studentID")

# figure out what doesn't match
# shows studentID that is in english but not in math
anti_join(english, math, by = "studentID")

# shows studentID that is in math but not in english
anti_join(math, english, by = "studentID")

8.5.4 View tables

If you want to just take a look at a variable, the table() function is extremely helpful.

# sometimes you just want to see a table of what's going on with a variable
table(scotus$decade)
table(scotus$liberal)

You can also turn the tables into a proportional table and control the number of decimal points that appear.

# turn it into a proportional table
prop.table(table(scotus$decisionDirection))

# round to 2 decimal points
round(prop.table(table(scotus$decisionDirection)), 2)

You can also create a crosstab with two variables.

# basic crosstab
table(scotus$decade, scotus$decisionDirection)

# and a prop crosstab
prop.table(table(scotus$decade, scotus$decisionDirection))

# and with rounding!
round(prop.table(table(scotus$decade, scotus$decisionDirection)), 2)

Turn the basic table into an object.

scotus_cross_example <- table(scotus$decade, scotus$decisionDirection)

I am going to demonstrate the margin command.

Say I want to know (liberal decision in 1940s)/(all liberal decisions in every decade). I will want to look at margin = 1, which will add up the number of liberal decisions in the dataset (rows).

scotus_cross_prop_example_row <- prop.table(scotus_cross_example, margin = 1)

Say I want to know (liberal decision in 1940s)/(all decisions in the 1940s). I will want to look at margin = 2, which will add up the number decisions per type (columns).

scotus_cross_prop_example_col <- prop.table(tab, margin = 2)

Say you want to count the number of cases per decade per decision type and you want it in a dataframe.

count(scotus, decade, decisionDirection)

You can turn this data frame into an object!

scotus_count <- count(scotus, decade, decisionDirection)

8.5.5 Group variables

You may want to create descriptive counts per group.

# you want to see the proportion of type_of_decision per decade
# so you sum up ALL the decisions per decade
# divide the number of decision types per decade by the number of ALL decisions in decade
# then round them to 3 decimal places
mutate(group_by(scotus_count, decade), 
       n_in_decade = sum(n, na.rm = TRUE),
       p = n / n_in_decade,
       p = round(p, 3))

8.5.6 Reshape data

I can demonstrate this better if we create a new dataset.

scotus_df_example <- select(scotus, term, issueArea, decisionDirection)

Currently, our decade variable is what you would call a wide dataset because it is horizontally wide.

head(scotus_df_example)
scotus_df_example <- as.data.frame(scotus_df_example)

Let’s use what we’ve learned to wrangle this dataset a bit. We want to select the term, liberal, and conservative variables. We want to filter the terms from 2017 to 2018. Term is different from year because term covers an entire SCOTUS term from October to June (i.e., so when the case was heard), while the year of decision (the year variable that we created earlier) is whenever SCOTUS filed the decision.

# select our variables
scotus_df_example <- select(scotus, term, liberal, conservative)

# filter to terms 2017, 2018, 2019
scotus_df_example <- filter(scotus_df_example, term %in% c(2017, 2018, 2019))

# group by term
# create a variable for mean(liberal) and mean(conservative)
scotus_df_example <- summarize(group_by(scotus_df_example, term), 
               liberal = mean(liberal, na.rm = TRUE),
               conservative = mean(conservative, na.rm = TRUE))

scotus_df_example

Say we want to make it into a long dataset. We would use the gather() function.

The key variable allows us to create the variable that we want the values in our individual columns to collate as. This is what we mean…

# gather()
scotus_df_example_gather <- gather(scotus_df_example, key = ideology, value = average, liberal, conservative)

The result should look like this. As you can see, the liberal and conservative variables are now values in the ideology variable with their mean scores in the average column.

> scotus_df_example_gather
# A tibble: 6 x 3
   term ideology     average
  <dbl> <chr>          <dbl>
1  2017 liberal        0.44 
2  2018 liberal        0.493
3  2019 liberal        0.5  
4  2017 conservative   0.467
5  2018 conservative   0.466
6  2019 conservative   0.433

We can then turn it back into a wide dataset. We would use the spread() function. This spreads out the data once again into its original setup.

spread(scotus_df_example_gather, key = ideology, value = average)