Learning Objectives

  • Know how to load a package
  • Be able to describe what data frames and tibbles are
  • Know how to use read.csv() to read a CSV file into R
  • Be able to explore a data frame
  • Be able to use subset a data frame
  • Know the distinctions base R and tidyverse

Now that we’ve learned a bit about how R is thinking about data under the hood, using different types of vectors to build more complicated data structures, let’s actually look at some data.

Presentation of the Survey Data

We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxon e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot

Loading the Data

Your current R project should already have a data folder with the surveys data CSV file in it. We can read it into R and assign it to an object by using the read.csv() function. The first argument to read.csv() is the path of the file you want to read, in quotes. This path will be relative to your current working directory, which in our case is the R Project folder. So from there, we want to access the “data” folder, and then the name of the CSV file.

surveys <- read.csv("data/portal_data_joined.csv")
  • Hint: tab-completion works with file paths too. Type the pair of quotes, and with your cursor between them, hit tab to bring up the files in your working directory.

Take a look at your Environment pane and you should see an object called “surveys”. We can print out the object to take a look at it by just running the name of the object. We can also check to see what class it is.

surveys
##   record_id month day year plot_id species_id sex hindfoot_length weight
## 1         1     7  16 1977       2         NL   M              32     NA
## 2        72     8  19 1977       2         NL   M              31     NA
## 3       224     9  13 1977       2         NL                  NA     NA
##     genus  species   taxa plot_type
## 1 Neotoma albigula Rodent   Control
## 2 Neotoma albigula Rodent   Control
## 3 Neotoma albigula Rodent   Control
##  [ reached 'max' / getOption("max.print") -- omitted 34783 rows ]
class(surveys)
## [1] "data.frame"

Wow, printing a data frame gives us quite a bit of output. This is a lot more data than the small vectors we worked with last lesson, but the basic principles remain the same.

Data frames are really just a collection of vectors: every column is a vector with a single data type, and every column is the exact same length. You can make a data frame “by hand”, but they’re usually created when you import some sort of tabular data into R using a function like read.csv().

Inspecting data.frame Objects

When working with a large data frame, it’s usually impractical to try to look at it all at once, so we’ll need to arm ourselves with a series of tools for inspecting them. Here is a non-exhaustive list of some common functions to do this:

  • Size:
    • nrow(surveys) - returns the number of rows
    • ncol(surveys) - returns the number of columns
  • Content:
    • head(surveys) - shows the first 6 rows
    • tail(surveys) - shows the last 6 rows
    • View(surveys) - opens a new tab in RStudio that shows the entire data frame. Useful at times, but you shouldn’t become overly reliant on checking data frames by eye, it’s easy to make mistakes
  • Names:
    • colnames(surveys) - returns the column names
    • rownames(surveys) - returns the row names
  • Summary:
    • str(surveys) - structure of the object and information about the class, length and content of each column
    • summary(surveys) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the output of str(surveys), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • How are our character data represented in this data frame?
  • How many species have been recorded during these surveys? (This may take more than just the str() function. Try Googling around how to count the unique observations in a character string in R)
ANSWER
str(surveys)
## 'data.frame':    34786 obs. of  13 variables:
##  $ record_id      : int  1 72 224 266 349 363 435 506 588 661 ...
##  $ month          : int  7 8 9 10 11 11 12 1 2 3 ...
##  $ day            : int  16 19 13 16 12 12 10 8 18 11 ...
##  $ year           : int  1977 1977 1977 1977 1977 1977 1977 1978 1978 1978 ...
##  $ plot_id        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ species_id     : chr  "NL" "NL" "NL" "NL" ...
##  $ sex            : chr  "M" "M" "" "" ...
##  $ hindfoot_length: int  32 31 NA NA NA NA NA NA NA NA ...
##  $ weight         : int  NA NA NA NA NA NA NA NA 218 NA ...
##  $ genus          : chr  "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
##  $ species        : chr  "albigula" "albigula" "albigula" "albigula" ...
##  $ taxa           : chr  "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type      : chr  "Control" "Control" "Control" "Control" ...
## * class: data frame
## * how many rows: 34786,  how many columns: 13
## * the character data are characters if you have R Version 4.0.0 of later, factors for older versions
length(unique(surveys$species))
## [1] 40
table(surveys$species)
## 
##        albigula       audubonii         baileyi       bilineata brunneicapillus 
##            1252              75            2891             303              50 
##       chlorurus          clarki        eremicus          flavus      fulvescens 
##              39               1            1299            1597              75 
##     fulviventer          fuscus       gramineus         harrisi        hispidus 
##              43               5               8             437             179 
##     intermedius     leucogaster      leucophrys        leucopus     maniculatus 
##               9            1006               2              36             899 
##       megalotis     melanocorys        merriami        montanus    ochrognathus 
##            2609              13           10596               8              43 
##           ordii    penicillatus      savannarum      scutalatus             sp. 
##            3027            3123               2               1              86 
##     spectabilis       spilosoma        squamata         taylori    tereticaudus 
##            2504             248              16              46               1 
##          tigris        torridus       undulatus       uniparens         viridis 
##               1            2249               5               1               1
## * how many species: 48


Indexing and subsetting data frames

When we wanted to extract particular values from a vector, we used square brackets and put index values in them. Since data frames are made out of vectors, we can use the square brackets again, but with one change. Data frames are 2-dimensional, so we need to specify row and column indices. Row numbers come first, then a comma, then column numbers. Leaving the row number blank will return all rows, and the same thing applies to column numbers.

One thing to note is that the different ways you write out these indices can give you back either a data frame or a vector.

# first element in the first column of the data frame (as a vector)
surveys[1, 1]   
## [1] 1
# first element in the 6th column (as a vector)
surveys[1, 6]   
## [1] "NL"
# first column of the data frame (as a vector)
surveys[, 1]    
##  [1]    1   72  224  266  349  363  435  506  588  661  748  845  990 1164 1261
## [16] 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859 5048 5180
## [31] 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022 8263 8387 8394
## [46] 8407 8514 8543 8657 8675
##  [ reached getOption("max.print") -- omitted 34736 entries ]
# first column of the data frame (as a data.frame)
surveys[1]      
##    record_id
## 1          1
## 2         72
## 3        224
## 4        266
## 5        349
## 6        363
## 7        435
## 8        506
## 9        588
## 10       661
## 11       748
## 12       845
## 13       990
## 14      1164
## 15      1261
## 16      1374
## 17      1453
## 18      1756
## 19      1818
## 20      1882
## 21      2133
## 22      2184
## 23      2406
## 24      2728
## 25      3000
## 26      3002
## 27      4667
## 28      4859
## 29      5048
## 30      5180
## 31      5299
## 32      5485
## 33      5558
## 34      5583
## 35      5966
## 36      6020
## 37      6023
## 38      6036
## 39      6167
## 40      6479
## 41      6500
## 42      8022
## 43      8263
## 44      8387
## 45      8394
## 46      8407
## 47      8514
## 48      8543
## 49      8657
## 50      8675
##  [ reached 'max' / getOption("max.print") -- omitted 34736 rows ]
# first three elements in the 7th column (as a vector)
surveys[1:3, 7] 
## [1] "M" "M" ""
# the 3rd row of the data frame (as a data.frame)
surveys[3, ]    
##   record_id month day year plot_id species_id sex hindfoot_length weight
## 3       224     9  13 1977       2         NL                  NA     NA
##     genus  species   taxa plot_type
## 3 Neotoma albigula Rodent   Control
# equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ] 

: is a special function that creates numeric vectors of integers in increasing or decreasing order; try running 1:10 and 10:1 to check this out.

You can also exclude certain indices of a data frame using the “-” sign:

surveys[, -1]          # The whole data frame, except the first column
##   month day year plot_id species_id sex hindfoot_length weight   genus  species
## 1     7  16 1977       2         NL   M              32     NA Neotoma albigula
## 2     8  19 1977       2         NL   M              31     NA Neotoma albigula
## 3     9  13 1977       2         NL                  NA     NA Neotoma albigula
## 4    10  16 1977       2         NL                  NA     NA Neotoma albigula
##     taxa plot_type
## 1 Rodent   Control
## 2 Rodent   Control
## 3 Rodent   Control
## 4 Rodent   Control
##  [ reached 'max' / getOption("max.print") -- omitted 34782 rows ]
surveys[-c(7:34786), ] # Equivalent to head(surveys)
##   record_id month day year plot_id species_id sex hindfoot_length weight
## 1         1     7  16 1977       2         NL   M              32     NA
## 2        72     8  19 1977       2         NL   M              31     NA
## 3       224     9  13 1977       2         NL                  NA     NA
##     genus  species   taxa plot_type
## 1 Neotoma albigula Rodent   Control
## 2 Neotoma albigula Rodent   Control
## 3 Neotoma albigula Rodent   Control
##  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

surveys["species_id"]       # Result is a data.frame
surveys[, "species_id"]     # Result is a vector
surveys[["species_id"]]     # Result is a vector
surveys$species_id          # Result is a vector

In general, when you’re working with data frames, you should make sure you know whether your code returns a data frame or a vector, as we see that different methods yield different results. Sometimes you get a data frame with one column, sometimes you get one vector.

You will probably end up using the $ subsetting quite a bit. What’s nice about it is that it supports tab-completion! Type out your data frame name, then a dollar sign, then hit tab to get a list of the column names that you can scroll through.

Challenge

We are going to create a few new data frames using our subsetting skills.

  1. Create a new data frame called surveys_200 containing row 200 of the surveys dataset.
  2. Create a new data frame called surveys_last, which extracts only the last row in of surveys.
    • Hint: Remember that nrow() gives you the number of rows in a data frame
    • Compare your surveys_last data frame with what you see as the last row using tail() with the surveys data frame to make sure it’s meeting expectations.
  3. Use nrow() to identify the row that is in the middle of surveys. Subset this row and store it in a new data frame called surveys_middle.
  4. Reproduce the output of the head() function by using the - notation (e.g. removal) and the nrow() function, keeping just the first through 6th rows of the surveys dataset.
ANSWER
## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]


Base R vs. tidyverse

Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.

All of the stuff we’ve covered so far has been using R’s “base” functionality, the built in functions and techniques that come with R by default. There is a new-ish set of packages called the tidyverse which does a lot of the same stuff as base R, plus much much more. The tidyverse is what we will focus on primarily from here on out, as it is a very powerful set of tools with a philosophy that focuses on being readable and intuitive when working with data. There are a few reasons we’ve taught you a bunch of base R stuff so far:

  1. Base R can be quick and useful in a lot of ways
  2. The tidyverse still works with the same building blocks as base R: vectors!
  3. Some packages you need to use will only work with base R
  4. You will someday use Google and find a perfect solution to your problem, using base R
  5. You will probably have a collaborator at some point who uses base R
  6. The tidyverse is constantly evolving, which can be good (new features!) and bad (really old tidyverse code may behave differently when you update)

For example, using [] to subset data and using read.csv() are base R ways of doing things, but we’ll show you tidyverse ways of doing them as well.

In R, there are almost always several ways of accomplishing the same task. Showing you every single way of getting a job done seems like a waste of time, but we also don’t want you to feel lost when you come across some base R code, so that’s why there might be a bit of redundancy.

Loading Packages

Almost every time you work in R, you will be using different “packages” to work with data. A package is a collection of functions used for some common purpose; there are packages for manipulating data, plotting, interfacing with other programs, and much much more.

For much of this course, we’ll be working with a series of packages collectively referred to as the tidyverse. They are packages designed to help you work with data, from cleaning and manipulation to plotting. They are all designed to work together nicely, and share a lot of similar principles. They are increasingly popular, have large user bases, and are generally very well-documented. You can install the core set of tidyverse packages with the install.packages() function:

install.packages("tidyverse")

It is usually recommended that you do NOT write this code into a script, or the package will be reinstalled every time you run the script. Instead, just run it once in your console, and it will be permanently installed so you can use it any time.

Once a package has been installed on your computer, you can load it in order to use it:

library(tidyverse)

Loading the tidyverse package actually loads a whole bunch of commonly used tidyverse packages at once, which is pretty convenient.

A common feature of tidyverse functions is that they use underscores in the name. For example, the tidyverse function for reading a CSV file is read_csv() instead of read.csv(). Let’s try it:

t_surveys <- read_csv("data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ──────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now let’s take a look at how prints and check the class:

t_surveys
## # A tibble: 34,786 × 13
##    record…¹ month   day  year plot_id speci…² sex   hindf…³ weight genus species
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>   <chr>   <dbl>  <dbl> <chr> <chr>  
##  1        1     7    16  1977       2 NL      M          32     NA Neot… albigu…
##  2       72     8    19  1977       2 NL      M          31     NA Neot… albigu…
##  3      224     9    13  1977       2 NL      <NA>       NA     NA Neot… albigu…
##  4      266    10    16  1977       2 NL      <NA>       NA     NA Neot… albigu…
##  5      349    11    12  1977       2 NL      <NA>       NA     NA Neot… albigu…
##  6      363    11    12  1977       2 NL      <NA>       NA     NA Neot… albigu…
##  7      435    12    10  1977       2 NL      <NA>       NA     NA Neot… albigu…
##  8      506     1     8  1978       2 NL      <NA>       NA     NA Neot… albigu…
##  9      588     2    18  1978       2 NL      M          NA    218 Neot… albigu…
## 10      661     3    11  1978       2 NL      <NA>       NA     NA Neot… albigu…
## # … with 34,776 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
## #   abbreviated variable names ¹​record_id, ²​species_id, ³​hindfoot_length
class(t_surveys)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Ooh, doesn’t that print out nicely? It only prints 10 rows by default, NAs are now colored red, and under the name of each column is the type of data! One important thing to notice is that the column types are only double and character, no factors here. By default, read_csv() keeps character data as character columns, which would be like setting stringsAsFactors=FALSE in read.csv().

Also, class() returned multiple things! You’ll notice one of them is data.frame, but there are things like tbl_df as well. The tidyverse has a special type of data.frame called a “tibble”. Tibbles are the same as data frames, but they print nicely as we just saw, and they usually return a tibble when you’re using bracket subsetting. As always, just be sure to check whether you’re getting a tibble or a vector back.

surveys[,1] # gives a vector back
##  [1]    1   72  224  266  349  363  435  506  588  661  748  845  990 1164 1261
## [16] 1374 1453 1756 1818 1882 2133 2184 2406 2728 3000 3002 4667 4859 5048 5180
## [31] 5299 5485 5558 5583 5966 6020 6023 6036 6167 6479 6500 8022 8263 8387 8394
## [46] 8407 8514 8543 8657 8675
##  [ reached getOption("max.print") -- omitted 34736 entries ]
t_surveys[,1] # gives a tibble back
## # A tibble: 34,786 × 1
##    record_id
##        <dbl>
##  1         1
##  2        72
##  3       224
##  4       266
##  5       349
##  6       363
##  7       435
##  8       506
##  9       588
## 10       661
## # … with 34,776 more rows

This lesson is adapted from the Data Carpentry: R for Data Analysis and Visualization of Ecological Data Starting With Data materials.