dplyr
and
tidyr
packages.dplyr
function select
.dplyr
function
filter
.dplyr
function
to the input of another function with the ‘pipe’ operator
%>%
.mutate
.summarize
, group_by
, and
tally
to split a data frame into groups of observations,
apply a summary statistics for each group, and then combine the
results.dplyr
and
tidyr
Bracket subsetting is handy, but it can be cumbersome and difficult
to read, especially for complicated operations. Enter
dplyr
. dplyr
is a package for making tabular data manipulation easier. It pairs
nicely with tidyr
which enables you to
swiftly convert between different data formats for plotting and
analysis.
Packages in R are basically sets of additional functions that let you
do more stuff. The functions we’ve been using so far, like
str()
or data.frame()
, come built into R;
packages give you access to more of them. Before you use a package for
the first time you need to install it on your machine, and then you
should import it in every subsequent R session when you need it. You
should already have installed the
tidyverse
package. This is an
“umbrella-package” that installs several packages useful for data
analysis which work together well such as
tidyr
,
dplyr
,
ggplot2
,
tibble
, etc.
The tidyverse
package tries to address
3 major problems with some of base R functions: 1. The results from a
base R function sometimes depend on the type of data. 2. Using R
expressions in a non standard way, which can be confusing for new
learners. 3. Hidden arguments, having default operations that new
learners are not aware of.
To load the package type:
library("tidyverse") ## load the tidyverse packages, incl. dplyr
dplyr
and
tidyr
?The package dplyr
provides easy tools
for the most common data manipulation tasks. It is built to work
directly with data frames, with many common tasks optimized by being
written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The
benefits of doing this are that the data can be managed natively in a
relational database, queries can be conducted on that database, and only
the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
The package tidyr
addresses the common
problem of wanting to reshape your data for plotting and use by
different R functions. Sometimes we want data sets where we have one row
per measurement. Sometimes we want a data frame where each measurement
type has its own column, and rows are instead more aggregated groups -
like plots or aquaria. Moving back and forth between these formats is
nontrivial, and tidyr
gives you tools for
this and more sophisticated data manipulation.
To learn more about dplyr
and
tidyr
after the workshop, you may want to
check out this handy
data transformation with dplyr
cheatsheet and this one
about tidyr
.
We’ll read in our data using the read_csv()
function,
from the tidyverse package readr
, instead
of read.csv()
, the base function for reading in data. The
data we are going to be using today should already be in your
R_DAVIS_2022 project in the folder data
.
surveys <- read_csv("data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ──────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## inspect the data
str(surveys)
Notice that the class of the data is now tbl_df
This is
referred to as a “tibble”. Tibbles are almost identical to R’s standard
data frames, but they tweak some of the old behaviors of data frames.
For our purposes the only differences between data frames and tibbles
are that:
character
are never automatically
converted into factors.We’re going to learn some of the most common
dplyr
functions: select()
,
filter()
, mutate()
, group_by()
,
summarize()
, and join
. To select columns of a
data frame, use select()
. The first argument to this
function is the data frame (surveys
), and the subsequent
arguments are the columns to keep.
select(surveys, plot_id, species_id, weight)
To choose rows based on a specific criteria, use
filter()
:
filter(surveys, year == 1995)
## # A tibble: 1,180 × 13
## record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 22314 6 7 1995 2 NL M 34 NA Neot… albigu…
## 2 22728 9 23 1995 2 NL F 32 165 Neot… albigu…
## 3 22899 10 28 1995 2 NL F 32 171 Neot… albigu…
## 4 23032 12 2 1995 2 NL F 33 NA Neot… albigu…
## 5 22003 1 11 1995 2 DM M 37 41 Dipo… merria…
## 6 22042 2 4 1995 2 DM F 36 45 Dipo… merria…
## 7 22044 2 4 1995 2 DM M 37 46 Dipo… merria…
## 8 22105 3 4 1995 2 DM F 37 49 Dipo… merria…
## 9 22109 3 4 1995 2 DM M 37 46 Dipo… merria…
## 10 22168 4 1 1995 2 DM M 36 48 Dipo… merria…
## # … with 1,170 more rows, 2 more variables: taxa <chr>, plot_type <chr>, and
## # abbreviated variable names ¹record_id, ²species_id, ³hindfoot_length
select
is used for rows and
filter
is used for columns.
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option is pipes
. Pipes let you
take the output of one function and send it directly to the next, which
is useful when you need to do many things to the same dataset. Pipes in
R look like %>%
and are made available via the
magrittr
package, installed automatically
with dplyr
. If you use RStudio, you can
type the pipe with Ctrl + Shift + M if
you have a PC or Cmd + Shift + M if you
have a Mac.
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
## # A tibble: 17 × 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
## 7 PP M 4
## 8 RM M 4
## 9 RM M 4
## 10 RM M 4
## 11 PF M 4
## 12 PF F 4
## 13 RM M 4
## 14 RM M 4
## 15 RM F 4
## 16 RM M 4
## 17 RM M 4
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows where
weight
is less than 5, then through select()
to keep only the species_id
, sex
, and
weight
columns. Since %>%
takes the object
on its left and passes it as the first argument to the function on its
right, we don’t need to explicitly include the data frame as an argument
to the filter()
and select()
functions any
more.
Some may find it helpful to read the pipe like the word “then”. For
instance, in the above example, we took the data frame
surveys
, then we filter
ed for rows
with weight < 5
, then we select
ed
columns species_id
, sex
, and
weight
. The dplyr
functions
by themselves are somewhat simple, but by combining them into linear
workflows with the pipe, we can accomplish more complex manipulations of
data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
## # A tibble: 17 × 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
## 7 PP M 4
## 8 RM M 4
## 9 RM M 4
## 10 RM M 4
## 11 PF M 4
## 12 PF F 4
## 13 RM M 4
## 14 RM M 4
## 15 RM F 4
## 16 RM M 4
## 17 RM M 4
Note that the final data frame is the leftmost part of this expression.
Using pipes, subset the surveys
data to include
individuals collected before 1995 and retain only the columns
year
, sex
, and weight
. Name this
dataframe surveys_challenge
surveys_challenge <- surveys %>%
filter(year < 1995) %>%
select(year, sex, weight)
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate()
.
To create a new column of weight in kg:
surveys %>%
mutate(weight_kg = weight / 1000)
## # A tibble: 34,786 × 14
## record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
## 2 72 8 19 1977 2 NL M 31 NA Neot… albigu…
## 3 224 9 13 1977 2 NL <NA> NA NA Neot… albigu…
## 4 266 10 16 1977 2 NL <NA> NA NA Neot… albigu…
## 5 349 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## 6 363 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## 7 435 12 10 1977 2 NL <NA> NA NA Neot… albigu…
## 8 506 1 8 1978 2 NL <NA> NA NA Neot… albigu…
## 9 588 2 18 1978 2 NL M NA 218 Neot… albigu…
## 10 661 3 11 1978 2 NL <NA> NA NA Neot… albigu…
## # … with 34,776 more rows, 3 more variables: taxa <chr>, plot_type <chr>,
## # weight_kg <dbl>, and abbreviated variable names ¹record_id, ²species_id,
## # ³hindfoot_length
You can also create a second new column based on the first new column
within the same call of mutate()
:
surveys %>%
mutate(weight_kg = weight / 1000,
weight_kg2 = weight_kg * 2)
## # A tibble: 34,786 × 15
## record…¹ month day year plot_id speci…² sex hindf…³ weight genus species
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
## 2 72 8 19 1977 2 NL M 31 NA Neot… albigu…
## 3 224 9 13 1977 2 NL <NA> NA NA Neot… albigu…
## 4 266 10 16 1977 2 NL <NA> NA NA Neot… albigu…
## 5 349 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## 6 363 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## 7 435 12 10 1977 2 NL <NA> NA NA Neot… albigu…
## 8 506 1 8 1978 2 NL <NA> NA NA Neot… albigu…
## 9 588 2 18 1978 2 NL M NA 218 Neot… albigu…
## 10 661 3 11 1978 2 NL <NA> NA NA Neot… albigu…
## # … with 34,776 more rows, 4 more variables: taxa <chr>, plot_type <chr>,
## # weight_kg <dbl>, weight_kg2 <dbl>, and abbreviated variable names
## # ¹record_id, ²species_id, ³hindfoot_length
If this runs off your screen and you just want to see the first few
rows, you can use a pipe to view the head()
of the data.
(Pipes work with non-dplyr
functions, too,
as long as the dplyr
or
magrittr
package is loaded).
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
## # A tibble: 6 × 14
## record_id month day year plot_id speci…¹ sex hindf…² weight genus species
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 7 16 1977 2 NL M 32 NA Neot… albigu…
## 2 72 8 19 1977 2 NL M 31 NA Neot… albigu…
## 3 224 9 13 1977 2 NL <NA> NA NA Neot… albigu…
## 4 266 10 16 1977 2 NL <NA> NA NA Neot… albigu…
## 5 349 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## 6 363 11 12 1977 2 NL <NA> NA NA Neot… albigu…
## # … with 3 more variables: taxa <chr>, plot_type <chr>, weight_kg <dbl>, and
## # abbreviated variable names ¹species_id, ²hindfoot_length
The first few rows of the output are full of NA
s, so if
we wanted to remove those we could insert a filter()
in the
chain:
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
## # A tibble: 6 × 14
## record_id month day year plot_id speci…¹ sex hindf…² weight genus species
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 588 2 18 1978 2 NL M NA 218 Neot… albigu…
## 2 845 5 6 1978 2 NL M 32 204 Neot… albigu…
## 3 990 6 9 1978 2 NL M NA 200 Neot… albigu…
## 4 1164 8 5 1978 2 NL M 34 199 Neot… albigu…
## 5 1261 9 4 1978 2 NL M 32 197 Neot… albigu…
## 6 1453 11 5 1978 2 NL M NA 218 Neot… albigu…
## # … with 3 more variables: taxa <chr>, plot_type <chr>, weight_kg <dbl>, and
## # abbreviated variable names ¹species_id, ²hindfoot_length
is.na()
is a function that determines whether something
is an NA
. The !
symbol negates the result, so
we’re asking for every row where weight is not an
NA
.
Create a new data frame from the surveys
data that meets
the following criteria: contains only the species_id
column
and a new column called hindfoot_half
containing values
that are half the hindfoot_length
values. In this
hindfoot_half
column, there are no NA
s and all
values are less than 30. Name this data frame
surveys_hindfoot_half
.
Hint: think about how the commands should be ordered to produce this data frame!
surveys_hindfoot_half <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_half = hindfoot_length / 2) %>%
filter(hindfoot_half < 30) %>%
select(species_id, hindfoot_half)
Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply
some analysis to each group, and then combine the results.
dplyr
makes this very easy through the use
of the group_by()
function.
group_by()
is often used together with
summarize()
, which collapses each group into a single-row
summary of that group. group_by()
takes as arguments the
column names that contain the categorical variables for
which you want to calculate the summary statistics. So to compute the
mean weight
by sex:
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
## sex mean_weight
## <chr> <dbl>
## 1 F 42.2
## 2 M 43.0
## 3 <NA> 64.7
You may also have noticed that the output from these calls doesn’t
run off the screen anymore. It’s one of the advantages of
tbl_df
over data frame.
You can also group by multiple columns:
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
## `summarise()` has grouped
## output by 'sex'. You can
## override using the `.groups`
## argument.
## # A tibble: 92 × 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 F BA 9.16
## 2 F DM 41.6
## 3 F DO 48.5
## 4 F DS 118.
## 5 F NL 154.
## 6 F OL 31.1
## 7 F OT 24.8
## 8 F OX 21
## 9 F PB 30.2
## 10 F PE 22.8
## # … with 82 more rows
When grouping both by sex
and species_id
,
the first rows are for individuals that escaped before their sex could
be determined and weighted. You may notice that the last column does not
contain NA
but NaN
(which refers to “Not a
Number”). To avoid this, we can remove the missing values for weight
before we attempt to calculate the summary statistics on weight. Because
the missing values are removed first, we can omit
na.rm = TRUE
when computing the mean:
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
## `summarise()` has grouped
## output by 'sex'. You can
## override using the `.groups`
## argument.
## # A tibble: 64 × 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 F BA 9.16
## 2 F DM 41.6
## 3 F DO 48.5
## 4 F DS 118.
## 5 F NL 154.
## 6 F OL 31.1
## 7 F OT 24.8
## 8 F OX 21
## 9 F PB 30.2
## 10 F PE 22.8
## # … with 54 more rows
Here, again, the output from these calls doesn’t run off the screen
anymore. If you want to display more data, you can use the
print()
function at the end of your chain with the argument
n
specifying the number of rows to display:
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
## `summarise()` has grouped
## output by 'sex'. You can
## override using the `.groups`
## argument.
## # A tibble: 64 × 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 F BA 9.16
## 2 F DM 41.6
## 3 F DO 48.5
## 4 F DS 118.
## 5 F NL 154.
## 6 F OL 31.1
## 7 F OT 24.8
## 8 F OX 21
## 9 F PB 30.2
## 10 F PE 22.8
## 11 F PF 7.97
## 12 F PH 30.8
## 13 F PL 19.3
## 14 F PM 22.1
## 15 F PP 17.2
## # … with 49 more rows
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
## `summarise()` has grouped
## output by 'sex'. You can
## override using the `.groups`
## argument.
## # A tibble: 64 × 4
## # Groups: sex [3]
## sex species_id mean_weight min_weight
## <chr> <chr> <dbl> <dbl>
## 1 F BA 9.16 6
## 2 F DM 41.6 10
## 3 F DO 48.5 12
## 4 F DS 118. 45
## 5 F NL 154. 32
## 6 F OL 31.1 10
## 7 F OT 24.8 5
## 8 F OX 21 20
## 9 F PB 30.2 12
## 10 F PE 22.8 11
## # … with 54 more rows
arrange()
) in
descending order, from heaviest to lightest. (This table should have 26
rows, one for each year)count()
. Group the data by
sex
and pipe the grouped data into the count()
function. How could you get the same result using
group_by()
and summarize()
? Hint: see
?n
.## Answer 1
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
summarize(max_weight_g = max(weight)) %>%
mutate(max_weight_kg = max_weight_g/1000) %>%
arrange()
## # A tibble: 26 × 3
## year max_weight_g max_weight_kg
## <dbl> <dbl> <dbl>
## 1 1977 149 0.149
## 2 1978 232 0.232
## 3 1979 274 0.274
## 4 1980 243 0.243
## 5 1981 264 0.264
## 6 1982 252 0.252
## 7 1983 256 0.256
## 8 1984 259 0.259
## 9 1985 225 0.225
## 10 1986 240 0.24
## # … with 16 more rows
## Answer 2
surveys %>%
group_by(sex) %>%
count()
## # A tibble: 3 × 2
## # Groups: sex [3]
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
surveys %>%
group_by(sex) %>%
summarize(n = n())
## # A tibble: 3 × 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748