?
and GoogleProgramming can make your science even better than it already is.
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
The benefits of programming parallel many of the cornerstones of science. Programming makes your workflow:
R is a free, open-source programming language that is designed for data analysis and statistics.
R also has a huge user-community and is highly extensible, with thousands of packages that add extra functionality. Lots of researchers use R, so it is also a common language between us and our colleagues. In short, for many researchers, it is the best tool to organize, visualize, and analyze data.
RStudio is essentially an interface to work with R, with lots of nice features and bells and whistles. It’s like the difference between driving a really old car and a nice new one with GPS and Bluetooth and heated seats and power steering: they both use 4 wheels to get you where you need to go, but one will get you there in comfort and style, and maybe keep you safer along the way.
RStudio is an IDE (integrated development environment) which we use to manage and execute R code. It is also free and open-source, it works on all platforms (e.g. you can use an Amazon Web Services cluster using RStudio), and it integrates version control and project management.
You write the same R code in RStudio as you would elsewhere, and it executes the same way. RStudio helps by keeping things nicely organized.
When you open RStudio you should see three panels:
The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
Console vs. script
You can think of working in the Console vs. working in the Script as something like cooking. The console is like making up a new recipe, but not writing anything down. You can carry out a series of steps and produce a nice, beautiful dish at the end, but you didn’t write anything down, so it’s harder to figure out exactly what you did, and in what order. Writing a script is like taking nice notes while cooking- you can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don’t have to try to remember what went well and what didn’t.
Enter
, R
will execute those commands and print the result.File -> New File -> R Script
or
‘ctrl/cmd-shift-N’cmd/ctrl-enter
executes the line the cursor is on by
copying that line and sending it to the Console
cmd/ctrl-enter
To run the line of your script where the cursor is, you can click on
the Run
button at the top-right of the script pane or use
the keyboard short cut: cmd/ctrl-enter
.
To run a block of code, select (highlight) it and click
Run
or cmd/ctrl-enter
.
You are working toward selecting a whole script and running it.
One of the biggest benefits in using RStudio is that you get to use R Projects, which are extremely helpful for keeping a given project neatly organized and consistent. An R Project is essentially a single folder/directory on your computer, containing all the files you need for that given project. It should be totally self-contained, with all your data, scripts, and results inside it. What makes it an “R Project” is a small .RProject file that sits inside the project folder, which tells RStudio that this folder is for a single R Project.
If you take a look in the very upper-right corner of the RStudio window, you should see a little cube icon with an R in it, possibly with “Project (None)” next to it. If you click on this, you’ll see some options to create a New Project, Open Project, etc.
By default, RStudio is set up so that every time you reopen a project, the environment is exactly as you left it. This might sound great, but it can actually get you into a bit of trouble! Think back to our recipe analogy for a second. You want to start the recipe fresh each time you make the dish- if you were good about writing down the directions (writing an R script), then you should be able to recreate everything the same as before. However, RStudio bringing stuff back in can be more like leaving residue on pots and pans, messing with your nice start-to-finish recipe.
We’re going to change the settings in RStudio so that you’ll be working with a nice clean environment each time. As long as you save your scripts, you shouldn’t lose any work, and you’ll avoid a lot of unforseen problems when you open a project up a few months later. In the menu bar at the top of your computer, navigate to Tools, then Global Options. Find the box that says “Restore .RData into workspace at startup” and make sure it is unchecked.
While you’re in Global Options, you can navigate to the Code tab on the lefthand side of the window. Find the box that says “Soft-wrap R source files” and make sure it is checked.
This will make long lines of code in an R script wrap around onto a new line so you don’t have to do a huge horizontal scroll, which is super annoying.
The simplest thing you can do with R is do arithmetic:
1 + 100
## [1] 101
And R will print out the answer, with a preceding “[1]”, which indicates the first item of output.
If you type in an incomplete command, R will wait for you to complete it:
> 1 -
+
Any time you execute code and the R session shows a “+” instead of a “>”, it means it’s waiting for you to complete the command. If you want to cancel a command you can simply hit “Esc” and RStudio will give you back the “>” prompt. You can also cancel commands with “Esc” if R is taking too long to finish a calculation.
R can use order of operations, just like you learned way back in algebra.
3 + 5 * 2
## [1] 13
Use parentheses to group to force the order of evaluation, and/or to make code easier to read.
(3 + 5) * 2
## [1] 16
Speaking of being easy to read, whitespace is ignored by R. Use it consistently to make code readable. For example, putting a single space on either side of an operator makes code easy to read.
(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2 # easier to read, once you know rules
3+5*2^2 # very hard to read
3 + 5 * (2 ^ 2) # to make order of operations clear, use parentheses
Really small or large numbers get a scientific notation:
2/10000
## [1] 2e-04
Which is shorthand for “multiplied by 10^XX
”. So
2e-4
is shorthand for 2 * 10^(-4)
.
You can write numbers in scientific notation too:
1e9 # One billion
## [1] 1e+09
R has many built in mathematical functions. To call a function, type its name, follow by open and closing parentheses. Anything we type inside those parentheses is an “argument” to that function.
Here we call the sin
function and provide it the
argument 3.14, or approximately \(\pi\).
sin(3.14) # trigonometry functions
## [1] 0.001592653
We can take a logarithm:
log(3) # natural logarithm
## [1] 1.098612
Or exponentiate:
exp(0.5) # e^(1/2)
## [1] 1.648721
You can even put functions inside each other. exp(0.5)
raised e
to the 1/2
power. Equivalently we
could take the square-root of e
. Expressions are
interpretted from the inside-out: In the following line, R first takes
e^4
, and then takes the square-root (that’s what the
sqrt
function does) of the result.
sqrt(exp(4))
## [1] 7.389056
You don’t need to remember function names. There are many ways to discover or rediscover them when you need them. Google is your friend, but we will discuss other ways soon.
We can do logical comparison in R. This will be important later, for example, when we want to filter a dataset based on a logical condition.
1 == 1 # equality (note two equals signs, read as "is equal to")
## [1] TRUE
1 != 2 # not-equal (read, "is not equal to")
## [1] TRUE
1 < 2 # less than
## [1] TRUE
1 >= -9 # greater than or equal to
## [1] TRUE
Now we’re getting to the actual coding stuff. You can store values/data in “objects”, also known as variables. An object has a name, and it has something stored in it, which could be a number, a string of characters, or even a whole dataset.
We can store values in objects using the assignment operator
<-
. You can also use a single equals sign,
=
, for assignment.
Note that unlike every other expression we have run so far, R doesn’t
print anything when we run this next line. Instead, it is stored for
later in a object, x
. x
now
contains the value 0.25
. Read this as
“Assign 1/4 to x.”
x <- 1/4
Look for the Environment
tab in one of the panes of
RStudio, and you will see that x
and its value have
appeared. Our object x
can be used in place of a number in
any calculation that expects a number:
x
## [1] 0.25
log(x)
## [1] -1.386294
This doesn’t change the value of x
or store the result
anywhere, it simply prints the answer to the console.
Objects can be reassigned:
x <- 99
x
used to contain the value 0.25 and and now it has the
value 99.
Assignment values can contain the object being assigned to:
x <- x + 1
Finally, objects can have values assigned using other objects:
y <- x*10
What does the following code print?
a <- 1
b <- 2
c <- a + b
b <- 4
a <- b
c <- a
c
Option 1) a
Option 2) 3
Option 3) 4
Option 4) ::nothing::
Object names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long object names, especially:
What you use is up to you, but be consistent.
Use descriptive object names, as they make your code easier to
understand. It will save time because you’ll remember what each object
is: It’s easier to remember what domesticPopulation
is than
dp
or x
. A silly example:
theNumberNine <- 9
Tab-completion is a really nice feature of RStudio that saves typing
and avoid typos. After you assign 9 to theNumberNine
, if
you start typing t...
, th...
, etc., and then
pressing tab
, RStudio will pull up a box of all the valid
ways to finish that word. You can scroll through them using the up- and
down-arrows and press enter to choose the one you want. If you press tab
when there is only one valid way to complete something, RStudio will
automatically complete it.
Once you figure out what function you want, you need to figure out how to use it. Every function has an associated help-file. They can be hard to read, especially at first, but it is important to learn how to make sense of them.
?function
brings up help-file. E.g.
?log
Each help-file contains the following components.
??
searches the text of all R help files,
e.g. ??base
will find log
.log(3)
, log
knew
3
was x
because it was in the first position,
but we could have also told log
explicitly that
3
is the value x
should take. These are the
same:log(3)
## [1] 1.098612
log(x = 3)
## [1] 1.098612
log
’s
base
defaults to exp(1)
, e, unless
you tell it otherwise. So these are identical:log(x = 3)
## [1] 1.098612
log(x = 3, base = exp(1))
## [1] 1.098612
To get the base 10 logarithm of 3, you could do
log(x = 3, base = 10)
## [1] 0.4771213
If you provide a function with arguments by name, they can go in any order. Otherwise, they have to appear in the order specified by the function. These are all the same:
log(3, 10)
## [1] 0.4771213
log(x = 3, base = 10)
## [1] 0.4771213
log(base = 10, x = 3)
## [1] 0.4771213
Remember tab-completion? Well it works with functions too! Type the name of a function, followed by parentheses, and make sure your cursor is between the parentheses. If you hit tab, a window should pop up showing the names of the arguments, with little purple rectangles next to them.
log()
If you use your arrow keys, you can move around, and RStudio will display a little description of each argument. If you hit Enter, it will paste that argument and an equals sign in between the parentheses.
log(x = )
You can type in whatever you want for that argument, type a comma, then hit Tab again, and RStudio will bring up a list of all remaining arguments.
log(x = 3, )
Three of the following lines produce the same result. Without running
the code, which one will produce a different result than the others? The
helpfile for log
(?log
) may be helpful.
Option 1) log(x = 1000, base = 10)
Option 2) log10(1000)
Option 3) log(base = 10, x = 1000)
Option 4) log(10, 1000)
Besides the value of an expression R has executed, there are a few other kinds of responses you might get from R, including errors, warnings, and messages.
R returns an error when it cannot proceed. It stops you in your tracks. The error message will provide some information on what the problem was, but it is often cryptic. Learning to understand these messages is important but takes practice. Here’s an example of an error:
log_of_a_word <- log("a_word")
## Error in log("a_word"): non-numeric argument to mathematical function
R tell us that something has gone wrong: It got a non-number for a
function that needs a number. Note that errors prevent execution of the
line, so nothing got assigned to log_of_a_word
there. If we
ask R what it thinks log_of_a_word
is, it will return
another error. Practice understanding R’s communication style: Do you
understand how R is telling you what the problem is?
log_of_a_word
## Error in eval(expr, envir, enclos): object 'log_of_a_word' not found
Warnings appear in the same red font in the console, but they start with “Warning” instead of “Error”. Warnings are R’s way of telling you that it did something, but it suspects it may not have been what you wanted. Warnings can be more insidious than errors because you can keep going, but keep going with a mistake in your pipeline. Here’s an example:
log_of_a_negative <- log(-2)
## Warning in log(-2): NaNs produced
NaN
means “not a number”, and R has kindly told us,
“Hey, I think you probably wanted a number here – taking a log of a
negative is kind of a weird thing to do. I can do it if you really want,
I just want to be make sure it’s what you want.”
Note that it did work, so if we ask R what
log_of_a_negative
is, we won’t get an error. Note that we
don’t get a warning either, so you need to pay attention when warnings
first appear.
log_of_a_negative
## [1] NaN
There’s a third source of red text in R: messages. These are R’s way of telling you that something happened, but it’s probably nothing to worry about. These don’t start with “Message”; they just print the red text.
Which elephant weighs more? Convert one’s weight to the units of the other, and store the result in an appropriately-named new object. Write a bit of code to test whether elephant1 weights more than elephant2 (1 kg ≈ 2.2 lb).
elephant1_kg <- 3492
elephant2_lb <- 7757
This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Introduction to R and RStudio materials and the Data Carpentry: R for data analysis and visualization of Ecological Data Before We Start materials.
Comments
The text that appears to the right of each line of code above is called a comment. Anything that follows the hash symbol –
#
– is ignored by R.Liberally add comments to your code as you write. Things that are clear as you write them will be mysterious to others, including your-future-self! Commenting takes little time and will save you time and headaches in the long run.