Content from Before we start


Last updated on 2024-11-19 | Edit this page


Overview

Questions

  • What is R and RStudio?
  • What is a working directory?
  • How should files be set up to import into R?
  • How can I look for help with R functions?

Objectives

  • Explain what R and RStudio are, what they are used for, and how they relate to each other.
  • Describe the purpose of the RStudio Script, Console, Environment, and Plots panes.
  • Organize files and directories for a set of analyses as an R Project, and understand the purpose of the working directory.
  • Use the built-in RStudio help interface to search for more information on R functions.
  • Demonstrate how to provide sufficient information for troubleshooting with the R user community.

What is R? What is RStudio?


The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is a popular way to write R scripts and interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

Why learn R?


R does not involve lots of pointing and clicking, and that’s a good thing

In R, the results of your analysis rely on a series of written commands, and not on remembering a succession of pointing and clicking. That is a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results. With a stored series of commands in an R script, you can repeat running them and R will process the new dataset exactly the same way as before.

Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.

Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.

R code is great for reproducibility

Reproducibility is when someone else, including your future self, can obtain the same results from the same dataset when using the same analysis.

R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.

R is widely used in academia and in industries such as pharma and biotech. These organisations expect analyses to be reproducible, so knowing R will give you an edge with these requirements.

R is interdisciplinary and extensible

With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.

R works on data of all shapes and sizes

The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.

R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.

R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.

R produces high-quality graphics

The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to visualize your data more effectively.

R has a large and welcoming community

Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, RStudio community, and Slack channels such as the R for Data Science online community (https://www.rfordatasci.com/). In addition, there are numerous online and in person meetups organised globally through organisations such as R Ladies Global (https://rladies.org/).

Not only is R free, but it is also open-source and cross-platform

Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.

Knowing your way around RStudio


Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.

The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, PBC.

We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.

RStudio interface screenshot. Clockwise from top left: Source,Environment/History, Files/Plots/Packages/Help/Viewer,Console.

RStudio is divided into 4 “panes”:

  • The Source for your scripts and documents (top-left, in the default layout)
  • Your Environment/History (top-right) which shows all the objects in your working space (Environment) and your command history (History)
  • Your Files/Plots/Packages/Help/Viewer (bottom-right)
  • The R Console (bottom-left)

The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). For ease of use, settings such as background color, font color, font size, and zoom level can also be adjusted in this menu (Global Options -> Appearance).

One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.

Getting set up


It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory. All of the scripts within this folder can then use relative paths to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way allows you to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.

RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and (re-)open files to assist resume work after a break. Go through the steps for creating an “R Project” for this tutorial below.

  1. Start RStudio.
  2. Under the File menu, click on New Project. Choose New Directory, then New Project.
  3. Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., ~/data-carpentry).
  4. Click on Create Project.
  5. Download the code handout, place it in your working directory and rename it (e.g., data-carpentry-script.R).
  6. (Optional) Set Preferences to ‘Never’ save workspace in RStudio.

A workspace is your current working environment in R which includes any user-defined object. By default, all of these objects will be saved, and automatically loaded, when you reopen your project. Saving a workspace to .RData can be cumbersome, especially if you are working with larger datasets, and it can lead to hard to debug errors by having objects in memory you forgot you had. Therefore, it is often a good idea to turn this off. To do so, go to Tools –> ‘Global Options’ and select the ‘Never’ option for ‘Save workspace to .RData’ on exit.’

Set 'Save workspace to .RData on exit' to'Never'

Organizing your working directory

Using a consistent folder structure across your projects will help keep things organized, and will help you to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for scripts, data, and documents.

  • data_raw/ & data/ Use these folders to store raw data and intermediate datasets you may create for the need of a particular analysis. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is also a good idea. For example, you could have files data_raw/tree_survey.plot1.txt and ...plot2.txt kept separate from a data/tree.survey.csv file generated by the scripts/01.preprocess.tree_survey.R script.
  • documents/ This would be a place to keep outlines, drafts, and other text.
  • scripts/ This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later).
  • Additional (sub)directories depending on your project needs.

For this workshop, we will need a data_raw/ folder to store our raw data, and we will use data/ for when we learn how to export data as CSV files, and a fig/ folder for the figures that we will save.

  • Under the Files tab on the right of the screen, click on New Folder and create a folder named data_raw within your newly created working directory (e.g., ~/data-carpentry/). (Alternatively, type dir.create("data_raw") at your R console.) Repeat these operations to create a data and a fig folder.

We are going to keep the script in the root of our working directory because we are only going to use one file. Later, when you start create more complex projects, it might make sense to organize scripts in sub-directories.

Your working directory should now look like this:

How it should look like at the beginning of this lesson
How it should look like at the beginning of this lesson

The working directory

The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.

RStudio assists you in this regard and sets the working directory automatically to the directory where you have placed your project in. If you need to check it, you can use getwd(). If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser where your working directory should be, and clicking on the blue gear icon “More”, and select “Set As Working Directory”. Alternatively you can use setwd("/path/to/working/directory") to reset your working directory. However, your scripts should not include this line because it will fail on someone else’s computer.

Interacting with R


The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.

Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.

RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Macs, Cmd + Return will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press Ctrl + Enter. You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudio IDE.

At some point in your analysis you may want to check the content of a variable or the structure of an object, without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the Ctrl + 1 and Ctrl + 2 shortcuts allow you to jump between the script and the console panes.

If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl + Enter), R will try to execute it, and when ready, will show the results and come back with a new > prompt to wait for new commands.

If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt.

Seeking help


Searching function documentation with ? and ??

If you need help with a specific function, let’s say mean(), you can type ?mean or press F1 while your cursor is on the function name. If you are looking for a function to do a particular task, but don’t know the function name, you can use the double question mark ??, for example ??kruskall. Both commands will open matching help files in RStudio’s help panel in the lower right corner. You can also use the help panel to search help directly, as seen in the screenshot.

RStudio help panel. When typing a word in the search field, it will showrelated suggestions.

Automatic code completion

When you write code in RStudio, you can use its automatic code completion to remind yourself of a function’s name or arguments. Start typing the function name and pay attention to the suggestions that pop up. Use the up and down arrow to select a suggested code completion and Tab to apply it. You can also use code completion to complete function’s argument names, object, names and file names. It even works if you don’t get the spelling 100% correct.

Package vignettes and cheat sheets

In addition to the documentation for individual functions, many packages have vignettes – instructions for how to use the package to do certain tasks. Vignettes are great for learning by example. Vignettes are accessible via the package help and by using the function browseVignettes().

There is also a Help menu at the top of the RStudio window, that has cheat sheets for popular packages, RStudio keyboard shortcuts, and more.

Finding more functions and packages

RStudio’s help only searches the packages that you have installed on your machine, but there are many more available on CRAN and GitHub. To search across all available R packages, you can use the website rdocumentation.org. Often, a generic Google or internet search “R <task>” will send you to the appropriate package documentation or a forum where someone else has already asked your question. Many packages also have websites with additional help, tutorials, news and more (for example tidyverse.org).

Dealing with error messages

Don’t get discouraged if your code doesn’t run immediately! Error messages are common when programming, and fixing errors is part of any programmer’s daily work. Often, the problem is a small typo in a variable name or a missing parenthesis. Watch for the red x’s next to your code in RStudio. These may provide helpful hints about the source of the problem.

RStudio shows a red x next to a line of code that R doesn't understand.

If you can’t fix an error yourself, start by googling it. Some error messages are too generic to diagnose a problem (e.g. “subscript out of bounds”). In that case it might help to include the name of the function or package you’re using in your query.

Asking for help

If your Google search is unsuccessful, you may want to ask other R users for help. There are different places where you can ask for help. During this workshop, don’t hesitate to talk to your neighbor, compare your answers, and ask for help. You might also be interested in organizing regular meetings following the workshop to keep learning from each other. If you have a friend or colleague with more experience than you, they might also be able and willing to help you.

Besides that, there are a few places on the internet that provide help:

  • Stack Overflow: Many questions have already been answered, but the challenge is to use the right words in your search to find them. If your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on how to ask a good question.
  • The R-help mailing list: it is used by a lot of people (including most of the R core team). If your question is valid (read its Posting Guide), you are likely to get an answer very fast, but the tone can be pretty dry and it is not always very welcoming to new users.
  • If your question is about a specific package rather than a base R function, see if there is a mailing list for the package. Usually it’s included in the DESCRIPTION file of the package that can be accessed using packageDescription("<package-name>").
  • You can also try to contact the package author directly, by emailing them or opening an issue on the code repository (e.g., on GitHub).
  • There are also some topic-specific mailing lists (GIS, phylogenetics, etc…). The complete list is on the R mailing lists website.

The key to receiving help from someone is for them to rapidly grasp your problem. Thus, you should be as precise as possible when describing your problem and help others to pinpoint where the issue might be. Try to…

  • Use the correct words to describe your problem. Otherwise you might get an answer pointing to the misuse of your words rather than answering your question.

  • Generalize what you are trying to do, so people outside your field can understand the question.

  • Reduce what does not work to a simple reproducible example. For instance, instead of using your real data set, create a small generic one. For more information on how to write a reproducible example see this article from the reprex package. Learning how to use the reprex package is also very helpful for this.

  • Include the output of sessionInfo() in your question. It provides information about your platform, the versions of R and the packages that you are using. As an example, here you can see the versions of R and all the packages that we are using to run the code in this lesson:

R

sessionInfo()

OUTPUT

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] RSQLite_2.3.7   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1
#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1
#>  [9] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0 knitr_1.47
#>
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5        gtable_0.3.5     compiler_4.4.2   renv_1.0.11
#>  [5] highr_0.11       tidyselect_1.2.1 blob_1.2.4       scales_1.3.0
#>  [9] fastmap_1.2.0    yaml_2.3.8       R6_2.5.1         generics_0.1.3
#> [13] munsell_0.5.1    DBI_1.2.3        pillar_1.9.0     tzdb_0.4.0
#> [17] rlang_1.1.3      utf8_1.2.4       cachem_1.1.0     stringi_1.8.4
#> [21] xfun_0.44        bit64_4.0.5      memoise_2.0.1    timechange_0.3.0
#> [25] cli_3.6.2        withr_3.0.0      magrittr_2.0.3   grid_4.4.2
#> [29] hms_1.1.3        lifecycle_1.0.4  vctrs_0.6.5      evaluate_0.23
#> [33] glue_1.7.0       fansi_1.0.6      colorspace_2.1-0 tools_4.4.2
#> [37] pkgconfig_2.0.3

How to learn more after the workshop?


The material we cover during this workshop will give you a taste of how you can use R to analyze data for your own research. However, to do advanced operations such as cleaning your dataset, using statistical methods, or creating beautiful graphics you will need to learn more.

The best way to become proficient and efficient at R, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch, and given that many people make their code available online, modifying existing code to suit your purpose might get first hands-on experience using R for your own work and help you become comfortable eventually creating your own scripts.

More resources


More about R

  • The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
  • The R FAQ is dense and technical but it is full of useful information.
  • To stay up to date, follow #rstats on twitter. Twitter can also be a way to get questions answered and learn about useful R packages and tipps (e.g., [@RLangTips])

How to ask good programming questions?

Key Points

  • R is a programming language and RStudio is the IDE that assists in using R.
  • There are many benefits to learning R, including writing reproducibile code, ability to use a variety of datasets, and a broad, open-source community of practioners.
  • Files related to analysis should be organized within a single working directory.
  • R uses commands containing functions to tell the computer what to do.
  • Documentation for each function is available within RStudio, or users can ask for help from one of many online forums, cheatsheets, or email lists.

Content from Introduction to R


Last updated on 2024-11-19 | Edit this page


Overview

Questions

  • How do you create objects in R?
  • How do you save R code for later use?
  • How do you manipulate data in R?

Objectives

  • Define the following terms as they relate to R: object, assign, call, function, arguments, options.
  • Create objects and assign values to them in R.
  • Learn how to name objects.
  • Save a script file for later use.
  • Use comments to inform script.
  • Solve simple arithmetic operations in R.
  • Call functions and use arguments to change their default options.
  • Inspect the content of vectors and manipulate their content.
  • Subset and extract values from vectors.
  • Analyze vectors with missing data.

Creating objects in R


You can get output from R simply by typing math in the console:

R

3 + 5
12 / 7

However, to do useful and interesting things, we need to assign values to objects. To create an object, we need to give it a name followed by the assignment operator <-, and the value we want to give it:

R

weight_kg <- 55

<- is the assignment operator we will use in this course. It assigns values on the right to objects on the left. So, after executing x <- 3, the value of x is 3. For historical reasons, you can also use = for assignments, but not in every context. Because of the slight differences in syntax, it is good practice to always use <- for assignments.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke in a PC, while typing Option + - (push Option at the same time as the - key) does the same in a Mac.

Objects can be given almost any name such as x, current_temperature, or subject_id. Here are some further guidelines on naming objects:

  • You want your object names to be explicit and not too long.
  • They cannot start with a number (2x is not valid, but x2 is).
  • R is case sensitive, so for example, weight_kg is different from Weight_kg.
  • There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data, df, weights). If in doubt, check the help to see if the name is already in use.
  • It’s best to avoid dots (.) within names. Many function names in R itself have them and dots also have a special meaning (methods) in R and other programming languages. To avoid confusion, don’t include dots in names.
  • It is recommended to use nouns for object names and verbs for function names.
  • Be consistent in the styling of your code, such as where you put spaces, how you name objects, etc. Styles can include “lower_snake”, “UPPER_SNAKE”, “lowerCamelCase”, “UpperCamelCase”, etc. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. In R, three popular style guides come from Google, Jean Fan and the tidyverse. The tidyverse style is very comprehensive and may seem overwhelming at first. You can install the lintr package to automatically check for issues in the styling of your code.

Objects vs. variables

What are known as objects in R are known as variables in many other programming languages. Depending on the context, object and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:

R

weight_kg <- 55    # doesn't print anything
(weight_kg <- 55)  # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg          # and so does typing the name of the object

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):

R

2.2 * weight_kg

We can also change an object’s value by assigning it a new one:

R

weight_kg <- 57.5
2.2 * weight_kg

This means that assigning a value to one object does not change the values of other objects. For example, let’s store the animal’s weight in pounds in a new object, weight_lb:

R

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

R

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 220?

Saving your code

Up to now, your code has been in the console. This is useful for quick queries but not so helpful if you want to revisit your work for any reason. A script can be opened by pressing Ctrl + Shift + N. It is wise to save your script file immediately. To do this press Ctrl + S. This will open a dialogue box where you can decide where to save your script file, and what to name it. The .R file extension is added automatically and ensures your file will open with RStudio.

Don’t forget to save your work periodically by pressing Ctrl + S.

Comments

The comment character in R is #. Anything to the right of a # in a script will be ignored by R. It is useful to leave notes and explanations in your scripts. For convenience, RStudio provides a keyboard shortcut to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.

Challenge

What are the values after each statement in the following?

R

mass <- 47.5            # mass?
age  <- 122             # age?
mass <- mass * 2.0      # mass?
age  <- age - 20        # age?
mass_index <- mass/age  # mass_index?

Functions and their arguments

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function. An example of a function call is:

R

weight_kg <- sqrt(10)

Here, the value of 10 is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to the object weight_kg. This function takes one argument, other functions might take several.

The return ‘value’ of a function need not be numerical (like that of sqrt()), and it also does not need to be a single item: it can be a set of things, or even a dataset. We’ll see that when we read data files into R.

Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.

Let’s try a function that can take multiple arguments: round().

R

round(3.14159)

OUTPUT

#> [1] 3

Here, we’ve called round() with just one argument, 3.14159, and it has returned the value 3. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) to find what arguments it takes, or look at the help for this function using ?round.

R

args(round)

OUTPUT

#> function (x, digits = 0, ...)
#> NULL

R

?round

We see that if we want a different number of digits, we can type digits = 2 or however many we want.

R

round(3.14159, digits = 2)

OUTPUT

#> [1] 3.14

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

R

round(3.14159, 2)

OUTPUT

#> [1] 3.14

And if you do name the arguments, you can switch their order:

R

round(digits = 2, x = 3.14159)

OUTPUT

#> [1] 3.14

It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to then specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing.

Vectors and data types


A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:

R

weight_g <- c(50, 60, 65, 82)
weight_g

A vector can also contain characters:

R

animals <- c("mouse", "rat", "dog")
animals

The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R will assume objects have been created called mouse, rat and dog. As these objects don’t exist in R’s memory, there will be an error message.

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

R

length(weight_g)
length(animals)

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates what kind of object you are working with:

R

class(weight_g)
class(animals)

The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

R

str(weight_g)
str(animals)

You can use the c() function to add other elements to your vector:

R

weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g

In the first line, we take the original vector weight_g, add the value 90 to the end of it, and save the result back into weight_g. Then we add the value 30 to the beginning, again saving the result back into weight_g.

We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.

An atomic vector is the simplest R data type and is a linear vector of a single type. Above, we saw 2 of the 6 main atomic vector types that R uses: "character" and "numeric" (or "double"). These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:

  • "logical" for TRUE and FALSE (the boolean data type)
  • "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • "complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that’s all we’re going to say about them
  • "raw" for bitstreams that we won’t discuss further

You can check the type of your vector using the typeof() function and inputting your vector as the argument.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array).

Challenge

  • We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?

R implicitly converts them to all be the same type

Challenge (continued)

  • What will happen in each of these examples? (hint: use class() to check the data type of your objects):

    R

    num_char <- c(1, 2, 3, "a")
    num_logical <- c(1, 2, 3, TRUE)
    char_logical <- c("a", "b", "c", TRUE)
    tricky <- c(1, 2, 3, "4")
  • Why do you think it happens?

Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.

Challenge (continued)

  • How many values in combined_logical are "TRUE" (as a character) in the following example (reusing the 2 ..._logicals from above):

    R

    combined_logical <- c(num_logical, char_logical)

Only one. There is no memory of past data types, and the coercion happens the first time the vector is evaluated. Therefore, the TRUE in num_logical gets converted into a 1 before it gets converted into "1" in combined_logical.

Challenge (continued)

  • You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

logical → numeric → character ← logical

Subsetting vectors


If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

R

animals <- c("mouse", "rat", "dog", "cat")
animals[2]

OUTPUT

#> [1] "rat"

R

animals[c(3, 2)]

OUTPUT

#> [1] "dog" "rat"

We can also repeat the indices to create an object with more elements than the original one:

R

more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals

OUTPUT

#> [1] "mouse" "rat"   "dog"   "rat"   "mouse" "cat"

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

R

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]

OUTPUT

#> [1] 21 54 55

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:

R

weight_g > 50    # will return logicals with TRUE for the indices that meet the condition

OUTPUT

#> [1] FALSE FALSE FALSE  TRUE  TRUE

R

## so we can use this to select only the values above 50
weight_g[weight_g > 50]

OUTPUT

#> [1] 54 55

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

R

weight_g[weight_g > 30 & weight_g < 50]

OUTPUT

#> [1] 34 39

R

weight_g[weight_g <= 30 | weight_g == 55]

OUTPUT

#> [1] 21 55

R

weight_g[weight_g >= 30 & weight_g == 21]

OUTPUT

#> numeric(0)

Here, > for “greater than”, < stands for “less than”, <= for “less than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

R

animals <- c("mouse", "rat", "dog", "cat", "cat")

# return both rat and cat
animals[animals == "cat" | animals == "rat"]

OUTPUT

#> [1] "rat" "cat" "cat"

R

# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")

OUTPUT

#> [1] FALSE  TRUE  TRUE  TRUE  TRUE

R

# use the logical vector created by %in% to return elements from animals
# that are found in the character vector
animals[animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")]

OUTPUT

#> [1] "rat" "dog" "cat" "cat"

Challenge (optional)

  • Can you figure out why "four" > "five" returns TRUE?

When using “>” or “<” on strings, R compares their alphabetical order. Here “four” comes after “five”, and therefore is “greater than” it.

Missing data


As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE to calculate the result as if the missing values were removed (rm stands for ReMoved) first.

R

heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples.

R

## Extract those elements which are not missing values.
heights[!is.na(heights)]

## Returns the object with incomplete cases removed.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
na.omit(heights)

## Extract those elements which are complete cases.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
heights[complete.cases(heights)]

Recall that you can use the typeof() function to find the type of your atomic vector.

Challenge

  1. Using this vector of heights in inches, create a new vector, heights_no_na, with the NAs removed.

R

heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
  1. Use the function median() to calculate the median of the heights vector.

  2. Use R to figure out how many people in the set are taller than 67 inches.

R

heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)

# 1.
heights_no_na <- heights[!is.na(heights)]
# or
heights_no_na <- na.omit(heights)
# or
heights_no_na <- heights[complete.cases(heights)]

# 2.
median(heights, na.rm = TRUE)

# 3.
heights_above_67 <- heights_no_na[heights_no_na > 67]
length(heights_above_67)

Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with the Portal dataset we have been using in the other lessons, and learn about data frames.

Key Points

  • <- is used to assign values on the right to objects on the left
  • Code should be saved within the Source pane in RStudio to help you return to your code later.
  • ‘#’ can be used to add comments to your code.
  • Functions can automate more complicated sets of commands, and require arguments as inputs.
  • Vectors are composed by a series of values and can take many forms.
  • Data structures in R include ‘vector’, ‘list’, ‘matrix’, ‘data.frame’, ‘factor’, and ‘array’.
  • Vectors can be subset by indexing or through logical vectors.
  • Many functions exist to remove missing data from data structures.

Content from Starting with data


Last updated on 2024-11-19 | Edit this page

Overview

Questions

  • What is a data.frame?
  • How can I read a complete csv file into R?
  • How can I get basic summary information about my dataset?
  • How can extract specific information from a dataframe?
  • What are factors, and how are they different from other datatypes?
  • How can I rename factors?
  • How are dates represented in R and how can I change the format?

Objectives

  • Load external data from a .csv file into a data frame.
  • Install and load packages.
  • Describe what a data frame is.
  • Summarize the contents of a data frame.
  • Use indexing to subset specific portions of data frames.
  • Describe what a factor is.
  • Convert between strings and factors.
  • Reorder and rename factors.
  • Change how character strings are handled in a data frame.
  • Format dates.

Loading the survey data


We are investigating the animal species diversity and weights found within plots at our study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular experimental plot of land
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxon e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot

Downloading the data

We created the folder that will store the downloaded data (data_raw) in the chapter “Before we start”. If you skipped that part, it may be a good idea to have a look now, to make sure your working directory is set up properly.

We are going to use the R function download.file() to download the CSV file that contains the survey data from Figshare, and we will use read_csv() to load the content of the CSV file into R.

Inside the download.file command, the first entry is a character string with the source URL (“https://ndownloader.figshare.com/files/2292169”). This source URL downloads a CSV file from figshare. The text after the comma (“data_raw/portal_data_joined.csv”) is the destination of the file on your local machine. You’ll need to have a folder on your machine called “data_raw” where you’ll download the file. So this command downloads a file from Figshare, names it “portal_data_joined.csv” and adds it to a preexisting folder named “data_raw”.

R

download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "data_raw/portal_data_joined.csv")

Reading the data into R

The file has now been downloaded to the destination you specified, but R has not yet loaded the data from the file into memory. To do this, we can use the read_csv() function from the tidyverse package.

Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like round(), sqrt(), or c() come built into R. Packages give you access to additional functions beyond base R. A similar function to read_csv() from the tidyverse package is read.csv() from base R. We don’t have time to cover their differences but notice that the exact spelling determines which function is used. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.

To install the tidyverse package, we can type install.packages("tidyverse") straight into the console. In fact, it’s better to write this in the console than in our script for any package, as there’s no need to re-install packages every time we run the script. Then, to load the package type:

R

## load the tidyverse packages, incl. dplyr
library(tidyverse)

Now we can use the functions from the tidyverse package. Let’s use read_csv() to read the data into a data frame (we will learn more about data frames later):

R

surveys <- read_csv("data_raw/portal_data_joined.csv")

OUTPUT

#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When you execute read_csv on a data file, it looks through the first 1000 rows of each column and guesses its data type. For example, in this dataset, read_csv() reads weight as col_double (a numeric data type), and species as col_character. You have the option to specify the data type for a column manually by using the col_types argument in read_csv.

Note

read_csv() assumes that fields are delineated by commas. However, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read_csv2() function. It behaves like read_csv() but uses different parameters for the decimal and the field separators. There is also the read_tsv() for tab separated data files and read_delim() for less common formats. Check out the help for read_csv() by typing ?read_csv to learn more.

In addition to the above versions of the csv format, you should develop the habits of looking at and recording some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not split into three variables), and the presence of unexpected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.

We can see the contents of the first few lines of the data by typing its name: surveys. By default, this will show you as many rows and columns of the data as fit on your screen. If you wanted the first 50 rows, you could type print(surveys, n = 50)

We can also extract the first few lines of this data using the function head():

R

head(surveys)

OUTPUT

#> # A tibble: 6 × 13
#>   record_id month   day  year plot_id species_id sex   hindfoot_length weight
#>       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
#> 1         1     7    16  1977       2 NL         M                  32     NA
#> 2        72     8    19  1977       2 NL         M                  31     NA
#> 3       224     9    13  1977       2 NL         <NA>               NA     NA
#> 4       266    10    16  1977       2 NL         <NA>               NA     NA
#> 5       349    11    12  1977       2 NL         <NA>               NA     NA
#> 6       363    11    12  1977       2 NL         <NA>               NA     NA
#> # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

Unlike the print() function, head() returns the extracted data. You could use it to assign the first 100 rows of surveys to an object using surveys_sample <- head(surveys, 100). This can be useful if you want to try out complex computations on a subset of your data before you apply them to the whole data set. There is a similar function that lets you extract the last few lines of the data set. It is called (you might have guessed it) tail().

To open the dataset in RStudio’s Data Viewer, use the view() function:

R

view(surveys)

Note

There are two functions for viewing which are case-sensitive. Using view() with a lowercase ‘v’ is part of tidyverse, whereas using View() with an uppercase ‘V’ is loaded through base R in the utils package.

What are data frames?


When we loaded the data into R, it got stored as an object of class tibble, which is a special kind of data frame (the difference is not important for our purposes, but you can learn more about tibbles here). Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. Data frames can be created by hand, but most commonly they are generated by functions like read_csv(); in other words, when importing spreadsheets from your hard drive or the web.

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this also when inspecting the structure of a data frame with the function str():

R

str(surveys)

Inspecting data frames


We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:

    • dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(surveys) - returns the number of rows
    • ncol(surveys) - returns the number of columns
  • Content:

    • head(surveys) - shows the first 6 rows
    • tail(surveys) - shows the last 6 rows
  • Names:

    • names(surveys) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(surveys) - returns the row names
  • Summary:

    • str(surveys) - structure of the object and information about the class, length and content of each column
    • summary(surveys) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the output of str(surveys), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?

R

str(surveys)

OUTPUT

#> spc_tbl_ [34,786 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ record_id      : num [1:34786] 1 72 224 266 349 363 435 506 588 661 ...
#>  $ month          : num [1:34786] 7 8 9 10 11 11 12 1 2 3 ...
#>  $ day            : num [1:34786] 16 19 13 16 12 12 10 8 18 11 ...
#>  $ year           : num [1:34786] 1977 1977 1977 1977 1977 ...
#>  $ plot_id        : num [1:34786] 2 2 2 2 2 2 2 2 2 2 ...
#>  $ species_id     : chr [1:34786] "NL" "NL" "NL" "NL" ...
#>  $ sex            : chr [1:34786] "M" "M" NA NA ...
#>  $ hindfoot_length: num [1:34786] 32 31 NA NA NA NA NA NA NA NA ...
#>  $ weight         : num [1:34786] NA NA NA NA NA NA NA NA 218 NA ...
#>  $ genus          : chr [1:34786] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
#>  $ species        : chr [1:34786] "albigula" "albigula" "albigula" "albigula" ...
#>  $ taxa           : chr [1:34786] "Rodent" "Rodent" "Rodent" "Rodent" ...
#>  $ plot_type      : chr [1:34786] "Control" "Control" "Control" "Control" ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   record_id = col_double(),
#>   ..   month = col_double(),
#>   ..   day = col_double(),
#>   ..   year = col_double(),
#>   ..   plot_id = col_double(),
#>   ..   species_id = col_character(),
#>   ..   sex = col_character(),
#>   ..   hindfoot_length = col_double(),
#>   ..   weight = col_double(),
#>   ..   genus = col_character(),
#>   ..   species = col_character(),
#>   ..   taxa = col_character(),
#>   ..   plot_type = col_character()
#>   .. )
#>  - attr(*, "problems")=<externalptr>

R

## * class: data frame
## * how many rows: 34786,  how many columns: 13

Indexing and subsetting data frames


Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

R

# We can extract specific values by specifying row and column indices
# in the format: 
# data_frame[row_index, column_index]
# For instance, to extract the first row and column from surveys:
surveys[1, 1]

# First row, sixth column:
surveys[1, 6]   

# We can also use shortcuts to select a number of rows or columns at once
# To select all columns, leave the column index blank
# For instance, to select all columns for the first row:
surveys[1, ]

# The same shortcut works for rows --
# To select the first column across all rows:
surveys[, 1]

# An even shorter way to select first column across all rows:
surveys[1] # No comma! 

# To select multiple rows or columns, use vectors!
# To select the first three rows of the 5th and 6th column
surveys[c(1, 2, 3), c(5, 6)] 

# We can use the : operator to create those vectors for us:
surveys[1:3, 5:6] 

# This is equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]

# As we've seen, when working with tibbles 
# subsetting with single square brackets ("[]") always returns a data frame.
# If you want a vector, use double square brackets ("[[]]")

# For instance, to get the first column as a vector:
surveys[[1]]

# To get the first value in our data frame:
surveys[[1, 1]]

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.

You can also exclude certain indices of a data frame using the “-” sign:

R

surveys[, -1]                 # The whole data frame, except the first column
surveys[-(7:nrow(surveys)), ] # Equivalent to head(surveys)

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

R

# As before, using single brackets returns a data frame:
surveys["species_id"]
surveys[, "species_id"]

# Double brackets returns a vector:
surveys[["species_id"]]

# We can also use the $ operator with column names instead of double brackets
# This returns a vector:
surveys$species_id

In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.

Challenge

  1. Create a data.frame (surveys_200) containing only the data in row 200 of the surveys dataset.

  2. Notice how nrow() gave you the number of rows in a data.frame?

  • Use that number to pull out just that last row from the surveys dataset.
  • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
  • Pull out that last row using nrow() instead of the row number.
  • Create a new data frame (surveys_last) from that last row.
  1. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle.

  2. Combine nrow() with the - notation above to reproduce the behavior of head(surveys), keeping just the first through 6th rows of the surveys dataset.

R

## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]

Factors


When we did str(surveys) we saw that several of the columns consist of integers. The columns genus, species, sex, plot_type, … however, are of the class character. Arguably, these columns contain categorical data, that is, they can only take on a limited number of values.

R has a special class for working with categorical data, called factor. Factors are very useful and actually contribute to making R particularly well suited to working with data. So we are going to spend a little time introducing them.

Once created, factors can only contain a pre-defined set of values, known as levels. Factors are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

When importing a data frame with read_csv(), the columns that contain text are not automatically coerced (=converted) into the factor data type, but once we have loaded the data we can do the conversion using the factor() function:

R

surveys$sex <- factor(surveys$sex)

We can see that the conversion has worked by using the summary() function again. This produces a table with the counts for each factor level:

R

summary(surveys$sex)

By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

R

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can see this by using the function levels() and you can find the number of levels using nlevels():

R

levels(sex)
nlevels(sex)

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:

R

sex # current order

OUTPUT

#> [1] male   female female male
#> Levels: female male

R

sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering

OUTPUT

#> [1] male   female female male
#> Levels: male female

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "female", "male" is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset).

Challenge

  1. Change the columns taxa and genus in the surveys data frame into a factor.

  2. Using the functions you learned before, can you find out…

  • How many rabbits were observed?
  • How many different genera are in the genus column?

R

surveys$taxa <- factor(surveys$taxa)
surveys$genus <- factor(surveys$genus)
summary(surveys)
nlevels(surveys$genus)

## * how many genera: There are 26 unique genera in the `genus` column.
## * how many rabbts: There are 75 rabbits in the `taxa` column.

Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

R

as.character(sex)

In some cases, you may have to convert factors where the levels appear as numbers (such as concentration levels or years) to a numeric vector. For instance, in one part of your analysis the years might need to be encoded as factors (e.g., comparing average weights across years) but in another part of your analysis they may need to be stored as numeric values (e.g., doing math operations on the years). This conversion from factor to numeric is a little trickier. The as.numeric() function returns the index values of the factor, not its levels, so it will result in an entirely new (and unwanted in this case) set of numbers. One method to avoid this is to convert factors to characters, and then to numbers.

Another method is to use the levels() function. Compare:

R

year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct)               # Wrong! And there is no warning...
as.numeric(as.character(year_fct)) # Works...
as.numeric(levels(year_fct))[year_fct]    # The recommended way.

Notice that in the levels() approach, three important steps occur:

  • We obtain all the factor levels using levels(year_fct)
  • We convert these levels to numeric values using as.numeric(levels(year_fct))
  • We then access these numeric values using the underlying integers of the vector year_fct inside the square brackets

Renaming factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of males and females captured over the course of the experiment:

R

## bar plot of the number of females and males captured during the experiment:
plot(surveys$sex)

However, as we saw when we used summary(surveys$sex), there are about 1700 individuals for which the sex information hasn’t been recorded. To show them in the plot, we can turn the missing values into a factor level with the addNA() function. We will also have to give the new factor level a label. We are going to work with a copy of the sex column, so we’re not modifying the working copy of the data frame:

R

sex <- surveys$sex
levels(sex)

OUTPUT

#> [1] "F" "M"

R

sex <- addNA(sex)
levels(sex)

OUTPUT

#> [1] "F" "M" NA

R

head(sex)

OUTPUT

#> [1] M    M    <NA> <NA> <NA> <NA>
#> Levels: F M <NA>

R

levels(sex)[3] <- "undetermined"
levels(sex)

OUTPUT

#> [1] "F"            "M"            "undetermined"

R

head(sex)

OUTPUT

#> [1] M            M            undetermined undetermined undetermined
#> [6] undetermined
#> Levels: F M undetermined

Now we can plot the data again, using plot(sex).

Challenge

  • Rename “F” and “M” to “female” and “male” respectively.
  • Now that we have renamed the factor level to “undetermined”, can you recreate the barplot such that “undetermined” is first (before “female”)?

R

levels(sex)[1:2] <- c("female", "male")
sex <- factor(sex, levels = c("undetermined", "female", "male"))
plot(sex)

Challenge

  1. We have seen how data frames are created when using read_csv(), but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame. Can you spot and fix them? Don’t hesitate to experiment!

R

animal_data <- data.frame(
          animal = c(dog, cat, sea cucumber, sea urchin),
          feel = c("furry", "squishy", "spiny"),
          weight = c(45, 8 1.1, 0.8)
          )
  1. Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate):
  • Are they what you expected? Why? Why not?
  • What would you need to change to ensure that each column had the accurate data type?

R

country_climate <- data.frame(
       country = c("Canada", "Panama", "South Africa", "Australia"),
       climate = c("cold", "hot", "temperate", "hot/temperate"),
       temperature = c(10, 30, 18, "15"),
       northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo = c(FALSE, FALSE, FALSE, 1)
       )

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (for instance, a letter in a column that should only contain numbers).

Learn more in this RStudio tutorial

Formatting dates


A common issue that new (and experienced!) R users have is converting date and time information into a variable that is suitable for analyses. One way to store date information is to store each component of the date in a separate column. Using str(), we can confirm that our data frame does indeed have a separate column for day, month, and year, and that each of these columns contains integer values.

R

str(surveys)

We are going to use the ymd() function from the package lubridate (which belongs to the tidyverse; learn more here). lubridate gets installed as part as the tidyverse installation. When you load the tidyverse (library(tidyverse)), the core packages (the packages used in most data analyses) get loaded. lubridate however does not belong to the core tidyverse, so you have to load it explicitly with library(lubridate)

Start by loading the required package:

R

library(lubridate)

The lubridate package has many useful functions for working with dates. These can help you extract dates from different string representations, convert between timezones, calculate time differences and more. You can find an overview of them in the lubridate cheat sheet.

Here we will use the function ymd(), which takes a vector representing year, month, and day, and converts it to a Date vector. Date is a class of data recognized by R as being a date and can be manipulated as such. The argument that the function requires is flexible, but, as a best practice, is a character vector formatted as “YYYY-MM-DD”.

Let’s create a date object and inspect the structure:

R

my_date <- ymd("2015-01-01")
str(my_date)

Now let’s paste the year, month, and day separately - we get the same result:

R

# sep indicates the character to use to separate each component
my_date <- ymd(paste("2015", "1", "1", sep = "-")) 
str(my_date)

Now we apply this function to the surveys dataset. Create a character vector from the year, month, and day columns of surveys using paste():

R

paste(surveys$year, surveys$month, surveys$day, sep = "-")

This character vector can be used as the argument for ymd():

R

ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))

WARNING

#> Warning: 129 failed to parse.

There is a warning telling us that some dates could not be parsed (understood) by the ymd() function. For these dates, the function has returned NA, which means they are treated as missing values. We will deal with this problem later, but first we add the resulting Date vector to the surveys data frame as a new column called date:

R

surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))

WARNING

#> Warning: 129 failed to parse.

R

str(surveys) # notice the new column, with 'date' as the class

Let’s make sure everything worked correctly. One way to inspect the new column is to use summary():

R

summary(surveys$date)

OUTPUT

#>         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
#> "1977-07-16" "1984-03-12" "1990-07-22" "1990-12-15" "1997-07-29" "2002-12-31"
#>         NA's
#>        "129"

Let’s investigate why some dates could not be parsed.

We can use the functions we saw previously to deal with missing data to identify the rows in our data frame that are failing. If we combine them with what we learned about subsetting data frames earlier, we can extract the columns “year,”month”, “day” from the records that have NA in our new column date. We will also use head() so we don’t clutter the output:

R

missing_dates <- surveys[is.na(surveys$date), c("year", "month", "day")]

head(missing_dates)

OUTPUT

#> # A tibble: 6 × 3
#>    year month   day
#>   <dbl> <dbl> <dbl>
#> 1  2000     9    31
#> 2  2000     4    31
#> 3  2000     4    31
#> 4  2000     4    31
#> 5  2000     4    31
#> 6  2000     9    31

Why did these dates fail to parse? If you had to use these data for your analyses, how would you deal with this situation?

The answer is because the dates provided as input for the ymd() function do not actually exist. If we refer to the output we got above, September and April only have 30 days, not 31 days as it is specified in our dataset.

There are several ways you could deal with situation:

  • If you have access to the raw data (e.g., field sheets) or supporting information (e.g., field trip reports/logs), check them and ensure the electronic database matches the information in the original data source.
  • If you are able to contact the person responsible for collecting the data, you could refer to them and ask for clarification.
  • You could also check the rest of the dataset for clues about the correct value for the erroneous dates.
  • If your project has guidelines on how to correct this sort of errors, refer to them and apply any recommendations.
  • If it is not possible to ascertain the correct value for these observations, you may want to leave them as missing data.

Regardless of the option you choose, it is important that you document the error and the corrections (if any) that you apply to your data.

Key Points

  • Use read.csv to read tabular data in R.
  • A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length.
  • dplyr provides many methods for inspecting and summarizing data in data frames.
  • Use factors to represent categorical data in R.
  • The lubridate package has many useful functions for working with dates.

Content from Manipulating, analyzing and exporting data with tidyverse


Last updated on 2024-11-19 | Edit this page

Overview

Questions

  • What are dplyr and tidyr?
  • How can I select specific rows and/or columns from a dataframe?
  • How can I combine multiple commands into a single command?
  • How can I create new columns or remove existing columns from a dataframe?

Objectives

  • Describe the purpose of the dplyr and tidyr packages.
  • Select certain columns in a data frame with the dplyr function select.
  • Extract certain rows in a data frame according to logical (boolean) conditions with the dplyr function filter .
  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
  • Add new columns to a data frame that are functions of existing columns with mutate.
  • Use the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and count to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.
  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.
  • Describe what key-value pairs are.
  • Reshape a data frame from long to wide format and back with the pivot_wider and pivot_longer commands from the tidyr package.
  • Export a data frame to a .csv file.

Data manipulation using dplyr and tidyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter dplyr. dplyr is a package for helping with tabular data manipulation. It pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The tidyverse package is an “umbrella-package” that installs tidyr, dplyr, and several other useful packages for data analysis, such as ggplot2, tibble, etc.

The tidyverse package tries to address 3 common issues that arise when doing data analysis in R:

  1. The results from a base R function sometimes depend on the type of data.
  2. R expressions are used in a non standard way, which can be confusing for new learners.
  3. The existence of hidden arguments having default operations that new learners are not aware of.

You should already have installed and loaded the tidyverse package. If you haven’t already done so, you can type install.packages("tidyverse") straight into the console. Then, type library(tidyverse) to load the package.

What are dplyr and tidyr?

The package dplyr provides helper tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and usage by different R functions. For example, sometimes we want data sets where we have one row per measurement. Other times we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups (e.g., a time period, an experimental unit like a plot or a batch number). Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

As before, we’ll read in our data using the read_csv() function from the tidyverse package readr.

R

surveys <- read_csv("data_raw/portal_data_joined.csv")

OUTPUT

#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R

## inspect the data
str(surveys)

R

## preview the data
view(surveys)

Next, we’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (surveys), and the subsequent arguments are the columns to keep.

R

select(surveys, plot_id, species_id, weight)

To select all columns except certain ones, put a “-” in front of the variable to exclude it.

R

select(surveys, -record_id, -species_id)

This will select all the variables in surveys except record_id and species_id.

To choose rows based on a specific criterion, use filter():

R

filter(surveys, year == 1995)

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

R

surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

R

surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl

  • Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

R

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

In the above code, we use the pipe to send the surveys dataset first through filter() to keep rows where weight is less than 5, then through select() to keep only the species_id, sex, and weight columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then.” For instance, in the example above, we took the data frame surveys, then we filtered for rows with weight < 5, then we selected columns species_id, sex, and weight. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

R

surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

surveys_sml

Note that the final data frame is the leftmost part of this expression.

Challenge

Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.

R

surveys %>%
    filter(year < 1995) %>%
    select(year, sex, weight)

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

R

surveys %>%
  mutate(weight_kg = weight / 1000)

You can also create a second new column based on the first new column within the same call of mutate():

R

surveys %>%
  mutate(weight_kg = weight / 1000,
         weight_lb = weight_kg * 2.2)

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data. (Pipes work with non-dplyr functions, too, as long as the dplyr or magrittr package is loaded).

R

surveys %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain:

R

surveys %>%
  filter(!is.na(weight)) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

is.na() is a function that determines whether something is an NA. The ! symbol negates the result, so we’re asking for every row where weight is not an NA.

Challenge

Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_cm containing the hindfoot_length values (currently in mm) converted to centimeters. In this hindfoot_cm column, there are no NAs and all values are less than 3.

Hint: think about how the commands should be ordered to produce this data frame!

R

surveys_hindfoot_cm <- surveys %>%
    filter(!is.na(hindfoot_length)) %>%
    mutate(hindfoot_cm = hindfoot_length / 10) %>%
    filter(hindfoot_cm < 3) %>%
    select(species_id, hindfoot_cm)

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. Key functions of dplyr for this workflow are group_by() and summarize().

The group_by() and summarize() functions

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean weight by sex:

R

surveys %>%
  group_by(sex) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))

You may also have noticed that the output from these calls doesn’t run off the screen anymore. It’s one of the advantages of tbl_df over data frame.

You can also group by multiple columns:

R

surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
  tail()

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

Here, we used tail() to look at the last six rows of our summary. Before, we had used head() to look at the first six rows. We can see that the sex column contains NA values because some animals had escaped before their sex and body weights could be determined. The resulting mean_weight column does not contain NA but NaN (which refers to “Not a Number”) because mean() was called on a vector of NA values while at the same time setting na.rm = TRUE. To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed first, we can omit na.rm = TRUE when computing the mean:

R

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

Here, again, the output from these calls doesn’t run off the screen anymore. If you want to display more data, you can use the print() function at the end of your chain with the argument n specifying the number of rows to display:

R

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight)) %>%
  print(n = 15)

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:

R

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_weight to put the lighter species first:

R

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight)) %>%
  arrange(min_weight)

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of mean weight:

R

surveys %>%
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight)) %>%
  arrange(desc(mean_weight))

OUTPUT

#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each sex, we would do:

R

surveys %>%
    count(sex)

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, surveys %>% count() is equivalent to:

R

surveys %>%
    group_by(sex) %>%
    summarize(count = n())

For convenience, count() provides the sort argument:

R

surveys %>%
    count(sex, sort = TRUE)

Previous example shows the use of count() to count the number of rows/observations for one factor (i.e., sex). If we wanted to count combination of factors, such as sex and species, we would specify the first and the second factor as the arguments of count():

R

surveys %>%
  count(sex, species)

With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the species and (ii) in descending order of the count:

R

surveys %>%
  count(sex, species) %>%
  arrange(species, desc(n))

From the table above, we may learn that, for instance, there are 75 observations of the albigula species that are not specified for its sex (i.e. NA).

Challenge

  1. How many animals were caught in each plot_type surveyed?

R

surveys %>%
    count(plot_type)

Challenge (continued)

  1. Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id). Also add the number of observations (hint: see ?n).

R

surveys %>%
    filter(!is.na(hindfoot_length)) %>%
    group_by(species_id) %>%
    summarize(
        mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n = n()
    )

Challenge (continued)

  1. What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight.

R

surveys %>%
    filter(!is.na(weight)) %>%
    group_by(year) %>%
    filter(weight == max(weight)) %>%
    select(year, genus, species, weight) %>%
    arrange(year)

Reshaping with pivot_longer and pivot_wider

In the spreadsheet lesson, we discussed how to structure our data leading to the four rules defining a tidy dataset:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell
  4. Each type of observational unit forms a table

Here we examine the fourth rule: Each type of observational unit forms a table.

In surveys, the rows of surveys contain the values of variables associated with each record (the unit), values such as the weight or sex of each animal associated with each record. What if instead of comparing records, we wanted to compare the different mean weight of each genus between plots? (Ignoring plot_type for simplicity).

We’d need to create a new table where each row (the unit) is comprised of values of variables associated with each plot. In practical terms this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot.

Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.

The opposite transformation would be to transform column names into values of a variable.

We can do both these of transformations with two tidyr functions, pivot_wider() and pivot_longer().

These may sound like dramatically different data layouts, but there are some tools that make transitions between these layouts more straightforward than you might think! The gif below shows how these two formats relate to each other, and gives you an idea of how we can use R to shift from one format to the other.

Pivoting from long to wide format

pivot_wider() takes three principal arguments:

  1. the data
  2. the names_from column variable whose values will become new column names.
  3. the values_from column variable whose values will fill the new column variables.

Further arguments include values_fill which, if set, fills in missing values with the value provided.

Let’s use pivot_wider() to transform surveys to find the mean weight of each genus in each plot over the entire survey period. We use filter(), group_by() and summarize() to filter our observations and variables of interest, and create a new variable for the mean_weight.

R

surveys_gw <- surveys %>%
  filter(!is.na(weight)) %>%
  group_by(plot_id, genus) %>%
  summarize(mean_weight = mean(weight))

OUTPUT

#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.

R

str(surveys_gw)

This yields surveys_gw where the observations for each plot are distributed across multiple rows, 196 observations of 3 variables. Using pivot_wider() with the names from genus and with values from mean_weight this becomes 24 observations of 11 variables, one row for each plot.

R

surveys_wide <- surveys_gw %>%
  pivot_wider(names_from = genus, values_from = mean_weight)

str(surveys_wide)

We could now plot comparisons between the weight of genera (one is called a genus, multiple are called genera) in different plots, although we may wish to fill in the missing values first.

R

surveys_gw %>%
  pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>%
  head()

Pivoting from wide to long format

The opposing situation could occur if we had been provided with data in the form of surveys_wide, where the genus names are column names, but we wish to treat them as values of a genus variable instead.

In this situation we are reshaping the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.

pivot_longer() takes four principal arguments:

  1. the data
  2. the names_to column variable we wish to create from column names.
  3. the values_to column variable we wish to create and fill with values.
  4. cols are the name of the columns we use to make this pivot (or to drop).

To recreate surveys_gw from surveys_wide we would create a names variable called genus and value variable called mean_weight.

In pivoting longer, we also need to specify what columns to reshape. If the columns are directly adjacent as they are here, we don’t even need to list the all out: we can just use the : operator!

R

surveys_long <- surveys_wide %>%
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)

str(surveys_long)

Note that now the NA genera are included in the long format data frame. Pivoting wider and then longer can be a useful way to balance out a dataset so that every replicate has the same composition

We could also have used a specification for what columns to exclude. In this example, we will use all columns except plot_id for the names variable. By using the minus sign in the cols argument, we omit plot_id from being reshaped

R

surveys_wide %>%
  pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) %>%
  head()

Challenge

  1. Reshape the surveys data frame with year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data. It’s a powerful function! See ?n_distinct for more.

R

surveys_wide_genera <- surveys %>%
  group_by(plot_id, year) %>%
  summarize(n_genera = n_distinct(genus)) %>%
  pivot_wider(names_from = year, values_from = n_genera)

OUTPUT

#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.

R

head(surveys_wide_genera)

Challenge (continued)

  1. Now take that data frame and pivot_longer() it, so each row is a unique plot_id by year combination.

R

surveys_wide_genera %>%
  pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)

Challenge (continued)

  1. The surveys data set has two measurement columns: hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use pivot_longer() to create a dataset where we have a names column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns will be part of the reshape.

R

surveys_long <- surveys %>%
  pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
  1. With this new data set, calculate the average of each measurement in each year for each different plot_type. Then pivot_wider() them into a data set with a column for hindfoot_length and weight. Hint: You only need to specify the names and values columns for pivot_wider().

R

surveys_long %>%
  group_by(year, measurement, plot_type) %>%
  summarize(mean_value = mean(value, na.rm=TRUE)) %>%
  pivot_wider(names_from = measurement, values_from = mean_value)

OUTPUT

#> `summarise()` has grouped output by 'year', 'measurement'. You can override
#> using the `.groups` argument.

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data_raw folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data directory, so even if the files it contains are deleted, we can always re-generate them.

In preparation for our next lesson on plotting, we are going to prepare a cleaned up version of the data set that doesn’t include any missing data.

Let’s start by removing observations of animals for which weight and hindfoot_length are missing, or the sex has not been determined:

R

surveys_complete <- surveys %>%
  filter(!is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         !is.na(sex))                # remove missing sex

Because we are interested in plotting how species abundances have changed through time, we are also going to remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first we are going to create a data set that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species:

R

## Extract the most common species_id
species_counts <- surveys_complete %>%
    count(species_id) %>%
    filter(n >= 50)

## Only keep the most common species
surveys_complete <- surveys_complete %>%
  filter(species_id %in% species_counts$species_id)

To make sure that everyone has the same data set, check that surveys_complete has 30463 rows and 13 columns by typing dim(surveys_complete).

Now that our data set is ready, we can save it as a CSV file in our data folder.

R

write_csv(surveys_complete, file = "data/surveys_complete.csv")

Key Points

  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use mutate() to create new variables.
  • Use group_by() and summarize() to work with subsets of data.

Content from Data visualization with ggplot2


Last updated on 2024-11-19 | Edit this page


Overview

Questions

  • How do you make plots using R?
  • How do you customize and modify plots?

Objectives

  • Produce scatter plots, boxplots, and time series plots using ggplot.
  • Set universal plot settings.
  • Describe what faceting is and apply faceting in ggplot.
  • Modify the aesthetics of an existing ggplot plot (including axis labels and color).
  • Build complex and customized plots from data in a data frame.

We start by loading the required packages. ggplot2 is included in the tidyverse package.

R

library(tidyverse)

If not still in the workspace, load the data we saved in the previous lesson.

R

surveys_complete <- read_csv("data/surveys_complete.csv")

Plotting with ggplot2


ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 refers to the name of the package itself. When using the package we use the function ggplot() to generate the plots, and so references to using the function will be referred to as ggplot() and the package as a whole as ggplot2

ggplot2 plots work best with data in the ‘long’ format, i.e., a column for every variable, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2

ggplot graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()
  • use the ggplot() function and bind the plot to a specific data frame using the data argument

R

ggplot(data = surveys_complete)
  • define an aesthetic mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g., as x/y positions or characteristics such as size, shape, color, etc.

R

ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length))
  • add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:

    • geom_point() for scatter plots, dot plots, etc.
    • geom_boxplot() for, well, boxplots!
    • geom_line() for trend lines, time series, etc.

To add a geom to the plot use + operator. Because we have two continuous variables, let’s use geom_point() first:

R

ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
  geom_point()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot “templates” and conveniently explore different types of plots, so the above plot can also be generated with code like this:

R

# Assign plot to a variable
surveys_plot <- ggplot(data = surveys_complete,
                       mapping = aes(x = weight, y = hindfoot_length))

# Draw the plot
surveys_plot +
    geom_point()

Notes

  • Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis you set up in aes().
  • You can also specify aesthetics for a given geom independently of the aesthetics defined globally in the ggplot() function.
  • The + sign used to add layers must be placed at the end of each line containing a layer. If, instead, the + sign is added in the line before the other layer, ggplot2 will not add the new layer and will return an error message.
  • You may notice that we sometimes reference ‘ggplot2’ and sometimes ‘ggplot’. To clarify, ‘ggplot2’ is the name of the most recent version of the package. However, any time we call the function itself, it’s just called ‘ggplot’.
  • The previous version of the ggplot2 package, called ggplot, which also contained the ggplot() function is now unsupported and has been removed from CRAN in order to reduce accidental installations and further confusion.

R

# This is the correct syntax for adding layers
surveys_plot +
  geom_point()

# This will not add the new layer and will return an error message
surveys_plot
  + geom_point()

Challenge (optional)

Scatter plots can be useful exploratory tools for small datasets. For data sets with large numbers of observations, such as the surveys_complete data set, overplotting of points can be a limitation of scatter plots. One strategy for handling such settings is to use hexagonal binning of observations. The plot space is tessellated into hexagons. Each hexagon is assigned a color based on the number of observations that fall within its boundaries. To use hexagonal binning with ggplot2, first install the R package hexbin from CRAN:

R

install.packages("hexbin")
library(hexbin)

Then use the geom_hex() function:

R

surveys_plot +
 geom_hex()
  • What are the relative strengths and weaknesses of a hexagonal bin plot compared to a scatter plot? Examine the above scatter plot and compare it with the hexagonal bin plot that you created.

Building your plots iteratively


Building plots with ggplot2 is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom:

R

ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
    geom_point()

Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (alpha) to avoid overplotting:

R

ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
    geom_point(alpha = 0.1)

We can also add colors for all the points:

R

ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
    geom_point(alpha = 0.1, color = "blue")

Or to color each species in the plot differently, you could use a vector as an input to the argument color. ggplot2 will provide a different color corresponding to different values in the vector. Here is an example where we color with species_id:

R

ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
    geom_point(alpha = 0.1, aes(color = species_id))

Challenge

Use what you just learned to create a scatter plot of weight over species_id with the plot types showing in different colors. Is this a good way to show this type of data?

R

ggplot(data = surveys_complete,
       mapping = aes(x = species_id, y = weight)) +
   geom_point(aes(color = plot_type))

Boxplot


We can use boxplots to visualize the distribution of weight within each species:

R

ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
    geom_boxplot()

By adding points to the boxplot, we can have a better idea of the number of measurements and of their distribution. Because the boxplot will show the outliers by default these points will be plotted twice – by geom_boxplot and geom_jitter. To avoid this we must specify that no outliers should be added to the boxplot by specifying outlier.shape = NA.

R

ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(alpha = 0.3, color = "tomato")

Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it’s not hidden?

Challenges

Boxplots are useful summaries, but hide the shape of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn.

  • Replace the box plot with a violin plot; see geom_violin().

R

ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_violin() 

Challenges (continued)

In many types of data, it is important to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making these modifications:

  • Represent weight on the log10 scale; see scale_y_log10().

R

ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
scale_y_log10() +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)

Challenges (continued)

So far, we’ve looked at the distribution of weight within species. Try making a new plot to explore the distribution of another variable within each species.

  • Create boxplot for hindfoot_length. Overlay the boxplot layer on a jitter layer to show actual measurements.

R

ggplot(data = surveys_complete, mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)

Challenges (continued)

  • Add color to the data points on your boxplot according to the plot from which the sample was taken (plot_id).

Hint: Check the class for plot_id. Consider changing the class of plot_id from integer to factor. Why does this change how R makes the graph?

Plotting time series data


Let’s calculate number of counts per year for each genus. First we need to group the data and count records within each group:

R

yearly_counts <- surveys_complete %>%
  count(year, genus)

Timelapse data can be visualized as a line plot with years on the x-axis and counts on the y-axis:

R

ggplot(data = yearly_counts, aes(x = year, y = n)) +
     geom_line()

Unfortunately, this does not work because we plotted data for all the genera together. We need to tell ggplot to draw a line for each genus by modifying the aesthetic function to include group = genus:

R

ggplot(data = yearly_counts, aes(x = year, y = n, group = genus)) +
    geom_line()

We will be able to distinguish genera in the plot if we add colors (using color also automatically groups the data):

R

ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
    geom_line()

Integrating the pipe operator with ggplot2


In the previous lesson, we saw how to use the pipe operator %>% to use different functions in a sequence and create a coherent workflow. We can also use the pipe operator to pass the data argument to the ggplot() function. The hard part is to remember that to build your ggplot, you need to use + and not %>%.

R

yearly_counts %>%
    ggplot(mapping = aes(x = year, y = n, color = genus)) +
    geom_line()

The pipe operator can also be used to link data manipulation with consequent data visualization.

R

yearly_counts_graph <- surveys_complete %>%
    count(year, genus) %>%
    ggplot(mapping = aes(x = year, y = n, color = genus)) +
    geom_line()

yearly_counts_graph

Faceting


ggplot has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset. We will use it to make a time series plot for each genus:

R

ggplot(data = yearly_counts, aes(x = year, y = n)) +
    geom_line() +
    facet_wrap(facets = vars(genus))

Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame grouped by year, genus, and sex:

R

 yearly_sex_counts <- surveys_complete %>%
                      count(year, genus, sex)

We can now make the faceted plot by splitting further by sex using color (within a single plot):

R

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
  geom_line() +
  facet_wrap(facets =  vars(genus))

We can also facet both by sex and genus:

R

ggplot(data = yearly_sex_counts,
       mapping = aes(x = year, y = n, color = sex)) +
  geom_line() +
  facet_grid(rows = vars(sex), cols =  vars(genus))

You can also organise the panels only by rows (or only by columns):

R

# One column, facet by rows
ggplot(data = yearly_sex_counts,
       mapping = aes(x = year, y = n, color = sex)) +
  geom_line() +
  facet_grid(rows = vars(genus))

R

# One row, facet by column
ggplot(data = yearly_sex_counts,
       mapping = aes(x = year, y = n, color = sex)) +
  geom_line() +
  facet_grid(cols = vars(genus))

Note: ggplot2 before version 3.0.0 used formulas to specify how plots are faceted. If you encounter facet_grid/wrap(...) code containing ~, please read https://ggplot2.tidyverse.org/news/#tidy-evaluation.

ggplot2 themes


Usually plots with white background look more readable when printed. Every single component of a ggplot graph can be customized using the generic theme() function, as we will see below. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.

For example, we can change our previous graph to have a simpler white background using the theme_bw() function:

R

 ggplot(data = yearly_sex_counts,
        mapping = aes(x = year, y = n, color = sex)) +
     geom_line() +
     facet_wrap(vars(genus)) +
     theme_bw()

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options.

Challenge

Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.

R

yearly_weight <- surveys_complete %>%
                group_by(year, species_id) %>%
                 summarize(avg_weight = mean(weight))

OUTPUT

#> `summarise()` has grouped output by 'year'. You can override using the
#> `.groups` argument.

R

ggplot(data = yearly_weight, mapping = aes(x=year, y=avg_weight)) +
   geom_line() +
   facet_wrap(vars(species_id)) +
   theme_bw()

Customization


Take a look at the ggplot2 cheat sheet, and think of ways you could improve the plot.

Now, let’s change names of axes to something more informative than ‘year’ and ‘n’ and add a title to the figure:

R

ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) +
    geom_line() +
    facet_wrap(vars(genus)) +
    labs(title = "Observed genera through time",
         x = "Year of observation",
         y = "Number of individuals") +
    theme_bw()

The axes have more informative names, but their readability can be improved by increasing the font size. This can be done with the generic theme() function:

R

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
    geom_line() +
    facet_wrap(vars(genus)) +
    labs(title = "Observed genera through time",
        x = "Year of observation",
        y = "Number of individuals") +
    theme_bw() +
    theme(text=element_text(size = 16))

Note that it is also possible to change the fonts of your plots. If you are on Windows, you may have to install the extrafont package, and follow the instructions included in the README for this package.

After our manipulations, you may notice that the values on the x-axis are still not properly readable. Let’s change the orientation of the labels and adjust them vertically and horizontally so they don’t overlap. You can use a 90 degree angle, or experiment to find the appropriate angle for diagonally oriented labels. We can also modify the facet label text (strip.text) to italicize the genus names:

R

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
    geom_line() +
    facet_wrap(vars(genus)) +
    labs(title = "Observed genera through time",
        x = "Year of observation",
        y = "Number of individuals") +
    theme_bw() +
    theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90, hjust = 0.5, vjust = 0.5),
                        axis.text.y = element_text(colour = "grey20", size = 12),
                        strip.text = element_text(face = "italic"),
                        text = element_text(size = 16))

If you like the changes you created better than the default theme, you can save them as an object to be able to easily apply them to other plots you may create:

R

grey_theme <- theme(axis.text.x = element_text(colour="grey20", size = 12,
                                               angle = 90, hjust = 0.5,
                                               vjust = 0.5),
                    axis.text.y = element_text(colour = "grey20", size = 12),
                    text=element_text(size = 16))

ggplot(surveys_complete, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot() +
    grey_theme

Challenge

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio ggplot2 cheat sheet for inspiration.

Here are some ideas:

  • See if you can change the thickness of the lines.
  • Can you find a way to change the name of the legend? What about its labels?
  • Try using a different color palette (see https://r-graphics.org/chapter-colors).

Arranging plots


Faceting is a great tool for splitting one plot into multiple plots, but sometimes you may want to produce a single figure that contains multiple plots using different variables or even different data frames. The patchwork package allows us to combine separate ggplots into a single figure while keeping everything aligned properly. Like most R packages, we can install patchwork from CRAN, the R package repository:

R

install.packages("patchwork")

After you have loaded the patchwork package you can use + to place plots next to each other, / to arrange them vertically, and plot_layout() to determine how much space each plot uses:

R

library(patchwork)

plot_weight <- ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
  geom_boxplot() +
  labs(x = "Species", y = expression(log[10](Weight))) +
  scale_y_log10()

plot_count <- ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
  geom_line() +
  labs(x = "Year", y = "Abundance")

plot_weight / plot_count + plot_layout(heights = c(3, 2))

You can also use parentheses () to create more complex layouts. There are many useful examples on the patchwork website

Exporting plots


After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters. The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Instead, use the ggsave() function, which allows you to easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi):

R

my_plot <- ggplot(data = yearly_sex_counts,
                  aes(x = year, y = n, color = sex)) +
    geom_line() +
    facet_wrap(vars(genus)) +
    labs(title = "Observed genera through time",
        x = "Year of observation",
        y = "Number of individuals") +
    theme_bw() +
    theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90,
                                     hjust = 0.5, vjust = 0.5),
          axis.text.y = element_text(colour = "grey20", size = 12),
          text = element_text(size = 16))

ggsave("name_of_file.png", my_plot, width = 15, height = 10)

## This also works for plots combined with patchwork
plot_combined <- plot_weight / plot_count + plot_layout(heights = c(3, 2))
ggsave("plot_combined.png", plot_combined, width = 10, dpi = 300)

Note: The parameters width and height also determine the font size in the saved plot.

Key Points

  • start simple and build your plots iteratively
  • the ggplot() function initiates a plot, and geom_ functions add representations of your data
  • use aes() when mapping a variable from the data to a part of the plot
  • use facet_ to partition a plot into multiple plots based on a factor included in the dataset
  • use premade theme_ functions to broadly change appearance, and the theme() function to fine-tune
  • the patchwork library can combine separate plots into a single figure
  • use ggsave() to save plots in your favorite format and dimensions

Content from SQL databases and R


Last updated on 2024-11-19 | Edit this page


Overview

Questions

  • How can I connect and get data from a database in R?
  • How can I retrieve data from multiple database tables?
  • How can I create an SQLite database?

Objectives

  • Access a database from R.
  • Run SQL queries in R using RSQLite and dplyr.
  • Describe the lazy behavior of dplyr on data stored in a database outside of R.
  • Prototype queries and retrieve all final results.
  • Create complex queries across one or multiple database tables.
  • Create an SQLite database from existing .csv files.

Introduction


So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole? In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the chunks needed for the current analysis.

Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.

R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., RSQLite, RMySQL, etc). Furthermore, the dplyr package you used in the previous chapter, in conjunction with dbplyr supports connecting to the widely-used open source databases sqlite, mysql and postgresql, as well as Google’s bigquery, and it can also be extended to other database types (a vignette in the dplyr package explains how to do it). RStudio has created a website that provides documentation and best practices to work on database interfaces.

Interfacing with databases using dplyr focuses on retrieving and analyzing datasets by generating SELECT SQL statements, but it doesn’t modify the database itself. dplyr does not offer functions to UPDATE or DELETE entries. If you need these functionalities, you will need to use additional R packages (e.g., RSQLite). Here we will demonstrate how to interact with a database using dplyr, using both the dplyr’s verb syntax and the SQL syntax.

The portal_mammals database

We will continue to explore the surveys data you are already familiar with from previous lessons. First, we are going to install the dbplyr package:

R

install.packages(c("dbplyr", "RSQLite"))

The SQLite database is contained in a single file portal_mammals.sqlite that you generated during the SQL lesson. If you don’t have it, you can download it from Figshare into the data_raw subdirectory using:

R

dir.create("data_raw", showWarnings = FALSE)
download.file(url = "https://ndownloader.figshare.com/files/2292171",
              destfile = "data_raw/portal_mammals.sqlite", mode = "wb")

Connecting to databases


We can point R to this database using:

R

library(dplyr)
library(dbplyr)

OUTPUT

#>
#> Attaching package: 'dbplyr'

OUTPUT

#> The following objects are masked from 'package:dplyr':
#>
#>     ident, sql

R

mammals <- DBI::dbConnect(RSQLite::SQLite(), "data_raw/portal_mammals.sqlite")

This command uses 2 packages that helps dbplyr and dplyr talk to the SQLite database. DBI is not something that you’ll use directly as a user. It allows R to send commands to databases irrespective of the database management system used. The RSQLite package allows R to interface with SQLite databases.

This command does not load the data into the R session (as the read_csv() function did). Instead, it merely instructs R to connect to the SQLite database contained in the portal_mammals.sqlite file.

Using a similar approach, you could connect to many other database management systems that are supported by R including MySQL, PostgreSQL, BigQuery, etc.

Let’s take a closer look at the mammals database we just connected to:

R

src_dbi(mammals)

OUTPUT

#> src:  sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#> tbls: plots, species, surveys

Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:

  • plots
  • species
  • surveys

Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.

Querying the database with the SQL syntax

To connect to tables within a database, you can use the tbl() function from dplyr. This function can be used to send SQL queries to the database. To demonstrate this functionality, let’s select the columns “year”, “species_id”, and “plot_id” from the surveys table:

R

tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))

With this approach you can use any of the SQL queries we have seen in the database lesson.

Querying the database with the dplyr syntax

One of the strengths of dplyr is that the same operation can be done using dplyr’s verbs instead of writing SQL. First, we select the table on which to do the operations by creating the surveys object, and then we use the standard dplyr syntax as if it were a data frame:

R

surveys <- tbl(mammals, "surveys")
surveys %>%
    select(year, species_id, plot_id)

In this case, the surveys object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. For instance, the head() function can be used to check the first 10 rows of the table:

R

head(surveys, n = 10)

OUTPUT

#> # Source:   SQL [10 x 9]
#> # Database: sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#>    record_id month   day  year plot_id species_id sex   hindfoot_length weight
#>        <int> <int> <int> <int>   <int> <chr>      <chr>           <int>  <int>
#>  1         1     7    16  1977       2 NL         M                  32     NA
#>  2         2     7    16  1977       3 NL         M                  33     NA
#>  3         3     7    16  1977       2 DM         F                  37     NA
#>  4         4     7    16  1977       7 DM         M                  36     NA
#>  5         5     7    16  1977       3 DM         M                  35     NA
#>  6         6     7    16  1977       1 PF         M                  14     NA
#>  7         7     7    16  1977       2 PE         F                  NA     NA
#>  8         8     7    16  1977       1 DM         M                  37     NA
#>  9         9     7    16  1977       1 DM         F                  34     NA
#> 10        10     7    16  1977       6 PF         F                  20     NA

This output of the head command looks just like a regular data.frame: The table has 9 columns and the head() command shows us the first 10 rows. Note that the columns plot_type, taxa, genus, and species are missing. These are now located in the tables plots and species which we will join together in a moment.

However, some functions don’t work quite as expected. For instance, let’s check how many rows there are in total using nrow():

R

nrow(surveys)

OUTPUT

#> [1] NA

That’s strange - R doesn’t know how many rows the surveys table contains - it returns NA instead. You might have already noticed that the first line of the head() output included ?? indicating that the number of rows wasn’t known.

The reason for this behavior highlights a key difference between using dplyr on datasets in memory (e.g. loaded into your R session via read_csv()) and those provided by a database. To understand it, we take a closer look at how dplyr communicates with our SQLite database.

SQL translation

Relational databases typically use a special-purpose language, Structured Query Language (SQL), to manage and query data.

For example, the following SQL query returns the first 10 rows from the surveys table:

SQL

SELECT *
FROM `surveys`
LIMIT 10

Behind the scenes, dplyr:

  1. translates your R code into SQL
  2. submits it to the database
  3. translates the database’s response into an R data frame

To lift the curtain, we can use dplyr’s show_query() function to show which SQL commands are actually sent to the database:

R

show_query(head(surveys, n = 10))

The output shows the actual SQL query sent to the database; it matches our manually constructed SELECT statement above.

Instead of having to formulate the SQL query ourselves - and having to mentally switch back and forth between R and SQL syntax - we can delegate this translation to dplyr. (You don’t even need to know SQL to interact with a database via dplyr!)

dplyr, in turn, doesn’t do the real work of subsetting the table, either. Instead, it merely sends the query to the database, waits for its response and returns it to us.

That way, R never gets to see the full surveys table - and that’s why it could not tell us how many rows it contains. On the bright side, this allows us to work with large datasets - even too large to fit into our computer’s memory.

dplyr can translate many different query types into SQL allowing us to, e.g., select() specific columns, filter() rows, or join tables.

To see this in action, let’s compose a few queries with dplyr.

Simple database queries


First, let’s only request rows of the surveys table in which weight is less than 5 and keep only the species_id, sex, and weight columns.

R

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

OUTPUT

#> # Source:   SQL [?? x 3]
#> # Database: sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#>    species_id sex   weight
#>    <chr>      <chr>  <int>
#>  1 PF         M          4
#>  2 PF         F          4
#>  3 PF         <NA>       4
#>  4 PF         F          4
#>  5 PF         F          4
#>  6 RM         M          4
#>  7 RM         F          4
#>  8 RM         M          4
#>  9 RM         M          4
#> 10 RM         M          4
#> # ℹ more rows

Executing this command will return a table with 10 rows and the requested species_id, sex and weight columns. Great!

… but wait, why are there only 10 rows?

The last line:

# ... with more rows

indicates that there are more results that fit our filtering criterion. Why was R lazy and only retrieved 10 of them?

Laziness


Hadley Wickham, the author of dplyr explains:

When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.

When you construct a dplyr query, you can connect multiple verbs into a single pipeline. For example, we combined the filter() and select() verbs using the %>% pipe.

If we wanted to, we could add on even more steps, e.g. remove the sex column in an additional select call:

R

data_subset <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

data_subset %>%
  select(-sex)

OUTPUT

#> # Source:   SQL [?? x 2]
#> # Database: sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#>    species_id weight
#>    <chr>       <int>
#>  1 PF              4
#>  2 PF              4
#>  3 PF              4
#>  4 PF              4
#>  5 PF              4
#>  6 RM              4
#>  7 RM              4
#>  8 RM              4
#>  9 RM              4
#> 10 RM              4
#> # ℹ more rows

Just like the first select(species_id, sex, weight) call, the select(-sex) command is not executed by R. It is sent to the database instead. Only the final result is retrieved and displayed to you.

Of course, we could always add on more steps, e.g., we could filter by species_id or minimum weight. That’s why R doesn’t retrieve the full set of results - instead it only retrieves the first 10 results from the database by default. (After all, you might want to add an additional step and get the database to do more work…)

To instruct R to stop being lazy, e.g. to retrieve all of the query results from the database, we add the collect() command to our pipe. It indicates that our database query is finished: time to get the final results and load them into the R session.

R

data_subset <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight) %>%
  collect()

Now we have all 17 rows that match our query in a data.frame and can continue to work with them exclusively in R, without communicating with the database.

Complex database queries


dplyr enables database queries across one or multiple database tables, using the same single- and multiple-table verbs you encountered previously. This means you can use the same commands regardless of whether you interact with a remote database or local dataset! This is a really useful feature if you work with large datasets: you can first prototype your code on a small subset that fits into memory, and when your code is ready, you can change the input dataset to your full database without having to change the syntax.

On the other hand, being able to use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.

To illustrate how to use dplyr with these complex queries, we are going to join the plots and surveys tables. The plots table in the database contains information about the different plots surveyed by the researchers. To access it, we point the tbl() command to it:

R

plots <- tbl(mammals, "plots")
plots

OUTPUT

#> # Source:   table<`plots`> [?? x 2]
#> # Database: sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#>    plot_id plot_type
#>      <int> <chr>
#>  1       1 Spectab exclosure
#>  2       2 Control
#>  3       3 Long-term Krat Exclosure
#>  4       4 Control
#>  5       5 Rodent Exclosure
#>  6       6 Short-term Krat Exclosure
#>  7       7 Rodent Exclosure
#>  8       8 Control
#>  9       9 Spectab exclosure
#> 10      10 Rodent Exclosure
#> # ℹ more rows

The plot_id column also features in the surveys table:

R

surveys

OUTPUT

#> # Source:   table<`surveys`> [?? x 9]
#> # Database: sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data_raw/portal_mammals.sqlite]
#>    record_id month   day  year plot_id species_id sex   hindfoot_length weight
#>        <int> <int> <int> <int>   <int> <chr>      <chr>           <int>  <int>
#>  1         1     7    16  1977       2 NL         M                  32     NA
#>  2         2     7    16  1977       3 NL         M                  33     NA
#>  3         3     7    16  1977       2 DM         F                  37     NA
#>  4         4     7    16  1977       7 DM         M                  36     NA
#>  5         5     7    16  1977       3 DM         M                  35     NA
#>  6         6     7    16  1977       1 PF         M                  14     NA
#>  7         7     7    16  1977       2 PE         F                  NA     NA
#>  8         8     7    16  1977       1 DM         M                  37     NA
#>  9         9     7    16  1977       1 DM         F                  34     NA
#> 10        10     7    16  1977       6 PF         F                  20     NA
#> # ℹ more rows

Because plot_id is listed in both tables, we can use it to look up matching records, and join the two tables.

If we have two tables named x and y with a common column called “ID”, we can join them using ‘join’ functions, two of which are described and illustrated below.

  1. inner_join() : This returns all rows from x where there are matching values in y, and all columns from x and y.

  2. left_join() : This return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.

In both forms of join, if there are multiple matches between x and y, all combinations of the matches are returned. For the full list of ‘join’ functions, check out the tidyverse join page.

In our example, the two tables we want to join are ‘plots’ and ‘surveys’.

diagram illustrating inner and left joins

For example, to extract all surveys for the first plot, which has plot_id 1, we can do:

R

plots %>%
  filter(plot_id == 1) %>%
  inner_join(surveys) %>%
  collect()

OUTPUT

#> Joining with `by = join_by(plot_id)`

OUTPUT

#> # A tibble: 1,995 × 10
#>    plot_id plot_type         record_id month   day  year species_id sex
#>      <int> <chr>                 <int> <int> <int> <int> <chr>      <chr>
#>  1       1 Spectab exclosure         6     7    16  1977 PF         M
#>  2       1 Spectab exclosure         8     7    16  1977 DM         M
#>  3       1 Spectab exclosure         9     7    16  1977 DM         F
#>  4       1 Spectab exclosure        78     8    19  1977 PF         M
#>  5       1 Spectab exclosure        80     8    19  1977 DS         M
#>  6       1 Spectab exclosure       218     9    13  1977 PF         M
#>  7       1 Spectab exclosure       222     9    13  1977 DS         M
#>  8       1 Spectab exclosure       239     9    13  1977 DS         M
#>  9       1 Spectab exclosure       263    10    16  1977 DM         M
#> 10       1 Spectab exclosure       270    10    16  1977 DM         F
#> # ℹ 1,985 more rows
#> # ℹ 2 more variables: hindfoot_length <int>, weight <int>

Important Note: Without the collect() statement, only the first 10 matching rows are returned. By adding collect(), the full set of 1,985 is retrieved.

Challenge

Write a query that returns the number of rodents observed in each plot in each year.

Hint: Connect to the species table and write a query that joins the species and survey tables together to exclude all non-rodents. The query should return counts of rodents by year.

Optional: Write a query in SQL that will produce the same result. You can join multiple tables together using the following syntax where foreign key refers to your unique id (e.g., species_id):

SQL

SELECT table.col, table.col
FROM table1 JOIN table2
ON table1.key = table2.key
JOIN table3 ON table2.key = table3.key

R

## with dplyr syntax
species <- tbl(mammals, "species")

left_join(surveys, species) %>%
  filter(taxa == "Rodent") %>%
  group_by(taxa, year, plot_id) %>%
  tally() %>%
  collect()

OUTPUT

#> Joining with `by = join_by(species_id)`

R

## with SQL syntax
query <- paste("
SELECT a.year, b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
GROUP BY b.taxa, a.year, a.plot_id",
sep = "" )

tbl(mammals, sql(query))

Challenge

Write a query that returns the total number of rodents in each genus caught in the different plot types.

Hint: Write a query that joins the species, plot, and survey tables together. The query should return counts of genus by plot type.

R

species <- tbl(mammals, "species")
genus_counts <- left_join(surveys, plots) %>%
  left_join(species) %>%
  filter(taxa == "Rodent") %>%
  group_by(plot_type, genus) %>%
  tally() %>%
  collect()

This is useful if we are interested in estimating the number of individuals belonging to each genus found in each plot type. But what if we were interested in the number of genera found in each plot type? Using tally() gives the number of individuals, instead we need to use n_distinct() to count the number of unique values found in a column.

R

species <- tbl(mammals, "species")
unique_genera <- left_join(surveys, plots) %>%
    left_join(species) %>%
    group_by(plot_type) %>%
    summarize(
        n_genera = n_distinct(genus)
    ) %>%
    collect()

OUTPUT

#> Joining with `by = join_by(plot_id)`
#> Joining with `by = join_by(species_id)`

n_distinct, like the other dplyr functions we have used in this lesson, works not only on database connections but also on regular data frames.

Creating a new SQLite database


So far, we have used a previously prepared SQLite database. But we can also use R to create a new database, e.g. from existing csv files. Let’s recreate the mammals database that we’ve been working with, in R. First let’s download and read in the csv files. We’ll import tidyverse to gain access to the read_csv() function.

R

download.file("https://ndownloader.figshare.com/files/3299483",
              "data_raw/species.csv")
download.file("https://ndownloader.figshare.com/files/10717177",
              "data_raw/surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474",
              "data_raw/plots.csv")
library(tidyverse)
species <- read_csv("data_raw/species.csv")

OUTPUT

#> Rows: 54 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): species_id, genus, species, taxa
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R

surveys <- read_csv("data_raw/surveys.csv")

OUTPUT

#> Rows: 35549 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): species_id, sex
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R

plots <- read_csv("data_raw/plots.csv")

OUTPUT

#> Rows: 24 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): plot_type
#> dbl (1): plot_id
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Also, you can create new SQLite database with dplyr by adding an argument to the same command we used above to open an existing .sqlite file. The create = TRUE argument instructs R to create a new, empty database instead.

Caution: When create = TRUE is added, any existing database at the same location is overwritten without warning.

R

my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)

WARNING

#> Warning: `src_sqlite()` was deprecated in dplyr 1.0.0.
#> ℹ Please use `tbl()` directly with a database connection
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Currently, our new database is empty, it doesn’t contain any tables:

R

my_db

OUTPUT

#> src:  sqlite 3.46.0 [/home/runner/work/R-ecology-lesson-previous/R-ecology-lesson-previous/site/built/data/portal-database-output.sqlite]
#> tbls:

To add tables, we copy the existing data.frames into the database one by one:

R

copy_to(my_db, surveys)
copy_to(my_db, plots)
my_db

If you check the location of our database you’ll see that data is automatically being written to disk. R and dplyr not only provide ways to query existing databases, they also provide functionality to create your own databases from flat files!

Challenge

Add the remaining species table to the my_db database and run some of your queries from earlier in the lesson to verify that you have faithfully recreated the mammals database.

Note: In this example, we first loaded all of the data into the R session by reading the three csv files. Because all the data has to flow through R, this is not suitable for very large datasets.

Note: Finally, to close the connection to the mammals database you may use DBI::dbDisconnect(mammals); this discards all pending work and frees resources, e.g. memory.

Key Points

  • tbl connects to a database and can send SQL queries.
  • use dplyr syntax to extract information from SQL tables.
  • dplyr laziness only pulls the needed information, speeding up data retrieval.
  • use src_sqlite() to create a new empty SQLite database and copy_to() to add data to it.