Before we start
- R is a programming language and RStudio is the IDE that assists in using R.
- There are many benefits to learning R, including writing reproducibile code, ability to use a variety of datasets, and a broad, open-source community of practioners.
- Files related to analysis should be organized within a single working directory.
- R uses commands containing functions to tell the computer what to do.
- Documentation for each function is available within RStudio, or users can ask for help from one of many online forums, cheatsheets, or email lists.
Introduction to R
-
<-
is used to assign values on the right to objects on the left - Code should be saved within the Source pane in RStudio to help you
return to your code later.
- ‘#’ can be used to add comments to your code.
- Functions can automate more complicated sets of commands, and require arguments as inputs.
- Vectors are composed by a series of values and can take many forms.
- Data structures in R include ‘vector’, ‘list’, ‘matrix’, ‘data.frame’, ‘factor’, and ‘array’.
- Vectors can be subset by indexing or through logical vectors.
- Many functions exist to remove missing data from data structures.
Starting with data
- Use
read.csv
to read tabular data in R. - A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length.
-
dplyr
provides many methods for inspecting and summarizing data in data frames. - Use factors to represent categorical data in R.
- The
lubridate
package has many useful functions for working with dates.
Manipulating, analyzing and exporting data with tidyverseData manipulation using dplyr
and
tidyr
Exporting data
- Use the
dplyr
package to manipulate data frames. - Use
select()
to choose variables from a data frame. - Use
filter()
to choose data based on values. - Use
mutate()
to create new variables. - Use
group_by()
andsummarize()
to work with subsets of data.
Data visualization with ggplot2
- start simple and build your plots iteratively
- the
ggplot()
function initiates a plot, andgeom_
functions add representations of your data - use
aes()
when mapping a variable from the data to a part of the plot - use
facet_
to partition a plot into multiple plots based on a factor included in the dataset - use premade
theme_
functions to broadly change appearance, and thetheme()
function to fine-tune - the
patchwork
library can combine separate plots into a single figure - use
ggsave()
to save plots in your favorite format and dimensions
SQL databases and R
-
tbl
connects to a database and can send SQL queries. - use
dplyr
syntax to extract information from SQL tables. -
dplyr
laziness only pulls the needed information, speeding up data retrieval. - use
src_sqlite()
to create a new empty SQLite database andcopy_to()
to add data to it.