Chapter 4 Intro to the {tidyverse}

4.1 Why {tidyverse}?

The {tidyverse} is a collection of R packages that extend the functionality of base R. The packages are developed to simplify and accelerate data analysis with R. All packages share an underlying design philosophy, grammar, and data structures. You may say that the {tidyverse} equips R with superpowers. Some of the packages may be familiar, e.g. {ggplot2} or {tidyr} and these may be installed already. But if you’d like to use some of these, you may as well install it in one go:

if(!require(tidyverse)){
  install.packages("tidyverse",repos = "http://cran.us.r-project.org")
  library(tidyverse)
}
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

4.2 Reading data from excel

The {readxl} package that is installed as part of the {tidyverse} can be used to read data from excel files:

library(readxl)
df_xl <- read_excel("../datasets/allData.xlsx", sheet = 2)
## New names:
## * `` -> ...4
head(df_xl)
## # A tibble: 6 × 5
##   Hair  Eye   Sex   ...4   Freq
##   <chr> <chr> <chr> <lgl> <dbl>
## 1 <NA>  <NA>  <NA>  NA       NA
## 2 Black Brown Male  NA       32
## 3 Brown Brown Male  NA       53
## 4 Red   Brown Male  NA       10
## 5 Blond Brown Male  NA        3
## 6 Black Blue  Male  NA       11

4.3 The pipe operator

Unlike the {openxlsx} package, there is no automatic detection and removal of empty lines or columns. To do this we can add a function to do this with the pipe operator %>% that is often used in the tidyverse. This operator is used to take the result of a function and feed it into the next function. The function that we will use is drop_na() and we tell this function to remove any line with “NA” in the column “Hair” from the data:

df_xl <- read_excel("../datasets/allData.xlsx", sheet = 2, skip = 1) %>% drop_na(Hair)
## New names:
## * `` -> ...4
head(df_xl)
## # A tibble: 6 × 5
##   Hair  Eye   Sex   ...4   Freq
##   <chr> <chr> <chr> <lgl> <dbl>
## 1 Black Brown Male  NA       32
## 2 Brown Brown Male  NA       53
## 3 Red   Brown Male  NA       10
## 4 Blond Brown Male  NA        3
## 5 Black Blue  Male  NA       11
## 6 Brown Blue  Male  NA       50

It is possible to use multiple pipe operators to combine multiple functions in a single command. Here we add another function select() to get rid of the 4th column, since it is empty. The - indicates that we do not select column number 4. Since the commands become pretty long when multiple pipe operators are used, it is good practice to start each function on a new line:

read_excel("../datasets/allData.xlsx", sheet = 2, skip = 1) %>%
  drop_na(Hair) %>%
  select(-4) %>%
  head()
## New names:
## * `` -> ...4
## # A tibble: 6 × 4
##   Hair  Eye   Sex    Freq
##   <chr> <chr> <chr> <dbl>
## 1 Black Brown Male     32
## 2 Brown Brown Male     53
## 3 Red   Brown Male     10
## 4 Blond Brown Male      3
## 5 Black Blue  Male     11
## 6 Brown Blue  Male     50

There are many more functions for data manipulation in the tidyverse, but in this workshop we will focus on the use of the {ggplot2} package for plotting.

4.4 Resources on tidyverse

Tidyverse website: https://www.tidyverse.org

R for Data Science (Hadley Wickham & Garrett Grolemund): https://r4ds.had.co.nz/index.html

A Modern Dive into R and the Tidyverse (Chester Ismay & Albert Y. Kim ): https://moderndive.com