Chapter 3 Importing data - base R

In this chapter, we take a look at functions to import data into R.

3.1 Using base R

If we are only relying on base R, we need our data in common text formats, such as tab- or comma separated files. It’s not unusual for laboratory machines, analysis programs, or bioinformaticians to supply the data in such a format. If you’ve got data in Microsoft Excel, you can choose comma-separated (.csv) and tab-separated (.tsv) format from the Save as... menu.

R’s base functions for reading files need to be told about the structure of the file. Then, it is actually pretty flexible with the input format. Here are some examples. Take a look at the files we import in a text editor, such as Textedit or Notepad, to see what the original data looks like.

3.1.1 Tab-delimited data with a header and using standard decimal points

The first example is a pretty standard text file without any hidden problems. Remember, you can execute the chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

UA_v1 <- read.delim("../datasets/USArrests_tab.txt")
head(UA_v1)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7

3.1.2 Non-standard decimal points

If your computer is set up with Dutch standards, Excel most likely exports numbers with a comma instead of a point/full stop as decimal point. Here’s how you can tell R about this.

UA_v2 <- read.delim("../datasets/USArrestsComma_tab.txt",dec=",")
head(UA_v2)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7

3.1.3 Comma-separated data with a header and using standard decimal points

Data with standard decimal points might also come separated by commas:

UA_v3 <- read.csv("../datasets/USArrests_comma.csv")
head(UA_v3)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7

If you look at the chunk above, we used different functions to read the tab- and comma-separated files. This saves us some typing, because by default read.delim separates the input at a tab and read.csv separates the input at the comma. However, read.delim and read.csv (along with a whole bunch of other functions) belong to one family. They only differ in their defaults, so if you explicitly state the field separator and some other formatting characteristics (like the decimal point dec above), you can interchange these functions. I find it a lot easier to remember the names of those arguments than the defaults of all the functions. So, I will use read.delim for the rest of the notebook:

UA_v3b <- read.delim("../datasets/USArrests_comma.csv",sep=",")
head(UA_v3b)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7

3.1.4 Tables without headers

While not best practice, you sometimes get data without a set of column headers:

UA_v4 <- read.delim("../datasets/USArrests_commaNoHeader.csv",sep=",")
head(UA_v4)
##   X13.2 X236 X58 X21.2
## 1  10.0  263  48  44.5
## 2   8.1  294  80  31.0
## 3   8.8  190  50  19.5
## 4   9.0  276  91  40.6
## 5   7.9  204  78  38.7
## 6   3.3  110  77  11.1

You can add column names manually:

colnames(UA_v4) <- c("Murder", "Assault", "UrbanPop", "Rape")
head(UA_v4)
##   Murder Assault UrbanPop Rape
## 1   10.0     263       48 44.5
## 2    8.1     294       80 31.0
## 3    8.8     190       50 19.5
## 4    9.0     276       91 40.6
## 5    7.9     204       78 38.7
## 6    3.3     110       77 11.1

3.1.5 Additional lines

Sometimes your data has some kind of preface that does not match the rest of your data. You can ignore it, if it is labeled with a specific character like so:

UA_v5 <- read.delim("../datasets/USArrests_tabComments.txt",comment.char = "#")
head(UA_v5)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7

Such a character is also ignored further down:

UA_v6 <- read.delim("../datasets/USArrests_tabManipulated.txt",comment.char = ";")
head(UA_v6)
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.2     113       65 21.0

There are many more options, e.g.:

  • skip : if you want to skip a certain number of lines at the top
  • quote : if fields with text are enclosed by quotation marks other than " or '
  • row.names : if one of the columns contains row names, which can be handled differently in R
  • na.strings : if you need to mark non-available data points, e.g. “NaN” or “n.d.”; you can also use multiple words, e.g. c(“NA,”“n.d.”“Na”)
  • stringsAsFactors : if R should interpret columns with words in the data table as factors instead of a vector of words. Don’t worry, if you don’t know the difference between factors and character vectors at this moment. But if you’re curious, take a look at the chapter on factors in the advanced R book.

3.2 Further methods

3.2.1 Excel files

There are several functions/packages that can deal with the .xlsx format, e.g. readxl, which is automatically installed with the tidyverse package or openxlsx, which is demonstrated below.

if(!require(openxlsx)){
  install.packages("openxlsx",repos = "http://cran.us.r-project.org")
  library(openxlsx)
}
pg <- read.xlsx("../datasets/plantGrowth.xlsx")
head(pg)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

You can see above that this function is smart enough to take the decimal separator from Excel and change it to the standard full stop. By default, the first line becomes the column names.

It is even a bit more smart: Take a look at the second sheet of allData.xlsx in the course material. It does not feature the cleanest formatting with those empty lines. But read.xlsx understands that we don’t want to have empty lines.

To load the second sheet, we need to use the sheet argument - this determines which of the sheets in an .xlsx file to read. You can either give the name sheet=hairEyeColor or the number of the sheet:

hc <- read.xlsx("../datasets/allData.xlsx",sheet = 2)
head(hc)
##    Hair   Eye  Sex Freq
## 1 Black Brown Male   32
## 2 Brown Brown Male   53
## 3   Red Brown Male   10
## 4 Blond Brown Male    3
## 5 Black  Blue Male   11
## 6 Brown  Blue Male   50

3.2.2 Reading from URLs

You can use base R’s read.delim/read.csv/read.table family of functions to load data from a webpage by giving the URL:

turl <- read.delim("https://github.com/a-h-b/R_base_vis_course/raw/main/TestCSV.csv", 
                   sep = ",")
head(turl)
##   Header.1 Header.2
## 1     Text        1
## 2      and        2
## 3     more        3
## 4     text        4

3.2.3 Reading from the clipboard

R can actually read data that you’ve copied ‘from the clipboard.’ But we recommend to never use this function, because it’s not reproducible.