Chapter 3 Importing data - base R
In this chapter, we take a look at functions to import data into R.
3.1 Using base R
If we are only relying on base R, we need our data in common text
formats, such as tab- or comma separated files. It’s not unusual for
laboratory machines, analysis programs, or bioinformaticians to supply
the data in such a format. If you’ve got data in Microsoft Excel, you
can choose comma-separated (.csv) and tab-separated (.tsv) format from
the Save as...
menu.
R’s base functions for reading files need to be told about the structure of the file. Then, it is actually pretty flexible with the input format. Here are some examples. Take a look at the files we import in a text editor, such as Textedit or Notepad, to see what the original data looks like.
3.1.1 Tab-delimited data with a header and using standard decimal points
The first example is a pretty standard text file without any hidden
problems. Remember, you can execute the chunks by clicking the Run
button within the chunk or by placing your cursor inside it and pressing
Cmd+Shift+Enter
.
<- read.delim("../datasets/USArrests_tab.txt")
UA_v1 head(UA_v1)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
3.1.2 Non-standard decimal points
If your computer is set up with Dutch standards, Excel most likely exports numbers with a comma instead of a point/full stop as decimal point. Here’s how you can tell R about this.
<- read.delim("../datasets/USArrestsComma_tab.txt",dec=",")
UA_v2 head(UA_v2)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
3.1.3 Comma-separated data with a header and using standard decimal points
Data with standard decimal points might also come separated by commas:
<- read.csv("../datasets/USArrests_comma.csv")
UA_v3 head(UA_v3)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
If you look at the chunk above, we used different functions to read the
tab- and comma-separated files. This saves us some typing, because by
default read.delim
separates the input at a tab and read.csv
separates the input at the comma. However, read.delim
and read.csv
(along with a whole bunch of other functions) belong to one family. They
only differ in their defaults, so if you explicitly state the field
separator and some other formatting characteristics (like the decimal
point dec
above), you can interchange these functions. I find it a lot
easier to remember the names of those arguments than the defaults of all
the functions. So, I will use read.delim
for the rest of the notebook:
<- read.delim("../datasets/USArrests_comma.csv",sep=",")
UA_v3b head(UA_v3b)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
3.1.4 Tables without headers
While not best practice, you sometimes get data without a set of column headers:
<- read.delim("../datasets/USArrests_commaNoHeader.csv",sep=",")
UA_v4 head(UA_v4)
## X13.2 X236 X58 X21.2
## 1 10.0 263 48 44.5
## 2 8.1 294 80 31.0
## 3 8.8 190 50 19.5
## 4 9.0 276 91 40.6
## 5 7.9 204 78 38.7
## 6 3.3 110 77 11.1
You can add column names manually:
colnames(UA_v4) <- c("Murder", "Assault", "UrbanPop", "Rape")
head(UA_v4)
## Murder Assault UrbanPop Rape
## 1 10.0 263 48 44.5
## 2 8.1 294 80 31.0
## 3 8.8 190 50 19.5
## 4 9.0 276 91 40.6
## 5 7.9 204 78 38.7
## 6 3.3 110 77 11.1
3.1.5 Additional lines
Sometimes your data has some kind of preface that does not match the rest of your data. You can ignore it, if it is labeled with a specific character like so:
<- read.delim("../datasets/USArrests_tabComments.txt",comment.char = "#")
UA_v5 head(UA_v5)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
Such a character is also ignored further down:
<- read.delim("../datasets/USArrests_tabManipulated.txt",comment.char = ";")
UA_v6 head(UA_v6)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.2 113 65 21.0
There are many more options, e.g.:
skip
: if you want to skip a certain number of lines at the topquote
: if fields with text are enclosed by quotation marks other than"
or'
row.names
: if one of the columns contains row names, which can be handled differently in Rna.strings
: if you need to mark non-available data points, e.g. “NaN” or “n.d.”; you can also use multiple words, e.g. c(“NA,”“n.d.”“Na”)stringsAsFactors
: if R should interpret columns with words in the data table as factors instead of a vector of words. Don’t worry, if you don’t know the difference between factors and character vectors at this moment. But if you’re curious, take a look at the chapter on factors in the advanced R book.
3.2 Further methods
3.2.1 Excel files
There are several functions/packages that can deal with the .xlsx
format, e.g. readxl
, which is
automatically installed with the tidyverse
package or
openxlsx
, which is
demonstrated below.
if(!require(openxlsx)){
install.packages("openxlsx",repos = "http://cran.us.r-project.org")
library(openxlsx)
}<- read.xlsx("../datasets/plantGrowth.xlsx")
pg head(pg)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
You can see above that this function is smart enough to take the decimal separator from Excel and change it to the standard full stop. By default, the first line becomes the column names.
It is even a bit more
smart: Take a look at the second sheet of allData.xlsx
in the course
material. It does not feature the cleanest formatting with those empty
lines. But read.xlsx
understands that we don’t want to have empty
lines.
To load the second sheet, we need to use the sheet
argument -
this determines which of the sheets in an .xlsx file to read. You can
either give the name sheet=hairEyeColor
or the number of the sheet:
<- read.xlsx("../datasets/allData.xlsx",sheet = 2)
hc head(hc)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
3.2.2 Reading from URLs
You can use base R’s read.delim
/read.csv
/read.table
family of
functions to load data from a webpage by giving the URL:
<- read.delim("https://github.com/a-h-b/R_base_vis_course/raw/main/TestCSV.csv",
turl sep = ",")
head(turl)
## Header.1 Header.2
## 1 Text 1
## 2 and 2
## 3 more 3
## 4 text 4
3.2.3 Reading from the clipboard
R can actually read data that you’ve copied ‘from the clipboard.’ But we recommend to never use this function, because it’s not reproducible.