The first step to any data analysis process is to get the data. Data can come from many sources but two of the most common include text and Excel files. This section covers how to import data from common files. I also cover how to load data from saved R object files for holding or transferring data that has been processed in R. In addition to the the commonly used base R functions to perform data importing, I will also cover functions from the popular
Text files are a popular way to hold and exchange tabular data as almost any data application supports exporting data to the .csv (or other text file) format. Text file formats use delimiters to separate the different elements in a line, and each line of data is in its own line in the text file. Therefore, importing different kinds of text files can follow a fairly consistent process once you’ve identified the delimiter.
There are three main groups of functions that we can use to read in text files:
All three functions will import a tabular file (.csv, .tsv, .txt, etc.) and convert it to a data frame in R.
read.table() is a multipurpose work-horse function in base R for importing data. The functions
read.delim() are special cases of
read.table() in which the defaults have been adjusted for efficiency. To illustrate these functions let’s work with this .csv file which looks like (be sure to save this file in your working directory):
variable 1,variable 2,variable 3 10,beer,TRUE 25,wine,TRUE 8,cheese,FALSE
To read in the CSV file we can use
mydata <- read.csv("mydata.csv")
Once you have imported the data there are several ways to get an initial view of this data prior to performing any analysis. First, you can view it in your console by evaluating the
mydata object you just created. Alternatively, you can use RStudio’s built-in data viewer to get a scroll-able view of the complete data set using
View or use
str to assess the structure of the data. Try the following:
Note that when we assess the structure of the data set,
variable.2 is automatically coerced to a factor variable and
variable.3 is automatically coerced to a logical variable. If you are unfamiliar with data types in R you can learn about them in the Data Types section of this website or, if you’re in one of my classes, we will cover these concepts later. Furthermore, any whitespace in the column names are replaced with a “.”.
str(mydata) ## 'data.frame': 3 obs. of 3 variables: ## $ variable.1: int 10 25 8 ## $ variable.2: Factor w/ 3 levels "beer","cheese",..: 1 3 2 ## $ variable.3: logi TRUE TRUE FALSE
However, we may want to read in
variable.2 as a character variable rather then a factor. We can take care of this by changing the
stringsAsFactors argument. The default has
stringsAsFactors = TRUE; however, setting it equal to
FALSE will read in the variable as a character variable.
mydata_2 <- read.csv("mydata.csv", stringsAsFactors = FALSE) mydata_2 ## variable.1 variable.2 variable.3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE str(mydata_2) ## 'data.frame': 3 obs. of 3 variables: ## $ variable.1: int 10 25 8 ## $ variable.2: chr "beer" "wine" "cheese" ## $ variable.3: logi TRUE TRUE FALSE
As previously stated
read.csv is just a wrapper function for
read.table but with adjusted default arguments. Therefore, we can use
read.table to read in this same data. The two arguments we need to be aware of are the field separator (
sep) and the argument indicating whether the file contains the names of the variables as its first line (
read.table the defaults are
sep = "" and
header = FALSE whereas in
read.csv the defaults are
sep = "," and
header = TRUE. There are multiple other arguments we can use for certain situations which we illustrate below:
# provides same results as read.csv above read.table("mydata.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE) ## variable.1 variable.2 variable.3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE # set column and row names read.table("mydata.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, col.names = c("Var 1", "Var 2", "Var 3"), row.names = c("Row 1", "Row 2", "Row 3")) ## Var.1 Var.2 Var.3 ## Row 1 10 beer TRUE ## Row 2 25 wine TRUE ## Row 3 8 cheese FALSE # manually set the classes of the columns set_classes <- read.table("mydata.csv", sep = ",", header = TRUE, colClasses = c("numeric", "character", "character")) str(set_classes) ## 'data.frame': 3 obs. of 3 variables: ## $ variable.1: num 10 25 8 ## $ variable.2: chr "beer" "wine" "cheese" ## $ variable.3: chr "TRUE" "TRUE" "FALSE" # limit the number of rows to read in read.table("mydata.csv", sep = ",", header = TRUE, nrows = 2) ## variable.1 variable.2 variable.3 ## 1 10 beer TRUE ## 2 25 wine TRUE
In addition to .csv files, there are other text files that
read.table works with. The primary difference is what separates the elements. For example, tab delimited text files typically end with the .txt and .tsv extensions. You can also use the
read.delim() function as, similiar to
read.delim() is a wrapper of
read.table() with defaults set specifically for tab delimited files. We can read in this .txt file with the following:
# reading in tab delimited text files read.delim("mydata.txt") ## variable.1 variable.2 variable.3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE # provides same results as read.delim read.table("mydata.txt", sep = "\t", header = TRUE) ## variable.1 variable.2 variable.3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE
Compared to the equivalent base functions,
readr functions are around 10x faster. They bring consistency to importing functions, they produce data frames in a
data.table format which are easier to view for large data sets, the default settings removes the “hassels” of
stringsAsFactors, and they have a more flexible column specification.
To illustrate, we can use
read_csv() which is equivalent to base R’s
read.csv() function. However, note that
read_csv() maintains the full variable name (whereas
read.csv eliminates any spaces in variable names and fills it with ‘.’). Also,
read_csv() automatically sets
stringsAsFactors = FALSE, which can be a controversial topic.
library(readr) mydata_3 <- read_csv("mydata.csv") mydata_3 ## variable 1 variable 2 variable 3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE str(mydata_3) ## Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 3 variables: ## $ variable 1: int 10 25 8 ## $ variable 2: chr "beer" "wine" "cheese" ## $ variable 3: logi TRUE TRUE FALSE
read_csv also offers many additional arguments for making adjustments to your data as you read it in:
# specify the column class using col_types read_csv("mydata.csv", col_types = list(col_double(), col_character(), col_character())) ## variable 1 variable 2 variable 3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE # we can also specify column classes with a string # in this example d = double, _ skips column, c = character read_csv("mydata.csv", col_types = "d_c") ## variable 1 variable 3 ## 1 10 TRUE ## 2 25 TRUE ## 3 8 FALSE # set column names read_csv("mydata.csv", col_names = c("Var 1", "Var 2", "Var 3"), skip = 1) ## Var 1 Var 2 Var 3 ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE # set the maximum number of lines to read in read_csv("mydata.csv", n_max = 2) ## variable 1 variable 2 variable 3 ## 1 10 beer TRUE ## 2 25 wine TRUE
Similar to base R,
readr also offers functions to import .txt files (
read_delim()), fixed-width files (
read_fwf()), general text files (
read_table()), and more.
The base R (
read.csv) and readr (
read_csv) functions work great for everyday, “normal” sized data sets. However, as analysts we also need to be able to quickly import and analyze large data sets. Although the
readr functions are about 10x faster than base R importing functions, importing large data sets can still be a bit slow.
For example, check out the train_1.csv web traffic data set provided by Kaggle. This data set has 145,063 observations and 551 variables which equates to 79,929,713 elements and 265.1 MB. Importing this data with base R
read.csv takes about 60 seconds and importing with readr’s
read_csv takes 9 seconds.
# time to import with base R system.time(df0 <- read.csv("train_1.csv")) ## user system elapsed ## 60.153 2.137 63.446 # time to import with readr system.time(df1 <- read_csv("train_1.csv")) ## user system elapsed ## 8.985 0.926 9.875
However, reading this in with data.table’s
fread function is much faster. The syntax for
fread is similar to
read_csv. We supply
fread with the path to the file, similar to our previous examples, and the output is a data frame.
library(data.table) fread("mydata.csv") ## variable 1 variable 2 variable 3 ## 1: 10 beer TRUE ## 2: 25 wine TRUE ## 3: 8 cheese FALSE
fread maintains the white space in the variable names and the default sets
stringsAsFactors = FALSE so all character variables will be imported as characters rather than factors. You can specify the
sep argument; however, a benefit of
fread is that it will identify the delimiter automatically so you should not have to change the default setting when importing .csv, .tsv, .txt., etc. files. Also, like the other functions you set parameters to skip lines, change variable names and types, drop columns, etc. Just check out
?fread for all the options.
So how much faster is
read_csv? It imports the train_1.csv file in under 3 seconds, which is 3x faster than
read_csv and 20x faster than
system.time(df2 <- fread("train_1.csv")) ## user system elapsed ## 2.717 0.229 2.973
These examples provide the basics for reading in text files. However, sometimes even text files can offer unanticipated difficulties with their formatting. The base R,
data.table functions offer many arguments to deal with different formatting issues and I suggest you take time to look at the help files for these functions to learn more (i.e.
With Excel still being the spreadsheet software of choice its important to be able to efficiently import and export data from these files. Often, R users will simply resort to exporting the Excel file as a CSV file and then import into R using
read.csv; however, this is far from efficient. This section will teach you how to eliminate the CSV step and to import data directly from Excel using the
Note that there are several packages available to connect R with Excel (i.e.
RExcel, etc.); however,
readxl provides all the fundamental requirements typically needed for dealing with Excel.
readxl is one of the newest packages for accessing Excel data with R and was developed by Hadley Wickham and the RStudio team who also developed the
readr package. This package works with both legacy .xls formats and the modern xml-based .xlsx format. Similar to
readxl functions are based on a C++ library so they are extremely fast. Unlike most other packages that deal with Excel,
readxl has no external dependencies, so you can use it to read Excel data on just about any platform. Additional benefits
readxl provides includes the ability to load dates and times as POSIXct formatted dates, automatically drops blank columns, and returns outputs as data.table formatted which provides easier viewing for large data sets.
To read in Excel data with
readxl you will commonly use the
excel_sheets() allows you to read the names of the different worksheets in the Excel workbook. You can then use
read_excel() to import the data.
read_excel() will auto detect the format from the file extension (.xls vs. .xlsx) so you need not worry about the extensions. Furthermore,
read_excel() will automatically convert date and date-time variables to POSIXct formatted variables, character variables will not be coerced to factors, and white spacing in variable names will not be removed.
The following illustrates how to use
readxl to import Excel data and you can tag along using this data.
library(readxl) excel_sheets("mydata.xlsx") ##  "PICK_ME_FIRST!" "Sheet2" "extra_header" "functions" ##  "date_time" "unique_NA" mydata <- read_excel("mydata.xlsx", sheet = "PICK_ME_FIRST!") mydata ## # A tibble: 3 x 3 ## `variable 1` `variable 2` `variable 3` ## <dbl> <chr> <lgl> ## 1 10 beer TRUE ## 2 25 wine TRUE ## 3 8 cheese FALSE str(mydata) ## Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 3 variables: ## $ variable 1: num 10 25 8 ## $ variable 2: chr "beer" "wine" "cheese" ## $ variable 3: logi TRUE TRUE FALSE
The available arguments allow you to change the data as you import it. Some examples are provided:
# change variable names by skipping the first row # and using col_names to set the new names read_excel("mydata.xlsx", sheet = "extra_header", skip = 3, col_names = c("Value", "Gender")) ## # A tibble: 4 x 2 ## Value Gender ## <dbl> <chr> ## 1 200 Male ## 2 225 Female ## 3 400 Female ## 4 310 Male # sometimes missing values are set as a sentinel value # rather than just left blank - (i.e. "999") read_excel("mydata.xlsx", sheet = "unique_NA") ## # A tibble: 3 x 4 ## `variable 1` `variable 2` `variable 3` `variable 4` ## <dbl> <chr> <lgl> <dbl> ## 1 10 beer TRUE 42328 ## 2 25 wine TRUE 999 ## 3 8 999 FALSE 42330 # we can change these to missing values with na argument read_excel("mydata.xlsx", sheet = "unique_NA", na = "999") ## # A tibble: 3 x 4 ## `variable 1` `variable 2` `variable 3` `variable 4` ## <dbl> <chr> <lgl> <dttm> ## 1 10 beer TRUE 2015-11-20 ## 2 25 wine TRUE NA ## 3 8 <NA> FALSE 2015-11-22
read_excel() allows you to change column types. these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. Here I use skip to not import the 2nd and 5th columns and to change the logical variable (variable 3) to a numeric data type.
read_excel("mydata.xlsx", sheet = "date_time", col_types = c("numeric", "skip", "numeric", "date", "skip")) ## # A tibble: 3 x 3 ## `variable 1` `variable 3` `variable 4` ## <dbl> <dbl> <dttm> ## 1 10 1 2015-11-20 ## 2 25 1 NA ## 3 8 0 2015-11-22
Sometimes you may need to save data or other R objects outside of your workspace. You may want to share R data/objects with co-workers, transfer between projects or computers, or simply archive them. There are three primary ways that people tend to save R data/objects: as .RData, .rda, or as .rds files. The differences behind when you use each will be covered in the Saving data as an R object file section. This section simply shows how to load these data/object forms.
load("mydata.RData") load(file = "mydata.rda") readRDS("mydata.rds") readr::read_rds("mydata.rds")
In addition to text and Excel files, there are multiple other ways that data are stored and exchanged. Commercial statistical software such as SPSS, SAS, Stata, and Minitab often have the option to store data in a specific format for that software. In addition, analysts commonly use databases to store large quantities of data. R has good support to work with these additional options which we did not cover here. The following provides a list of additional resources to learn about data importing for these specific cases:
readxlpackage with this RStudio webinar