Chapter 5 Lists and data frames
5.1 Lists
Lists are special vectors that can store elements of any mode (including other lists).
age = c(33,28, 33)
names <- c('Daniel', 'Jehanne', 'Romain')
my.list <- list(Names = names, Age = age)
Like any other vector, a list is indexed by the [...]
operator, however, note that the result will be a list containing as unique element the desired item:
## $Names
## [1] "Daniel" "Jehanne" "Romain"
## [1] "list"
To get the desired item directly, we therefore use the [[...]]
operator or the $
operator followed by the name of the element (if available):
## [1] "Daniel" "Jehanne" "Romain"
## NULL
The elements of a list can have different lengths:
## $Names
## [1] "Daniel" "Jehanne" "Romain"
##
## $Age
## [1] 33 28 33
##
## $city
## [1] "paris" "lyon" "lyon" "paris" "nantes"
5.2 Data frames
The most widely data containers is the data frame, a special list of class data.frame
in which all elements have the same length. For this reason, a data frame is represented in the form of a two-dimensional array whose columns are its elements. Typically, in a data frame the columns represent the variables and the rows the observations. Unlike matrices, the elements of a data frame can have different modes.
id <- c('id.453', 'id.452', 'id.455', 'id.459', 'id.458', 'id.456', 'id.450', 'id. 451')
age <- c (19, 45, 67, 53, 17, 30, 27, 35)
smoker <- c (TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
sex <- c ('f', 'f', 'h', 'h', 'f', 'h', 'f', 'f')
my.db <- data.frame (Id = id, Age = age, Smoker = smoker, Sex = sex); my.db
## Id Age Smoker Sex
## 1 id.453 19 TRUE f
## 2 id.452 45 FALSE f
## 3 id.455 67 TRUE h
## 4 id.459 53 TRUE h
## 5 id.458 17 FALSE f
## 6 id.456 30 TRUE h
## 7 id.450 27 TRUE f
## 8 id. 451 35 TRUE f
## [1] 8 4
## [1] 8
## [1] 4
## [1] "Id" "Age" "Smoker" "Sex"
A data frame being a list, we can extract a column using the $
operator preceded by the name of the data frame and followed by the name of the column (or variable), or use the operator [...]
## [1] f f h h f h f f
## Levels: f h
## [1] 19 45 67 53 17 30 27 35
## [1] 45 17
The columns are directly accessible in the workspace (without having to type the name of the data frame and the $
) after having attached the data frame:
## [1] 19 45 67 53 17 30 27 35
To display only the first six lines:
## Id Age Smoker Sex
## 1 id.453 19 TRUE f
## 2 id.452 45 FALSE f
## 3 id.455 67 TRUE h
## 4 id.459 53 TRUE h
## 5 id.458 17 FALSE f
## 6 id.456 30 TRUE h
Similarly, tail()
creates a data frame with the last six columns.
5.3 Importing and exporting data
Importing data is a fundamental step in data analysis. To load the data stored in a file (texte, .csv, Excel, …) into the workspace (ie into memory), you can use the basic function read.table()
. The three most important arguments are:
file
: name (and path) of the file, in quotesheader
: are the elements of the first row the names of the columns?sep
: character separating the columns
read.table()
returns a data frame:
url1 <- 'https://raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/iris.txt'
d1 <- read.table(url1,
# the first line contains the name of the variables
header = TRUE,
# values are separated by ;
sep = ';')
head (d1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
url2 <- 'https://raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/heart.txt'
d2 <- read.table(url2,
header = TRUE,
# variables are separated by a tabulation
sep = '\t')
dim(d2); names(d2)
## [1] 270 13
## [1] "age" "sexe" "type_douleur" "pression" "cholester"
## [6] "sucre" "electro" "taux_max" "angine" "depression"
## [11] "pic" "vaisseau" "coeur"
For data stored in the .Rda
or .Rdata
format, the import is done with load()
with the argument file = filename
. For instance download the Iris.Rda
file at github.com/vittorioperduca/Introduction-to-R/blob/master/data/Iris.Rda to your working directory and then try the following:
If you want to load .Rda
or .Rdata
files directly from an url, don’t forget to use the url()
function (this was not necessary in read.table()
).
Data can be exported either to a text file (or .csv, Excel …) using write.file()
or to .rda
and .Rdata
files at using save()
. In both cases, the two main arguments are
x =
data to savefile =
the name of the file (in quotes).
If the dataset is stored (or must be saved up) locally, it is necessary to know (and be able to modify) the working directory:
Remember that in Linux and macOS machines, ~/
is a shortcut for /Users/username
. For Windows machines, the address syntax is slightly different. For example we use \
instead of /
.
5.4 Exercise
Download the text file raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/hepatitis.txt to your working directory.
- Import the dataset in
R
. Warning: missing data were coded with a?
, read the documentation ofread.table ()
. - Find the number of observations, display the names of the variables and the first six observations. Check that the value of
STEROID
for the fourth observation is missing using the appropriate function. - Calculate the mean value of
ALBUMIN
in women and men. - Create a variable
NSYMP
counting the number of times a variable is equal toyes
betweenFATIGUE
andMALAISE
. Pay attention to the format of these two variables!