Chapter 5 Lists and data frames

5.1 Lists

Lists are special vectors that can store elements of any mode (including other lists).

age = c(33,28, 33)
names <- c('Daniel', 'Jehanne', 'Romain')
my.list <- list(Names = names, Age = age)

Like any other vector, a list is indexed by the [...] operator, however, note that the result will be a list containing as unique element the desired item:

my.list[1]

## $Names
## [1] "Daniel"  "Jehanne" "Romain"

mode(my.list[1])

## [1] "list"

To get the desired item directly, we therefore use the [[...]] operator or the $ operator followed by the name of the element (if available):

my.list[[1]]

## [1] "Daniel"  "Jehanne" "Romain"

my.list$age

## NULL

The elements of a list can have different lengths:

city <- c('paris', 'lyon', 'lyon', 'paris', 'nantes')
my.list$city <- city
my.list

## $Names
## [1] "Daniel"  "Jehanne" "Romain" 
## 
## $Age
## [1] 33 28 33
## 
## $city
## [1] "paris"  "lyon"   "lyon"   "paris"  "nantes"

5.2 Data frames

The most widely data containers is the data frame, a special list of class data.frame in which all elements have the same length. For this reason, a data frame is represented in the form of a two-dimensional array whose columns are its elements. Typically, in a data frame the columns represent the variables and the rows the observations. Unlike matrices, the elements of a data frame can have different modes.

id <- c('id.453', 'id.452', 'id.455', 'id.459', 'id.458', 'id.456', 'id.450', 'id. 451')
age <- c (19, 45, 67, 53, 17, 30, 27, 35)
smoker <- c (TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
sex <- c ('f', 'f', 'h', 'h', 'f', 'h', 'f', 'f')
my.db <- data.frame (Id = id, Age = age, Smoker = smoker, Sex = sex); my.db

##        Id Age Smoker Sex
## 1  id.453  19   TRUE   f
## 2  id.452  45  FALSE   f
## 3  id.455  67   TRUE   h
## 4  id.459  53   TRUE   h
## 5  id.458  17  FALSE   f
## 6  id.456  30   TRUE   h
## 7  id.450  27   TRUE   f
## 8 id. 451  35   TRUE   f

dim(my.db); nrow(my.db); ncol(my.db)

## [1] 8 4

## [1] 8

## [1] 4

names(my.db)

## [1] "Id"     "Age"    "Smoker" "Sex"

A data frame being a list, we can extract a column using the $ operator preceded by the name of the data frame and followed by the name of the column (or variable), or use the operator [...]

my.db$Sex # a column of characters is automatically transformed into a factor

## [1] f f h h f h f f
## Levels: f h

my.db[, 2]

## [1] 19 45 67 53 17 30 27 35

my.db$Age[my.db$Smoker == FALSE] # simple example of selection

## [1] 45 17

The columns are directly accessible in the workspace (without having to type the name of the data frame and the $) after having attached the data frame:

attach(my.db)
Age

## [1] 19 45 67 53 17 30 27 35

To display only the first six lines:

head(my.db)

##       Id Age Smoker Sex
## 1 id.453  19   TRUE   f
## 2 id.452  45  FALSE   f
## 3 id.455  67   TRUE   h
## 4 id.459  53   TRUE   h
## 5 id.458  17  FALSE   f
## 6 id.456  30   TRUE   h

Similarly, tail() creates a data frame with the last six columns.

5.3 Importing and exporting data

Importing data is a fundamental step in data analysis. To load the data stored in a file (texte, .csv, Excel, …) into the workspace (ie into memory), you can use the basic function read.table(). The three most important arguments are:

file: name (and path) of the file, in quotes
header: are the elements of the first row the names of the columns?
sep: character separating the columns

read.table() returns a data frame:

url1 <- 'https://raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/iris.txt'
d1 <- read.table(url1,
                 # the first line contains the name of the variables
                 header = TRUE,
                 # values are separated by ;
                 sep = ';') 

head (d1)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

url2 <- 'https://raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/heart.txt'
d2 <- read.table(url2,
                 header = TRUE,
                 # variables are separated by a tabulation
                 sep = '\t') 
dim(d2); names(d2)

## [1] 270  13

##  [1] "age"          "sexe"         "type_douleur" "pression"     "cholester"   
##  [6] "sucre"        "electro"      "taux_max"     "angine"       "depression"  
## [11] "pic"          "vaisseau"     "coeur"

For data stored in the .Rda or .Rdata format, the import is done with load() with the argument file = filename. For instance download the Iris.Rda file at github.com/vittorioperduca/Introduction-to-R/blob/master/data/Iris.Rda to your working directory and then try the following:

iris_path <- 'data/Iris.Rda' # replace with the file path
load(iris_path)

If you want to load .Rda or .Rdata files directly from an url, don’t forget to use the url() function (this was not necessary in read.table()).

Data can be exported either to a text file (or .csv, Excel …) using write.file() or to .rda and .Rdata files at using save(). In both cases, the two main arguments are

x = data to save
file = the name of the file (in quotes).

If the dataset is stored (or must be saved up) locally, it is necessary to know (and be able to modify) the working directory:

# getwd() # try on your machine!
# setwd('~/Documents') # to move to the Documents directory

Remember that in Linux and macOS machines, ~/ is a shortcut for /Users/username. For Windows machines, the address syntax is slightly different. For example we use \ instead of /.

5.4 Exercise

Download the text file raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/hepatitis.txt to your working directory.

Import the dataset in R. Warning: missing data were coded with a ?, read the documentation of read.table ().
Find the number of observations, display the names of the variables and the first six observations. Check that the value of STEROID for the fourth observation is missing using the appropriate function.
Calculate the mean value of ALBUMIN in women and men.
Create a variable NSYMP counting the number of times a variable is equal toyes between FATIGUE andMALAISE. Pay attention to the format of these two variables!