Chapter 5 Lists and data frames

5.1 Lists

Lists are special vectors that can store elements of any mode (including other lists).

Like any other vector, a list is indexed by the [...] operator, however, note that the result will be a list containing as unique element the desired item:

## $Names
## [1] "Daniel"  "Jehanne" "Romain"
## [1] "list"

To get the desired item directly, we therefore use the [[...]] operator or the $ operator followed by the name of the element (if available):

## [1] "Daniel"  "Jehanne" "Romain"
## NULL

The elements of a list can have different lengths:

## $Names
## [1] "Daniel"  "Jehanne" "Romain" 
## 
## $Age
## [1] 33 28 33
## 
## $city
## [1] "paris"  "lyon"   "lyon"   "paris"  "nantes"

5.2 Data frames

The most widely data containers is the data frame, a special list of class data.frame in which all elements have the same length. For this reason, a data frame is represented in the form of a two-dimensional array whose columns are its elements. Typically, in a data frame the columns represent the variables and the rows the observations. Unlike matrices, the elements of a data frame can have different modes.

##        Id Age Smoker Sex
## 1  id.453  19   TRUE   f
## 2  id.452  45  FALSE   f
## 3  id.455  67   TRUE   h
## 4  id.459  53   TRUE   h
## 5  id.458  17  FALSE   f
## 6  id.456  30   TRUE   h
## 7  id.450  27   TRUE   f
## 8 id. 451  35   TRUE   f
## [1] 8 4
## [1] 8
## [1] 4
## [1] "Id"     "Age"    "Smoker" "Sex"

A data frame being a list, we can extract a column using the $ operator preceded by the name of the data frame and followed by the name of the column (or variable), or use the operator [...]

## [1] f f h h f h f f
## Levels: f h
## [1] 19 45 67 53 17 30 27 35
## [1] 45 17

The columns are directly accessible in the workspace (without having to type the name of the data frame and the $) after having attached the data frame:

## [1] 19 45 67 53 17 30 27 35

To display only the first six lines:

##       Id Age Smoker Sex
## 1 id.453  19   TRUE   f
## 2 id.452  45  FALSE   f
## 3 id.455  67   TRUE   h
## 4 id.459  53   TRUE   h
## 5 id.458  17  FALSE   f
## 6 id.456  30   TRUE   h

Similarly, tail() creates a data frame with the last six columns.

5.3 Importing and exporting data

Importing data is a fundamental step in data analysis. To load the data stored in a file (texte, .csv, Excel, …) into the workspace (ie into memory), you can use the basic function read.table(). The three most important arguments are:

  • file: name (and path) of the file, in quotes
  • header: are the elements of the first row the names of the columns?
  • sep: character separating the columns

read.table() returns a data frame:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## [1] 270  13
##  [1] "age"          "sexe"         "type_douleur" "pression"     "cholester"   
##  [6] "sucre"        "electro"      "taux_max"     "angine"       "depression"  
## [11] "pic"          "vaisseau"     "coeur"

For data stored in the .Rda or .Rdata format, the import is done with load() with the argument file = filename. For instance download the Iris.Rda file at github.com/vittorioperduca/Introduction-to-R/blob/master/data/Iris.Rda to your working directory and then try the following:

If you want to load .Rda or .Rdata files directly from an url, don’t forget to use the url() function (this was not necessary in read.table()).

Data can be exported either to a text file (or .csv, Excel …) using write.file() or to .rda and .Rdata files at using save(). In both cases, the two main arguments are

  • x = data to save
  • file = the name of the file (in quotes).

If the dataset is stored (or must be saved up) locally, it is necessary to know (and be able to modify) the working directory:

Remember that in Linux and macOS machines, ~/ is a shortcut for /Users/username. For Windows machines, the address syntax is slightly different. For example we use \ instead of /.

5.4 Exercise

Download the text file raw.githubusercontent.com/vittorioperduca/Introduction-to-R/master/data/hepatitis.txt to your working directory.

  1. Import the dataset in R. Warning: missing data were coded with a ?, read the documentation of read.table ().
  2. Find the number of observations, display the names of the variables and the first six observations. Check that the value of STEROID for the fourth observation is missing using the appropriate function.
  3. Calculate the mean value of ALBUMIN in women and men.
  4. Create a variable NSYMP counting the number of times a variable is equal toyes between FATIGUE andMALAISE. Pay attention to the format of these two variables!