7  Matrices and arrays

7.1 Introduction

So far we have seen various data types that can store a collection of elements. All these data types – vectors, lists, and factors – are one dimensional. There is another set of data types in R that can store multiple elements and that too in more than one dimensions. These data types are ideally suited for handling “real-world” data or in other words the data that can be represented as a table with columns and rows. Some of these data types can go beyond two dimensions! Let’s understand the construction and functions of these multi-dimensional data types.

7.2 Matrices

A matrix, as the name suggest, is a two dimensional collection of elements. Matrices are essentially an extension of Vectors (which are one dimensional) in two dimensions. Just like vectors, matrices are homogeneous i.e. all the elements in a matrix are of the same data type. To create matrix we can use the matrix function with a collection of elements as an argument. When we print the class of the matrix object the output is matrix array that’s because all matrices are arrays; the difference is that matrices can have only two dimensions while an array can have any number of dimensions (see below). To check the dimensions of the matrix object use the dim function.

m1 <- matrix(1:5)
print(m1)
     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
cat("The class of m1 is:", class(m1),"\n")
The class of m1 is: matrix array 
cat("The dimensions of m1 is:", dim(m1))
The dimensions of m1 is: 5 1

We can also create a matrix of desired dimensions by using the nrow and ncol keyword argument of the matrix function. By default the elements in a matrix are filled column-wise; to fill row-wise instead, use the byrow Boolean keyword argument.

m1 <- matrix(1:10, nrow=2)
print(m1)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
cat("The dimensions of m1 is:", dim(m1), "\n")
The dimensions of m1 is: 2 5 
m1 <- matrix(1:10, nrow=2, byrow=T)
print(m1)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
cat("The dimensions of m1 is:", dim(m1))
The dimensions of m1 is: 2 5

7.2.1 Naming matrices

Just like we can assign names to the elements of vector, we can name the dimensions of a matrix. This can be done when creating a matrix using the dimnames argument of the matrix function. This keyword argument takes a list having two vectors - first for the row names and second for the column names. We can even assign names after a matrix has been created using the dimnames function. Also, to get the names of rows and columns of a matrix, dimnames can be used.

dimnames(m1) <- list(c("R1", "R2"), c("C1", "C2", "C3", "C4", "C5"))
print(m1)
   C1 C2 C3 C4 C5
R1  1  2  3  4  5
R2  6  7  8  9 10

7.2.2 Matrix from vectors

A matrix can also be created be combining a set of vectors. A prerequisite for this is that all the vectors should be of equal length. The combining of vectors can be done either column-wise or row-wise using cbind and rbind respectively.

v1 <- letters[1:5]
v2 <- c(1:5)
mat_r <- rbind(v1,v2)
mat_c <- cbind(v1,v2)
print(mat_r)
   [,1] [,2] [,3] [,4] [,5]
v1 "a"  "b"  "c"  "d"  "e" 
v2 "1"  "2"  "3"  "4"  "5" 
print(mat_c)
     v1  v2 
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"

In the output above notice that number are converted to characters that’s because matrices are homogeneous data types. Since the two vectors that were combined had different data types (v1 - integer and v2 - character) so their data types were coerced to character data type.

We can initialize a matrix filled with any scalar value e.g. a matrix of zeros or a matrix of some text.

mat_zeros <- matrix(0,2,3) 
print(mat_zeros)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
mat_hi <- matrix("hi",3,2)
print(mat_hi)
     [,1] [,2]
[1,] "hi" "hi"
[2,] "hi" "hi"
[3,] "hi" "hi"

7.2.3 Subsetting

m2 <- matrix(1:25, nrow=5)
print(m2)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
print("First two rows of m2")
[1] "First two rows of m2"
print(m2[1:2,])
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
cat("Second row of m2:", m2[2,], "\n")
Second row of m2: 2 7 12 17 22 
cat("Third column of m2:", m2[,3], "\n")
Third column of m2: 11 12 13 14 15 
cat("Element at fifth row and fourth column", m2[5,4])
Element at fifth row and fourth column 20

Subsetting can either be performed using indices as shown above or using the row and column names.

print(m1)
   C1 C2 C3 C4 C5
R1  1  2  3  4  5
R2  6  7  8  9 10
cat("The C3 columns of m1 has values:", m1[,"C3"])
The C3 columns of m1 has values: 3 8

7.2.4 Matrix operations

The matrix data type makes it easy to perform common matrix operations such as addition, multiplication, transposing a matrix etc. The diag function can used to exact the diagonal of a given matrix. Also, the same function can be used to create a matrix with specific elements across its diagonal.

m3 <- matrix(1:9, nrow=3)
print(m3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
# diagonal elements
print(diag(m3)) 
[1] 1 5 9
# transpose of a matrix
print(t(m3)) 
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
# multiplication by a scalar
print(m3*5) 
     [,1] [,2] [,3]
[1,]    5   20   35
[2,]   10   25   40
[3,]   15   30   45
# matrix addition
print(m3+m3) 
     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18
# matrix multiplication
print(m3*m3) 
     [,1] [,2] [,3]
[1,]    1   16   49
[2,]    4   25   64
[3,]    9   36   81
# cross product
print(m3%*%m3) 
     [,1] [,2] [,3]
[1,]   30   66  102
[2,]   36   81  126
[3,]   42   96  150
# identity matrix
print(diag(1, nrow=3, ncol=3)) 
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1
# zero matrix with diagonal values from m3
print(diag(diag(m3)))
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    5    0
[3,]    0    0    9

When a matrix is mulipled by a vector then a new matrix is created using the values of the vector with dimensions same as the that of the matrix being multiplied. In the code below the vector c(1:3) is interpreted as a 3x3 matrix with values 1 2 3 in the three columns.

print(m3*c(1:3))
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    4   10   16
[3,]    9   18   27

7.3 Arrays

Arrays are an extension of matrices such that they can be of more than two dimensions. Arrays are also homogeneous. The array function is used to create an array. It takes two arguments – data that would make up the array and the dimensions for the array. The keyword argument dim takes a vector in which the first to values specify the number of rows and columns and subsequent values specify the number higher dimensions. Note that the product of numbers for dims should be equal to the total number of elements otherwise the elements would be repeated to fill up the array. And R won’t give any warning! The code below create a 3D array which can be thought of as two matrices. Elements in an array can be accessed by subsetting.

v1 <- c(1:18)
arr_1 <- array(v1, dim = c(3,3,2))
print(arr_1)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18
print("Element at second row, third column in the second dimension is:") 
[1] "Element at second row, third column in the second dimension is:"
print(arr_1[2,3,2])
[1] 17

An array’s dimensions can be changed using the dim function.

dim(arr_1) <- c(3,2,3)
print(arr_1)
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12

, , 3

     [,1] [,2]
[1,]   13   16
[2,]   14   17
[3,]   15   18

7.3.1 Naming arrays

The different dimensions of an array can be named similar to matrices. The dimensions of the list passed to dimnames must be same as array dim. Array with names for dimensions can be subsetted using names as well. The code below extract all the values in Row_b from all the dimensions.

arr_1 <- array(v1, dim = c(3,3,2))
dimnames(arr_1) <- list(paste("Row",letters[1:3], sep="_"), paste("Col",letters[1:3], sep="_"),c("3D_1","3D_2"))
print(arr_1)
, , 3D_1

      Col_a Col_b Col_c
Row_a     1     4     7
Row_b     2     5     8
Row_c     3     6     9

, , 3D_2

      Col_a Col_b Col_c
Row_a    10    13    16
Row_b    11    14    17
Row_c    12    15    18
print(arr_1["Row_b",,])
      3D_1 3D_2
Col_a    2   11
Col_b    5   14
Col_c    8   17

7.3.2 Array operations

temp1 <- array(c(1:3), dim = c(3,3,2))
print(temp1)
, , 1

     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    2    2
[3,]    3    3    3

, , 2

     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    2    2
[3,]    3    3    3

The syntax for mathematical operations on array is similar to matrices. When we multiple a vector with an array, the new array (of same dimensions as the array being multiplied) is created using the values of the vector.

print(arr_1*c(1:3))
, , 3D_1

      Col_a Col_b Col_c
Row_a     1     4     7
Row_b     4    10    16
Row_c     9    18    27

, , 3D_2

      Col_a Col_b Col_c
Row_a    10    13    16
Row_b    22    28    34
Row_c    36    45    54

We execute a function on elements of an array using apply function. This function takes at least three arguments – an array, a margin or the dimension on which to apply the function, and the function that is to be applied. In the code below, the mean function is applied to an array with two different margins. First to calculate mean across column values and second to calculate mean across the third dimension of the array.

print(arr_1)
, , 3D_1

      Col_a Col_b Col_c
Row_a     1     4     7
Row_b     2     5     8
Row_c     3     6     9

, , 3D_2

      Col_a Col_b Col_c
Row_a    10    13    16
Row_b    11    14    17
Row_c    12    15    18
arr_1.mean_cols <- apply(arr_1, c(1,2), mean)
print(arr_1.mean_cols)
      Col_a Col_b Col_c
Row_a   5.5   8.5  11.5
Row_b   6.5   9.5  12.5
Row_c   7.5  10.5  13.5
arr_1.mean_D <- apply(arr_1, c(0,0,3), mean)
print(arr_1.mean_D)
3D_1 3D_2 
   5   14 

7.4 Data frame

A data frame is a fundamental data structure used by various data science packages so it is important to understand its syntax. Like matrices and tables, data frame is a collection of elements in two dimensions i.e. as rows and columns. A data frame in R can be created using the data.frame function which in its basic form takes data to be added to the data frame. The data can be a set of vectors or lists or a matrix. For example, below we create two dataframes one with two vectors and another with a matrix.

v1 <- c(1:5) 
df1 <- data.frame(v1,v1**2)  
m1 <- matrix(c(1:15), nrow = 5, ncol = 3) 
df2 <- data.frame(m1)

{r}

print(df1)

{r} print(df2)}

The data.frame function has row.name argument that can be used to specify the names for the rows in a dataframe. Column names can be specified by giving names for the vectors while creating a dataframe. Alternatively, the row names and column names can be specified after creating the data frame using the rownames and colnames functions. Let’s re-visit the code above, this time with addition of row names and column names in two different ways!

v1 <- c(1:5) 
df1 <- data.frame("Number"=v1,"Square"=v1**2, row.names = c("R1","R2","R3","R4", "R5"))  

m1 <- matrix(c(1:15), nrow = 5, ncol = 3) 
df2 <- data.frame(m1) 
rownames(df2) <- c("R1","R2","R3","R4", "R5") 
colnames(df2) <- c("C1", "C2", "C3")
print(df1)
   Number Square
R1      1      1
R2      2      4
R3      3      9
R4      4     16
R5      5     25
print(df2)
   C1 C2 C3
R1  1  6 11
R2  2  7 12
R3  3  8 13
R4  4  9 14
R5  5 10 15

The columns in a dataframe can be accessed by their names using the $ operator.

print(df1$Square)
[1]  1  4  9 16 25
print(df2$C2)
[1]  6  7  8  9 10

7.4.1 Subsetting a dataframe

We can select specific rows and columns from a dataframe by subsetting it with required indices. To do this, within the square brackets specify the row range and column range separated by a comma. E.g. to get 2nd and 3rd rows and 1st and 2nd columns the subset would be [2:3,1:2]. Similarly, to get 4th and 5th rows and all the columns, we’ll use [4:5,].

print(df2[2:3,1:2])
   C1 C2
R2  2  7
R3  3  8
print(df2[4:5,])
   C1 C2 C3
R4  4  9 14
R5  5 10 15

7.4.2 Modifying a dataframe

To add a column to a dataframe, simply assign a vector to a new column in the dataframe. Note that the vector length must me same as the number of rows in the dataframe. We can also use the cbind function for this. Similarly, rows can be added to a dataframe using the rbind function.

The cbind and rbind functions can be used to combine multiple dataframes as well.

m1 <- matrix(c(1:15), nrow = 5, ncol = 3) 
df2 <- data.frame(m1) 
rownames(df2) <- c("R1","R2","R3","R4", "R5") 
colnames(df2) <- c("C1", "C2", "C3")
#Adding a column 
#using column name 
df2$C4 <- c(16:20) 
print(df2)
   C1 C2 C3 C4
R1  1  6 11 16
R2  2  7 12 17
R3  3  8 13 18
R4  4  9 14 19
R5  5 10 15 20
#Adding a column 
#using cbind 
df2 <- cbind(df2,C5=c(21:25)) 
print(df2)
   C1 C2 C3 C4 C5
R1  1  6 11 16 21
R2  2  7 12 17 22
R3  3  8 13 18 23
R4  4  9 14 19 24
R5  5 10 15 20 25
#Adding a row using rbind 
df3 <- rbind(df2,R6=letters[1:5]) 
print(df3)
   C1 C2 C3 C4 C5
R1  1  6 11 16 21
R2  2  7 12 17 22
R3  3  8 13 18 23
R4  4  9 14 19 24
R5  5 10 15 20 25
R6  a  b  c  d  e

To remove row(s) or column(s) from a dataframe we can use negative subset i.e. select only the elements which are not part of the subset. E.g., to remove 6th row and 4th and 5th columns the negative subset would be [-6,-(4:5)].

print(df3)
   C1 C2 C3 C4 C5
R1  1  6 11 16 21
R2  2  7 12 17 22
R3  3  8 13 18 23
R4  4  9 14 19 24
R5  5 10 15 20 25
R6  a  b  c  d  e
df3 <- df3[-6,-(4:5)] 
print(df3)
   C1 C2 C3
R1  1  6 11
R2  2  7 12
R3  3  8 13
R4  4  9 14
R5  5 10 15

7.5 The apply family of functions

When we need to run a function with each element of a collection as an argument we can use the lapply function. You can think of it as running a for loop and calling a function with an important difference that this method is much faster than a for (or while) loop. The lapply function takes two arguments -- data and FUN the function that needs to be applied. It return a list of length same as the input vector (since the function is applied to each element of the input).

get_square <- function(x){
  return(x**2)
}
v1 = c(1:5)
lapply(v1, get_square)
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

The sapply function can works similar to lapply with a difference that can take a list or a martix or a dataframe as input and returns an array or a matrix object depending upon the dimensions of the return value. In the case of sapply the function is applied column-wise.

get_square <- function(x){
  return(x**2)
}
v1 <- c(1:5)
v2 <- c(6:10)
sapply(data.frame(v1,v2), get_square)
     v1  v2
[1,]  1  36
[2,]  4  49
[3,]  9  64
[4,] 16  81
[5,] 25 100
sapply(data.frame(v1,v2), mean)
v1 v2 
 3  8 
# mean along rows
m1 <- matrix(c(1:15), nrow = 5, ncol = 3) 
df2 <- data.frame(m1) 
apply(df2, FUN = mean, MARGIN = 1)
[1]  6  7  8  9 10
# mean along columns
m1 <- matrix(c(1:15), nrow = 5, ncol = 3) 
df2 <- data.frame(m1) 
apply(df2, FUN = mean, MARGIN = 2)
X1 X2 X3 
 3  8 13 

The tapply function offers an additional feature of subsetting vectors based on a factor variable. This is useful to apply the required function category-wise. This function takes three arguments – the data on which the function is to be applied, the factors (categorical variables), and the function that is to be applied. In the code below the mean function is applied to calculate average marks for the two programs.

marks <- c(40,50,60,60,80,70)
program <- c("UG","UG","UG","PG","PG","PG")
results <- data.frame(marks, program)
tapply(results$marks, results$program, FUN = mean)
PG UG 
70 50