<- matrix(1:5)
m1 print(m1)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
cat("The class of m1 is:", class(m1),"\n")
The class of m1 is: matrix array
cat("The dimensions of m1 is:", dim(m1))
The dimensions of m1 is: 5 1
So far we have seen various data types that can store a collection of elements. All these data types – vectors, lists, and factors – are one dimensional. There is another set of data types in R
that can store multiple elements and that too in more than one dimensions. These data types are ideally suited for handling “real-world” data or in other words the data that can be represented as a table with columns and rows. Some of these data types can go beyond two dimensions! Let’s understand the construction and functions of these multi-dimensional data types.
A matrix, as the name suggest, is a two dimensional collection of elements. Matrices are essentially an extension of Vectors (which are one dimensional) in two dimensions. Just like vectors, matrices are homogeneous i.e. all the elements in a matrix are of the same data type. To create matrix we can use the matrix
function with a collection of elements as an argument. When we print the class of the matrix object the output is matrix array
that’s because all matrices are arrays; the difference is that matrices can have only two dimensions while an array can have any number of dimensions (see below). To check the dimensions of the matrix object use the dim
function.
<- matrix(1:5)
m1 print(m1)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
cat("The class of m1 is:", class(m1),"\n")
The class of m1 is: matrix array
cat("The dimensions of m1 is:", dim(m1))
The dimensions of m1 is: 5 1
We can also create a matrix of desired dimensions by using the nrow
and ncol
keyword argument of the matrix
function. By default the elements in a matrix are filled column-wise; to fill row-wise instead, use the byrow
Boolean keyword argument.
<- matrix(1:10, nrow=2)
m1 print(m1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
cat("The dimensions of m1 is:", dim(m1), "\n")
The dimensions of m1 is: 2 5
<- matrix(1:10, nrow=2, byrow=T)
m1 print(m1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
cat("The dimensions of m1 is:", dim(m1))
The dimensions of m1 is: 2 5
Just like we can assign names to the elements of vector, we can name the dimensions of a matrix. This can be done when creating a matrix using the dimnames
argument of the matrix
function. This keyword argument takes a list having two vectors - first for the row names and second for the column names. We can even assign names after a matrix has been created using the dimnames
function. Also, to get the names of rows and columns of a matrix, dimnames
can be used.
dimnames(m1) <- list(c("R1", "R2"), c("C1", "C2", "C3", "C4", "C5"))
print(m1)
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
A matrix can also be created be combining a set of vectors. A prerequisite for this is that all the vectors should be of equal length. The combining of vectors can be done either column-wise or row-wise using cbind
and rbind
respectively.
<- letters[1:5]
v1 <- c(1:5)
v2 <- rbind(v1,v2)
mat_r <- cbind(v1,v2)
mat_c print(mat_r)
[,1] [,2] [,3] [,4] [,5]
v1 "a" "b" "c" "d" "e"
v2 "1" "2" "3" "4" "5"
print(mat_c)
v1 v2
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
In the output above notice that number are converted to characters that’s because matrices are homogeneous data types. Since the two vectors that were combined had different data types (v1 - integer and v2 - character) so their data types were coerced to character data type.
We can initialize a matrix filled with any scalar value e.g. a matrix of zeros or a matrix of some text.
<- matrix(0,2,3)
mat_zeros print(mat_zeros)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
<- matrix("hi",3,2)
mat_hi print(mat_hi)
[,1] [,2]
[1,] "hi" "hi"
[2,] "hi" "hi"
[3,] "hi" "hi"
<- matrix(1:25, nrow=5)
m2 print(m2)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
print("First two rows of m2")
[1] "First two rows of m2"
print(m2[1:2,])
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
cat("Second row of m2:", m2[2,], "\n")
Second row of m2: 2 7 12 17 22
cat("Third column of m2:", m2[,3], "\n")
Third column of m2: 11 12 13 14 15
cat("Element at fifth row and fourth column", m2[5,4])
Element at fifth row and fourth column 20
Subsetting can either be performed using indices as shown above or using the row and column names.
print(m1)
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
cat("The C3 columns of m1 has values:", m1[,"C3"])
The C3 columns of m1 has values: 3 8
The matrix data type makes it easy to perform common matrix operations such as addition, multiplication, transposing a matrix etc. The diag
function can used to exact the diagonal of a given matrix. Also, the same function can be used to create a matrix with specific elements across its diagonal.
<- matrix(1:9, nrow=3)
m3 print(m3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# diagonal elements
print(diag(m3))
[1] 1 5 9
# transpose of a matrix
print(t(m3))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
# multiplication by a scalar
print(m3*5)
[,1] [,2] [,3]
[1,] 5 20 35
[2,] 10 25 40
[3,] 15 30 45
# matrix addition
print(m3+m3)
[,1] [,2] [,3]
[1,] 2 8 14
[2,] 4 10 16
[3,] 6 12 18
# matrix multiplication
print(m3*m3)
[,1] [,2] [,3]
[1,] 1 16 49
[2,] 4 25 64
[3,] 9 36 81
# cross product
print(m3%*%m3)
[,1] [,2] [,3]
[1,] 30 66 102
[2,] 36 81 126
[3,] 42 96 150
# identity matrix
print(diag(1, nrow=3, ncol=3))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
# zero matrix with diagonal values from m3
print(diag(diag(m3)))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 5 0
[3,] 0 0 9
When a matrix is mulipled by a vector then a new matrix is created using the values of the vector with dimensions same as the that of the matrix being multiplied. In the code below the vector c(1:3)
is interpreted as a 3x3 matrix with values 1 2 3 in the three columns.
print(m3*c(1:3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 27
Arrays are an extension of matrices such that they can be of more than two dimensions. Arrays are also homogeneous. The array
function is used to create an array. It takes two arguments – data that would make up the array and the dimensions for the array. The keyword argument dim
takes a vector in which the first to values specify the number of rows and columns and subsequent values specify the number higher dimensions. Note that the product of numbers for dims
should be equal to the total number of elements otherwise the elements would be repeated to fill up the array. And R
won’t give any warning! The code below create a 3D array which can be thought of as two matrices. Elements in an array can be accessed by subsetting.
<- c(1:18)
v1 <- array(v1, dim = c(3,3,2))
arr_1 print(arr_1)
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
print("Element at second row, third column in the second dimension is:")
[1] "Element at second row, third column in the second dimension is:"
print(arr_1[2,3,2])
[1] 17
An array’s dimensions can be changed using the dim
function.
dim(arr_1) <- c(3,2,3)
print(arr_1)
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
, , 3
[,1] [,2]
[1,] 13 16
[2,] 14 17
[3,] 15 18
The different dimensions of an array can be named similar to matrices. The dimensions of the list passed to dimnames
must be same as array dim
. Array with names for dimensions can be subsetted using names as well. The code below extract all the values in Row_b from all the dimensions.
<- array(v1, dim = c(3,3,2))
arr_1 dimnames(arr_1) <- list(paste("Row",letters[1:3], sep="_"), paste("Col",letters[1:3], sep="_"),c("3D_1","3D_2"))
print(arr_1)
, , 3D_1
Col_a Col_b Col_c
Row_a 1 4 7
Row_b 2 5 8
Row_c 3 6 9
, , 3D_2
Col_a Col_b Col_c
Row_a 10 13 16
Row_b 11 14 17
Row_c 12 15 18
print(arr_1["Row_b",,])
3D_1 3D_2
Col_a 2 11
Col_b 5 14
Col_c 8 17
<- array(c(1:3), dim = c(3,3,2))
temp1 print(temp1)
, , 1
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
, , 2
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
The syntax for mathematical operations on array is similar to matrices. When we multiple a vector with an array, the new array (of same dimensions as the array being multiplied) is created using the values of the vector.
print(arr_1*c(1:3))
, , 3D_1
Col_a Col_b Col_c
Row_a 1 4 7
Row_b 4 10 16
Row_c 9 18 27
, , 3D_2
Col_a Col_b Col_c
Row_a 10 13 16
Row_b 22 28 34
Row_c 36 45 54
We execute a function on elements of an array using apply
function. This function takes at least three arguments – an array, a margin or the dimension on which to apply the function, and the function that is to be applied. In the code below, the mean
function is applied to an array with two different margins. First to calculate mean across column values and second to calculate mean across the third dimension of the array.
print(arr_1)
, , 3D_1
Col_a Col_b Col_c
Row_a 1 4 7
Row_b 2 5 8
Row_c 3 6 9
, , 3D_2
Col_a Col_b Col_c
Row_a 10 13 16
Row_b 11 14 17
Row_c 12 15 18
<- apply(arr_1, c(1,2), mean)
arr_1.mean_cols print(arr_1.mean_cols)
Col_a Col_b Col_c
Row_a 5.5 8.5 11.5
Row_b 6.5 9.5 12.5
Row_c 7.5 10.5 13.5
<- apply(arr_1, c(0,0,3), mean)
arr_1.mean_D print(arr_1.mean_D)
3D_1 3D_2
5 14
A data frame is a fundamental data structure used by various data science packages so it is important to understand its syntax. Like matrices and tables, data frame is a collection of elements in two dimensions i.e. as rows and columns. A data frame in R can be created using the data.frame
function which in its basic form takes data to be added to the data frame. The data can be a set of vectors or lists or a matrix. For example, below we create two dataframes one with two vectors and another with a matrix.
<- c(1:5)
v1 <- data.frame(v1,v1**2)
df1 <- matrix(c(1:15), nrow = 5, ncol = 3)
m1 <- data.frame(m1) df2
{r}
print(df1)
{r} print(df2)}
The data.frame
function has row.name
argument that can be used to specify the names for the rows in a dataframe. Column names can be specified by giving names for the vectors while creating a dataframe. Alternatively, the row names and column names can be specified after creating the data frame using the rownames
and colnames
functions. Let’s re-visit the code above, this time with addition of row names and column names in two different ways!
<- c(1:5)
v1 <- data.frame("Number"=v1,"Square"=v1**2, row.names = c("R1","R2","R3","R4", "R5"))
df1
<- matrix(c(1:15), nrow = 5, ncol = 3)
m1 <- data.frame(m1)
df2 rownames(df2) <- c("R1","R2","R3","R4", "R5")
colnames(df2) <- c("C1", "C2", "C3")
print(df1)
Number Square
R1 1 1
R2 2 4
R3 3 9
R4 4 16
R5 5 25
print(df2)
C1 C2 C3
R1 1 6 11
R2 2 7 12
R3 3 8 13
R4 4 9 14
R5 5 10 15
The columns in a dataframe can be accessed by their names using the $
operator.
print(df1$Square)
[1] 1 4 9 16 25
print(df2$C2)
[1] 6 7 8 9 10
We can select specific rows and columns from a dataframe by subsetting it with required indices. To do this, within the square brackets specify the row range and column range separated by a comma. E.g. to get 2nd and 3rd rows and 1st and 2nd columns the subset would be [2:3,1:2]
. Similarly, to get 4th and 5th rows and all the columns, we’ll use [4:5,]
.
print(df2[2:3,1:2])
C1 C2
R2 2 7
R3 3 8
print(df2[4:5,])
C1 C2 C3
R4 4 9 14
R5 5 10 15
To add a column to a dataframe, simply assign a vector to a new column in the dataframe. Note that the vector length must me same as the number of rows in the dataframe. We can also use the cbind
function for this. Similarly, rows can be added to a dataframe using the rbind
function.
The cbind
and rbind
functions can be used to combine multiple dataframes as well.
<- matrix(c(1:15), nrow = 5, ncol = 3)
m1 <- data.frame(m1)
df2 rownames(df2) <- c("R1","R2","R3","R4", "R5")
colnames(df2) <- c("C1", "C2", "C3")
#Adding a column
#using column name
$C4 <- c(16:20)
df2print(df2)
C1 C2 C3 C4
R1 1 6 11 16
R2 2 7 12 17
R3 3 8 13 18
R4 4 9 14 19
R5 5 10 15 20
#Adding a column
#using cbind
<- cbind(df2,C5=c(21:25))
df2 print(df2)
C1 C2 C3 C4 C5
R1 1 6 11 16 21
R2 2 7 12 17 22
R3 3 8 13 18 23
R4 4 9 14 19 24
R5 5 10 15 20 25
#Adding a row using rbind
<- rbind(df2,R6=letters[1:5])
df3 print(df3)
C1 C2 C3 C4 C5
R1 1 6 11 16 21
R2 2 7 12 17 22
R3 3 8 13 18 23
R4 4 9 14 19 24
R5 5 10 15 20 25
R6 a b c d e
To remove row(s) or column(s) from a dataframe we can use negative subset i.e. select only the elements which are not part of the subset. E.g., to remove 6th row and 4th and 5th columns the negative subset would be [-6,-(4:5)]
.
print(df3)
C1 C2 C3 C4 C5
R1 1 6 11 16 21
R2 2 7 12 17 22
R3 3 8 13 18 23
R4 4 9 14 19 24
R5 5 10 15 20 25
R6 a b c d e
<- df3[-6,-(4:5)]
df3 print(df3)
C1 C2 C3
R1 1 6 11
R2 2 7 12
R3 3 8 13
R4 4 9 14
R5 5 10 15
apply
family of functionsWhen we need to run a function with each element of a collection as an argument we can use the lapply
function. You can think of it as running a for loop and calling a function with an important difference that this method is much faster than a for (or while) loop. The lapply
function takes two arguments -- data and FUN
the function that needs to be applied. It return a list
of length same as the input vector (since the function is applied to each element of the input).
<- function(x){
get_square return(x**2)
}= c(1:5)
v1 lapply(v1, get_square)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
The sapply
function can works similar to lapply
with a difference that can take a list or a martix or a dataframe as input and returns an array or a matrix object depending upon the dimensions of the return value. In the case of sapply
the function is applied column-wise.
<- function(x){
get_square return(x**2)
}<- c(1:5)
v1 <- c(6:10)
v2 sapply(data.frame(v1,v2), get_square)
v1 v2
[1,] 1 36
[2,] 4 49
[3,] 9 64
[4,] 16 81
[5,] 25 100
sapply(data.frame(v1,v2), mean)
v1 v2
3 8
# mean along rows
<- matrix(c(1:15), nrow = 5, ncol = 3)
m1 <- data.frame(m1)
df2 apply(df2, FUN = mean, MARGIN = 1)
[1] 6 7 8 9 10
# mean along columns
<- matrix(c(1:15), nrow = 5, ncol = 3)
m1 <- data.frame(m1)
df2 apply(df2, FUN = mean, MARGIN = 2)
X1 X2 X3
3 8 13
The tapply
function offers an additional feature of subsetting vectors based on a factor variable. This is useful to apply the required function category-wise. This function takes three arguments – the data on which the function is to be applied, the factors (categorical variables), and the function that is to be applied. In the code below the mean
function is applied to calculate average marks for the two programs.
<- c(40,50,60,60,80,70)
marks <- c("UG","UG","UG","PG","PG","PG")
program <- data.frame(marks, program)
results tapply(results$marks, results$program, FUN = mean)
PG UG
70 50