Learning how to work with files is absolutely imperative to effectively apply coding skills for data analysis. That’s because the data would be invariably in the form of files. Also, it is important to learn how to save results in a file. Files can be in a variety of formats, so it is helpful to understand those format and have an idea about specific libraries for different formats. Here we’ll learn the basic concepts in file handling and also look a some of the specific libraries for working with some of the well defined file formats.
8.2 Connecting to a file
A file can be opened in different modes depending upon the kind of operations that you need to perform. To open a file for reading, r mode should be used and for writing to a file connection use w mode. An important point to keep in mind here is that when a file is opened with w mode, if the file exists, the contents of that file are erased on opening. To open an existing file for writing while retaining the contents for that file, use a i.e. append mode. For all these three modes, t and b can be appended to the mode to refer to text (default) and binary modes, respectively. E.g. to open a file for reading in text mode, use rt and for reading in binary mode, use rb. A + suffix to any of the three modes provides additional capabilities in terms of performing multiple operations. To have this option for performing multiple operations in the binary mode use +b. The table below summarize the different modes for file handling and their uses.
Mode
Purpose
Mode
Purpose
Mode
Purpose
r or rt
read text
rb
read binary
r+ / r+b
read and write. The file must exist.
w or wt
write text
wb
write binary
w+ / w+b
read and write. Remove existing contents.
a or at
append text
ab
append binary
a / a+b
read and append
To simply get all the lines in a file as a vector, the readLines function is used with the name of the file as an argument. Optionally, we can also specify the number of line to read.
v1 <-readLines("test123.txt")cat(v1,sep ="\n")
Learning file handling in R.
This is the second line.
This is the third line.
v2 <-readLines("test123.txt",2)cat("\n**V2 has only first two lines.**",v2, sep ="\n")
**V2 has only first two lines.**
Learning file handling in R.
This is the second line.
A connection with
description "test123.txt"
class "file"
mode "r"
text "text"
opened "opened"
can read "yes"
can write "no"
readLines(con_read)
[1] "Learning file handling in R." "This is the second line."
[3] "This is the third line."
close(con_read)
To write to a file we first need to open a file in the write (w) mode. In this case if the file doesn’t exist a new file would be created. However, if the file exists then it would be overwritten i.e. all its contents would be erased prior to writing. The writeLines functions is used to write of a file connection opened with the write mode. After writing the contents to the file, the file connection must be closed using the close function.
con1 <-file("test1.txt", open ="w")print(con1)
A connection with
description "test1.txt"
class "file"
mode "w"
text "text"
opened "opened"
can read "no"
can write "yes"
writeLines("This is a test file", con = con1)writeLines("Another line in the file", con = con1)close(con1)readLines("test1.txt")
[1] "This is a test file" "Another line in the file"
8.3 Utility functions
R has some useful function to get information about the file, directories, paths etc. The getwd function return the current working directory. The list.files and list.dirs functions return the list of files and list of directories in the current working directory. Both these functions have a keyword argument recursive which defaults to FALSE and TRUE in the case of former and later, respectively. The pattern keyword argument for the list.files function can be used to list only selected files, e.g. to list only files with .txt extension use list.files(pattern = "*.txt").
Options for list.files
Keyword argument
Default value
path
.
pattern
NULL
all.files
FALSE
full.names
FALSE
recursive
FALSE
ignore.case
FALSE
include.dirs
FALSE
no..
FALSE
Options for list.dirs
Keyword argument
Default value
path
.
full.names
TRUE
recursive
TRUE
8.4 Working with directories
To iterate through the files in a directory, a for loop can be used; e.g., the code below lists all the files with .txt extension in all the sub-directories.
There are certain functions to work with specific file types. For instance for reading and writing csv files, read.csv and write.csv functions are available. The read.csv functions return a data.frame object in which the column separated values are stored as rows and columns.
csv_file <-read.csv("test1.csv")csv_file
S.No. Name Grade
1 1 Sam 1
2 2 Mike 2
3 3 Rohan 1
There are many libraries to work with image files e.g., the magick library has various functions for image processing. It can be installed using install.package('magick'). Similarly, to work with excel files, xlsx library can be used.