8  File Handling

8.1 Introduction

Learning how to work with files is absolutely imperative to effectively apply coding skills for data analysis. That’s because the data would be invariably in the form of files. Also, it is important to learn how to save results in a file. Files can be in a variety of formats, so it is helpful to understand those format and have an idea about specific libraries for different formats. Here we’ll learn the basic concepts in file handling and also look a some of the specific libraries for working with some of the well defined file formats.

8.2 Connecting to a file

A file can be opened in different modes depending upon the kind of operations that you need to perform. To open a file for reading, r mode should be used and for writing to a file connection use w mode. An important point to keep in mind here is that when a file is opened with w mode, if the file exists, the contents of that file are erased on opening. To open an existing file for writing while retaining the contents for that file, use a i.e. append mode. For all these three modes, t and b can be appended to the mode to refer to text (default) and binary modes, respectively. E.g. to open a file for reading in text mode, use rt and for reading in binary mode, use rb. A + suffix to any of the three modes provides additional capabilities in terms of performing multiple operations. To have this option for performing multiple operations in the binary mode use +b. The table below summarize the different modes for file handling and their uses.

Mode Purpose Mode Purpose Mode Purpose
r or rt read text rb read binary r+ / r+b read and write. The file must exist.
w or wt write text wb write binary w+ / w+b read and write. Remove existing contents.
a or at append text ab append binary a / a+b read and append

To simply get all the lines in a file as a vector, the readLines function is used with the name of the file as an argument. Optionally, we can also specify the number of line to read.

v1 <- readLines("test123.txt")
cat(v1,sep = "\n")
Learning file handling in R.
This is the second line.
This is the third line.
v2 <- readLines("test123.txt",2)
cat("\n**V2 has only first two lines.**",v2, sep = "\n")

**V2 has only first two lines.**
Learning file handling in R.
This is the second line.

Reading file contents via file connection.

con_read <- file("test123.txt", "r")
print(con_read)
A connection with                         
description "test123.txt"
class       "file"       
mode        "r"          
text        "text"       
opened      "opened"     
can read    "yes"        
can write   "no"         
readLines(con_read)
[1] "Learning file handling in R." "This is the second line."    
[3] "This is the third line."     
close(con_read)

To write to a file we first need to open a file in the write (w) mode. In this case if the file doesn’t exist a new file would be created. However, if the file exists then it would be overwritten i.e. all its contents would be erased prior to writing. The writeLines functions is used to write of a file connection opened with the write mode. After writing the contents to the file, the file connection must be closed using the close function.

con1 <- file("test1.txt", open = "w")
print(con1)
A connection with                       
description "test1.txt"
class       "file"     
mode        "w"        
text        "text"     
opened      "opened"   
can read    "no"       
can write   "yes"      
writeLines("This is a test file", con = con1)
writeLines("Another line in the file", con = con1)
close(con1)
readLines("test1.txt")
[1] "This is a test file"      "Another line in the file"

8.3 Utility functions

R has some useful function to get information about the file, directories, paths etc. The getwd function return the current working directory. The list.files and list.dirs functions return the list of files and list of directories in the current working directory. Both these functions have a keyword argument recursive which defaults to FALSE and TRUE in the case of former and later, respectively. The pattern keyword argument for the list.files function can be used to list only selected files, e.g. to list only files with .txt extension use list.files(pattern = "*.txt").

Options for list.files
Keyword argument Default value
path .
pattern NULL
all.files FALSE
full.names FALSE
recursive FALSE
ignore.case FALSE
include.dirs FALSE
no.. FALSE
Options for list.dirs
Keyword argument Default value
path .
full.names TRUE
recursive TRUE

8.4 Working with directories

To iterate through the files in a directory, a for loop can be used; e.g., the code below lists all the files with .txt extension in all the sub-directories.

for(x in list.dirs(recursive = F)){
  cat(x, "\t",list.files(x, pattern = "*.txt"), "\n")
}

8.5 Working with specific file types

There are certain functions to work with specific file types. For instance for reading and writing csv files, read.csv and write.csv functions are available. The read.csv functions return a data.frame object in which the column separated values are stored as rows and columns.

csv_file <- read.csv("test1.csv")
csv_file
  S.No.  Name Grade
1     1   Sam     1
2     2  Mike     2
3     3 Rohan     1

There are many libraries to work with image files e.g., the magick library has various functions for image processing. It can be installed using install.package('magick'). Similarly, to work with excel files, xlsx library can be used.