10  Data Visualization

10.1 Introduction

Effective data visualization is an essential requirement for understanding the data. Done rightly, graphs are indeed worth thousands of words. As we have seen before, for plotting vectors, we can use the plot and lines functions. However, these functions are not designed to process (large) dataframes. The tidyverse package has ggplot library that offers a robust way of performing visual analysis of the data. This library, in fact, is now a standard way of plotting graphs in R. It can be installed using install.package("ggplot2").

10.2 The ggplot2 library

In its simplest form, the ggplot function takes two arguments – a dataset (generally a dataframe) and an aes object. If the dataset is not a dataframe then it is converted to one implicitly. The aes function specifies the components of the dataframe that should constitute the aesthetics of the plot (more on this below). The ggplot function alone won’t generate the graph that we expect. That’s because the ggplot function only declares what data needs to be plotted and does not specify the representation of the data. For this we need to add geom (geometric object) to the plot. A geom object would specify the required representation (e.g., scatter or line or bar etc.) of the data within the graph. For instance, to make a scatter plot, geom_point should be added to the ggplot function. Below, we use the iris data and plot a graph between petal length and petal width as a scatter plot.

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point()

Let’s we want to color the points in the graph based on the species of the Iris plant. This information is there in the iris dataset. Since here we are using some additional information in the dataframe so we need to modify the aes to get the required plot. Notice how the plot has changed with different colors for different species and a legend has been added to the plot.

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

It is generally a good idea to not have the aes specified within the ggplot function because that way we have the option to include multiple layers that have different aesthetics and representations. In other words, the ggplot function should only have the reference to the data frame and all other aspects of the graph i.e. the aesthetics or the representation should be added to ggplot. This way we have a lot of flexibility in terms of rendering the final graph image. All geom functions have the mapping keyword argument that can be used to specify the aesthetics. So, an alternate, and better, way of plotting the graph above is to use mapping argument of the geom function to specify the aesthetics. This approach helps us to display multiple representations of the data within one graph. The code also shows an example of manually adding a legend and of modifying the x- and y- ticks.

   Number Square Cube
1       1      1    1
2       2      4    8
3       3      9   27
4       4     16   64
5       5     25  125
6       6     36  216
7       7     49  343
8       8     64  512
9       9     81  729
10     10    100 1000
ggplot(data = df1) +
  geom_point(mapping = aes(Number, Square)) +
  geom_line(mapping = aes(Number, Square,color="red")) +
  geom_point(mapping = aes(Number, Cube), pch=1) +
  geom_line(mapping = aes(Number, Cube, color="brown")) +
  labs(y="Value") +
  scale_color_identity(guide = "legend", name="Value",
                       labels = c("Cube", "Square")) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5))

10.3 Aesthetics

Once we have define the data and the coordinate system, the next layer in a ggplot is the asethetics of a plot. The aesthetics of a plot describes the visual appearance of the graph; this includes marker size and shape, linetype, and color/fill. The corresponding keyword argument of the aes function can be modified to get the desired plot. Below are the options for the marker shapes and line styles. The color can be specified as string for the required color or as hex value. An important point to note here that we can adjust the aesthetics based of some data as well. For instance, to color the data points based on some categorical variable in the data, the color can be set to that categorical column. The same concept applies to changing marker size by some data instead of having same size for all the data points. Below is a list of all the aesthetics that can be adjust using the corresponding keywords for the aes function.

adj, alpha, angle, bg, cex, col, color, colour, fg, fill, group
hjust, label, linetype, lower, lty, lwd, max, middle, min, pch
radius, sample, shape, size, srt, upper, vjust, weight, width, x
xend, xmax, xmin, xintercept, y, yend, ymax, ymin, yintercept, z

Below are the different marker shapes corresponding to the numeric value for the shape keyword argument of the aes function.

Size of a marker or width of a line can be adjusted using the size or the linewidth (or lwd) arguments respectively. Both of these keyword argument takes a number

10.4 Geometric Objects

The next important component or the layer in a graph is geometric object i.e. the required representation of the data. The geom function is used to declare the representation of the data e.g. whether we would like to have a scatter plot (geom_point), line plot (geom_line), or bar plot (geom_col). Since each of these functions adds a layer to the plot, it is straightforward to have multiple representation in one plot. The aesthetics for each of these geom layers can be individually adjusted by mapping the aes function. The version 3.4.3 of ggplot contains 53 options for geoms as follows:

geom_abline, geom_area, geom_bar, geom_bin_2d, geom_bin2d, geom_blank, geom_boxplot
geom_col, geom_contour, geom_contour_filled, geom_count, geom_crossbar, geom_curve
geom_density, geom_density_2d, geom_density_2d_filled, geom_density2d, geom_density2d_filled
geom_dotplot, geom_errorbar, geom_errorbarh, geom_freqpoly, geom_function, geom_hex, geom_histogram
geom_hline, geom_jitter, geom_label, geom_line, geom_linerange, geom_map, geom_path
geom_point, geom_pointrange, geom_polygon, geom_qq, geom_qq_line, geom_quantile, geom_raster
geom_rect, geom_ribbon, geom_rug, geom_segment, geom_sf, geom_sf_label, geom_sf_text
geom_smooth, geom_spoke, geom_step, geom_text, geom_tile, geom_violin, geom_vline

10.5 Theme

To enhance the representation of the graphs, we can use the theme function. E.g. we can change the background color of the graph and also alter the grid lines as shown below. Note how we can create a graph object and then change the theme as required. The different aspect of the legend can be also be modified with the theme function. For example, in the code below, the legend.position keyword argument is used to change the legend position. The labs function is used to set/modify different graph labeling such as x and y axes label, title, subtitle, etc.

p <- ggplot(data = iris) + 
  geom_point(mapping = aes(Petal.Length, Petal.Width, shape = Species, color = Species)) 
p + theme(panel.background = element_rect(fill = "azure")) +
  theme(panel.grid.major = element_line(colour = "grey")) + 
  theme(legend.position = "bottom") +
  labs(x = "Petal Length", y = "Petal Width")

Apart from manually altering the theme of a plot, we can also use the in-built themes to “automatically” set various graph options for visual appeal. In addtion, there are packages like ggthemes that provide a an additional set of well known themes. Below are two examples of the themes available in this library – theme_clean() and theme_wsj().

library(ggthemes)
p <- ggplot(data = iris) + 
  geom_point(mapping = aes(Petal.Length, Petal.Width, shape = Species, color = Species)) 
p + theme_clean()

p + theme_wsj()

10.6 Facets

Faceting allows us to make subplot with a single plot. The facet_grid function is used to segregate plots based on specific values within the dataframe. This faceting can be done row-wise or column-wise.

ggplot(data = iris) + 
  geom_point(mapping = aes(Petal.Length, Petal.Width, shape = Species, color = Species)) +
  facet_grid(. ~ Species) + 
  theme(legend.position = "none") +
  labs(title = "Facet into Columns")

ggplot(data = iris) + 
  geom_point(mapping = aes(Petal.Length, Petal.Width, shape = Species, color = Species)) +
  facet_grid(Species ~ .) + 
  theme(legend.position = "none") +
  labs(title = "Facet into rows")

10.7 The esquisse package

The esquisse package provides a graphical interface to generate ggplot code for plotting some data. This is a great utility for all those who are new to ggplot syntax since it allows to generate the code for the plot in an interactive manner. This pacage can be installed by running install.packages("esquisse"). Once installed, launch the GUI by executing esquisse::esquisser(). In the window that pops up, selected the data frame (should be already present) that you would like to plot. Next, drag and drop columns for different aesthetics for ggplot such as columns for the x and y axes, column for coloring the plot. Select the desired representation (geom) and adjust other parameters as per the options in the bottom panel. Finally, click on the code button on the bottom right and you’ll get the ggplot code for the graph which can be directly inserted into the R script.

Esquisser screenshot

Figure: Screenshot for Esquisser.

The code for the graph shown in the screenshot above, as generated by esquisser, is given below.

# The dataframe has data for 
# India and United States only. 
ggplot(df_covid_IUS) +
  aes(
    x = date,
    y = total_cases,
    colour = location,
    size = new_cases
  ) +
  geom_point(shape = "circle") +
  scale_color_hue(direction = 1) +
  theme_minimal()