Data visualization in R using ggplot2
R has a beautiful set of plotting capabilities that allow it to produce publication-quality graphs very easily and quickly. A commonly used package for making graphs in R is called ggplot2. This “Grammar of Graphics” package (hence ‘gg’) has become very popular beacuse it has great documentation, tutorials, and cheatsheets (see the Resources links on the bottom), in addition to it being fairly easy to learn and use.
So let’s get started. Load up the package:
Before starting, there is major assumption here for making plots: that your data is already cleaned up and tidy, ready for plotting and analysis. If it isn’t, finish that part of your work first!
Let’s look over two datasets. I’m using two here because one contains a lot of
continuous data (swiss) and the other contains more discrete data (mpg). Use
colnames
to look at the column names of your dataset:
Alright, let’s do some simple plotting here. Standard plots include:
- Line graph
- Scatterplot
- Scatterplot with a regression/smoothing line
- Barplot
- Boxplot
The nice thing about ggplot2 is it is based on layers. You start with the base
ggplot
function, and using +
you add additional layers with the geom_
commands. Each type of layer ends in the type it is trying to create; so a line
graph would be geom_line
, a scatterplot would be geom_point
, a bar would be
geom_bar
, and so on. Where you put the geom_
in the layer will dictate where
it will be placed on the final plot. The other thing to use in ggplot2 is the
aes
command, which stands for the aesthetics… or rather, what data and
values you actually want to plot. So aes(x = Height, y = Weight)
would put
Height on the x-axis and Weight on the y-axis. Let’s try it out.
Common plots
Line graph: Fertility by Agriculture
Scatterplot: Education by Examination
Scatterplot with regression/smoothing lin: Education by Examination
Using loess smoothing line:
… or a simple linear regression line:
Barplot: Number of vehicle types (class)
Boxplot: Vehicle type (class) by highway miles/gallon (hwy)
Sub-dividing up your plot:
Let’s plot drive type (4-wheel, front, rear) by highway mpg by number of cylinders.
The ‘facet_grid()’ function specifies which variable to sub-divide by - in this case ‘cyl’. By coding (~ cyl), the figure will subdivide by columns, whereas if you coded it as (cyl ~), it will subdivide by rows.
There are dozens of types of layers (geom_
) that you can use and the
documentation is incredible! So if there is a plot you want to make, you
definitely can do it in R!
Customizing your plots:
Default, using density plot (which shows the distribution of a continuous variable, useful for assessing skewness):
Note: fill
tells ggplot2 how to fill in groups with a colour.
Adding a different colour (using the scale_
group of commands; since fill
is
used, it would be scale_fill_
and since one of the colour palettes is called
brewer
, it turns into scale_fill_brewer
). You can see the different choices
for palettes by running ?scale_fill_brewer
to look at the help file.
Similarly, you can use the scale_
function to make changes to the legend.
Using labs
will allow you to modify or add titles to the figure.
And to customize individual features of the plot, you use theme
. The theme
options are quite extensive, so if you want to look more into it, check out
?theme
or the very detailed documentation
here. There is a nice
graphic right above “Complete and incomplete theme objects” section, near the
bottom of the web document.
Another example:
Finally, assign your figure code a name and save it as an individual file using ggsave
.
Resources:
- ggplot2 cheatsheet (in RStudio ‘Help -> Cheatsheets -> Data Visualization’)
- ggplot2 documentation
- ggplot2 book
- Tutorial from a course
- DataCamp tutorials
- YouTube videos