This document is an introduction to plotting using ggplot. ggplot is powerfull tool for plotting basically anything you could think of and it allows you to modify all aspects of your plot, which makes ggplot much more flexible than the plot function in base R. I hope that this document can help you to get an understanding of how ggplots are build. This should help you understand why problems occur when plotting and help you solve them.

This document includes two parts; The first part is a short introduction to the layers of a ggplot and their functionalities. The second part contains suggestions of additional packages, that can help you improve your graphic work. Examples are provided with code and graphs throughout the document.

1 The layers of ggplot

A ggplot consists of different layers that can be combined to a graph. The differnt layers will add an extra layer of information to the plot. You can (in most cases) add several layers of the same type, e.g. two different geometries can be added and will create one plot with two types of graphs inside. In the following the different layers and their content is discribed breifely and examples will follow afterwards.

  • Data: The data input for your plot can be in differnt formats, either as a full dataset, where each row is an observation or as summerized data, e.g. counts or percentages for different groups.
  • Mapping of data: The mapping of data is defined using the function aes(). It is used to describe which part of your data input that is used for what e.g. which variable is your x and y axis.
  • Geometries: The geometry is used to define how the data should be interpreted, e.g. as point, lines or bars.
  • Statistics: Statistics (stats) can be used for transformation of the original data into other measures e.g. from count to percentages. Stats are commonly used when the data input does not include the format you wish to plot, e.g. if you plot directly from a dataset where group percentages are not pre-calculated.
  • Scales: Scales imply a specific interpretation of values as e.g. continuous or discrete. It can be used to quite many puposes incluing translation between categories and colors and defining axis and limits.
  • Coordinates: Coordinates defines the physical mapping of the data. It can be used to modify which parts of the plot should be plottet and is very usefull for geographic plots.
  • Facets: This features can be used to split a plot by other variables, e.g. for analysing subgroups.
  • Theme: Themes can be used to modify anything that is unrelated to your data including labels, fonts, text size, background colors etc.

1.1 Data and mapping of data

Data is most often defined in ggplot(), but the mapping of data using aes() can be written under both ggplot() and the geometries. If all geometries are using the same mapping of data, then it does not matter if aes() is defined in ggplot() or under the geometry. If they use differnt mappings, aes() will have to be defined under each geometry.

Here a plot with two geometries are shown with the two different codes - they result in the same plot. If the different geometries were relying on differnt data sources, then data must be defined under each geometry, you can see an example here: Maps

#A
cloud1<-ggplot(dat,aes(x = age, y = bmi))+
  geom_point(alpha=0.3)+
  geom_density_2d()
#B
cloud2<-ggplot(dat)+  
  geom_point(aes(x = age, y = bmi), alpha=0.3)+
  geom_density_2d(aes(x = age, y = bmi)) 

1.2 Geometries

Geometry is the layer that define how your data should be interpreted. There are made specific geometry functions for almost any type of plot, a few examples are shown here. As you can se more than one geometry can be used for each plot.

#A
boxplot<-ggplot(dat, aes(x=age))+
  geom_boxplot()
#B 
plot<-ggplot(dat,aes(x=sex))+
  geom_bar()
#C
plot1<-ggplot(dat,aes(x = age, y = bmi))+  
  geom_point(alpha=0.3)+
  geom_density_2d(colour='lightyellow')
#D
plot2<-ggplot(dat,aes(x = bmi, color=educ))+
  geom_histogram(bins=50, fill='white', alpha=0.5, position='identity')

1.3 Statistics

The layer statistics or stats can be used to transform counts into percentages or avereaging of values. You need to use the statistic layer when you have not done any pre-calculations on your data. It is also an option to calculate the needed values before plotting, in such case the statistic layer is not needed. An example of how a bar plot with average values can be made using the two methods is shown here. As you can see the two methods result in identical plots.

#A
barplot1<-ggplot(dat, aes(x=sex, y=age, fill=diabetes))+ #input of data is raw and includes no pre-calculations
  stat_summary(fun.data=mean_sdl, geom='bar', position=position_dodge()) # we use stats to calculate the mean age

#B
x<-dat[,.(mean_age=mean(age)), keyby=.(sex,diabetes)] 
# we calculate the needed variables before adding data to the ggplot

barplot2<-ggplot(x,aes(x=sex, y=mean_age, fill=diabetes))+ 
geom_col(aes(fill=diabetes), position=position_dodge()) 
#statistics are already calculated in data, so we can use geom_col() - no need for stats

1.4 Scales

Scales are good for handeling axis labels and tics, but they can also be used for transitions including reversing or log transformations of the axis.

When using scales you need to define which axis (x or y) you which to modify and if the variable is continuous or discrete. Scales can also be defined for the legend using ‘fill’ or ‘color’ allowing you to changes colors, labels and much more. scale_x_date() is handy when you have dates as one of your axis.

First, an example of how to add labels to axis and legends using scales is shown.

#A
barplot3<-ggplot(dat, aes(x=sex, y=age, fill=diabetes))+
  stat_summary(fun.data=mean_sdl, geom='bar', position=position_dodge()) + 
  scale_x_discrete(name='Sex',labels=c('Male','Female'))+
  scale_y_continuous(name='Average age',breaks=seq(0,70,by=10))+
  scale_fill_manual(values=c("0" = "orange", "1" ="darkgreen"),name='Diabetes',labels=c('No','Yes')) 
# use scale_fill_manual() to modify the fill
#B
hist<-plot2+
  scale_color_manual(values=c('darkgreen','#E69F00','#561313','#02416A'), name='Educational\nlevel', labels=c('Lowest','Medium','Higher','Highest'))

Secondly is an example of how to reverse or transform axis with scales.

#A
scale1<-plot1+
  scale_y_reverse(name='BMI from high to low')+
  scale_x_reverse(name='Age from high to low')
#B
scale2<-plot1+
  scale_y_log10(name='Log BMI')+
  scale_x_log10(name='Log Age', breaks=seq(30,80, by=5))

1.5 Coordinates

Coordinates controls the grid on which you plot your graphics. This feature is specially important for plotting geografical data onto a map, but it can also be used for changing and modifying your axis as shown below.

hist+coord_flip() #swap axis

An important note is that modifying axis using coord_cartesian() does not change the actual plot, it only change what part of the plot that can be seen. This is different when changing axis using scales where the points that are cut out of the plot window will be excluded from the plot and might result in changes. As you can see in the example below, the density curves are preserved in plot A, but they are recalculated for plot B using only the data included in the plot.

#A
coord1<-plot1+
  coord_cartesian(xlim=c(30,60)) 
#change axis without excluding points from the plot
#B
coord2<-plot1+
  scale_x_continuous(limits=c(30,60)) 
#points are excluded from the plot and the density curves are changed accordingly
## Warning: Removed 3254 rows containing non-finite values (stat_density2d).
## Warning: Removed 3254 rows containing missing values (geom_point).

1.5.1 Maps

Here is a simple example of how to plot geografic information using coordinates. You can learn more about plotting with geografic data here: https://ggplot2-book.org/maps.html and here: https://r-spatial.github.io/sf. The ggmap package alows you to integrate online map sources in your plot including e.g. Google Maps.

dk_cities<-maps::world.cities%>%
  filter(country.etc=='Denmark')%>%
  select(-country.etc,lon=long)

dk <- st_as_sf(maps::map("world", regions = 'denmark', fill = TRUE, plot = FALSE))

map_plot<-ggplot() + 
  geom_sf(data=dk) + 
  geom_point(data=dk_cities, aes(x=lon, y=lat, size=pop),
             colour='darkgreen') +
  scale_size(name='Population', breaks=seq(0,1000000,by=250000))+
  coord_sf(xlim=c(8,13));map_plot

1.6 Facets

Facet is a very usefull feature when you neeed to create the same plot for different groups of your population. Two of the most comonly used facet functions are facet_grid and facet_wrap.

facet_grid() creates a matrix of the variables provided. It is possible to include more than two variable, although the plot easily becomes difficult to read when faceting across more than two.

facet_grid<-hist+
  facet_grid(sex~age_group);facet_grid

facet_wrap() works in a similar way, but the plots are not shown as a matrix.

facet_wrap<-hist+
  facet_wrap(sex~age_group);facet_wrap

1.7 Theme

To improve the final visualization of your plot you can change the text, lines, colours and background using themes. The modifications can be done using pre-defined themes or by manually adjusting settings in theme().

Here an example of how to manually changes settings in theme(). There are endless options for modifying the theme, so only a few things are shown here. When editing an object in theme() you have to specify which element you wish to modify by choosing element_line(), element_rect(), or element_text(). If you set a statement to element_blank(), this feature will be removed from the plot.

# make labels for the panels
label.age<-c('<40'='Under 40','40-50'='40-50','50-60'='50-60','>60'='60 and older')
label.sex<-c('1'='Male','2'='Female')

histo<-ggplot(dat,aes(x = bmi, color=educ))+
  geom_histogram(bins=50, fill='white', alpha=0.5, position='identity')+
  facet_grid(sex~age_group, 
             labeller = labeller(age_group=label.age,
                                 sex=label.sex))

# adding final descriptions to the plot and editing the theme
histogram<-histo+
  scale_color_discrete(name='Educational level',labels=c('Lowest','Medium','Higher','Highest'))+
  labs(x = 'BMI',
       y = 'Count',
       title = 'Histogram of educational level',
       subtitle = 'by age groups and sex',
       caption = 'Source: Data from Framingham Heart Study')+
  theme(strip.text = element_text(size=14), # adjusting the size of the panel-text
        strip.background = element_rect('white'), # insert white background of panels
        title = element_text(size=18), #adjust all text elements
        legend.position = 'top',
        legend.direction = 'horizontal',
        legend.box.background = element_rect(),# insert box around the legend
        legend.box.margin = margin(2,2,2,2), # add some margin space to the legend box
        panel.background = element_rect('white') #insert white background of the facet plots
        );histogram

Predefined themes can make it very easy to give your plots a nice look, but at the cost of some flexibility as some parameters are fixed. I have chosen to show two different theme packages of which some examples are shown in the following. A theme can be added to a ggplot object at the end or you can define a theme for all your plots useing theme_set().

1.7.1 ggraph

The ggraph package includes a theme with a clean look that still allows you to adjust many factors.

histogram+
  theme_graph(title_size = 20,
              subtitle_size = 16,
              strip_text_size = 13,
              background = 'white',
              caption_size = 12,
              plot_margin = margin(10,10,10,10),
              base_size = 14)

1.7.2 ggthemes

The ggthemes package includes a range of themes and only a few of them are shown here. They have a broad range of looks, but only allows for the following modifications:

  • base_size: base font size
  • base_family: base font family
  • base_line_size: base size for line elements
  • base_rect_size: base size for rect elements

The ggthemes packages contains some predefined color scales (e.g. scale_color_gdocs()), these can be applied to the plot and they will change the colors previously chosen. Changing colors can overwrite earlier specifications of the legends, why you might need to redo specifications. A selection of examples of the different themes are shown in the following.

theme1<-histogram+
  scale_color_economist(name='Educational level',labels=c('Lowest','Medium','Higher','Highest'))+
  theme_economist_white(base_size = 18);theme1

theme2<-histogram+
  scale_color_gdocs(name='Educational level',labels=c('Lowest','Medium','Higher','Highest'))+
  theme_tufte(base_size = 18); theme2

histogram+
  scale_color_tableau(name='Educational level',labels=c('Lowest','Medium','Higher','Highest'))+
 theme_fivethirtyeight()

histogram+
  theme_clean(base_size = 14)

few<-histogram+
  theme_few(base_size = 16)+
  scale_colour_few(name='Educational level',
                   labels=c('Lowest','Medium','Higher','Highest'), 'Dark');few

igray<-histogram+
  theme_igray();igray

map<-map_plot+
  theme_map();map

2 Additional packages

There are endless of packages with additional features for ggplots. I will in the following present a small sample of packages I think are nice to know when working with ggplots.

2.1 survminer

All though it is possible to draw Kaplan Meier curves in ggplot, it is not so straight forward. Therefore, additional packages have been developed for that specific purpose. The package survminer contains some functions for plotting Kaplan Meier curves and includes a function for faceting too.

surv<-survfit(Surv(timedth,death==1)~educ,data=dat)
surv_plot<-ggsurvplot(surv,
           data=dat,
           censor.shape='|',
           size=1,
           conf.int=T,
           ylim=c(0.5,1),
           risk.table=T,
           risk.table.height=c(0.35),
           palette=c('darkgreen','brown',
                     'orange','darkseagreen'),
           legend.title='Educational level',
           legend.labs=c('Lowest','Medium',
                         'Higher','Highest'));surv_plot

surv<-survfit(Surv(timedth,death==1)~educ+age_group+sex,data=dat)
facet<-ggsurvplot_facet(surv,
                 data=dat,
                 palette=c('darkgreen','brown',
                           'orange','darkseagreen'),
                 legend.title='Educational level',
                 legend.labs=c('Lowest','Medium',
                               'Higher','Highest'),
                 facet.by=c('sex','age_group'),
                 short.panel.labs = T,
                 panel.labs = list(age_group=c('Under 40','40-50',
                                               '50-60','Over 60'),
                                   sex=c('Male', 'Female')),
                 break.time.by=2000,
                 ylim=c(0.5,1),
                 panel.labs.background = list('white'));facet

2.2 ggforce

The package ggforce has numerous nice features, I will show a few of the many features here including how to add textboxes, how to zoom in on a plot and how to plot a matrix. You can read more about the package at https://ggforce.data-imaginist.com/index.html.

# Add a text object to your plot
  ggplot(dat,aes(x = age, y = bmi))+ 
      geom_mark_ellipse(aes(filter = bmi > 55 & 
                              age > 65,label = 'Obesity',
                        description = 
                          'These are the most obese people above age 65'))+
    geom_point()+
  scale_y_continuous(limits = c(10,80))

# Zoom in on a part of your plot
  hist + 
    facet_zoom(xlim = c(35, 45), 
               ylim = c(0, 100), 
               horizontal = FALSE)

#plot a matrix
ggplot(dat, aes(x = .panel_x, y = .panel_y,fill=sex, colour=sex)) + 
  geom_point(alpha = 0.2, shape = 16, size = 0.5) +
  geom_autodensity(alpha=0.2, position = 'identity') +
  facet_matrix(vars(educ,bmi,age,sysbp, diabp), 
               layer.diag = 2, 
               layer.mixed=-3, 
               layer.continuous = -4, 
               layer.discrete = 2)+
    theme(strip.background = element_blank(),
        strip.placement = 'outside',
        strip.text = element_text(size = 10),
        legend.text = element_text())

2.3 patchwork

The package patchwork can be used for grouping several plot objects together. It allows you to easily arrange plots and ensure that the axis of the different plots are nicely alligned. The code is very simple as you can see from the examples below. You can also add a theme to all the plots you combine using theme_set(). This way you can ensure that all plots have the same graphical presentation.

Here two plots are plottet on top of a third.

theme_set(theme_tufte())
new_plot<-((plot1+barplot3)/facet_grid);new_plot

Here three plots are plottet on top of each other.

plot1/barplot3/facet_grid

Another nice feature is that you can collect all the legends in one place and remove duplicate legends by using plot_layout().

new_plot+
  plot_layout(guides='collect')

Finally, annotation including title, subtitle, caption and tags can be added to the combined plot using plot_annotation().

theme_set(theme_clean())
new_plot+
  plot_layout(nrow=2,height=c(1.5,1))+
  plot_annotation(title='Hey, this is a title',
                  subtitle = 'and this is a subtitle',
                  caption = 'and this is the caption', 
                  tag_levels='A')

2.4 ggfittext and ggrepel

Text often creates problems when plotting; it is either too small or overlapping other elements of the plot, and manual handeling of text objects is very time consuming. The two packages; ggfittext and ggrepel, can help automate some of the editing of text objects. ggfittext can be used to shrink text and ggrepel can be used to ensure that text objects are not overlapping by rearranging them.

#prepare data for plot
bars<-dat[,.N,keyby=.( age_group,educ)][,.(Percent=round(N/sum(N),2), age_group), keyby=educ]

ggplot(bars, aes(x=educ,y=Percent, fill=age_group, label=scales::percent(Percent)))+
  geom_bar(stat = 'identity')+
  scale_y_continuous(labels = scales::percent_format())+
  geom_bar_text(position='stack', min.size = 5, reflow = T)#allow changes in text size off labels to a minimum of size=6.

setDT(dk_cities)
dk_cities[pop>15000,label:=name]#create label for the biggest cities

ggplot() + 
  geom_sf(data=dk) + 
  geom_point(dk_cities, mapping=aes(x=lon, y=lat, size=pop), colour='orange') +
  scale_size(name='Population')+
  coord_sf(xlim=c(7,15))+
  geom_text_repel(dk_cities ,mapping=aes(x=lon, y=lat,label=label), size=3)
## Warning: Removed 271 rows containing missing values (geom_text_repel).

2.5 gganimate

Maybe more fun than usefull, but gganimate is a package that can help you create simple animations from your plots. I have made two examples here, so you can see some of the features of the gganimate package.

bars<-dat[,.N,keyby=.( age_group,educ)][,.(Percent=round(N/sum(N),2), age_group), keyby=educ]
ggplot(bars, aes(x=educ,y=Percent, fill=age_group,
                 label=scales::percent(Percent)))+
  geom_col(aes(x=educ,y=Percent, 
               fill=age_group), position='dodge')+
  scale_y_continuous(labels = scales::percent_format())+
  geom_bar_text(position='dodge', min.size = 5, reflow = T)+
  transition_states(age_group, 
                    transition_length = 2, 
                    state_length = 1)+
  enter_fade()+
  exit_shrink()

ggplot(dat, aes(x=bmi, color=educ))+
  geom_histogram(bins=50,fill='white', 
                 alpha=0.5, position='identity')+
  scale_color_manual(values=c('darkgreen','#E69F00',
                              '#561313','#02416A'), name='Educational level')+
  transition_states(period, 
                    transition_length = 2, 
                    state_length = 1)+
  ggtitle("Period {closest_state} of Framingham Heart Study")+
  enter_drift()+
  exit_drift()