Joyce Robbins originally published: 1/31/2018
tidyr::gather() has been superceded by tidyr::pivot_longer().
Please see this tutorial on
pivot_longer()
instead.
This tutorial focuses on one function: tidyr::gather(), which is used
to convert data from messy to tidy. I won’t get into the distinction
here – I’m going to assume that you know the terms and just need help
figuring how to get from A to B. If, however, this is all new to you, I
suggest reading “Tidy
Data” by Hadley
Wickham, the seminal paper on the topic, which lays out the concept in
detail.
Before getting into the nitty-gritty of gather() let’s study what our
messy data looks like, and what it will look like when it’s tidied:
Study the differences carefully. What did you observe?
Hopefully you noticed the following:
-
The
idcolumn appears in both versions, though it’s double the length in the tidy form. -
We have two new columns:
roadtypeandmpg. -
The
cityandhwycolumn names became the contents of the newroadtype(key) column. -
The values of the
cityandhwycolumns became the values of the newmpg(value) column.
Now here’s the trick: the only parts we need to concern ourselves with are 1. and 2.: deciding which columns to keep unchanged, and what names to give the new columns, which are called the key and value columns.
The columns to keep are the ones that are already tidy. The ones to dump
are the ones that aren’t true variables but in fact are levels of
another variable. In this case, city and hwy should be levels of a
new variable called roadtype (or something similar), according to the
theory of tidy data. On the other hand, id should remain as is.
The bottom line: keep id as is
The second (and final) step is to choose names for the new key and
value columns. We’ve already chosen roadtype for key; let’s pick
mpg for value. You can also choose to stick with the defaults for
the new key and value column names: key and value (surprise).
Using key and value may be helpful at first to keep track of which
is which, though ultimately I find more descriptive names to be more
useful.
The bottom line: key = roadtype, value = mpg
The code to carry out the transformation from messy to tidy is one call
to gather(), with parameters that reflect our conclusions about the
columns to keep and the names we’ve chosen:
tidydata <- messydata %>%
gather(key = "roadtype", value = "mpg", -id)The most important thing to note is that each parameter corresponds to one of the columns in the tidy data set.
Now let’s look at them one at a time in more detail:
key = "roadtype" sounds like it’s setting the key to a preexisting
column called “roadtype”, but that’s not what’s going on. Rather, think
of this as an instruction to create a new key column and call it
“roadtype”.
value = "mpg" likewise is an instruction to create a new value column
and call it “mpg”.
-id is the piece that says “Don’t touch my id column! Leave it as
is!”
Nothing else needs to be specified: every value in a column not marked “don’t touch” will be moved to the value (“mpg”) column, paired with its old column name (in this case “city” or “hwy”) from the messy data set, which now appears in the key column.
There are a few problems I’ve seen with the way in which people use
gather():
If we had made this mistake in the example above, the result would be something like:
messydata %>%
gather(key = "roadtype", value = "mpg")## # A tibble: 9 x 2
## roadtype mpg
## <chr> <chr>
## 1 id car1
## 2 id car2
## 3 id car3
## 4 city 19
## 5 city 20
## 6 city 29
## 7 hwy 24
## 8 hwy 30
## 9 hwy 35
Yikes, not what we wanted. Looking at the mpg column, we see that
“car1”, “car2”, and “car3” don’t belong. The solution is to add the
-id parameter – remember, think: “don’t touch id” – so that it isn’t
“dumped” into the key column.
Note: it is possible to specify the columns to dump rather then the columns to keep:
messydata %>%
gather(key = "roadtype", value = "mpg", city, hwy)## # A tibble: 6 x 3
## id roadtype mpg
## <chr> <chr> <dbl>
## 1 car1 city 19
## 2 car2 city 20
## 3 car3 city 29
## 4 car1 hwy 24
## 5 car2 hwy 30
## 6 car3 hwy 35
I find this method less intuitive, but of course it’s your choice.
Note in our example that the id column is important for linking the
city and highway miles per gallon. For example, we might want to plot
the data as follows, which requires knowing which car had which city and
highway mpg:
Often, however, there is no id column: it’s not necessary since each
row represents one item – a car in this case. If we try to tidy messy
data without an id, it looks like this:
The problem is that we’ve lost the natural connection we had between city and highway for each car since the two values no longer reside in the same row. The solution here is to move the row names – or row numbers in this case – to a column to be used as an id:
messy2 %>% rownames_to_column("id") %>%
gather(key, value, -id)## # A tibble: 6 x 3
## id key value
## <chr> <chr> <dbl>
## 1 1 city 19
## 2 2 city 20
## 3 3 city 29
## 4 1 hwy 24
## 5 2 hwy 30
## 6 3 hwy 35
Note that we must specify the “don’t touch” id column that we just
created. Also note that we used the default key and value column
names rather than choose our own.
Let’s take the painters dataset from the MASS package:
## Composition Drawing Colour Expression School
## Da Udine 10 8 16 3 A
## Da Vinci 15 16 4 14 A
## Del Piombo 8 13 16 7 A
## Del Sarto 12 16 9 8 A
## Fr. Penni 0 15 8 0 A
## Guilio Romano 15 16 4 14 A
…and tidy it into four columns as such:
## Name School Skill Score
## 1 Da Udine A Composition 10
## 2 Da Vinci A Composition 15
## 3 Del Piombo A Composition 8
## 4 Del Sarto A Composition 12
## 5 Fr. Penni A Composition 0
## 6 Guilio Romano A Composition 15
This example is a little more complex than the previous one, since it’s missing an id column, and there is more than one “keep” column. (You are encouraged to try this on your own before looking at the solution!)
-
Move rownames to a new column and call it
Name(see “missingidcolumn” in the Heads up section above to understand the need for this new column.) -
Keep
NameandSchoolcolumns as is. (“Don’t Touch!”) -
Since the four columns
Composition,Drawing,Colour, andExpressionare really levels of another variable, they do not get the “keep” as is designation. That means we are staging them to be dumped into a new key variable which we’ll callSkill. In addition, the values contained in these columns will move to a new value column, which we’ll callScore.
library(MASS)
library(tidyverse)
tidypaint <- painters %>%
rownames_to_column("Name") %>%
gather(key = "Skill", value = "Score", -Name, -School)The breakdown:
Finally, if the visuals aren’t your style, here’s a running commentary of the instructions:
“Take the (messy) dataset painters. Move the contents of the rownames
to a new column called Name. Now let’s start tidying by gathering
multiple columns into key-value pairs. Do this by creating a new key
column called Skill. While you’re at it, create a new value column
called Score. Absolutely do not touch or change the Name and
School columns (other than to replicate as necessary). It goes without
saying that the four other column names will fill the new Skill
column, and the values of those columns will fill the new Score
column, since it’s understood from what I’ve said already! Much
appreciated!”
Thank you to @angelotrivelli @dch4n @jschoeley @jspncr_ @kierisi @s_lump for providing feedback and helpful suggestions in response to this Twitter post.








