We often find ourselves tidying and reshaping data. Here we consider the two packages tidyr and reshape2, our aim is to see where their purposes overlap and where they differ by comparing the functions gather(), separate() and spread(), from tidyr, with the functions melt(), colsplit() and dcast(), from reshape2.

Data tidying

Data tidying is the operation of transforming data into a clear and simple form that makes it easy to work with. “Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Compare the two data frames below (cf.Wickham (2014)) to get an idea of the differences: example.tidy is the tidy version of example.messy, the same information is organized in two different ways.

From the wide to the long format: gather() vs melt()

We now begin by seeing in action how we can bring data from the "wide" to the "long" format.

Let’s start loading the packages we need:

and some data (from RStudio Blog - Introducing tidyr): we have measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

Our first step is to put the data in the tidy format, to do that we use tidyr’s functions gather() and separate(). Following Wickham’s tidy data definition, this data frame is not tidy because some variable values are in the column names. We bring this messy data frame from the wide to the long format by using the gather() function (give a look at Sean C. Anderson - An Introduction to reshape2 to get an idea of the wide/long format). We want to gather all the columns, except for the id and trt ones, in two columns key and value:

Note that in gather() we used bare variable names to specify the names of the key, value, id and trt columns.

We can get the same result with the melt() function from reshape2:

We now compare the two functions by running them over the data without any further parameter and see what happen:

We see a different behaviour: gather() has brought messy into a long data format with a warning by treating all columns as variable, while melt() has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if melt() does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather() requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.

Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, gather() cannot handle matrices or arrays, while melt() can as shown below.

Split a column: separate() vs colsplit()

Our next step is to split the column key into two different columns in order to separate the location and time variables and obtain a tidy data frame:

Again, the result is the same but we need a workaround: because colsplit() operates only on a single column we usecbind() to insert the new two columns in the data frame. separate() performs all the operation at once reducing the possibility of making mistakes.

From the long to the wide format: spread() vs dcast()

Finally, we compare spread() with dcast() using the data frame example for the spread() documentation itself. Briefly,spread() is complementary to gather() and brings data from the long to the wide format.

Again, the same result produced by spread() can be obtained using dcast() by specifying the correct formula.

In the next session, we are going to modify the formula parameter in order to perform some data aggregation and compare further the two packages.

Data aggregation

Up to now we made reshape2 following tidyr, showing that everything you can do with tidyr can be achieved by reshape2, too, at the price of a some workarounds. As we now go on with our simple example we will get out of the purposes of tidyr and have no more functions available for our needs. Now we have a tidy data set - one observation per row and one variable per column - to work with. We show some aggregations that are possible with dcast() using the tips data frame from reshape2. Tips contains the information one waiter recorded about each tip he received over a period of a few months working in one restaurant.

We use dcast() to get information on the average total bill, tip and group size per day and time:

Averages per smoker or not in the group.

There is no function in the tidyr package that allows us to perform a similar operation, the reason is that tidyr is designed only for data tidying and not for data reshaping.

Conclusions

At the beginning we have seen tidyr and reshape2 functions performing the same operations, therefore, suggesting that the two packages are similar, if not equal in what they do; lately, we have seen that reshape2’s functions can do data aggregation that is not possible with tidyr. Indeed, tidyr’s aim is data tidying while reshape2 has the wider purpose of data reshaping and aggregating. It follows that tidyr syntax is easier to understand and to work with, but its functionalities are limited. Therefore, we use tidyr gather() and separate() functions to quickly tidy our data and reshape2dcast() to aggregate them.

Further readings:

References:

Wickham, Hadley. 2014. “Tidy data.” Journal of Statistical Software 59 (10).

Related Post

Share: