We use summarise() with aggregate functions, which take a vector of values and return a single number. Function summarise_each() offers an alternative approach to summarise() with identical results.

This post aims to compare the behavior of summarise() and summarise_each() considering two factors we can take under control:

  1. How many variables to manipulate
  • 1A. single variable
  • 1B. more than a variable
  1. How many functions to apply to each variable
  • 2A. single function
  • 2B. more than one function

resulting in the following four cases:

  • Case 1: apply one function to one variable
  • Case 2: apply many functions to one variable
  • Case 3: apply one function to many variables
  • Case 4: apply many functions to many variables

These four cases will be also tested with and without a group_by() option.

The mtcars data frame

For this article we will use the well known mtcars data frame.

We will first transform it into a tbl_df object; no change will occur to the standard data.frame object but a much better print method will be available.

Finally, to keep this article tidy and clean we will select only four variables of interest

Case 1: apply one function to one variable

In this case, summarise() results the simplest candidate.

We could use function summarise_each() as well but, its usage results in a loss of clarity.

Case 2: apply many functions to one variable

In this case we can use both functions summarise() and summarise_each().

Function summarise() has a more intuitive syntax:

The names of the output variables can be specified in simple forms like: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

The names of the output variables is given by the name of the functions: min and max. In this case we loose the name of the variable the function is applied to. If we prefer something like: min_mpg and max_mpg we shall rename the functions we call within funs():

Case 3: apply one function to many variables

This case is very similar to case 2. Both functions summarise() and summarise_each() can be used

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

The names of the output variables is given by the name of the variables: mpg and disp. In this case we loose track of the name of the function applied to the variables: mean(). Possibly we would prefer something like: mean_mpg and mean_disp. In order to achieve this result we shall appropriately rename the variables we pass to ... within summarise_each():

Case 4: apply many functions to many variables

As in the previous cases both functions: summarise() and summarise_each() provide a valid alternative.

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

The names of the output variables is given by the notation: variable_function: i.e. mpg_mim, disp_min etc ....

Naming output variables with a different notation: i.e. function_variable does not appear to be possible within the call tosummarise_each()

This goal has to be achieved with a separate instruction

Conclusions

When using functions returning results of length one we have two possible candidate verbs:

  • summarise()
  • summarise_each()

Function summarise() has a simpler syntax while function summarise_each() has a more compact notation.

As a consequence, summarise() seems more appropriate dealing with a single variable or a single function. The more the number of variables or functions increases, the more summarise_each() becomes a better choice.

Function summarise_each() has its own way to assign names to the output variables:

Case 2: apply many functions to one variable

The names of the output variables is given by the name of the functions. In this case we loose the name of the variable the function is applied to.

Case 3: apply one function to many variables

The names of the output variables is given by the name of the variables. In this case we loose track of the name of the function applied to the variables

Case 4: apply many functions to many variables

The names of the output variables is given by the notation: variable_function. Naming output variables with a different notation does not appear to be possible within the call to summarise_each()

Related Post

Share: