R can be connected with Hadoop through the rmr2 package. The core of this package is mapreduce() function that allows to write some custom MapReduce algorithms. The aim of this article is to show how it works and to provide an example.

As mentioned in the previous article, the R mapreduce() function requires some arguments, but now we will deal with only two of them. As a matter of fact, the most difficult part is about map() and reduce(). They both consist in R functions that take as input and output some key/value data, since it is a requirement of MapReduce paradigm.

Since now “k” and “v” are the matrices with the input key-value pairs and “key” and “val” are the output ones. Fortunately, rmr2 provides keyval() function that generates a list from the output key and value matrices. The condition is that the key must be a matrix with a column and the same number of rows as the value. As a matter of fact, each element of the key matrix is matched with a row of the value matrix. It's not mandatory, though recommended, to use the R matrix data type. This is the structure of a map() (or reduce()) function.

I used rmr2 package a lot for my academic thesis, but unfortunately I'm not allowed to publish the code. Anyway, you can find my thesis PDF here.

In order to show an example, I have produced some other code now. Unfortunately, I haven't tested it yet, but it's not complex and I'm pretty confident that it works. This example is about a wordcount, so the input is a text and the output is a list with each word with its number of occurrences. You can find an example here.

The assumption is it that both the input and the output are stored through hdfs native format. If the format was csv, the sintax would have been slightly different.

In this case the value has always one column only. Anyway, it's possible to have a matrix with any number of columns.

In conclusion, the rmr2 package is a good way to perform a data analysis in the Hadoop ecosystem. Its advantages are the flexibility and the integration within an R environment. Its disadvantages are the necessity of having a deep understanding of the MapReduce paradigm and the high amount of required time for writing code. I think that it's very useful to customize the algorithms only after having used some current ones first. For instance, the first stage of the analysis may consist in aggregating data through Hive and perform Machine Learning through Mahout. Afterwards, rmr2 allows to modify the algorithms in order to improve the performances and fit better the problems.

Print Friendly, PDF & Email

Related Post