R can be connected with Hadoop through the `rmr2`

package. The core of this package is `mapreduce()`

function that allows to write some custom MapReduce algorithms. The aim of this article is to show how it works and to provide an example.

As mentioned in the previous article, the R `mapreduce()`

function requires some arguments, but now we will deal with only two of them. As a matter of fact, the most difficult part is about `map()`

and `reduce()`

. They both consist in R functions that take as input and output some key/value data, since it is a requirement of MapReduce paradigm.

Since now “k” and “v” are the matrices with the input key-value pairs and “key” and “val” are the output ones. Fortunately, `rmr2`

provides `keyval()`

function that generates a list from the output key and value matrices. The condition is that the key must be a matrix with a column and the same number of rows as the value. As a matter of fact, each element of the key matrix is matched with a row of the value matrix. It's not mandatory, though recommended, to use the R matrix data type. This is the structure of a `map()`

(or `reduce()`

) function.

1 2 3 4 5 |
map = function(k, v){ key = ... val = ... return(keyval(key, val)) } |

I used `rmr2`

package a lot for my academic thesis, but unfortunately I'm not allowed to publish the code. Anyway, you can find my thesis PDF here.

In order to show an example, I have produced some other code now. Unfortunately, I haven't tested it yet, but it's not complex and I'm pretty confident that it works. This example is about a wordcount, so the input is a text and the output is a list with each word with its number of occurrences. You can find an example here.

The assumption is it that both the input and the output are stored through hdfs native format. If the format was csv, the sintax would have been slightly different.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mapreduce( input = '/test/input.native', output = '/test/output.native', map = function(k, v){ key = v nData = dim(v)[1] val = matrix(data=1, nrow=nData, ncol=1) return(keyval(key, val) }, reduce = function(k, v){ key = k[1, 1] val = sum(k[, 2]) return(keyval(key, val) } ) |

In this case the value has always one column only. Anyway, it's possible to have a matrix with any number of columns.

In conclusion, the `rmr2`

package is a good way to perform a data analysis in the Hadoop ecosystem. Its advantages are the flexibility and the integration within an R environment. Its disadvantages are the necessity of having a deep understanding of the MapReduce paradigm and the high amount of required time for writing code. I think that it's very useful to customize the algorithms only after having used some current ones first. For instance, the first stage of the analysis may consist in aggregating data through Hive and perform Machine Learning through Mahout. Afterwards, `rmr2`

allows to modify the algorithms in order to improve the performances and fit better the problems.

Hi Mixed,

Thanks for explaining in simple way and impressed with your style of explanation.

I am Hadoop developer and from statistics graduate.

I would like to learn RMR3 and my dream is to become data Scientist. I searched in internet and tried to get some information from external sources but did not found any appropriate material.

Could you provide your valuable suggestions to learn RMR/become data scientist and help me with some documents or material or manuals so that I can prepare myself.

Thanks,

Anil