R User Group of Milano (Italy)

A Big Data introduction

Since R uses the computer RAM, it may handle only rather small sets of data. Nevertheless, there are some packages that allow to treat larger volumes and the best solution is to connect R with a Big Data environment. This post introduces some Big Data concepts that are fundamental to understand how R can work in this environment. Afterwards, some other posts will explain in detail how R can be connected with Hadoop.

The "Big Data" phrase refers to some new challenges, known as the "Three V", and the most important is the Volume. It refers to the necessity to deal with large amounts of data and the most popular solution is provided by Hadoop. Indeed, this software handles wide datasets by splitting them into chunks that are scattered among a cluster of computers. This distributed solution eases the data process since it is possible to split the working load without overcharging a single computer. As regards the hardware, the Hadoop advantage is that it does not have any particular requirement about that. As a matter of fact, it may use a cluster of cheap computers and in that way it avoids overloading any node of the cluster. In addition, larger volumes may be handled through adding some new nodes.

However, this approach has some specific requirements about the data process. First, it must be possible to handle large volumes. The bottleneck is the master node that handles the work by splitting it into tasks and assigning them to the other nodes. In order to avoid the master overloading, the computational cost of this step must be very low. Second, the required time must be reasonable, so the algorithms should be scalable. It means that the computational power of the device grows linearly with the available resources that are the total power of the cluster of computers. Finally, the techniques should deal with distributed data. For these reasons, the input and output of any algorithm must be stored in a specific way.

Hadoop solution is MapReduce that is a specific programming model for writing algorithms. As its name suggests, the algorithms are divided into two steps that are Map and Reduce. Mapping consists in extracting from each chunk of data all information that is necessary. Once the information is extracted, the Reduce step aggregates it and computes the output. Every Hadoop procedure follows the MapReduce logic and there are different high-level tools that present this structure, although it is hidden. However, in order to develop new algorithms, it is necessary to follow the MapReduce logic. Regarding R, there are some packages that connect it to Hadoop and allow to write low-level code.

The MapReduce structure is capable of handling Big Data problems. Indeed, a mapping consists in performing the same chosen operation on each datum, so the computational cost linearly grows as the volume increases. As regards reducing, things are more complicated, but if projected properly it is scalable too. In conclusion, the efficiency is generally low but it depends on the algorithm.

This brief description talks only about the logic upon which MapReduce is based. Stay tuned for a more detailed description.

Leave a Reply