As mentioned in the previous article, a possibility for dealing with some Big Data problems is to integrate R within the Hadoop ecosystem. Therefore, it's necessary to have a bridge between the two environments. It means that R should be capable of handling data the are stored through the Hadoop Distributed File System (HDFS). In order to process the distributed data, all the algorithms must follow the MapReduce model. This allows to handle the data and to parallelize the jobs. Another requirement is to have an unique analysis procedure, so there must be a connection between in-memory and HDFS places.

Share: