You would like to attend the meeting but could not get a ticket? Follow the live event online!
MilanoR is a free event, open to all R users and enthusiasts or those who wish to learn more about R. The meeting consists of two talks (this time one about big data and and one about statistical learning) + a free buffet and networking time
DESPINA Big Data Lab at the Department of Economics and Statistics “Cognetti de Martiis”– Università degli Studi di Torino invites applications for a 1-year Post-doc Researcher in Big Data Analytics for business and social sciences.
R can be connected with Hadoop through the
rmr2 package. The core of this package is
mapreduce() function that allows to write some custom MapReduce algorithms. The aim of this article is to show how it works and to provide an example.
As mentioned in the previous article, a possibility for dealing with some Big Data problems is to integrate R within the Hadoop ecosystem. Therefore, it's necessary to have a bridge between the two environments. It means that R should be capable of handling data the are stored through the Hadoop Distributed File System (HDFS). In order to process the distributed data, all the algorithms must follow the MapReduce model. This allows to handle the data and to parallelize the jobs. Another requirement is to have an unique analysis procedure, so there must be a connection between in-memory and HDFS places.
Since R uses the computer RAM, it may handle only rather small sets of data. Nevertheless, there are some packages that allow to treat larger volumes and the best solution is to connect R with a Big Data environment. This post introduces some Big Data concepts that are fundamental to understand how R can work in this environment. Afterwards, some other posts will explain in detail how R can be connected with Hadoop.