The post Performing Principal Components Regression (PCR) in R appeared first on MilanoR.

]]>*This article was originally posted on Quantide blog - see here.*

Principal components regression (**PCR**) is a regression technique based on principal component analysis (**PCA**).

The basic idea behind **PCR** is to calculate the **principal components** and then use some of these components as predictors in a linear regression model fitted using the typical least squares procedure.

As you can easily notice, the core idea of PCR is very closely related to the one underlying PCA and the “trick” is very similar. In some cases a small number of principal components are enough to explain the vast majority of the variability in the data. For instance, say you have a dataset of 50 variables that you would like to use to predict a single variable. By using PCR you might found out that 4 or 5 principal components are enough to explain 90% of the variance of your data. In this case, you might be better off running PCR on with these 5 components instead of running a linear model on all the 50 variables. This is a rough example but I hope it helped to get the point through.

A core assumption of **PCR** is that the directions in which the predictors show the most variation are the exact directions associated with the response variable. On one hand, this assumption is not guaranteed to hold 100% of the times, however, even though the assumption is not completely true it can be a good approximation and yield interesting results.

Some of the most notable advantages of performing PCR are the following:

- Dimensionality reduction
- Avoidance of multicollinearity between predictors
- Overfitting mitigation

Let’s briefly walk through each one of them:

By using PCR you can easily perform dimensionality reduction on a high dimensional dataset and then fit a linear regression model to a smaller set of variables, while at the same time keep most of the variability of the original predictors. Since the use of only some of the principal components reduces the number of variables in the model, this can help in reducing the model complexity, which is always a plus. In case you need a lot of principal components to explain most of the variability in your data, say roughly as many principal components as the number of variables in your dataset, then PCR might not perform that well in that scenario, it might even be worse than plain vanilla linear regression.

PCR tends to perform well when the first principal components are enough to explain most of the variation in the predictors.

A significant benefit of PCR is that by using the principal components, if there is some degree of multicollinearity between the variables in your dataset, this procedure should be able to avoid this problem since performing PCA on the raw data produces linear combinations of the predictors that are uncorrelated.

If all the assumptions underlying PCR hold, then fitting a least squares model to the principal components will lead to better results than fitting a least squares model to the original data since most of the variation and information related to the dependent variable is condensend in the principal components and by estimating less coefficients you can reduce the risk of overfitting.

As always with potential benefits come potential risks and drawbacks.

For instance, a typical mistake is to consider PCR a feature selection method. PCR is not a feature selection method because each of the calculated principal components is a linear combination of the original variables. Using principal components instead of the actual features can make it harder to explain what is affecting what.

Another major drawback of PCR is that the directions that best represent each predictor are obtained in an unsupervised way. The dependent variable is not used to identify each principal component direction. This essentially means that it is not certain that the directions found will be the optimal directions to use when making predictions on the dependent variable.

There are a bunch of packages that perform PCR however, in my opinion, the **pls package** offers the easiest option. It is very user friendly and furthermore it can perform data standardization too. Let’s make a test.

Before performing PCR, it is preferable to standardize your data. This step is not necessary but strongly suggested since PCA is not scale invariant. You might ask why is it important that each predictor is on the same scale as the others. The scaling will prevent the algorithm to be skewed towards predictors that are dominant in absolute scale but perhaps not so relevant as others. In other words, variables with higher variance will influence more the calculation of the principal components and overall have a larger effect on the final results of the algorithm. Personally I would prefer to standardize the data most of the times.

Another thing to assess before running PCR is missing data: you should remove all the observations containing missing data, or approximate the missing observations with some technique before running the PCR function.

For this toy example, I am using the evergreen *iris* dataset.

require(pls) set.seed (1000) pcr_model <- pcr(Sepal.Length~., data = iris, scale = TRUE, validation = "CV")

By setting the parameter *scale* equal to *TRUE* the data is standardized before running the pcr algorithm on it. You can also perform validation by setting the argument *validation*. In this case I chose to perform 10 fold cross-validation and therefore set the *validation* argument to “CV”, however there other validation methods available just type *?pcr* in the R command window to gather some more information on the parameters of the *pcr* function.

In oder to print out the results, simply use the *summary* function as below

summary(pcr_model)

## Data: X dimension: 150 5 ## Y dimension: 150 1 ## Fit method: svdpc ## Number of components considered: 5 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps ## CV 0.8308 0.5141 0.5098 0.3947 0.3309 0.3164 ## adjCV 0.8308 0.5136 0.5092 0.3941 0.3303 0.3156 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps ## X 56.20 88.62 99.07 99.73 100.00 ## Sepal.Length 62.71 63.58 78.44 84.95 86.73

As you can see, two main results are printed, namely the **validation error** and the **cumulative percentage of variance explained** using n components.

The **cross validation** results are computed for each number of components used so that you can easily check the score with a particular number of components without trying each combination on your own.

The pls package provides also a set of methods to plot the results of PCR. For example you can plot the results of cross validation using the **validationplot** function.

By default, the pcr function computes the root mean squared error and the **validationplot** function plots this statistic, however you can choose to plot the usual mean squared error or the R2 by setting the **val.type** argument equal to “MSEP” or “R2” respectively

# Plot the root mean squared error validationplot(pcr_model)

# Plot the cross validation MSE validationplot(pcr_model, val.type="MSEP")

# Plot the R2 validationplot(pcr_model, val.type = "R2")

What you would like to see is a low **cross validation error** with a lower number of components than the number of variables in your dataset. If this is not the case or if the smalles cross validation error occurs with a number of components close to the number of variables in the original data, then no dimensionality reduction occurs. In the example above, it looks like 3 components are enough to explain more than 90% of the **variability** in the data although the CV score is a little higher than with 4 or 5 components. Finally, note that 6 components explain all the variability as expected.

You can plot the predicted vs measured values using the *predplot* function as below

predplot(pcr_model)

while the regression coefficients can be plotted using the *coefplot* function

coefplot(pcr_model)

Now you can try to use PCR on a traning-test set and evaluate its performance using, for example, using only 3 components.

# Train-test split train <- iris[1:120,] y_test <- iris[120:150, 1] test <- iris[120:150, 2:5] pcr_model <- pcr(Sepal.Length~., data = train,scale =TRUE, validation = "CV") pcr_pred <- predict(pcr_model, test, ncomp = 3) mean((pcr_pred - y_test)^2)

## [1] 0.213731

With the iris dataset there is probably no need to use PCR, in fact, it may even be worse using it. However, I hope this toy example was useful to introduce this model.

Thank you for reading this article, please feel free to leave a comment if you have any questions or suggestions and share the post with others if you find it useful.

The post Performing Principal Components Regression (PCR) in R appeared first on MilanoR.

]]>The post dplyr do: Some Tips for Using and Programming appeared first on MilanoR.

]]>*This post was originally posted on Quantide blog. Read the full article here.*

If you want to compute arbitrary operations on a data frame returning more than one number back, use ** dplyr do()**!

This post aims to explore some basic concepts of `do()`

, along with giving some advice in using and programming.

`do()`

is a verb (function) of `dplyr`

. `dplyr`

is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.

First of all, you have to install `dplyr`

package:

install.packages("dplyr")

and to load it:

require(dplyr)

We will analyze the use of `do()`

with the following dataset, created with random data:

set.seed(100) ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)), x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))

We firstly transform it into a `tbl_df`

object to achieve a better print method. No changes occur on the input data frame.

ds <- tbl_df(ds) ds

Source: local data frame [300 x 3] group x y (fctr) (dbl) (dbl) 1 a 1.995615 -1.71089045 2 a 3.263062 -0.03712943 3 a 2.842166 -0.09022217 4 a 4.773570 0.69742469 5 a 3.233943 2.76536531 6 a 3.637260 4.06379942 7 a 1.836419 2.26214995 8 a 4.429065 2.75438347 9 a 1.349481 -1.77539016 10 a 2.280276 3.04043881 .. ... ... ...

As we already said, `do()`

computes arbitrary operations on a data frame returning more than one number back.

To use `do()`

, you must know that:

- it always
**returns a dataframe** - unlike the others data manipulation verbs of
`dplyr`

,`do()`

needs the specification ofplaceholder inside the function to apply, referring to the data it has to work with.`.`

# Head of ds ds %>% do(head(.))

Source: local data frame [6 x 3] group x y (fctr) (dbl) (dbl) 1 a 1.995615 -1.71089045 2 a 3.263062 -0.03712943 3 a 2.842166 -0.09022217 4 a 4.773570 0.69742469 5 a 3.233943 2.76536531 6 a 3.637260 4.06379942

- it is conceived to be used with dplyr
to compute operations within groups:`group_by()`

# Head of ds by group ds %>% group_by(group) %>% do(head(.))

Source: local data frame [18 x 3] Groups: group [3] group x y (fctr) (dbl) (dbl) 1 a 1.99561530 -1.71089045 2 a 3.26306233 -0.03712943 3 a 2.84216582 -0.09022217 4 a 4.77356962 0.69742469 5 a 3.23394254 2.76536531 6 a 3.63726018 4.06379942 7 b 2.33415330 -0.56965729 8 b 5.72622741 1.71643653 9 b 2.06170532 4.87756954 10 b 4.68575126 -0.08011508 11 b 0.08401255 -0.04767590 12 b 2.19938816 4.18954758 13 c 3.05634353 -0.89257491 14 c 2.28659319 2.63171152 15 c 4.70525275 1.31450497 16 c 4.02673050 -1.86270620 17 c 5.03640599 2.48564201 18 c 0.95704183 1.27446410

- the argument of
`do()`

can be**named**or**unnamed**:- named arguments (more than one supplied) become list-columns, with one element for each group:

# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(out=tail(.$x, 3))

Source: local data frame [3 x 2] Groups: <by row> group out (fctr) (chr) 1 a <dbl[3]> 2 b <dbl[3]> 3 c <dbl[3]>

- unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:

# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))

Source: local data frame [9 x 2] Groups: group [3] group out (fctr) (dbl) 1 a 3.8270397 2 a 0.6426337 3 a 0.6519305 4 b 3.3238824 5 b 0.8290942 6 b 4.1538746 7 c 6.5861213 8 c 4.6280643 9 c 0.3599512

Its use is the same working with customized functions.

Let us define the following function, which performs two simple operations returning a data frame:

my_fun <- function(x, y){ res_x = mean(x) + 2 res_y = mean(y) * 5 return(data.frame(res_x, res_y)) }

If the argument is named the result is:

# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))

Source: local data frame [3 x 2] Groups: <by row> group out (fctr) (chr) 1 a <data.frame [1,2]> 2 b <data.frame [1,2]> 3 c <data.frame [1,2]>

Otherwise, if argument is unnamed the result is:

# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))

Source: local data frame [3 x 3] Groups: group [3] group res_x res_y (fctr) (dbl) (dbl) 1 a 5.005825 9.167546 2 b 5.022282 8.683619 3 c 5.025586 11.240558

How can we enclose the previous operations inside a function? Simple! Using ** do_()** (the SE version of

`do()`

) and `interp()`

`lazyeval`

Continue reading on Quantide blog...

The post dplyr do: Some Tips for Using and Programming appeared first on MilanoR.

]]>The post Playing Around with Methods Overloading, C-language and Operators (1) appeared first on MilanoR.

]]>*This post was originally posted on Quantide blog. Read the full article here.*

R is an object-oriented (OO) language. This basically means that R is able to recognize the type of objects generate from analysis and to apply the right operation on different objects.

For example, the `summary(x)`

method performs different operations depending on the so-called “class” of `x`

:

x <- data.frame(first=1:10, second=letters[1:10]) class(x) # An object of class "data.frame"

## [1] "data.frame"

summary(x)

## first second ## Min. : 1.00 a :1 ## 1st Qu.: 3.25 b :1 ## Median : 5.50 c :1 ## Mean : 5.50 d :1 ## 3rd Qu.: 7.75 e :1 ## Max. :10.00 f :1 ## (Other):4

ds <- data.frame(x=1:100, y=100 + 3 * 1:100 + rnorm(100,sd = 10)) md <- lm(formula = y ~ x, data = ds) class(md) # An object of class "lm"

## [1] "lm"

summary(md)

## ## Call: ## lm(formula = y ~ x, data = ds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -20.1695 -5.8434 -0.4058 5.3611 27.9861 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 100.07892 1.99555 50.15 <2e-16 *** ## x 2.98890 0.03431 87.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.903 on 98 degrees of freedom ## Multiple R-squared: 0.9873, Adjusted R-squared: 0.9871 ## F-statistic: 7590 on 1 and 98 DF, p-value: < 2.2e-16

the outputs reported, and the calculations performed, by the `summary()`

method are really different for the `x`

and the `md`

objects.

This behavior is one of characteristics of OO languages, and it is called *methods overloading*.

The methods overloading can be applied not only for methods in “function form” (i.e., for methods like `summary()`

), but also for operators; indeed, “behind the scenes”, the operators are functions/methods. For example, if we try to write

in the R console we obtain:`+`

`+`

## function (e1, e2) .Primitive("+")

That means that the `+`

operator actually is a function/method that requires two arguments: `e1`

and `e2`

, which are respectively the left and right argument of operator itself.

The `+`

operator is present in R base, and can be overloaded as well, like in `ggplot2`

package, where the `+`

operator is used to “build” the graph characteristics, as in following example:

require(ggplot2) prds <- predict(object = md, interval = "prediction", level=.9) ds <- cbind(ds, prds) ds$outliers <- as.factor(ds$y<ds$lwr | ds$y>ds$upr) graph <- ggplot(data = ds,mapping = aes(x=x, y=y, color=outliers)) graph <- graph + geom_point() graph <- graph + geom_line(aes(y=fit), col="blue") graph <- graph + geom_line(aes(y=lwr), col="green") graph <- graph + geom_line(aes(y=upr), col="green") graph <- graph + ggtitle("Regression and 90% prediction bands") print(graph)

The `+`

operator, then, is applied differently with `ggplot2`

objects (with respect other object types), where it “concatenates” or “assembles” parts of final graph.

In this small post, and following ones, I would like to produce some “jokes” with objects, operators, overloading, and similar “oddities”..

`+=`

operator and its emulation in RIn *C* language there are several useful operators that allow the programmer to save some typing and to produce some more efficient and easiest to read code. The first operator that I would like to discuss is the `+=`

one.

`+=`

is an operator that does operations like: `a = a + k`

.

In *C*, the above sentence can be summarized with `a += k`

. Of course, the sentence can be something of more complex, like

`a += (x-log(2))^2`

In this case, the code line shall be “translated” to `a = a + (x-log(2))^2`

.

If I would like to have in R a new operator that acts similarly to *C*’s `+=`

, I whould have to create it.

Unfortunately, not all the names are allowed in R for new operators: if I want to produce a new operator name I can only use operators with names like `%mynewoperator%`

where the `%`

symbols are mandatory.

Indeed, for this example, I will create a new `%+=%`

operator that acts similarly to the *C*’s `+=`

.

This new operator has to be able to get the values of variables passed as arguments, to sum them, and then, more importantly, to update the value of the first variable with the new value.

Continue reading on Quantide blog...

The post Playing Around with Methods Overloading, C-language and Operators (1) appeared first on MilanoR.

]]>The post R Courses - Autumn Term Calendar appeared first on MilanoR.

]]>Hi guys! We just fixed the dates of our **Live R Courses for the Autumn term.**

There are five R courses (plus maybe two more, let's see) of** two-days each**, open to 6 attendees and located in our premises close to Milano. Whether you are a beginner or a R proficient user, a PhD student or a company manager, there is something for you. Here they are:

**October 3-4:** **R for Beginners**. Learn the **basics of R**, and get an overview on methods for data import, data manipulation, data visualization and data analysis. Reserve now

**October 17-18: Efficient Data Manipulation with R.** Handle every kind of Data Management task, using the most modern R tools: **tidyr, dplyr and lubridate**. Even with backend databases. Reserve now

**October 25-26: Statistical Models with R.** Develop a wide variety of statistical models with R, from the simplest** Linear Regression** to the most sophisticated **GLM models**. Reserve now

**November 7-8: Data Mining with R.** Find patterns in large data sets using the R tools for**Dimensionality Reduction, Clustering, Classification and Prediction. **Reserve now

**November 15-16: R for Developers.** Move forward from being a R user to become a R developer. Discover the **R working mechanisms** and master your R programming skills. Reserve now

If you wish to get a wider overview of our training offer and clients, check out our Training presentation.

If you have questions, don't hesitate to contact us at traning*[at]*quantide*[dot]*com.

Quantide is a provider of consulting services and training courses on Data Science and Big Data. The company is specialized in R, the open source software for statistical computing. Headquartered in Legnano, near Milan (Italy), Quantide has been supporting for 9 years several customers from many industries all over the world.

Quantide offers a wide range of R Courses. Each course strikes a good balance between **explanations and exercises**. Students are always invited to present real case studies in the classroom. All our teachers have taught hundreds of R courses in companies and universities.

The post R Courses - Autumn Term Calendar appeared first on MilanoR.

]]>The post How to reshape data in R: tidyr vs reshape2 appeared first on MilanoR.

]]>We often find ourselves tidying and reshaping data. Here we consider the two packages *tidyr* and *reshape2*, our aim is to see where their purposes overlap and where they differ by comparing the functions `gather()`

, `separate()`

and `spread()`

, from *tidyr*, with the functions `melt()`

, `colsplit()`

and `dcast()`

, from *reshape2*.

Data tidying is the operation of transforming data into a clear and simple form that makes it easy to work with. “Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Compare the two data frames below (cf.Wickham (2014)) to get an idea of the differences: `example.tidy`

is the tidy version of `example.messy`

, the same information is organized in two different ways.

example.messy

## treatmenta treatmentb ## John Smith NA 2 ## Jane Doe 16 11 ## Mary Johnson 3 1

example.tidy

## name trt result ## 1 John Smith treatmenta NA ## 2 Jane Doe treatmenta 16 ## 3 Mary Johnson treatmenta 3 ## 4 John Smith treatmentb 2 ## 5 Jane Doe treatmentb 11 ## 6 Mary Johnson treatmentb 1

We now begin by seeing in action how we can bring data from the "wide" to the "long" format.

Let’s start loading the packages we need:

library(tidyr) library(reshape2)

and some data (from RStudio Blog - Introducing tidyr): we have measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

set.seed(10) messy <- data.frame(id = 1:4, trt = sample(rep(c('control', 'treatment'), each = 2)), work.T1 = runif(4), home.T1 = runif(4), work.T2 = runif(4), home.T2 = runif(4)) messy

## id trt work.T1 home.T1 work.T2 home.T2 ## 1 1 treatment 0.08513597 0.6158293 0.1135090 0.05190332 ## 2 2 control 0.22543662 0.4296715 0.5959253 0.26417767 ## 3 3 treatment 0.27453052 0.6516557 0.3580500 0.39879073 ## 4 4 control 0.27230507 0.5677378 0.4288094 0.83613414

Our first step is to put the data in the tidy format, to do that we use *tidyr*’s functions `gather()`

and `separate()`

. Following Wickham’s tidy data definition, this data frame is not tidy because some variable values are in the column names. We bring this messy data frame from the wide to the long format by using the `gather()`

function (give a look at Sean C. Anderson - An Introduction to reshape2 to get an idea of the wide/long format). We want to gather all the columns, except for the *id* and *trt* ones, in two columns *key* and *value*:

gathered.messy <- gather(messy, key, value, -id, -trt) head(gathered.messy)

## id trt key value ## 1 1 treatment work.T1 0.08513597 ## 2 2 control work.T1 0.22543662 ## 3 3 treatment work.T1 0.27453052 ## 4 4 control work.T1 0.27230507 ## 5 1 treatment home.T1 0.61582931 ## 6 2 control home.T1 0.42967153

Note that in `gather()`

we used bare variable names to specify the names of the *key*, *value*, *id* and *trt* columns.

We can get the same result with the `melt()`

function from *reshape2*:

molten.messy <- melt(messy, variable.name = "key", value.names = "value", id.vars = c("id", "trt")) head(molten.messy)

## id trt key value ## 1 1 treatment work.T1 0.08513597 ## 2 2 control work.T1 0.22543662 ## 3 3 treatment work.T1 0.27453052 ## 4 4 control work.T1 0.27230507 ## 5 1 treatment home.T1 0.61582931 ## 6 2 control home.T1 0.42967153

We now compare the two functions by running them over the data without any further parameter and see what happen:

head(gather(messy))

## Warning: attributes are not identical across measure variables; they will ## be dropped

## key value ## 1 id 1 ## 2 id 2 ## 3 id 3 ## 4 id 4 ## 5 trt treatment ## 6 trt control

head(melt(messy))

## Using trt as id variables

## trt variable value ## 1 treatment id 1.00000000 ## 2 control id 2.00000000 ## 3 treatment id 3.00000000 ## 4 control id 4.00000000 ## 5 treatment work.T1 0.08513597 ## 6 control work.T1 0.22543662

We see a different behaviour: `gather()`

has brought `messy`

into a long data format with a warning by treating all columns as variable, while `melt()`

has treated *trt* as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if `melt()`

does not receive any id.variables specification, then it will use the factor or character columns as id variables. `gather()`

requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.

Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, `gather()`

cannot handle matrices or arrays, while `melt()`

can as shown below.

set.seed(3) M <- matrix(rnorm(6), ncol = 3) dimnames(M) <- list(letters[1:2], letters[1:3]) melt(M)

## Var1 Var2 value ## 1 a a -0.96193342 ## 2 b a -0.29252572 ## 3 a b 0.25878822 ## 4 b b -1.15213189 ## 5 a c 0.19578283 ## 6 b c 0.03012394

gather(M)

## Error in UseMethod("gather_"): no applicable method for 'gather_' applied to an object of class "c('matrix', 'double', 'numeric')"

Our next step is to split the column *key* into two different columns in order to separate the *location* and *time* variables and obtain a tidy data frame:

tidy <- separate(gathered.messy, key, into = c("location", "time"), sep = "\.") res.tidy <- cbind(molten.messy[1:2], colsplit(molten.messy[, 3], "\.", c("location", "time")), molten.messy[4]) head(tidy)

## id trt location time value ## 1 1 treatment work T1 0.08513597 ## 2 2 control work T1 0.22543662 ## 3 3 treatment work T1 0.27453052 ## 4 4 control work T1 0.27230507 ## 5 1 treatment home T1 0.61582931 ## 6 2 control home T1 0.42967153

head(res.tidy)

## id trt location time value ## 1 1 treatment work T1 0.08513597 ## 2 2 control work T1 0.22543662 ## 3 3 treatment work T1 0.27453052 ## 4 4 control work T1 0.27230507 ## 5 1 treatment home T1 0.61582931 ## 6 2 control home T1 0.42967153

Again, the result is the same but we need a workaround: because `colsplit()`

operates only on a single column we use`cbind()`

to insert the new two columns in the data frame. `separate()`

performs all the operation at once reducing the possibility of making mistakes.

Finally, we compare `spread()`

with `dcast()`

using the data frame example for the `spread()`

documentation itself. Briefly,`spread()`

is complementary to `gather()`

and brings data from the long to the wide format.

set.seed(14) stocks <- data.frame(time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4)) stocksm <- gather(stocks, stock, price, -time) spread.stock <- spread(stocksm, stock, price) head(spread.stock)

## time X Y Z ## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590 ## 2 2009-01-02 1.71895416 0.5988432 -0.7943331 ## 3 2009-01-03 2.12166699 1.3484795 0.5554631 ## 4 2009-01-04 1.49715368 -0.5856326 -1.1173440 ## 5 2009-01-05 -0.03614058 0.9761067 2.8356777 ## 6 2009-01-06 1.23194518 1.7656036 -3.0664418

cast.stock <- dcast(stocksm, formula = time ~ stock, value.var = "price") head(cast.stock)

## time X Y Z ## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590 ## 2 2009-01-02 1.71895416 0.5988432 -0.7943331 ## 3 2009-01-03 2.12166699 1.3484795 0.5554631 ## 4 2009-01-04 1.49715368 -0.5856326 -1.1173440 ## 5 2009-01-05 -0.03614058 0.9761067 2.8356777 ## 6 2009-01-06 1.23194518 1.7656036 -3.0664418

Again, the same result produced by `spread()`

can be obtained using `dcast()`

by specifying the correct `formula`

.

In the next session, we are going to modify the `formula`

parameter in order to perform some data aggregation and compare further the two packages.

Up to now we made *reshape2* following *tidyr*, showing that everything you can do with *tidyr* can be achieved by *reshape2*, too, at the price of a some workarounds. As we now go on with our simple example we will get out of the purposes of *tidyr* and have no more functions available for our needs. Now we have a tidy data set - one observation per row and one variable per column - to work with. We show some aggregations that are possible with `dcast()`

using the `tips `

data frame from *reshape2*. `Tips`

contains the information one waiter recorded about each tip he received over a period of a few months working in one restaurant.

head(tips)

## total_bill tip sex smoker day time size ## 1 16.99 1.01 Female No Sun Dinner 2 ## 2 10.34 1.66 Male No Sun Dinner 3 ## 3 21.01 3.50 Male No Sun Dinner 3 ## 4 23.68 3.31 Male No Sun Dinner 2 ## 5 24.59 3.61 Female No Sun Dinner 4 ## 6 25.29 4.71 Male No Sun Dinner 4

m.tips <- melt(tips)

## Using sex, smoker, day, time as id variables

head(m.tips)

## sex smoker day time variable value ## 1 Female No Sun Dinner total_bill 16.99 ## 2 Male No Sun Dinner total_bill 10.34 ## 3 Male No Sun Dinner total_bill 21.01 ## 4 Male No Sun Dinner total_bill 23.68 ## 5 Female No Sun Dinner total_bill 24.59 ## 6 Male No Sun Dinner total_bill 25.29

We use `dcast()`

to get information on the average total bill, tip and group size per day and time:

dcast(m.tips, day+time ~ variable, mean)

## day time total_bill tip size ## 1 Fri Dinner 19.66333 2.940000 2.166667 ## 2 Fri Lunch 12.84571 2.382857 2.000000 ## 3 Sat Dinner 20.44138 2.993103 2.517241 ## 4 Sun Dinner 21.41000 3.255132 2.842105 ## 5 Thur Dinner 18.78000 3.000000 2.000000 ## 6 Thur Lunch 17.66475 2.767705 2.459016

Averages per smoker or not in the group.

dcast(m.tips, smoker ~ variable, mean)

## smoker total_bill tip size ## 1 No 19.18828 2.991854 2.668874 ## 2 Yes 20.75634 3.008710 2.408602

There is no function in the *tidyr* package that allows us to perform a similar operation, the reason is that *tidyr* is designed only for data tidying and not for data reshaping.

At the beginning we have seen *tidyr* and *reshape2* functions performing the same operations, therefore, suggesting that the two packages are similar, if not equal in what they do; lately, we have seen that *reshape2*’s functions can do data aggregation that is not possible with *tidyr*. Indeed, *tidyr*’s aim is data tidying while *reshape2* has the wider purpose of data reshaping and aggregating. It follows that *tidyr* syntax is easier to understand and to work with, but its functionalities are limited. Therefore, we use *tidyr* `gather()`

and `separate()`

functions to quickly tidy our data and *reshape2*`dcast()`

to aggregate them.

- RStudio Blog - Introducing tidyr
- R-blogger - Reshape and aggregate data with the R package reshape2.
- reshape2 - R Documentation
- tidyr - R Documentation

Wickham, Hadley. 2014. “Tidy data.” *Journal of Statistical Software* 59 (10).

The post How to reshape data in R: tidyr vs reshape2 appeared first on MilanoR.

]]>The post MilanoR meeting | Call for presentations appeared first on MilanoR.

]]>We are delighted to announce that the next MilanoR meeting will take place on **Thursday, October 27th**.

A MilanoR meeting is an occasion to bring together the R users in the Milano area to share knowledge and experiences. The meeting is open to beginners as well as expert R users.

A MilanoR meeting consists of two R talks of about 25 minutes, and a break offered by our main sponsor Quantide, to give plenty of room for discussions and exchange of ideas.

If you feel you have something to input or you can recommend someone, please contact us at admin[at]milanor[dot]net. If you wish to see an example of our previous presentations, check the downloadable presentations of our previous meetings.

See you soon with new details!

**MilanoR** is a user group dedicated to bring together local users of the popular open source language R. Our aim is to exchange knowledge, learn and share tricks and techniques and provide R beginners with an opportunity to meet more experienced users. We wish to spur the adoption of R for innovative research and commercial applications, and show case studies of real-life R applications in industry and government.

Anyone can join MilanoR: you can subscribe through this link. Anyone can also submit a post for the blog or participate in meetings as a speaker. **MilanoR is open to everyone!**

MilanoR is sponsored by **Quantide**, a Milan-based company focused on R training and consulting.

The post MilanoR meeting | Call for presentations appeared first on MilanoR.

]]>The post Inventing new words. Tribute to Umberto Eco appeared first on MilanoR.

]]>*15 March 2016*

[Code and data used in the article can be downloaded here: InventingWords]

In the mid-1980s, while I was a first-year student in physics at the Bologna State University (Italy), a few friends of mine dragged me to the School of Drama, Arts and Music Studies (DAMS) to follow a couple of lessons of Umberto Eco. I still remember how happy and impressed I was. Actually, I didn’t limit myself to listen to him but I literally drank his words, fascinated by him, his style, his erudition, his rigor at the same time austere and brilliant. A couple of years later, during my Belgian university course in quantitative social sciences, I had the opportunity to enthusiastically study Eco’s academic contributions for my exams of linguistics, semiotics, social sciences and philosophy. Well before the novelist Umberto Eco, I liked, and still like, the investigator, meticulous to the mind-boggling, but always ready to expand the domain of his thinking well outside and ahead the limits of academic knowledge, in popular culture, in search of meaning. His ability to play with words, or use them either as tools or historical markers, was astonishing. As I still feel the emotion following his death, I would like to propose, as a kind of tribute and language game, *a method for inventing words* —actually, it is more an algorithm than a method. And this will be focused on the Italian language, which was Eco’s mother tongue (and my second mother tongue as well).

The easiest and most immediate way to invent a new word is to sample letters randomly. Proceeding that way to create words of, say, 7 letters, I got: nevltur, capbowqa, grnsohy, tfgzuoq, birymdo^{4}. Nothing wrong with that, but the least one could say is that these words hardly “sound” Italian. How can I assert it? Perhaps because some sounds in these words are so rare in Italian (if not absolutely inexistent) that they do not sound familiar to an Italian ear by any mean. Something in the back of our minds tells us that consonant clusters such *nvl*, *cpbq*, *grns*, *tfgz* or *zuoq* or *rymdo* are not spontaneous in Italian —as a matter of fact, they don’t exist in any Italian word. So, the words we use in each language are not a product of chance but belong to rules which in turn they also contribute to set.

Therefore, if we want to invent new words that *sound* like Italian words we must first identify the rules of formation of phonemes in Italian (or any other language). And a statistical approach can here be of great help.

First of all, I built an Italian dictionary containing about 323,000 words collected from existing dictionaries and a few hundred free eBooks in Italian (novels, essays, etc.).^{5} Analysing the composition of the words in a statistical manner, it appears quite clearly that each letter has a specific probability of being followed by another letter. As well as each letter has a certain probability of being the first or the last letter of a word.

For example, let’s look at the letter **p**.

In Italian, the probability that the letter **p** is followed by the letter **a** is of about 18%, about 14.5% by **r** or **e** and so on up to 0.2% probability of being followed by **n** (as in the word *apnea*), or 0.01% of being followed by **z** (for example, in *opzione*), and it is never followed by **g** or **b**. Perhaps this is the very reason why intuitively a word created by complete random sampling like the previous **capbowqa** doesn’t seem Italian at all.

So, if we say that a letter is a statistical event, the sequence of events in a word follows a probability chain similar to what statisticians call a **Markov chain**.^{6}

Applying the same rationale to all the letters of the alphabet (including characters with accents) we can build a matrix, which will be specific to each language and details the probabilities of transition from any letter to any other in the following position. For the Italian language, the transition matrix looks like the chart below, and it reads like this: you choose a letter on the vertical axis, the corresponding color on the horizontal axis gives an indication of the probability that the vertical-axis letter is followed by the horizontal-axis letters. The grey color indicates a probability equal to zero (i.e. the combination of letters doesn’t exist in Italian), light blue indicates a very low probability up to dark red to reflect a high probability of transition. Those who wonder in which Italian words the letter **é** precedes the letters **b** or **e** … well, these are words of French origin commonly used in Italian, such as *tournée* or *débacle*.

We therefore see in Italian that the letter **q** is almost always followed by **u**, rarely it can be followed by another **q** (*soqquadro*, *soqquadrare*) or by nothing (the last letter of the word as in the case of *Iraq*). The *ù* and *à* appear both followed only by nothing, so always ending a word and never appearing in the middle of a word. Predominantly vowels follow **v**. On the contrary, the letter **a** accepts almost any other letter in his vicinity, except accented vowels.

At this point, it just takes feeding an algorithm with this matrix of transition probabilities from one letter to another to generate new words that will sound like Italian beyond their total absence of any meaning. I now can close this brief article with a personal touch, half way serious and facetious, inventing a story made of many invented words that would sound like Italian to Italian ears. Kind of… ear candies for Italian language lovers.

Ieri, passeggiando lungo la riva del *flumattico*, ho visto sul ramo di un *fregirio* solitario un *sidri* occupato a *cinotolare* il suo *zantaro* che *roromava* di piacere. Ma il *gotriolo* che feci avvicinandomi lo *spinesò* e scappò via, rapido e *iasto*, verso la cima *pravata* della collina. Scomparve presto dalla mia vista e mi rimase solo il *faniaco* di un *untiolo* raro nonché anche il *dolinori* al pensiero che questo *ospruto gollitello* fosse diventato così *cutro* nelle nostre *senioli* campagne^{7anal}

- This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve in any way, either directly or indirectly, any of the employers, past and present, of the author. ↩
- eMail: salvino [dot] salvaggio [at] gmail [dot] com ; WebSite: http://www.salvaggio.net/ ↩
- This work was inspired by the blog of David Louapre, Science Etonnante - http://is.gd/zVTGEH ↩
- In R: paste(sample(letters, 7, replace=T), collapse=’’) ↩
- Novels and essays sometimes contain words in other languages, but in a so small proportion that it doesn’t really impacts the overall figures. Anyway, I kinda manually cleaned the dictionary canceling foreign words that I saw flicking through it.↩
- To be precise and accurate, the Markov property states that the future event depends
*only*on the current state of the system and not on the previous states. In linguistics this is not entirely true since the probability of finding a letter in a certain position in the word doesn’t only depend on the previous letter, but also, although to a lesser extent, on the foregoing others —that is why a consonant which is doubled is never tripled. More on Markov Chain in*R*: http://is.gd/nyoFXN ↩ - For the sake of transparency, it should be noted that with this basic version of the algorithm I had to generate hundreds of words to have a sufficient choice to be able to select those that seem
*more*Italian.↩

The post Inventing new words. Tribute to Umberto Eco appeared first on MilanoR.

]]>The post "R for Developers" Course | May 30-31 appeared first on MilanoR.

]]>The** two-days course** **R for Developers, **is organized by the R training and consulting company Quantide**. ****Next live class** is on** May**** 30-31 **in** Legnano (Milan).**

If you want to know more about Quantide, check out Quantide website.

If you are curious about the course, check course page or the full list of R courses.

If you wish to attend the class, reserve a seat on the course ticket page.

If you wish to move forward from being a R user to become a R developer, this is the right course for you. This two-day course provides an overview of several advanced R topics and gives you a inner perspective of R working mechanisms

This course illustrates a large spectrum of advanced R programming tools. During the first day you will quickly review the basic R objects followed by an explanation of more advanced R objects such as: environments, expression and calls. Functions objects along with their structures will then be analysed in details. R as functional programming language including the use of functionals and functions factories will close the first day.

The second day will touch several independent topics that all together form the basis for a solid R development know how. You will explore R as an objects oriented language trought S3 and S4. You wil learn how to exploit of modern computer architteture by learning about parallel computation. As a key asset of your development you will be introduced to efficient data programmint tools: debugging, profiling and packaging. The second day will terminate with an anaysis of NSE vs SE in R and the lazyeval package as a tool for building clear and reusable R code

Euro 800 + VAT

- How R works
- Basic R objects
- Advanced R objects
- Functions
- Functional Programming
- Object Oriented Programming
- Debugging and Profiling
- Building R packages
- Parallel Computation
- NSE vs SE

This class will be a good fit for you if you have a solid R knowledge and want to improve or consolidate your programming skills.

The cost includes lunch, comprehensive course materials + 1 hour of individual online post course support for each student within 30 days from course date.

We offer an academic discount for those engaged in full time studies or research. Please contact us for further information at training[at]quantide[dot]com

A laptop with the latest version of R and R-Studio.

Andrea Spanò is an Rstudio certificated instructor who has worked as an R trainer and consultant for over 20 years. Andrea graduated in Statistics from the University of Siena and obtained a Master’s degree in Applied Statistics at the University College of London. He runs Quantide consulting firm and teaches at Luiss University post grad course on Big Data Management.

This course is taught in italian. Course material in English language

Legnano is about 30 min by train from Milano. Trains from Milano to Legnano are scheduled every 30 minutes, and Quantide premises are 3 walking minutes from Legnano train station.

You can contact us attraining[at]quantide[dot]com

The post "R for Developers" Course | May 30-31 appeared first on MilanoR.

]]>The post Preparing the data for modelling with R appeared first on MilanoR.

]]>One of the first things which I came across while studying about data science was that three important steps in a data science project is data preparation, creating & testing the model and reporting. It is a widely accepted fact that data preparation takes up most of the time followed by creating the model and then reporting. There were opinions which says we should try to reduce the time taken for data preparation which we can use for creating and testing the model. But a model is only as good as the data on which it is created. A simpler model based on clean data will most likely outperform a complicated model based on dirty or ambiguous data.

With an example from a regression problem to predict the sales, we can go through some of the common situations we might face while creating a good data set. The dataset which I am using is taken from http://www.analyticsvidhya.com/ Big Mart Sales prediction problem which I have modified a bit to include some outliers in the response variable `Item_Outlet_Sales`

**Item_Identifier** : Unique product ID

**Item_Weight** : Weight of product

**Item_Fat_Content** : Whether the product is low fat or not

**Item_Visibility** : The % of total display area of all products in a store allocated to the particular product

**Item_Type** : The category to which the product belongs

**Item_MRP** : Maximum Retail Price (list price) of the product

**Outlet_Identifier** : Unique store ID

**Outlet_Establishment_Year** : The year in which store was established

**Outlet_Size** : The size of the store in terms of ground area covered

**Outlet_Location_Type** : The type of city in which the store is located

**Outlet_Type** : Whether the outlet is just a grocery store or some sort of supermarket

**Item_Outlet_Sales** : Sales of the product in the particulat store. This is the outcome variable to be predicted

First let’s take a look at the **missing values**. If number of observations with missing values are much lower than the total number of observations, then there’s not much loss of information by dropping them. I am using the function `complete.cases()`

to check for rows without missing values. The function returns a logical vector indicating which cases are complete, i.e., have no missing values. Please note that this function looks for NULL/NA value and there might be missing values in other forms like blanks in character factor columns.

nrows <- nrow(Data) ncomplete <- sum(complete.cases(Data)) ncomplete

## [1] 7060

ncomplete/nrows

## [1] 0.8283468

Here we can see that by dropping all the rows with missing values, we are losing about 18% of data. So we cannot drop them.

Now let’s have a proper look into the data set. We can begin with the response variable. From the information we have, it is a continuos variable. I am using the `ggplot2`

package for data visualization which I believe most of you would be familiar with. I will be showing the distribution of the dependent variable `Item_Outlet_Sales`

library(ggplot2) #Plotting the dependent variable distribution pl1 <- ggplot(Data, aes(Item_Outlet_Sales)) pl1 + geom_density(fill = "red", alpha = "0.7")

Here we can see that the distribution looks similar to a half normal distribution. If we take a closer look, we can see that there is a sudden spike towards the right end of the graph. This might possibly be a sentinel value. A **sentinel value** is a special kind of bad numerical value: a value that used to represent “unknown” or “not applicable”" or other special cases in numeric data. One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values. We can now take a look into the summary of the `Item_Outlet_Sales`

variable to confirm this

summary(Data$Item_Outlet_Sales)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.29 834.90 1798.00 2218.00 3104.00 33330.00

Here we can see that the maximum value is 33333 which is nowhere closer to the other values.

We can now examine these values to check whether they follow a pattern. If not, we can drop them.

I am now using the `dplyr`

package. If anyone is not familiar with it, please go through the package help. The documentation is comprehensive

The `filter()`

function in dplyr helps us to subset the data based column values

library(dplyr) #Creating a data frame with only the outliers outlier <- Data %>% filter(Item_Outlet_Sales == 33333) outlier

## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## 1 DRI11 NA Low Fat 0.03423768 ## 2 FDP25 15.200 Low Fat 0.02132748 ## 3 FDU55 16.200 Low Fat 0.03598410 ## 4 FDF44 7.170 Regular 0.05997133 ## 5 FDE41 NA reg 0.00000000 ## 6 FDK04 7.360 Low Fat 0.05260793 ## 7 FDJ53 10.500 Low Fat 0.07125791 ## 8 FDX08 12.850 Low Fat 0.02264989 ## 9 FDE53 10.895 Low Fat 0.02703220 ## 10 FDA31 7.100 Low Fat 0.11023479 ## Item_Type Item_MRP Outlet_Identifier ## 1 Hard Drinks 113.2834 OUT027 ## 2 Canned 216.8824 OUT017 ## 3 Fruits and Vegetables 260.6278 OUT045 ## 4 Fruits and Vegetables 132.1968 OUT018 ## 5 Frozen Foods 83.7566 OUT019 ## 6 Frozen Foods 56.3588 OUT017 ## 7 Frozen Foods 121.3098 OUT046 ## 8 Fruits and Vegetables 179.3318 OUT045 ## 9 Frozen Foods 106.3280 OUT017 ## 10 Fruits and Vegetables 171.7080 OUT045 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## 1 1985 Medium Tier 3 ## 2 2007 Tier 2 ## 3 2002 Tier 2 ## 4 2009 Medium Tier 3 ## 5 1985 Small Tier 1 ## 6 2007 Tier 2 ## 7 1997 Small Tier 1 ## 8 2002 Tier 2 ## 9 2007 Tier 2 ## 10 2002 Tier 2 ## Outlet_Type Item_Outlet_Sales ## 1 Supermarket Type3 33333 ## 2 Supermarket Type1 33333 ## 3 Supermarket Type1 33333 ## 4 Supermarket Type2 33333 ## 5 Grocery Store 33333 ## 6 Supermarket Type1 33333 ## 7 Supermarket Type1 33333 ## 8 Supermarket Type1 33333 ## 9 Supermarket Type1 33333 ## 10 Supermarket Type1 33333

There are only 10 observations which has these sentinel values. We can see that `Item_Type`

, `Item_MRP`

, `Outlet_Location_Type`

, `Item_Weight`

, `Outlet_Type`

are all different among these outliers. So, it does not look like wrong data from a particular store or location. Let`s drop them as them. Dealing with these type of values normally require domain knowledge.

#Removing the observations with sentinel value Data <- Data %>% filter(Item_Outlet_Sales != 33333) pl2 <- ggplot(Data, aes(Item_Outlet_Sales)) pl2 + geom_density(fill = "blue", alpha = "0.5")

summary(Data$Item_Outlet_Sales)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.29 833.60 1794.00 2182.00 3101.00 13090.00

Now we can explore the remaining variables.

summary(Data)

## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## FDG33 : 10 Min. : 4.555 LF : 316 Min. :0.00000 ## FDW13 : 10 1st Qu.: 8.775 low fat: 112 1st Qu.:0.02699 ## DRE49 : 9 Median :12.600 Low Fat:5081 Median :0.05395 ## DRN47 : 9 Mean :12.860 reg : 116 Mean :0.06616 ## FDD38 : 9 3rd Qu.:16.850 Regular:2888 3rd Qu.:0.09466 ## FDF52 : 9 Max. :21.350 Max. :0.32839 ## (Other):8457 NA's :1461 ## Item_Type Item_MRP Outlet_Identifier ## Fruits and Vegetables:1228 Min. : 31.29 OUT027 : 934 ## Snack Foods :1200 1st Qu.: 93.81 OUT013 : 932 ## Household : 910 Median :143.02 OUT035 : 930 ## Frozen Foods : 852 Mean :140.99 OUT049 : 930 ## Dairy : 682 3rd Qu.:185.66 OUT046 : 929 ## Baking Goods : 648 Max. :266.89 OUT018 : 927 ## (Other) :2993 (Other):2931 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## Min. :1985 :2404 Tier 1:2386 ## 1st Qu.:1987 High : 932 Tier 2:2779 ## Median :1999 Medium:2791 Tier 3:3348 ## Mean :1998 Small :2386 ## 3rd Qu.:2004 ## Max. :2009 ## ## Outlet_Type Item_Outlet_Sales ## Grocery Store :1082 Min. : 33.29 ## Supermarket Type1:5570 1st Qu.: 833.58 ## Supermarket Type2: 927 Median : 1794.33 ## Supermarket Type3: 934 Mean : 2181.56 ## 3rd Qu.: 3101.30 ## Max. :13086.97 ##

Looking at the `Item_Weight`

variable, we can see that there are 1461 missing values. We can also see that the `Item_Fat_Content`

variable is coded incorrectly. The same factor levels are coded in different ways.

There are 2404 missing values in the `Outlet_Size`

variable. Another interesting thing is about the `Item_Visibility`

variable. In my opinion, there can’t be any item with 0 visibility as no item in a supermarket or grocery store is supposed to be completely invisible to customers.

Let’s treat them one by one:

First let’s recode the `Item_Fat_Content`

variable. There are 2 levels- `Regular`

and `low fat`

which are coded into 5 different levels named as `LF`

, `low fat`

, `Low Fat`

, `reg`

and `Regular`

. We can **recode** them into lowfat

#with gsub replacing the levels with Regular or lowfat as required Data$Item_Fat_Content <- gsub("LF", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("low fat", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("Low Fat", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("reg", "Regular",Data$Item_Fat_Content) Data$Item_Fat_Content <- as.factor(Data$Item_Fat_Content) summary(Data$Item_Fat_Content)

## lowfat Regular ## 5509 3004

Now let us **replace the missing values** in the `Item_Weight`

variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example. I am using mean replacement for `Item_Weight`

and regression for `Item_Visibility.`

In real projects, these methods are chosen based on requirements. Normally we use mean replacement for variable which have lower predictive power for the final response variable.

#Using mean to replace the missing values in Item_Weight variable MeanItem_Weight <- mean(Data$Item_Weight[!is.na(Data$Item_Weight)]) Data$Item_Weight[is.na(Data$Item_Weight)] <- MeanItem_Weight #Using regression to replace the zeros in Item_visibility variable Data_1 <- Data %>% filter(Item_Visibility != 0) visibility_model <- lm(Item_Visibility ~ Item_Weight + Item_Fat_Content + Item_Type + Item_MRP + Outlet_Establishment_Year + Outlet_Size + Outlet_Location_Type + Item_Outlet_Sales, data = Data_1) Data$Item_Visibility[Data$Item_Visibility == 0] <- predict(visibility_model,newdata = Data[Data$Item_Visibility == 0,])

Finally we have to **classify the missing values** in the `Outlet_Size variable`

.

I am using the random forest algorithm for classification. In my experience, the **random forest algorithm** has worked well for classification models as it has the advantage of being an ensemble model. I am using the `randomForest`

package as it has a very good implementation of the random forest algorithm. The dataset is split to train and test set using the package `caTools`

. The caTools package is a very good tool for splitting our dataset for machine learning algorithms.

The function `sample.split()`

is used for splitting. Two subsets are made which are classified as `TRUE`

and `FALSE`

. Normally we use the `TRUE`

subset for training and `FALSE`

subset for testing

library(caTools) set.seed(100) Data$Outlet_Size <- as.character(Data$Outlet_Size) Storetypes <- subset(Data, Outlet_Size != "") spl <- sample.split(Storetypes$Outlet_Size, SplitRatio = 0.8) Train <- subset(Storetypes, spl == TRUE) Test <- subset(Storetypes, spl == FALSE) ###Using Random Forest for classification library(randomForest) Train$Outlet_Size <- as.factor(Train$Outlet_Size) Test$Outlet_Size <- as.factor(Test$Outlet_Size) ###Creating the model SizeForest <- randomForest(Outlet_Size ~.-Item_Outlet_Sales -Item_Identifier, data = Train,nodesize = 25, ntree = 100) ###Predicting on the test set PredictForest <- predict(SizeForest, newdata = Test) #Confusion matrix for accuracy table(Test$Outlet_Size, PredictForest)

## PredictForest ## High Medium Small ## High 186 0 0 ## Medium 0 558 0 ## Small 0 0 477

###Classifying the missing values in the dataset Data$Outlet_Size <- predict(SizeForest, newdata =Data) ######

Now we can check the complete dataset once more. We can see that the problem of missing values are resolved and the factors are well coded.

summary(Data)

## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## FDG33 : 10 Min. : 4.555 lowfat :5509 Min. :0.003575 ## FDW13 : 10 1st Qu.: 9.310 Regular:3004 1st Qu.:0.033088 ## DRE49 : 9 Median :12.860 Median :0.060615 ## DRN47 : 9 Mean :12.860 Mean :0.070478 ## FDD38 : 9 3rd Qu.:16.000 3rd Qu.:0.096000 ## FDF52 : 9 Max. :21.350 Max. :0.328391 ## (Other):8457 ## Item_Type Item_MRP Outlet_Identifier ## Fruits and Vegetables:1228 Min. : 31.29 OUT027 : 934 ## Snack Foods :1200 1st Qu.: 93.81 OUT013 : 932 ## Household : 910 Median :143.02 OUT035 : 930 ## Frozen Foods : 852 Mean :140.99 OUT049 : 930 ## Dairy : 682 3rd Qu.:185.66 OUT046 : 929 ## Baking Goods : 648 Max. :266.89 OUT018 : 927 ## (Other) :2993 (Other):2931 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## Min. :1985 High : 932 Tier 1:2386 ## 1st Qu.:1987 Medium:5195 Tier 2:2779 ## Median :1999 Small :2386 Tier 3:3348 ## Mean :1998 ## 3rd Qu.:2004 ## Max. :2009 ## ## Outlet_Type Item_Outlet_Sales ## Grocery Store :1082 Min. : 33.29 ## Supermarket Type1:5570 1st Qu.: 833.58 ## Supermarket Type2: 927 Median : 1794.33 ## Supermarket Type3: 934 Mean : 2181.56 ## 3rd Qu.: 3101.30 ## Max. :13086.97 ##

The dataset is now ready for modelling….

The post Preparing the data for modelling with R appeared first on MilanoR.

]]>The post "Data Mining with R" Course | May 17-18 appeared first on MilanoR.

]]>The** two-days course** **Data Mining with R, **is organized by the R training and consulting company Quantide**. ****Next live class** is on** May**** 17-18 **in** Legnano (Milan).**

If you want to know more about Quantide, check out Quantide's website.

If you wish to attend the class, reserve a seat on the course ticket page.

This course introduces some of most important and popular techniques in data-mining applications with R.

Data mining is the computational process of discovering patterns in large data sets.

During the two-days course we will review a wide variety of techniques to catch information from big amount of data: Dimensionality reduction, Clustering, Classification and Prediction examples will be presented and deepened.

The course will start with an introduction to basic methods for data description. After that, we will review the most popular techniques for data/dimensionality reduction, as Multidimensional Scaling, Principal Components Analysis, Correspondence Analysis. Next, we will focus on methods for searching for “natural subgroups” within data, as Hierachical/non hierarchical Cluster Analysis, Gaussian Mixtures Models.

The end of first day and the begin of second day will present techniques for classification analysis (Linear/Quadratic Discriminant Analysis, Logistic Regression, K-Nearest-Neighborhood,…).

Finally, in remaining part of second day, we will review some techniques for variables selection, collinearity reduction, and best prediction for regression models (PCA regresssion, Ridge Regression, Lasso Regression, Elastic-Net regression, ..)

Euro 800 + VAT

- Univariate Descriptive Statistics
- Reduction of Data Dimensions (MDS, PCA and EFA, CA)
- Clustering (HC, NHC, GMM)
- Classification (LDA, CLASS, KNN)
- Prediction (Several techniques to model data)

This class will be a good fit for you if you are already using R and wants to get an overview of data-mining techniques with R. Some background in theoretical statistics, probability, linear and logistic regression is required.

The cost includes lunch, comprehensive course materials + 1 hour of individual online post course support for each student within 30 days from course date.

We offer an academic discount for those engaged in full time studies or research. Please contact us for further information at training[at]quantide[dot]com

A laptop with the latest version of R and R-Studio.

Enrico Pegoraro works in R training and consulting activities, with a special focus on Six Sigma, industrial statistical analysis and corporate training courses. Enrico graduated in Statistics from the University of Padua.

He has taught statistical models and R for hundreds of hours during specialized and applied courses, in universities, masters and companies.

This course is taught in italian. Course material in English language

Legnano is about 30 min by train from Milano. Trains from Milano to Legnano are scheduled every 30 minutes, and Quantide premises are 3 walking minutes from Legnano train station.

You can contact us attraining[at]quantide[dot]com

The post "Data Mining with R" Course | May 17-18 appeared first on MilanoR.

]]>