One of the first things which I came across while studying about data science was that three important steps in a data science project is data preparation, creating & testing the model and reporting. It is a widely accepted fact that data preparation takes up most of the time followed by creating the model and then reporting. There were opinions which says we should try to reduce the time taken for data preparation which we can use for creating and testing the model. But a model is only as good as the data on which it is created. A simpler model based on clean data will most likely outperform a complicated model based on dirty or ambiguous data.

### Big Mart Sales dataset

With an example from a regression problem to predict the sales, we can go through some of the common situations we might face while creating a good data set. The dataset which I am using is taken from http://www.analyticsvidhya.com/ Big Mart Sales prediction problem which I have modified a bit to include some outliers in the response variable `Item_Outlet_Sales`

### Data description

Item_Identifier : Unique product ID
Item_Weight : Weight of product
Item_Fat_Content : Whether the product is low fat or not
Item_Visibility : The % of total display area of all products in a store allocated to the particular product
Item_Type : The category to which the product belongs
Item_MRP : Maximum Retail Price (list price) of the product
Outlet_Identifier : Unique store ID
Outlet_Establishment_Year : The year in which store was established
Outlet_Size : The size of the store in terms of ground area covered
Outlet_Location_Type : The type of city in which the store is located
Outlet_Type : Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales : Sales of the product in the particulat store. This is the outcome variable to be predicted

## Dealing with missing values

First let’s take a look at the missing values. If number of observations with missing values are much lower than the total number of observations, then there’s not much loss of information by dropping them. I am using the function `complete.cases()` to check for rows without missing values. The function returns a logical vector indicating which cases are complete, i.e., have no missing values. Please note that this function looks for NULL/NA value and there might be missing values in other forms like blanks in character factor columns.

Here we can see that by dropping all the rows with missing values, we are losing about 18% of data. So we cannot drop them.

## Data Exploration

Now let’s have a proper look into the data set. We can begin with the response variable. From the information we have, it is a continuos variable. I am using the `ggplot2` package for data visualization which I believe most of you would be familiar with. I will be showing the distribution of the dependent variable `Item_Outlet_Sales`

Here we can see that the distribution looks similar to a half normal distribution. If we take a closer look, we can see that there is a sudden spike towards the right end of the graph. This might possibly be a sentinel value. A sentinel value is a special kind of bad numerical value: a value that used to represent “unknown” or “not applicable”" or other special cases in numeric data. One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values. We can now take a look into the summary of the `Item_Outlet_Sales` variable to confirm this

Here we can see that the maximum value is 33333 which is nowhere closer to the other values.
We can now examine these values to check whether they follow a pattern. If not, we can drop them.
I am now using the `dplyr` package. If anyone is not familiar with it, please go through the package help. The documentation is comprehensive
The `filter()` function in dplyr helps us to subset the data based column values

There are only 10 observations which has these sentinel values. We can see that `Item_Type`, `Item_MRP`, `Outlet_Location_Type`, `Item_Weight`, `Outlet_Type` are all different among these outliers. So, it does not look like wrong data from a particular store or location. Let`s drop them as them. Dealing with these type of values normally require domain knowledge.

Now we can explore the remaining variables.

Looking at the `Item_Weight` variable, we can see that there are 1461 missing values. We can also see that the `Item_Fat_Content` variable is coded incorrectly. The same factor levels are coded in different ways.
There are 2404 missing values in the `Outlet_Size` variable. Another interesting thing is about the `Item_Visibility` variable. In my opinion, there can’t be any item with 0 visibility as no item in a supermarket or grocery store is supposed to be completely invisible to customers.

Let’s treat them one by one:

First let’s recode the `Item_Fat_Content` variable. There are 2 levels- `Regular` and `low fat` which are coded into 5 different levels named as `LF`, `low fat`, `Low Fat`, `reg` and `Regular`. We can recode them into lowfat

Now let us replace the missing values in the `Item_Weight` variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example. I am using mean replacement for `Item_Weight` and regression for `Item_Visibility.` In real projects, these methods are chosen based on requirements. Normally we use mean replacement for variable which have lower predictive power for the final response variable.

Finally we have to classify the missing values in the `Outlet_Size variable`.
I am using the random forest algorithm for classification. In my experience, the random forest algorithm has worked well for classification models as it has the advantage of being an ensemble model. I am using the `randomForest` package as it has a very good implementation of the random forest algorithm. The dataset is split to train and test set using the package `caTools`. The caTools package is a very good tool for splitting our dataset for machine learning algorithms.
The function `sample.split()` is used for splitting. Two subsets are made which are classified as `TRUE` and `FALSE`. Normally we use the `TRUE` subset for training and `FALSE` subset for testing

Now we can check the complete dataset once more. We can see that the problem of missing values are resolved and the factors are well coded.

The dataset is now ready for modelling….

Share: