Blog

Building Interactive Graphs with ggplot2 and Shiny

Some time ago, I was contacted from guys at Packt Publishing. Their just published the Building Interactive Graphs with ggplot2 and Shiny online course and they ask me my (humble) opinion.

I am proud of their request, and I will review shortly here the Building Interactive Graphs with ggplot2 and Shiny online course. I'll publish a more in-depth review at the begin of September, when Italian R users come back from vacation. In this post, I will provide a description of the course. In the future post, I will highlight what was new for me and I will share an example of what I learned from this useful course.

I discovered the online course some days before I was contacted by Packt's team. The course sounds interesting to me because I was working on a project involving ggplot2 and Shiny. Moreover, I find an online course more effective and useful than a printed book. This is obviously, since I work for a company providing on line (and on site, too) R courses.

As the author says on his website:

The course consists of short videos (around 2 or 3 minutes) that explain one concept at the time. Each video comes with the relevant code, and pointers to go further in your own time.

About the target of this video, I agree with Arthur's review:

I highly recommend it to a very wide audience, from students beginning data science or statistics to mature data analysts or even seasoned enterprise business intelligence professionals.

Course length is about 90 minutes, so you can watch it during the favorite serial of your wife (Italian TV networks usually broadcasts two episodes at time) or the unmissable soccer match of your husband.

The course consists of 40 short videos, grouped in eight sections. You can find the course outline, at the official website of the video course. If you never bought a course from Packt Publishing, you can download the whole course in a single zip file. Once you downloaded and uncompressed the zip file, you have to open the index.html page with your favorite browser. A pleasant (off line) web site, will direct you to the video of your interest. You can watch the course head to tail, but its structure allows you to watch immediately the topics you need now and postpone the others. Alternatively, you can watch each video online, even in your internet connected TV. The third link allows you to download the code. You'll download presentations too, but they were not very useful for me.

As you can see from my posts, I am not very able with English language. :-)
By the way, I found the British English of the author easy to understand to non-native speaker too.

The first five sections focus on ggplot2, starting from installing and exploring several advanced topics, such as faceting, big data and plot customization. All that requires the first hours.

Section 6 and 7 show Shiny capabilities. Unlike first sections, in which each section covers a well defined subject, you can imagine this as an unique section about Shiny, made by ten short videos.

Finally, the last section shows how to put everything together.

If you already know both ggplot2 and Shiny, this course will not improve your capabilities in a relevant way. You can find something new, especially in the ggplot2 part. Anyway, you can find it a valuable review and its structure allows you to jump to videos of your interest. If you are new to R or if you are new to ggplot2 and/or Shiny you should buy this online course now. You will be productive in a short while.

Posted in R | Tagged , , , , | Leave a comment

Sales Dashboard in R with qplot and ggplot2 - Part 2

In Part 1 of this series we moved the first steps into building our Sales Dashboard in R. In this Part 2 we explore additional ways to display sales related data.

If you haven't read Part 1, it is highly recommended that you do so first because we will build on what was covered there.

1. Bar charts

A useful way to visualize the total order intake by sales person is to produce a bar chart with the total order amount for each sales person. While this is also an easy task for qplot, we have to be careful about which additional parameters are needed to obtain exactly what we want. Here is the right syntax.

Rplot04

Note how qplot has automatically calculated the total order amount per sales person. This is the default behavior for geom="bar" and stat="identity". What actually happens is that qplot generates one bar per order, with an height proportional to the order amount, and then groups and stacks all bars belonging to the same sales person. You can "see" the stacking by coloring the bar outlines in a different color.

Rplot05

Note that to specify a fixed color for the outline we used the color parameter with an argument of I("blue"). The function I() tells qplot to use the value as is, without any conversion or attempt to interpret it as a variable name.

Now that we understand better how qplot thinks, we can improve the way our data are visualized. For example, we could color code each bar according to the country where the order was taken. As  there is a 1:1 relationship between each sales person and its country, the effect is to color each bar uniformly either for USA or for UK.

Rplot06

The parameter useful to color the interior of each bar according to the Country is fill.

Note that the color coding by country has had the effect to add a legend on the left, which in turn has reduced the area available for the chart, causing an annoying overlapping effect for the names of the sales person. In order to fix this, we need to use a more advanced feature that goes beyond qplot, but bear with me because it is not that complex. We are going to rotate the labels for the x axis by 90 degrees and align them properly under the tick marks.

Rplot07

The call to the theme function has the effect to override the default appearance for the specified element, in this case axis.text.xangle=90 rotates the text clockwise by 90 degrees while hjust=1 and vjust=0 align it properly under the tick marks. You can experiment with different values to see the effect. For example, vjust=0.5 centers the text under the axis tick mark.

In the case of this data set, color coding by Country doesn't add a lot of meaning to the visualization. It would be much more useful for example to color code each portion of the bar according to the calendar year in which the order was taken.

2. Stacked bars

Earlier we found out that each bar in our bar chart is actually made of a series of stacked bars where each one has an height proportional to the order amount. Let's try to color code them by year instead that by country and see what it looks like. Experimenting (and, yes!, making mistakes) is often the best way to learn how qplot works!

Rplot08

 

This is indeed a fancy looking chart! I am sure any Sales Director would be absolutely pleased with it (ok, just kidding!)

What's the problem here? Well, we have told qplot to color code each stacked bar with a fill color corresponding to the Order.Date. Since almost all order dates are different one from the other, qplot has used a large range of discrete colors to try to code them all. The result is the Arlecchino chart above.

What we actually need to do is to have one different color for each year, which means we need to extract the year from each order date and pass it to fill in qplot. Having earlier converted Order.Date to a Date class allows us to use as.character to extract the year and convert it to a character format.

RPlot09

 

While the result is graphically as expected, there are a couple of annoyances in this chart. First, the title of the legend includes the function used to extract the year and convert it to a character sequence. Second, the sequence of the colors in the legend is exactly the opposite of the sequence of colors in the bars.

To fix the first problem we have different possibilities. One would be to add a Year variable to the data set, containing the order year already in the needed format. However this would represent an unnecessary duplication of information. The second way is to assign a different title to the legend. This is straightforward to do through the labs function.

The call to labs is telling qplot that whatever variable is used to encode the fill attribute (or "aesthetic" in ggplot2 jargon) should be labeled as specified by the fill argument.

Rplot10

To fix the order of colors in the legend, so that their sequence correspond to the one in the bars, we can use the guides function.

Rplot11

 

guides has a similar logic to labs. It takes the name of the attribute (aesthetic) that we want to modify in the legend and uses guide_legend to set an attribute for it. A way to read it is: in the guides (aka the legend), reverse the sequence of colors for the fill attribute.

It works exactly as expected, but our plotting command is getting long! Is there any chance to simplify it? As it turns out guide_legend supports another attributes among the many which is title, meant to set the title of the legend (or guide). In this case it is equivalent to what we achieved with labs, so we can omit the latter and move the definition of the lagend title within guide_legend.

This produces the exact same chart as above.

3. A final touch

We have generated with little code a professionally looking sales chart. I guess your Sales Director will be very much pleased with it. For the perfectionists out there, we could add a final touch to it though.

First, the labels for the axis are still the name of the variables in the data set. We could do better for sure. Second, we are missing a title for the chart. qplot can accommodate our needs through three additional parameters that can be specified directly into its call.

  • xlab sets the label for the x axis
  • ylab (you guessed it!) sets the label for the y axis
  • main sets the label for the chart title

Rplot12

This is it for Part 2. In Part 3 we will cover some more variations to the bar charts and other type of data visualization. Till next time!

* This article originally appeared in Sales Dashboard in R with qplot and ggplot2 - Part 2

Posted in R | Tagged , , , | Leave a comment

Sales Dashboard in R with qplot and ggplot2 - Part 1

In a previous post on my personal blog about creating Pivot Tables in R with melt and cast we covered a simple way to generate sales reports and summary tables from a data set consisting of orders. It is often said that a picture is worth 1000 words, so in this series of posts we will focus on how to create visual representations and summaries of the same data.

Our graphical library of choice for the job will be ggplot2 (what else?), even though we are mostly going to use it in its simplest format, which is through qplot. I have written other posts on ggplot2 which you may want to also read.

1. Getting started

If you haven't done it yet, please complete steps 1, 2 and 3 in my previous post Pivot Tables in R with melt and cast. The file with the data can be obtained from the link at the bottom of that post. Once completed, you should have your data set loaded in R and ready for the next steps.

2. Checking the data

Before starting to plot any data frame with ggplot2, it is a good idea to check the data structure and make sure all variables have the correct type. As a matter of fact ggplot2 is a very smart library and will attempt to plot your data even if they are not in the expected format. While this may or may not produce a warning message, the results may end up being far from what we expect. Better to check in advance and save us the pain of a long troubleshooting afterwards.

It has been pointed out that str is one of the most useful functions in R and this is surely true! Let's take a look at the structure of our data set.

The use of str highlights indeed a problem with our data set. Order.Date is currently regarded by R as a factor instead of a Date. If we are thinking of grouping our sales data by quarter for example, it would be useful to convert it to a Date class so we can use data manipulation functions such as quarter() to extract the quarter of the year. This is an easy fix.

Note that the format string using in as.Date has to match the format of the date in Order.Date. In this case %d represents the day in digits (1-31), %m the month in digits (1-12) and %Y (capital Y) the year in the 4-digits format (1900-2999).

After the conversion, our data set structure looks like this.

We are now ready to create our sales dashboard.

3. A simple scatter plot of orders

Visualizing data in a simple and immediate format should always be the first step of a good visual data analysis. This allows to spot anomalies (for example outliers) and to get an overview of the content of the data set before aggregating and manipulating it further.

Let's start with a plot of all Order.Amount in a temporal sequence, which means by Order.Date.

Rplot01

Note few things here. First, we need to load the ggplot2 library before we can use qplot. This only needs to be done once in the same R session. Second, qplot is invoked with 3 arguments:

  • x is the variable we want to plot on the horizontal axis
  • y is the variable we want to plot on the vertical axis
  • data is the name of the data set the variables belong to, which allows us to specify them just by variable name (such as Order.Date or Order.Amount) instead that in the full format (which would be data$Order.Date or data$Order.Amount)

Third, if we do not specify any further parameter, qplot uses its defaults for all the rest. Which default is used depends also on whether only y is specified or both x and y. When both x and y are specified, the default is to produce a scatter plot of y values versus x values. Another default is to use the variable names as labels for the axis, as well as apply the standard theme. Enough technicalities, let's get back to data visualization.

Let's say we are interested to show from which country the orders came from. Let's color code the points in the scatter plot according to the value of the Country variable in the data set, which is either USA or UK. With qplot this is as easy as adding an extra argument to the function call.

Note that the color parameter can also be used with its British spelling of colour. Here is the resulting chart.

Rplot02

Once more, qplot has applied some defaults. First, a standard high-contrast color scheme to distinguish between the orders coming from the two different countries. Second, a legend on the left of the chart specifying how to read each color. The title of the legend is, by default, the name of the variable used to color code the points. Sweet!

Let's try to color code the points according to the sales person who took the order. Another easy one with qplot. Just change the color parameter to the use the Salesperson variable.

Rplot03

qplot has done a nice job to accommodate our request and color code the points by Salesperson, however there are too many colors and the chart is not really meaningful. Time to switch to a different view!

In Part 2 we will cover Bar Charts and how to make the best use of them. Till next time!

* This article originally appeared in Sales Dashboard in R with qplot and ggplot2 - Part 1

Posted in R | Tagged , , , | Leave a comment

How to open an SPSS file into R

R is a powerful system for statistical analysis and data visualization. However, it’s not exactly user-friendly for data storage, so, still for several time your data will be archived using Excel, SPSS or similar programs.

How to open into R a file stored using the SPSS (.sav) format? There are some packages as foreign which allow to perform this operation. The package foreign is already present in the base distribution of R system and you just need to activate it using the function library().

When you activated the package, you can open your file if you know where it’s located… the simpler method to locate a file (Yes, I know, you can set the work directory, but I have abrupt manners) is to send the instruction:

The system will open a window for the file access; you can look for your file in the folder where you have earlier archived it. R return the path to file:

Now, you can read the SPSS file using foreign, specifying the path to file (yes, you have understood, you need to copy and paste the path):

Do you want avoid the copy and paste? You can assign the result of the instruction file.choose() to an object named db (abbreviation for database):

As before, you obtained the path to file, but this time R not shows it because you assigned to the object db. Then, the object db contains a character string identifying the path that R will have to follow to recover the file. Using this way, you need to run file.choose() at every session, while if you write the path you can use it every time. Ready go?

The instruction read.spss() read the dataset in sav format. You must be careful, however, to specify as TRUE the argument to.data.frame, which requires to the function to arrange the data within a data frame (i.e. the class of R object for data tables).

Yolo, man. Another very simple method to open an SPSS file into R is to save the file in a format which R manage very well: the dat format (tab-delimited). So, you save your SPSS file in .dat and you behave as before, searching the file with file.choose() and assigning the resulting string to an object.

The function to read the file, now, is read.table(). Pay attention to missing data: if there are missing values, you should to indicate to R what is their code (e.g. 999), specifying a value for the argument na.strings.

Do you have your file in .dat format?

The argument header = TRUE specifies that the first row of the file contains the variable names, therefore these values aren’t to interpret as data.

Being in a hurry? Conflate  all the operations in just one line:

or, with .dat:

Once you import a file, it’s a good idea to verify that the reading was performed with accuracy.

To check the size of your database, use the dim() function. You will obtain two numbers, the first one refers to the cases (rows in your database), while the second one is the number of variables (the columns of your database).

Further, can be useful to visualize a preview of data. To inspect the first six rows of the dataset, use the head() function:

To inspect the flast six rows of the dataset, use the tail() function:

To inspect the structure of the dataset, use the str() function:

Do you want visualize the entire matrix of your dataset? If the data table is large, it is advisable to use the function View(), or fix() which allows you to manually edit the cell content.

This post was originally written in Italian by Davide Massidda and Antonello Preti and published in InsulaR blog

How to open into R a Microsoft Excel file? Please read again the post Read Excel files from R.

Posted in R | Tagged , , | Leave a comment

R AND OOP - defining new classes

My previous article shows an example in which data analysis requires a structured framework with R and OOP. In order to explain how to build the framework this article describes how to do that in more detail.

Using OOP means creating new data structures and defining their methods that are functions performing a specific tasks on the object. Defining a new data structure requires creating a new class and this articles shows how to create it through S4 R classes.

Continue reading

Posted in R | Tagged , , | Leave a comment

R framework with Object-Oriented Programming

Data analysis deals with different kinds of data.
For instance we can have supermarket sales with
- a transactional table, with customer ID, item ID, date of purchase
- an item table, with the item ID and its price
- a customer table, with customer ID and its anagraphic details (age, gender)
In this example data are tables with different structures.

Continue reading

Posted in R | Tagged , | Leave a comment

DailyMeteo.org - 2014 Conference

Our friend Stefan has been participating in MilanoR since the beginning, and was one of the people who started using R intensively after the "Introduction to R" Quantide course. Since he is from Belgrade (Serbia), and takes part in the activities of the Belgrade R community, there is an interesting R event/conference which will take place in Belgrade in June, which he would like to share with us.

Continue reading

Posted in R | Tagged , , | Leave a comment

Merry Christmas

Dear R-enthusiastics,
this is the last post of the 2013.

I wish you all Merry Christmas.

Continue reading

Posted in R | Tagged , | Leave a comment

My first... web application with Shiny

It was several time I was thinking about developing a web application with R and Shiny.

In these days I realize my first application with Shiny. You can find it at http://spark.rstudio.com/nsturaro/pyramid0/

Continue reading

Posted in R | Tagged , , | Leave a comment

My first... plot (.ly): beautiful plots with Plotly

Questo articolo può essere letto anche in italiano

Dear R-enthusiastic,
I discovered Plotly some days ago, and I was fascinated by it.

What is Plotly?
Plotly is a service for creating and sharing data visualizations that also offers statistical analysis tools plus a robust API, the ability to graph custom functions and a built-in Python shell. Among its APIs, there is the R one: Plotly interactive visualization can be created directly from R.

Continue reading

Posted in R | Tagged , , , | 1 Comment