Access data quickly and easily: data.table package

This article gives a brief overview of the data.table package written by M. Dowle, T. Short, S. Lianoglou.

A data.table is an extension of a data.frame created to reduce the working time of the user in two ways:

  1. programming time
  2. compute time

The data.table sintax is inspired by the R syntax matrix A [B] where A is a matrix and B is a 2-column matrix.

As a data.table is a data.frame, will be compliant with all R functions and packages that accept data.frame as object.
The big advantage of a data.table than a data.frame is that it uses the tables as if they were tables in a database, with a speed of data access truly remarkable.

A data.table is created exactly like a data.frame, the sintax is the same.

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

DF e DT are identical but on DT can create an index by defining a key.

setkey(DT,x)
tables()
NAME NROW MB COLS KEY
[1,] DT 9 1 x,y,v x
Total: 1MB

DT have been re-ordered according to the values of x column.

A key consists of one or more columns which may be integer, factor, character or some other class.
A data.tables do not have rownames but may instead have a key of one or more columns using setkey. This key may be used for row indexing instead of rownames.

Now we can subsetting data

DT["b",] # extract data for key-column = “b”
DT[,v] # extract the v column

100+ times faster than ==

A data.table is like a data.frame but i and j can be expressions of column names directly.
Furthermore i may itself be a data.table which invokes a fast table join using binary search in O(log n).

We can easily add new data

DT[,w:=1:3] # add a w column

500+ times faster than DF[i,j] = value

or join data.table

DT[J("a",3:6)] # inner join (J is an alias of data.table)

or fast grouping

DT[,sum(v),by=x]
DT[,list(vSum=sum(v),
vMin=min(v),
vMax=max(v)),
by=list(x,y)]

10+ times faster than tapply()

with a syntax much easier than in data.frame.

In a data.table each cell can be a different type

  • each cell can be vector
  • each cell can itself be a data.table
  • combining list columns with i and b


data.table(x=letters[1:3],
y=list(1:10,
letters[1:4],
data.table(a=1:3,b=4:6)))

In conclusion a data.table is identical to a data.frame other than:

  • it doesn't have rownames
  • selecting a single row will always return a single row data.table not a vector
  • the comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table
  • [] is like a call to subset()
  • [,...], is like a call to with()

this implies

  • up to 10 times less memory
  • up to 10 times faster to create, and copy
  • simpler R code

5
Shares

About Anna Longari

Graduated in Mathematics from the University of Milan, worked in the field of Statistical and Mathematical Modelling and Data Mining. In particular data analysis at various levels, forecasting on demand planning, clustering, customer satisfaction, fraud detection and design and implementation of custom products and core prediction engine.
This entry was posted in R and tagged , . Bookmark the permalink.

2 Responses to Access data quickly and easily: data.table package

  1. Stefan says:

    Great (p)review of the data.table package!
    It seams to me that we will all be moving to it very soon!

  2. Pingback: Big data | m's R Blog

Leave a Reply