This article gives a brief overview of the data.table package written by M. Dowle, T. Short, S. Lianoglou.

A data.table is an extension of a data.frame created to reduce the working time of the user in two ways:

- programming time
- compute time

The data.table sintax is inspired by the R syntax matrix `A [B]`

where `A`

is a matrix and `B`

is a 2-column matrix.

As a data.table is a data.frame, will be compliant with all R functions and packages that accept data.frame as object.

The big advantage of a data.table than a data.frame is that it uses the tables as if they were tables in a database, with a speed of data access truly remarkable.

A data.table is **created** exactly like a data.frame, the sintax is the same.

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

DF e DT are identical but on DT can create an **index** by defining a **key**.

setkey(DT,x)

tables()

NAME NROW MB COLS KEY

[1,] DT 9 1 x,y,v x

Total: 1MB

DT have been re-ordered according to the values of x column.

A key consists of one or more columns which may be integer, factor, character or some other class.

A data.tables do not have rownames but may instead have a key of one or more columns using setkey. This key may be used for row indexing instead of rownames.

Now we can **subsetting** data

DT["b",] # extract data for key-column = “b”

DT[,v] # extract the v column

100+ times faster than ==

A data.table is like a data.frame but i and j can be expressions of column names directly.

Furthermore i may itself be a data.table which invokes a fast table join using binary search in O(log n).

We can easily **add** new data

DT[,w:=1:3] # add a w column

500+ times faster than DF[i,j] = value

or **join** data.table

DT[J("a",3:6)] # inner join (J is an alias of data.table)

or fast **grouping**

DT[,sum(v),by=x]

DT[,list(vSum=sum(v),

vMin=min(v),

vMax=max(v)),

by=list(x,y)]

10+ times faster than tapply()

with a syntax much easier than in data.frame.

In a data.table each cell can be a different type

- each cell can be vector
- each cell can itself be a data.table
- combining list columns with i and b

data.table(x=letters[1:3],

y=list(1:10,

letters[1:4],

data.table(a=1:3,b=4:6)))

In conclusion a data.table is identical to a data.frame other than:

- it doesn't have rownames
- selecting a single row will always return a single row data.table not a vector
- the comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table
- [] is like a call to subset()
- [,...], is like a call to with()

this implies

- up to 10 times less memory
- up to 10 times faster to create, and copy
- simpler R code

Great (p)review of the data.table package!

It seams to me that we will all be moving to it very soon!

Pingback: Big data | m's R Blog