Skip to main content

This page presents a side-by-side comparison of common data manipulation operations in R in three idioms: data.table, base, and dplyr. This allows you to compare syntax and understand how to accomplish tasks across these popular frameworks.

This reference guide covers everything from basic filtering and sorting to advanced operations like joins and reshaping data. Many of these examples were originally crafted by Atrebas. They were then reorganized and augmented with base examples by a team of contributors.

To begin, we create example data. The base R data frame is called DF, the data.table table is called DT, and the dplyr tibble is called TB. Data creation is wrapped in a refresh_data() function, which is called periodically throughout the document to ensure that the data is reset after modifications.

library(data.table)
library(dplyr)

refresh_data = function() {
    DT <<- data.table(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = c(0.5, 1.0, 1.5),
        V4 = rep(LETTERS[1:3], 3)
    )

    DF <<- data.frame(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = c(0.5, 1.0, 1.5),
        V4 = rep(LETTERS[1:3], 3)
    )

    TB <<- tibble(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = rep(c(0.5, 1.0, 1.5), 3),
        V4 = rep(LETTERS[1:3], 3)
    )
}

refresh_data()
πŸ’‘ Tip

When using the let() and set*() functions or := operator modifies a data.table β€œin place,” which means that it does not copy the object at all. This is more efficient than re-assigning the entire data set. However, when modified in place, the data table will not be printed immediately to the console after modification. You must call the object again to see the changes.

Filter

Integer index

Positive indices keep the specified rows.

DT[3:4,]
>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     4   0.5      A

Negative indices exclude the specified rows.

DT[-(3:7),]
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     2     8   1.0      B
> 4:     1     9   1.5      C

Logical index

Filter rows using a logical vector. Keep the rows where the condition is TRUE. %chin% is a fast version of %in%, optimized for strings. %like% is a convenient operator for regular expression matches.

DT[V2 > 5]
>       V1    V2    V3     V4
>       
> 1:     2     6   1.5      C
> 2:     1     7   0.5      A
> 3:     2     8   1.0      B
> 4:     1     9   1.5      C
DT[V4 %chin% c("A", "C")] # faster than %in% for strings
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A
> 4:     2     6   1.5      C
> 5:     1     7   0.5      A
> 6:     1     9   1.5      C
DT[V4 %like% c("A|C")]    # regular expressions
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A
> 4:     2     6   1.5      C
> 5:     1     7   0.5      A
> 6:     1     9   1.5      C

Filter rows based on multiple conditions.

DT[V1 == 1 & V4 == "A"]
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     7   0.5      A

Unique

unique(DT)
> Indices: , 
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C
unique(DT, by = c("V1", "V4"))
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C

Missing values

Drop rows with missing values in specified columns.

na.omit(DT, cols = 1:4)
> Indices: , 
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

Random sample

Draw a random sample of rows.

DT[sample(.N, 3)]
>       V1    V2    V3     V4
>       
> 1:     1     7   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A
DT[sample(.N, .N / 2)]
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     5   1.0      B
> 3:     2     6   1.5      C
> 4:     1     7   0.5      A

Other

DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)]

Sort

Rows

Sort rows in ascending order.

DT[order(V3)]
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     4   0.5      A
> 3:     1     7   0.5      A
> 4:     2     2   1.0      B
> 5:     1     5   1.0      B
> 6:     2     8   1.0      B
> 7:     1     3   1.5      C
> 8:     2     6   1.5      C
> 9:     1     9   1.5      C

Sort rows in decreasing order.

DT[order(-V3)]
>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     6   1.5      C
> 3:     1     9   1.5      C
> 4:     2     2   1.0      B
> 5:     1     5   1.0      B
> 6:     2     8   1.0      B
> 7:     1     1   0.5      A
> 8:     2     4   0.5      A
> 9:     1     7   0.5      A

Sort rows by multiple columns.

DT[order(V1, -V2)]
>       V1    V2    V3     V4
>       
> 1:     1     9   1.5      C
> 2:     1     7   0.5      A
> 3:     1     5   1.0      B
> 4:     1     3   1.5      C
> 5:     1     1   0.5      A
> 6:     2     8   1.0      B
> 7:     2     6   1.5      C
> 8:     2     4   0.5      A
> 9:     2     2   1.0      B

Sort dataset and the column passed in argument becomes key. Output is an object the same type as the object indexed. This code is not executed to avoid sorting the original data set in place.

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))
refresh_data()

Columns

Reorders the columns of a dataset. This code is not executed to avoid reordering columns in the original data set.

setcolorder(DT, c("V4", "V1", "V2"))
DT
>        V4    V1    V2    V3
>       
> 1:      A     1     1   0.5
> 2:      B     2     2   1.0
> 3:      C     1     3   1.5
> 4:      A     2     4   0.5
> 5:      B     1     5   1.0
> 6:      C     2     6   1.5
> 7:      A     1     7   0.5
> 8:      B     2     8   1.0
> 9:      C     1     9   1.5
refresh_data()

Select

Keep

Extract one column as a vector.

# DT[["V3"]]
# DT[, V3]
DT[[3]]
> [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5

Extract one column as a data frame.

# DT[, .SD, .SDcols = "V3"]
DT[, "V3"]
>       V3
>    
> 1:   0.5
> 2:   1.0
> 3:   1.5
> 4:   0.5
> 5:   1.0
> 6:   1.5
> 7:   0.5
> 8:   1.0
> 9:   1.5

Select several columns by column names.

DT[, .(V2, V3, V4)]
DT[, V2:V4]
DT[, .SD, .SDcols = V2:V4]
DT[, .SD, .SDcols = c("V2", "V3", "V4")]
cols = c("V2", "V3")
DT[, ..cols]

Drop

Exclude several columns by column name.

# DT[, .SD, .SDcols = !c("V2", "V3")]
DT[, !c("V2", "V3")]
>       V1     V4
>     
> 1:     1      A
> 2:     2      B
> 3:     1      C
> 4:     2      A
> 5:     1      B
> 6:     2      C
> 7:     1      A
> 8:     2      B
> 9:     1      C

Remove a column from the data set. Using let() is efficient because it modifies the data set in place. This code is not executed because the V5 column is not present in the data set.

DT[, let(V1 = NULL)]
>       V2    V3     V4
>      
> 1:     1   0.5      A
> 2:     2   1.0      B
> 3:     3   1.5      C
> 4:     4   0.5      A
> 5:     5   1.0      B
> 6:     6   1.5      C
> 7:     7   0.5      A
> 8:     8   1.0      B
> 9:     9   1.5      C
DT
>       V2    V3     V4
>      
> 1:     1   0.5      A
> 2:     2   1.0      B
> 3:     3   1.5      C
> 4:     4   0.5      A
> 5:     5   1.0      B
> 6:     6   1.5      C
> 7:     7   0.5      A
> 8:     8   1.0      B
> 9:     9   1.5      C

Remove several columns from the data set. Using := is efficient because it modifies the data set in place. This code is not executed because the V6 and V7 columns are not present in the data set.

cols = c("V2", "V3")
DT[, (cols) := NULL]
>        V4
>    
> 1:      A
> 2:      B
> 3:      C
> 4:      A
> 5:      B
> 6:      C
> 7:      A
> 8:      B
> 9:      C
DT
>        V4
>    
> 1:      A
> 2:      B
> 3:      C
> 4:      A
> 5:      B
> 6:      C
> 7:      A
> 8:      B
> 9:      C
refresh_data()

Rename

Select and rename.

DT[, .(X1 = V1, X2 = V2)]
>       X1    X2
>     
> 1:     1     1
> 2:     2     2
> 3:     1     3
> 4:     2     4
> 5:     1     5
> 6:     2     6
> 7:     1     7
> 8:     2     8
> 9:     1     9

Using the data.table::setnames() to rename columns is efficient because it renames column in place. This code is not executed to avoid renaming columns in the original data set.

setnames(DT, old = c("V1", "V2"), new = c("X1", "X2"))
DT
>       X1    X2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C
refresh_data()

Advanced selections

Complex selections using regular expressions or dedicated functions.

DT[, .SD, .SDcols = c("V1", "V2")]
DT[, .SD, .SDcols = patterns("^V[1-2]$")]
DT[, .SD, .SDcols = patterns("V")]
DT[, .SD, .SDcols = patterns("3$")]
DT[, .SD, .SDcols = patterns(".2")]
DT[, .SD, .SDcols = patterns("^V1$|^X$")]
DT[, .SD, .SDcols = patterns("^(?!V2)", perl = TRUE)]

Summarize

Single columns

Create a new data frame with a single row and a single column, summarizing the information of one column. Named or unnamed results.

DT[, sum(V1)]
> [1] 13
DT[, .(sumV1 = sum(V1))]
>    sumV1
>    
> 1:    13

Create a new data frame with a single row and two columns, summarizing the information of two manually specified columns.

DT[, .(sumV1 = sum(V1), sdV3 = sd(V3))]
>    sumV1      sdV3
>         
> 1:    13 0.4330127

Multiple columns

Apply a function to each column.

DT[, lapply(.SD, head, 1)]
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A

Apply a function to each column that matches a specific type.

DT[, lapply(.SD, mean), .SDcols = is.numeric]
>          V1    V2    V3
>         
> 1: 1.444444     5     1

By group

Count the number of observation by group.

DT[, .N, by = V4]
>        V4     N
>     
> 1:      A     3
> 2:      B     3
> 3:      C     3

Multiple named summaries

DT[, .(nobs = .N, meanV1 = mean(V1)), by = V4]
>        V4  nobs   meanV1
>         
> 1:      A     3 1.333333
> 2:      B     3 1.666667
> 3:      C     3 1.333333

Apply a function to the full data frame in each group. Here, we return the first row in each group using the head() function.

DT[, head(.SD, 1), by = V4]
>        V4    V1    V2    V3
>       
> 1:      A     1     1   0.5
> 2:      B     2     2   1.0
> 3:      C     1     3   1.5

Get the row number of first (and last) observation by group. In data.table, we use the .I operator, which reports the row number.

DT[, .I[1], by = V4]
>        V4    V1
>     
> 1:      A     1
> 2:      B     2
> 3:      C     3
DT[, range(.I), by = V4]
>        V4    V1
>     
> 1:      A     1
> 2:      A     7
> 3:      B     2
> 4:      B     8
> 5:      C     3
> 6:      C     9

List-columns are columns where each element is a vector, data frame, or other object.

DT[, .(.(V1)),  by = V4]
>        V4     V1
>     
> 1:      A  1,2,1
> 2:      B  2,1,2
> 3:      C  1,2,1
DT[, .(.(.SD)), by = V4]
>        V4                V1
>                
> 1:      A 
> 2:      B 
> 3:      C 

Modify

Single columns

Create or modify a column.

DT[, let(V5 = V1^2)]
>       V1    V2    V3     V4    V5
>        
> 1:     1     1   0.5      A     1
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     1
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     1
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     1
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     1
DT
>       V1    V2    V3     V4    V5
>        
> 1:     1     1   0.5      A     1
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     1
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     1
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     1
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     1

Create several new columns, each named explicitly.

DT[, let(
  V5 = sqrt(V1),
  V6 = "X")]
>       V1    V2    V3     V4       V5     V6
>            
> 1:     1     1   0.5      A 1.000000      X
> 2:     2     2   1.0      B 1.414214      X
> 3:     1     3   1.5      C 1.000000      X
> 4:     2     4   0.5      A 1.414214      X
> 5:     1     5   1.0      B 1.000000      X
> 6:     2     6   1.5      C 1.414214      X
> 7:     1     7   0.5      A 1.000000      X
> 8:     2     8   1.0      B 1.414214      X
> 9:     1     9   1.5      C 1.000000      X
DT
>       V1    V2    V3     V4       V5     V6
>            
> 1:     1     1   0.5      A 1.000000      X
> 2:     2     2   1.0      B 1.414214      X
> 3:     1     3   1.5      C 1.000000      X
> 4:     2     4   0.5      A 1.414214      X
> 5:     1     5   1.0      B 1.000000      X
> 6:     2     6   1.5      C 1.414214      X
> 7:     1     7   0.5      A 1.000000      X
> 8:     2     8   1.0      B 1.414214      X
> 9:     1     9   1.5      C 1.000000      X

Multiple columns

Take the square of every numeric column.

DT[, names(.SD) := lapply(.SD, \(x) x^2), .SDcols = is.numeric]
>       V1    V2    V3     V4    V5     V6
>         
> 1:     1     1  0.25      A     1      X
> 2:     4     4  1.00      B     2      X
> 3:     1     9  2.25      C     1      X
> 4:     4    16  0.25      A     2      X
> 5:     1    25  1.00      B     1      X
> 6:     4    36  2.25      C     2      X
> 7:     1    49  0.25      A     1      X
> 8:     4    64  1.00      B     2      X
> 9:     1    81  2.25      C     1      X
DT
>       V1    V2    V3     V4    V5     V6
>         
> 1:     1     1  0.25      A     1      X
> 2:     4     4  1.00      B     2      X
> 3:     1     9  2.25      C     1      X
> 4:     4    16  0.25      A     2      X
> 5:     1    25  1.00      B     1      X
> 6:     4    36  2.25      C     2      X
> 7:     1    49  0.25      A     1      X
> 8:     4    64  1.00      B     2      X
> 9:     1    81  2.25      C     1      X
refresh_data()

Replace

Replace values in rows that match a condition applied to a column.

DT[V2 <= 4, let(V2 = 0)]
>       V1    V2    V3     V4
>       
> 1:     1     0   0.5      A
> 2:     2     0   1.0      B
> 3:     1     0   1.5      C
> 4:     2     0   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C
DT
>       V1    V2    V3     V4
>       
> 1:     1     0   0.5      A
> 2:     2     0   1.0      B
> 3:     1     0   1.5      C
> 4:     2     0   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C
refresh_data()

If, else, case

Fast version of ifelse(). Return a vector the same length as the vector tested, with specific value for in case where the condition is TRUE, where it is FALSE, and where it is NA.

x = c(-3:3, NA)
fifelse(test = x < 0,
  yes  = "neg",
  no   = "pos",
  na   = "NA")
> [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

Recode several cases at once, based on a vector of conditions.

x = 1:10
fcase(
  x %% 6 == 0, "fizz buzz",
  x %% 2 == 0, "fizz",
  x %% 3 == 0, "buzz",
  default = as.character(x)
)
>  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
>  [7] "7"         "fizz"      "buzz"      "fizz"

Modify by group

Group summaries of a column in categories.

DT[, by = "V4", .(sumV2 = sum(V2))]
>        V4 sumV2
>     
> 1:      A    12
> 2:      B    15
> 3:      C    18

Group values of a column in groups while applying a function to each category.

DT[,
  by = tolower(V4),
  .(sumV1 = sum(V1))]
>    tolower sumV1
>      
> 1:       a     4
> 2:       b     5
> 3:       c     4

Group values of a column in two categories, TRUE (for rows matching the condition) and FALSE (For rows not matching the condition).

DT[,
  keyby = V4 == "A",
  sum(V1)]
> Key: 
>        V4    V1
>     
> 1:  FALSE     9
> 2:   TRUE     4

Group values of a column in several categories with some of the rows of the initial dataset removed.

DT[1:5,
  by = V4,
  .(sumV1 = sum(V1))]
>        V4 sumV1
>     
> 1:      A     3
> 2:      B     3
> 3:      C     1

Add a new column with the number of observations per group.

DT[, let(n = .N), by = V1]
>       V1    V2    V3     V4     n
>        
> 1:     1     1   0.5      A     5
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     5
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     5
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     5
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     5
DT
>       V1    V2    V3     V4     n
>        
> 1:     1     1   0.5      A     5
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     5
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     5
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     5
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     5
refresh_data()

Advanced

Summarise all the columns, typically using an aggregation function.

DT[, lapply(.SD, max)]
>       V1    V2    V3     V4
>       
> 1:     2     9   1.5      C

Summarise several columns, typically using an aggregation function.

DT[, lapply(.SD, mean),
  .SDcols = c("V1", "V2")]
>          V1    V2
>        
> 1: 1.444444     5

Summarise several columns by groups, typically using an aggregation function.

DT[, by = V4,
  lapply(.SD, mean),
  .SDcols = c("V1", "V2")]
>        V4       V1    V2
>         
> 1:      A 1.333333     4
> 2:      B 1.666667     5
> 3:      C 1.333333     6
DT[, by = V4,
  lapply(.SD, mean),
  .SDcols = patterns("V1|V2|Z0")]
>        V4       V1    V2
>         
> 1:      A 1.333333     4
> 2:      B 1.666667     5
> 3:      C 1.333333     6

Summarise several columns by group using multiple aggregation functions, grouping by one or more variables.

DT[, by = V4,
  c(lapply(.SD, sum),
  lapply(.SD, mean))]
>        V4    V1    V2    V3       V1    V2    V3
>             
> 1:      A     4    12   1.5 1.333333     4   0.5
> 2:      B     5    15   3.0 1.666667     5   1.0
> 3:      C     4    18   4.5 1.333333     6   1.5

Summarise a subset of columns by column type or condition.

DT[, lapply(.SD, mean), .SDcols = is.numeric]
>          V1    V2    V3
>         
> 1: 1.444444     5     1
foo = function(x) {is.numeric(x) && mean(x) > 3}
DT[, lapply(.SD, mean), .SDcols = foo]
>       V2
>    
> 1:     5

Modify all the columns using the same function.

DT[, lapply(.SD, rev)]
>       V1    V2    V3     V4
>       
> 1:     1     9   1.5      C
> 2:     2     8   1.0      B
> 3:     1     7   0.5      A
> 4:     2     6   1.5      C
> 5:     1     5   1.0      B
> 6:     2     4   0.5      A
> 7:     1     3   1.5      C
> 8:     2     2   1.0      B
> 9:     1     1   0.5      A

Apply a transformation to each element of the variables selected.

DT[, lapply(.SD, sqrt), .SDcols = V1:V2]
>          V1       V2
>           
> 1: 1.000000 1.000000
> 2: 1.414214 1.414214
> 3: 1.000000 1.732051
> 4: 1.414214 2.000000
> 5: 1.000000 2.236068
> 6: 1.414214 2.449490
> 7: 1.000000 2.645751
> 8: 1.414214 2.828427
> 9: 1.000000 3.000000

Apply a transformation for each element of the variables selected.

DT[, names(.SD) := lapply(.SD, sqrt), .SDcols = V1:V3]
>          V1       V2        V3     V4
>                 
> 1: 1.000000 1.000000 0.7071068      A
> 2: 1.414214 1.414214 1.0000000      B
> 3: 1.000000 1.732051 1.2247449      C
> 4: 1.414214 2.000000 0.7071068      A
> 5: 1.000000 2.236068 1.0000000      B
> 6: 1.414214 2.449490 1.2247449      C
> 7: 1.000000 2.645751 0.7071068      A
> 8: 1.414214 2.828427 1.0000000      B
> 9: 1.000000 3.000000 1.2247449      C
DT
>          V1       V2        V3     V4
>                 
> 1: 1.000000 1.000000 0.7071068      A
> 2: 1.414214 1.414214 1.0000000      B
> 3: 1.000000 1.732051 1.2247449      C
> 4: 1.414214 2.000000 0.7071068      A
> 5: 1.000000 2.236068 1.0000000      B
> 6: 1.414214 2.449490 1.2247449      C
> 7: 1.000000 2.645751 0.7071068      A
> 8: 1.414214 2.828427 1.0000000      B
> 9: 1.000000 3.000000 1.2247449      C

Apply a transformation to each element of the variables selected by condition.

DT[, .SD - 1, .SDcols = is.numeric]
>           V1        V2         V3
>                   
> 1: 0.0000000 0.0000000 -0.2928932
> 2: 0.4142136 0.4142136  0.0000000
> 3: 0.0000000 0.7320508  0.2247449
> 4: 0.4142136 1.0000000 -0.2928932
> 5: 0.0000000 1.2360680  0.0000000
> 6: 0.4142136 1.4494897  0.2247449
> 7: 0.0000000 1.6457513 -0.2928932
> 8: 0.4142136 1.8284271  0.0000000
> 9: 0.0000000 2.0000000  0.2247449

Apply a transformation to each element of the variables selected by condition.

DT[, names(.SD) := lapply(.SD, as.integer),
  .SDcols = is.numeric]
>       V1    V2    V3     V4
>       
> 1:     1     1     0      A
> 2:     1     1     1      B
> 3:     1     1     1      C
> 4:     1     2     0      A
> 5:     1     2     1      B
> 6:     1     2     1      C
> 7:     1     2     0      A
> 8:     1     2     1      B
> 9:     1     3     1      C
DT
>       V1    V2    V3     V4
>       
> 1:     1     1     0      A
> 2:     1     1     1      B
> 3:     1     1     1      C
> 4:     1     2     0      A
> 5:     1     2     1      B
> 6:     1     2     1      C
> 7:     1     2     0      A
> 8:     1     2     1      B
> 9:     1     3     1      C

Combine multiple functions in a single statement.

DT[, by = V4,
  .(V1[1:2], "X")]
>        V4    V1     V2
>      
> 1:      A     1      X
> 2:      A     1      X
> 3:      B     1      X
> 4:      B     1      X
> 5:      C     1      X
> 6:      C     1      X
refresh_data()

Chain (pipe)

Expression chaining allows you to perform multiple operations in sequence without creating intermediate objects.

# Chain operations using [][]
DT[, by = V4, .(V1sum = sum(V1))][
  V1sum > 4]
>        V4 V1sum
>     
> 1:      B     5

# Sort results in descending order
DT[, by = V4, .(V1sum = sum(V1))][
  order(-V1sum)]
>        V4 V1sum
>     
> 1:      B     5
> 2:      A     4
> 3:      C     4
refresh_data()

Join

First, let’s create example datasets for demonstrating joins.

x = data.table(
  Id  = c("A", "B", "C", "C"),
  X1  = c(1L, 3L, 5L, 7L),
  XY  = c("x2", "x4", "x6", "x8")
)
y = data.table(
  Id  = c("A", "B", "B", "D"),
  Y1  = c(1L, 3L, 5L, 7L),
  XY  = c("y1", "y3", "y5", "y7")
)

Basic Joins

There are two basic ways to perform joins in data.table. First, we can use the standard merge() function that should be familiar to most R users.

Second, we can use the square bracket notation with the on argument. Here, the logic is that we select every row from the table outside that match rows from the table inside the bracket. The table inside the brackets is used as an β€œindex”, so x[y] essentially means that we are merging x into y. The benefit of this syntax is that we can efficiently combine it with other operations like summaries, sorts, etc.

Note that, in data.table, the on argument specifies the joining columns, and the i. prefix refers to columns from the right table, while the x. prefix refers to columns from the left table.

# Left join (keep all rows from x)
# merge(x, y, all.x = TRUE, by = "Id")
y[x, on = "Id"]
>        Id    Y1     XY    X1   i.XY
>        
> 1:      A     1     y1     1     x2
> 2:      B     3     y3     3     x4
> 3:      B     5     y5     3     x4
> 4:      C    NA        5     x6
> 5:      C    NA        7     x8

# Right join (keep all rows from y)
# merge(x, y, all.y = TRUE, by = "Id")
x[y, on = "Id"]
>        Id    X1     XY    Y1   i.XY
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      D    NA        7     y7

# Inner join (keep only matching rows)
# merge(x, y, by = "Id"
x[y, on = "Id", nomatch = NULL]
>        Id    X1     XY    Y1   i.XY
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5

# Anti join (keep rows from x with no match in y)
x[!y, on = "Id"]
>        Id    X1     XY
>      
> 1:      C     5     x6
> 2:      C     7     x8

# Full join (keep all rows)
merge(x, y, by = "Id", all = TRUE)
> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA   
> 6:      D    NA        7     y7

Reshape

Wide to long

Converting data from wide format (where variables are in columns) to long format (where variables become rows).

melt(DT,
  id.vars       = "V4",
  variable.name = "Variable",
  value.name    = "Value")
>         V4 Variable Value
>         
>  1:      A       V1   1.0
>  2:      B       V1   2.0
>  3:      C       V1   1.0
>  4:      A       V1   2.0
>  5:      B       V1   1.0
>  6:      C       V1   2.0
>  7:      A       V1   1.0
>  8:      B       V1   2.0
>  9:      C       V1   1.0
> 10:      A       V2   1.0
> 11:      B       V2   2.0
> 12:      C       V2   3.0
> 13:      A       V2   4.0
> 14:      B       V2   5.0
> 15:      C       V2   6.0
> 16:      A       V2   7.0
> 17:      B       V2   8.0
> 18:      C       V2   9.0
> 19:      A       V3   0.5
> 20:      B       V3   1.0
> 21:      C       V3   1.5
> 22:      A       V3   0.5
> 23:      B       V3   1.0
> 24:      C       V3   1.5
> 25:      A       V3   0.5
> 26:      B       V3   1.0
> 27:      C       V3   1.5
>         V4 Variable Value
>         
'measure.vars' [V1, V2, V3] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

Long to wide

Converting data from long format back to wide format.

# Create example long data
long = CJ(a = 1:2, b = 1:2, c = c("x", "y"))
long[, let(d = rnorm(8))]
> Key: 
>        a     b      c          d
>            
> 1:     1     1      x -0.8425213
> 2:     1     1      y -0.0431396
> 3:     1     2      x -0.4273492
> 4:     1     2      y -0.3644490
> 5:     2     1      x -0.4764893
> 6:     2     1      y  0.2829335
> 7:     2     2      x  0.1844807
> 8:     2     2      y -0.5859007

dcast(long, a + b ~ c)
> Key: 
>        a     b          x          y
>                 
> 1:     1     1 -0.8425213 -0.0431396
> 2:     1     2 -0.4273492 -0.3644490
> 3:     2     1 -0.4764893  0.2829335
> 4:     2     2  0.1844807 -0.5859007
Using 'd' as value column. Use 'value.var' to override

Split rows

Separating data into groups based on a factor.

split(DT, by = "V4")
> $A
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     4   0.5      A
> 3:     1     7   0.5      A
> 
> $B
>       V1    V2    V3     V4
>       
> 1:     2     2     1      B
> 2:     1     5     1      B
> 3:     2     8     1      B
> 
> $C
>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     6   1.5      C
> 3:     1     9   1.5      C

Split columns

Separating a single column into multiple columns based on a delimiter.

# Create example data
tmp = data.table(a = c("A:a", "B:b", "C:c"))

tmp[, c("w", "z") := tstrsplit(a, split = ":")]
>         a      w      z
>      
> 1:    A:a      A      a
> 2:    B:b      B      b
> 3:    C:c      C      c
tmp
>         a      w      z
>      
> 1:    A:a      A      a
> 2:    B:b      B      b
> 3:    C:c      C      c

Bind rows

Combining multiple datasets by rows. To begin, we create example data.

x = data.table(1:3)
y = data.table(4:6)
z = data.table(7:9, 0L)
# Simple row bind
rbind(x, y)
>       V1
>    
> 1:     1
> 2:     2
> 3:     3
> 4:     4
> 5:     5
> 6:     6

# Simple row bind
rbindlist(list(x, y, x))
>       V1
>    
> 1:     1
> 2:     2
> 3:     3
> 4:     4
> 5:     5
> 6:     6
> 7:     1
> 8:     2
> 9:     3

# Row bind with different columns
rbind(x, z, fill = TRUE)
>       V1    V2
>     
> 1:     1    NA
> 2:     2    NA
> 3:     3    NA
> 4:     7     0
> 5:     8     0
> 6:     9     0

# Bind list of data.tables with ID column
rbindlist(list(x, y), idcol = TRUE)
>      .id    V1
>     
> 1:     1     1
> 2:     1     2
> 3:     1     3
> 4:     2     4
> 5:     2     5
> 6:     2     6

Bind columns

# Column bind
base::cbind(x, y)
>       V1    V1
>     
> 1:     1     4
> 2:     2     5
> 3:     3     6

Set Operations

Set operations treat datasets as sets and perform mathematical set operations on them.

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
# Intersection (rows present in both x and y)
fintersect(x, y)
>       V1
>    
> 1:     2
> 2:     3

# Set difference (rows in x but not in y)
fsetdiff(x, y)
>       V1
>    
> 1:     1
# Include duplicates
fsetdiff(x, y, all = TRUE)
>       V1
>    
> 1:     1
> 2:     3

# Union (combine unique rows)
funion(x, y)
>       V1
>    
> 1:     1
> 2:     2
> 3:     3
> 4:     4
# Include duplicates
funion(x, y, all = TRUE)
>        V1
>     
>  1:     1
>  2:     2
>  3:     2
>  4:     3
>  5:     3
>  6:     2
>  7:     2
>  8:     3
>  9:     4
> 10:     4

# Test for equality
fsetequal(x, x[order(-V1),])
> [1] TRUE
all.equal(x, x)
> [1] TRUE

Read and write

# Write to CSV
fwrite(DT, "DT.csv")

# Write tab-delimited
fwrite(DT, "DT.txt", sep = "\t")

# Read CSV or tab-delimited
DT1 = fread("DT.csv")
DT2 = fread("DT.txt", sep = "\t")

# Read specific columns
DT3 = fread("DT.csv", select = c("V1", "V4"))
DT4 = fread("DT.csv", drop = "V4")

# Combine multiple files
rbindlist(lapply(c("DT.csv", "DT.csv"), fread))
>        V1    V2    V3     V4
>        
>  1:     1     1   0.5      A
>  2:     2     2   1.0      B
>  3:     1     3   1.5      C
>  4:     2     4   0.5      A
>  5:     1     5   1.0      B
>  6:     2     6   1.5      C
>  7:     1     7   0.5      A
>  8:     2     8   1.0      B
>  9:     1     9   1.5      C
> 10:     1     1   0.5      A
> 11:     2     2   1.0      B
> 12:     1     3   1.5      C
> 13:     2     4   0.5      A
> 14:     1     5   1.0      B
> 15:     2     6   1.5      C
> 16:     1     7   0.5      A
> 17:     2     8   1.0      B
> 18:     1     9   1.5      C

Clean up local files.

file.remove(c("DT.csv", "TB.csv", "DF.csv", "DT.txt", "TB.txt", "DF.txt"))
> [1] TRUE TRUE TRUE TRUE TRUE TRUE
Loading source...