<code>data.table</code> vs. <code>base</code> vs. <code>dplyr</code>

This page presents a side-by-side comparison of common data manipulation operations in R in three idioms: data.table, base, and dplyr. This allows you to compare syntax and understand how to accomplish tasks across these popular frameworks.

This reference guide covers everything from basic filtering and sorting to advanced operations like joins and reshaping data. Many of these examples were originally crafted by Atrebas. They were then reorganized and augmented with base examples by a team of contributors.

To begin, we create example data. The base R data frame is called DF, the data.table table is called DT, and the dplyr tibble is called TB. Data creation is wrapped in a refresh_data() function, which is called periodically throughout the document to ensure that the data is reset after modifications.

library(data.table)
library(dplyr)

refresh_data = function() {
    DT <<- data.table(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = c(0.5, 1.0, 1.5),
        V4 = rep(LETTERS[1:3], 3)
    )

    DF <<- data.frame(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = c(0.5, 1.0, 1.5),
        V4 = rep(LETTERS[1:3], 3)
    )

    TB <<- tibble(
        V1 = rep(1:2, 5)[-10],
        V2 = 1:9,
        V3 = rep(c(0.5, 1.0, 1.5), 3),
        V4 = rep(LETTERS[1:3], 3)
    )
}

refresh_data()

💡 Tip

When using the let() and set*() functions or := operator modifies a data.table “in place,” which means that it does not copy the object at all. This is more efficient than re-assigning the entire data set. However, when modified in place, the data table will not be printed immediately to the console after modification. You must call the object again to see the changes.

Filter

Integer index

Positive indices keep the specified rows.

DT[3:4,]

>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     4   0.5      A

DF[3:4,]

>   V1 V2  V3 V4
> 3  1  3 1.5  C
> 4  2  4 0.5  A

TB |> slice(3:4)

> # A tibble: 2 × 4
>      V1    V2    V3 V4   
>      
> 1     1     3   1.5 C    
> 2     2     4   0.5 A

Negative indices exclude the specified rows.

DT[-(3:7),]

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     2     8   1.0      B
> 4:     1     9   1.5      C

DF[-(3:7),]

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 2  2  2 1.0  B
> 8  2  8 1.0  B
> 9  1  9 1.5  C

TB |> slice(-(3:7))

> # A tibble: 4 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     2   1   B    
> 3     2     8   1   B    
> 4     1     9   1.5 C

Logical index

Filter rows using a logical vector. Keep the rows where the condition is TRUE. %chin% is a fast version of %in%, optimized for strings. %like% is a convenient operator for regular expression matches.

DT[V2 > 5]

>       V1    V2    V3     V4
>       
> 1:     2     6   1.5      C
> 2:     1     7   0.5      A
> 3:     2     8   1.0      B
> 4:     1     9   1.5      C

DT[V4 %chin% c("A", "C")] # faster than %in% for strings

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A
> 4:     2     6   1.5      C
> 5:     1     7   0.5      A
> 6:     1     9   1.5      C

DT[V4 %like% c("A|C")]    # regular expressions

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A
> 4:     2     6   1.5      C
> 5:     1     7   0.5      A
> 6:     1     9   1.5      C

subset(DF, V2 > 5)

>   V1 V2  V3 V4
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

subset(DF, V4 %in% c("A", "C"))

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 9  1  9 1.5  C

subset(DF, grepl("A|C", V4))

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 9  1  9 1.5  C

TB |> filter(V2 > 5)

> # A tibble: 4 × 4
>      V1    V2    V3 V4   
>      
> 1     2     6   1.5 C    
> 2     1     7   0.5 A    
> 3     2     8   1   B    
> 4     1     9   1.5 C

TB |> filter(V4 %in% c("A", "C"))

> # A tibble: 6 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     1     3   1.5 C    
> 3     2     4   0.5 A    
> 4     2     6   1.5 C    
> 5     1     7   0.5 A    
> 6     1     9   1.5 C

TB |> filter(grepl("A|C", V4))

> # A tibble: 6 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     1     3   1.5 C    
> 3     2     4   0.5 A    
> 4     2     6   1.5 C    
> 5     1     7   0.5 A    
> 6     1     9   1.5 C

Filter rows based on multiple conditions.

DT[V1 == 1 & V4 == "A"]

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     7   0.5      A

subset(DF, V1 == 1 & V4 == "A")

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 7  1  7 0.5  A

TB |> filter(V1 == 1, V4 == "A")

> # A tibble: 2 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     1     7   0.5 A

Unique

unique(DT)

> Indices: , 
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

unique(DT, by = c("V1", "V4"))

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C

DF[!duplicated(DF), ]

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 2  2  2 1.0  B
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

DF[!duplicated(DF[c("V1", "V4")]), ]

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 2  2  2 1.0  B
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C

TB |> distinct()

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     2   1   B    
> 3     1     3   1.5 C    
> 4     2     4   0.5 A    
> 5     1     5   1   B    
> 6     2     6   1.5 C    
> 7     1     7   0.5 A    
> 8     2     8   1   B    
> 9     1     9   1.5 C

TB |> distinct(V1, V4, .keep_all = TRUE)

> # A tibble: 6 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     2   1   B    
> 3     1     3   1.5 C    
> 4     2     4   0.5 A    
> 5     1     5   1   B    
> 6     2     6   1.5 C

Missing values

Drop rows with missing values in specified columns.

na.omit(DT, cols = 1:4)

> Indices: , 
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

DF[complete.cases(DF[, 1:4]), ]

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 2  2  2 1.0  B
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

TB |> tidyr::drop_na(1:4)

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     2   1   B    
> 3     1     3   1.5 C    
> 4     2     4   0.5 A    
> 5     1     5   1   B    
> 6     2     6   1.5 C    
> 7     1     7   0.5 A    
> 8     2     8   1   B    
> 9     1     9   1.5 C

Random sample

Draw a random sample of rows.

DT[sample(.N, 3)]

>       V1    V2    V3     V4
>       
> 1:     1     7   0.5      A
> 2:     1     3   1.5      C
> 3:     2     4   0.5      A

DT[sample(.N, .N / 2)]

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     1     5   1.0      B
> 3:     2     6   1.5      C
> 4:     1     7   0.5      A

DF[sample(nrow(DF), 3), ]

>   V1 V2  V3 V4
> 7  1  7 0.5  A
> 9  1  9 1.5  C
> 6  2  6 1.5  C

DF[sample(nrow(DF), nrow(DF) / 2), ]

>   V1 V2  V3 V4
> 9  1  9 1.5  C
> 8  2  8 1.0  B
> 7  1  7 0.5  A
> 4  2  4 0.5  A

TB |> slice_sample(n = 3)

> # A tibble: 3 × 4
>      V1    V2    V3 V4   
>      
> 1     1     7   0.5 A    
> 2     2     2   1   B    
> 3     2     6   1.5 C

TB |> slice_sample(prop = 0.5)

> # A tibble: 4 × 4
>      V1    V2    V3 V4   
>      
> 1     1     3   1.5 C    
> 2     2     8   1   B    
> 3     2     4   0.5 A    
> 4     1     7   0.5 A

Other

DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)]

subset(DF, grepl("^B", V4))
subset(DF, V2 >= 3 & V2 <= 5)
subset(DF, V2 > 3 & V2 < 5)
subset(DF, V2 %in% c(-1:1, 1:3))

TB |> filter(grepl("^B", V4))
TB |> filter(dplyr::between(V2, 3, 5))
TB |> filter(V2 > 3 & V2 < 5)
TB |> filter(V2 >= -1:1 & V2 <= 1:3)

Sort

Rows

Sort rows in ascending order.

DT[order(V3)]

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     4   0.5      A
> 3:     1     7   0.5      A
> 4:     2     2   1.0      B
> 5:     1     5   1.0      B
> 6:     2     8   1.0      B
> 7:     1     3   1.5      C
> 8:     2     6   1.5      C
> 9:     1     9   1.5      C

sort_by(DF, ~V3)

>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 4  2  4 0.5  A
> 7  1  7 0.5  A
> 2  2  2 1.0  B
> 5  1  5 1.0  B
> 8  2  8 1.0  B
> 3  1  3 1.5  C
> 6  2  6 1.5  C
> 9  1  9 1.5  C

TB |> arrange(V3)

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     4   0.5 A    
> 3     1     7   0.5 A    
> 4     2     2   1   B    
> 5     1     5   1   B    
> 6     2     8   1   B    
> 7     1     3   1.5 C    
> 8     2     6   1.5 C    
> 9     1     9   1.5 C

Sort rows in decreasing order.

DT[order(-V3)]

>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     6   1.5      C
> 3:     1     9   1.5      C
> 4:     2     2   1.0      B
> 5:     1     5   1.0      B
> 6:     2     8   1.0      B
> 7:     1     1   0.5      A
> 8:     2     4   0.5      A
> 9:     1     7   0.5      A

sort_by(DF, ~list(-V3))

>   V1 V2  V3 V4
> 3  1  3 1.5  C
> 6  2  6 1.5  C
> 9  1  9 1.5  C
> 2  2  2 1.0  B
> 5  1  5 1.0  B
> 8  2  8 1.0  B
> 1  1  1 0.5  A
> 4  2  4 0.5  A
> 7  1  7 0.5  A

TB |> arrange(desc(V3))

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     3   1.5 C    
> 2     2     6   1.5 C    
> 3     1     9   1.5 C    
> 4     2     2   1   B    
> 5     1     5   1   B    
> 6     2     8   1   B    
> 7     1     1   0.5 A    
> 8     2     4   0.5 A    
> 9     1     7   0.5 A

Sort rows by multiple columns.

DT[order(V1, -V2)]

>       V1    V2    V3     V4
>       
> 1:     1     9   1.5      C
> 2:     1     7   0.5      A
> 3:     1     5   1.0      B
> 4:     1     3   1.5      C
> 5:     1     1   0.5      A
> 6:     2     8   1.0      B
> 7:     2     6   1.5      C
> 8:     2     4   0.5      A
> 9:     2     2   1.0      B

sort_by(DF, ~list(V1, -V2))

>   V1 V2  V3 V4
> 9  1  9 1.5  C
> 7  1  7 0.5  A
> 5  1  5 1.0  B
> 3  1  3 1.5  C
> 1  1  1 0.5  A
> 8  2  8 1.0  B
> 6  2  6 1.5  C
> 4  2  4 0.5  A
> 2  2  2 1.0  B

TB |> arrange(V1, desc(V2))

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     9   1.5 C    
> 2     1     7   0.5 A    
> 3     1     5   1   B    
> 4     1     3   1.5 C    
> 5     1     1   0.5 A    
> 6     2     8   1   B    
> 7     2     6   1.5 C    
> 8     2     4   0.5 A    
> 9     2     2   1   B

Sort dataset and the column passed in argument becomes key. Output is an object the same type as the object indexed. This code is not executed to avoid sorting the original data set in place.

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))

DF = DF[order(DF$V4, -DF$V1), ]

TB = TB |> arrange(V4, desc(V1))

refresh_data()

Columns

Reorders the columns of a dataset. This code is not executed to avoid reordering columns in the original data set.

setcolorder(DT, c("V4", "V1", "V2"))
DT

>        V4    V1    V2    V3
>       
> 1:      A     1     1   0.5
> 2:      B     2     2   1.0
> 3:      C     1     3   1.5
> 4:      A     2     4   0.5
> 5:      B     1     5   1.0
> 6:      C     2     6   1.5
> 7:      A     1     7   0.5
> 8:      B     2     8   1.0
> 9:      C     1     9   1.5

DF = DF[, c("V4", "V1", "V2")]
DF

>   V4 V1 V2
> 1  A  1  1
> 2  B  2  2
> 3  C  1  3
> 4  A  2  4
> 5  B  1  5
> 6  C  2  6
> 7  A  1  7
> 8  B  2  8
> 9  C  1  9

TB = TB |> select(V4, V1, V2)
TB

> # A tibble: 9 × 3
>   V4       V1    V2
>     
> 1 A         1     1
> 2 B         2     2
> 3 C         1     3
> 4 A         2     4
> 5 B         1     5
> 6 C         2     6
> 7 A         1     7
> 8 B         2     8
> 9 C         1     9

refresh_data()

Select

Keep

Extract one column as a vector.

# DT[["V3"]]
# DT[, V3]
DT[[3]]

> [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5

# DF[["V3"]]
# DF[, 3, drop = TRUE]
DF[[3]]

> [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5

# TB[["V3"]]
# TB |> pull(V3)
TB[[3]]

> [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5

Extract one column as a data frame.

# DT[, .SD, .SDcols = "V3"]
DT[, "V3"]

>       V3
>    
> 1:   0.5
> 2:   1.0
> 3:   1.5
> 4:   0.5
> 5:   1.0
> 6:   1.5
> 7:   0.5
> 8:   1.0
> 9:   1.5

# DF[, "V3", drop = FALSE]
DF[, 3, drop = FALSE]

>    V3
> 1 0.5
> 2 1.0
> 3 1.5
> 4 0.5
> 5 1.0
> 6 1.5
> 7 0.5
> 8 1.0
> 9 1.5

# TB |> select(V3)
TB[, "V3"]

> # A tibble: 9 × 1
>      V3
>   
> 1   0.5
> 2   1  
> 3   1.5
> 4   0.5
> 5   1  
> 6   1.5
> 7   0.5
> 8   1  
> 9   1.5

Select several columns by column names.

DT[, .(V2, V3, V4)]
DT[, V2:V4]
DT[, .SD, .SDcols = V2:V4]
DT[, .SD, .SDcols = c("V2", "V3", "V4")]
cols = c("V2", "V3")
DT[, ..cols]

DF[, c("V2", "V3", "V4")]
subset(DF, select = c("V2", "V3", "V4"))
cols = c("V2", "V3")
DF[, cols]
DF[ , names(DF) %in% cols]

TB |> select(V2, V3, V4)
TB |> select(V2:V4)
TB |> select(any_of(c("V2", "V3", "V4")))
cols = c("V2", "V3")
DF |> select(!!cols)

Drop

Exclude several columns by column name.

# DT[, .SD, .SDcols = !c("V2", "V3")]
DT[, !c("V2", "V3")]

>       V1     V4
>     
> 1:     1      A
> 2:     2      B
> 3:     1      C
> 4:     2      A
> 5:     1      B
> 6:     2      C
> 7:     1      A
> 8:     2      B
> 9:     1      C

DF[ , !(names(DF) %in% c("V2", "V3"))]

>   V1 V4
> 1  1  A
> 2  2  B
> 3  1  C
> 4  2  A
> 5  1  B
> 6  2  C
> 7  1  A
> 8  2  B
> 9  1  C

TB |> select(-V2, -V3)

> # A tibble: 9 × 2
>      V1 V4   
>    
> 1     1 A    
> 2     2 B    
> 3     1 C    
> 4     2 A    
> 5     1 B    
> 6     2 C    
> 7     1 A    
> 8     2 B    
> 9     1 C

Remove a column from the data set. Using let() is efficient because it modifies the data set in place. This code is not executed because the V5 column is not present in the data set.

DT[, let(V1 = NULL)]

>       V2    V3     V4
>      
> 1:     1   0.5      A
> 2:     2   1.0      B
> 3:     3   1.5      C
> 4:     4   0.5      A
> 5:     5   1.0      B
> 6:     6   1.5      C
> 7:     7   0.5      A
> 8:     8   1.0      B
> 9:     9   1.5      C

DT

>       V2    V3     V4
>      
> 1:     1   0.5      A
> 2:     2   1.0      B
> 3:     3   1.5      C
> 4:     4   0.5      A
> 5:     5   1.0      B
> 6:     6   1.5      C
> 7:     7   0.5      A
> 8:     8   1.0      B
> 9:     9   1.5      C

DF = DF[, !names(DF) %in% "V1"]
DF

>   V2  V3 V4
> 1  1 0.5  A
> 2  2 1.0  B
> 3  3 1.5  C
> 4  4 0.5  A
> 5  5 1.0  B
> 6  6 1.5  C
> 7  7 0.5  A
> 8  8 1.0  B
> 9  9 1.5  C

TB = TB |> select(-V1)
TB

> # A tibble: 9 × 3
>      V2    V3 V4   
>     
> 1     1   0.5 A    
> 2     2   1   B    
> 3     3   1.5 C    
> 4     4   0.5 A    
> 5     5   1   B    
> 6     6   1.5 C    
> 7     7   0.5 A    
> 8     8   1   B    
> 9     9   1.5 C

Remove several columns from the data set. Using := is efficient because it modifies the data set in place. This code is not executed because the V6 and V7 columns are not present in the data set.

cols = c("V2", "V3")
DT[, (cols) := NULL]

>        V4
>    
> 1:      A
> 2:      B
> 3:      C
> 4:      A
> 5:      B
> 6:      C
> 7:      A
> 8:      B
> 9:      C

DT

>        V4
>    
> 1:      A
> 2:      B
> 3:      C
> 4:      A
> 5:      B
> 6:      C
> 7:      A
> 8:      B
> 9:      C

DF = DF[, !(names(DF) %in% c("V2", "V3"))]
DF

> [1] "A" "B" "C" "A" "B" "C" "A" "B" "C"

TB = TB |> select(-V2, -V3)
TB

> # A tibble: 9 × 1
>   V4   
>   
> 1 A    
> 2 B    
> 3 C    
> 4 A    
> 5 B    
> 6 C    
> 7 A    
> 8 B    
> 9 C

refresh_data()

Rename

Select and rename.

DT[, .(X1 = V1, X2 = V2)]

>       X1    X2
>     
> 1:     1     1
> 2:     2     2
> 3:     1     3
> 4:     2     4
> 5:     1     5
> 6:     2     6
> 7:     1     7
> 8:     2     8
> 9:     1     9

setNames(
  DF[, c("V1", "V2")],
  c("X1", "X2"))

TB |> select(X1 = V1, X2 = V2)

> # A tibble: 9 × 2
>      X1    X2
>    
> 1     1     1
> 2     2     2
> 3     1     3
> 4     2     4
> 5     1     5
> 6     2     6
> 7     1     7
> 8     2     8
> 9     1     9

Using the data.table::setnames() to rename columns is efficient because it renames column in place. This code is not executed to avoid renaming columns in the original data set.

setnames(DT, old = c("V1", "V2"), new = c("X1", "X2"))
DT

>       X1    X2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     2   1.0      B
> 3:     1     3   1.5      C
> 4:     2     4   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

colnames(DF)[match(c("V1", "V2"), colnames(DF))] = c("X1", "X2")
DF

>   X1 X2  V3 V4
> 1  1  1 0.5  A
> 2  2  2 1.0  B
> 3  1  3 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

TB = TB |> rename(X1 = V1, X2 = V2)
TB

> # A tibble: 9 × 4
>      X1    X2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     2   1   B    
> 3     1     3   1.5 C    
> 4     2     4   0.5 A    
> 5     1     5   1   B    
> 6     2     6   1.5 C    
> 7     1     7   0.5 A    
> 8     2     8   1   B    
> 9     1     9   1.5 C

refresh_data()

Advanced selections

Complex selections using regular expressions or dedicated functions.

DT[, .SD, .SDcols = c("V1", "V2")]
DT[, .SD, .SDcols = patterns("^V[1-2]$")]
DT[, .SD, .SDcols = patterns("V")]
DT[, .SD, .SDcols = patterns("3$")]
DT[, .SD, .SDcols = patterns(".2")]
DT[, .SD, .SDcols = patterns("^V1$|^X$")]
DT[, .SD, .SDcols = patterns("^(?!V2)", perl = TRUE)]

DF[, c("V1", "V2")]
DF[ , grep("^V[1-2]$", names(DF))]
DF[ , c("V4", setdiff(names(DF), "V4"))]
DF[ , grep("V", names(DF))]
DF[ , grep("3$", names(DF))]
DF[ , grep(".2", names(DF))]
DF[ , c("V1", "X")]
DF[ , !grepl("^V2", names(DF))]

TB |> select(V1, V2)
TB |> select(num_range("V", 1:2))
TB |> select(contains("V"))
TB |> select(ends_with("3"))
TB |> select(matches(".2"))
TB |> select(one_of(c("V1", "X")))
TB |> select(-starts_with("V2"))

Summarize

Single columns

Create a new data frame with a single row and a single column, summarizing the information of one column. Named or unnamed results.

DT[, sum(V1)]

> [1] 13

DT[, .(sumV1 = sum(V1))]

>    sumV1
>    
> 1:    13

sum(DF$V1)

> [1] 13

data.frame(sumV1 = sum(DF$V1))

>   sumV1
> 1    13

TB |> pull(V1) |> sum()

> [1] 13

TB |> summarise(sumV1 = sum(V1))

> # A tibble: 1 × 1
>   sumV1
>   
> 1    13

Create a new data frame with a single row and two columns, summarizing the information of two manually specified columns.

DT[, .(sumV1 = sum(V1), sdV3 = sd(V3))]

>    sumV1      sdV3
>         
> 1:    13 0.4330127

data.frame(sumV1 = sum(DF$V1), sdV3 = sd(DF$V3))

>   sumV1      sdV3
> 1    13 0.4330127

TB |> summarise(sumV1 = sum(V1), sdV3 = sd(V3))

> # A tibble: 1 × 2
>   sumV1  sdV3
>    
> 1    13 0.433

Multiple columns

Apply a function to each column.

DT[, lapply(.SD, head, 1)]

>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A

data.frame(lapply(DF, head, 1))

>   V1 V2  V3 V4
> 1  1  1 0.5  A

TB |> summarize(across(everything(), \(x) head(x, 1)))

> # A tibble: 1 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A

Apply a function to each column that matches a specific type.

DT[, lapply(.SD, mean), .SDcols = is.numeric]

>          V1    V2    V3
>         
> 1: 1.444444     5     1

data.frame(lapply(DF[sapply(DF, is.numeric)], mean))

>         V1 V2 V3
> 1 1.444444  5  1

TB |> summarize(across(where(is.numeric), mean))

> # A tibble: 1 × 3
>      V1    V2    V3
>     
> 1  1.44     5     1

By group

Count the number of observation by group.

DT[, .N, by = V4]

>        V4     N
>     
> 1:      A     3
> 2:      B     3
> 3:      C     3

as.data.frame(table(DF$V4))

>   Var1 Freq
> 1    A    3
> 2    B    3
> 3    C    3

TB |>
  group_by(V4) |>
  tally()

> # A tibble: 3 × 2
>   V4        n
>    
> 1 A         3
> 2 B         3
> 3 C         3

Multiple named summaries

DT[, .(nobs = .N, meanV1 = mean(V1)), by = V4]

>        V4  nobs   meanV1
>         
> 1:      A     3 1.333333
> 2:      B     3 1.666667
> 3:      C     3 1.333333

do.call(rbind, by(DF, ~V4, \(x) {
  data.frame(nobs = nrow(x), meanV1 = mean(x$V1))
  }
))

>   nobs   meanV1
> A    3 1.333333
> B    3 1.666667
> C    3 1.333333

TB |>
  group_by(V4) |>
  summarize(nobs = n(), meanV1 = mean(V1)) |>
  ungroup()

> # A tibble: 3 × 3
>   V4     nobs meanV1
>      
> 1 A         3   1.33
> 2 B         3   1.67
> 3 C         3   1.33

Apply a function to the full data frame in each group. Here, we return the first row in each group using the head() function.

DT[, head(.SD, 1), by = V4]

>        V4    V1    V2    V3
>       
> 1:      A     1     1   0.5
> 2:      B     2     2   1.0
> 3:      C     1     3   1.5

do.call(rbind, by(DF, DF$V4, \(x) head(x, 1)))

>   V1 V2  V3 V4
> A  1  1 0.5  A
> B  2  2 1.0  B
> C  1  3 1.5  C

TB |>
  group_by(V4) |>
  summarize(across(everything(), head, 1))

> # A tibble: 3 × 4
>   V4       V1    V2    V3
>      
> 1 A         1     1   0.5
> 2 B         2     2   1  
> 3 C         1     3   1.5

There was 1 warning in `summarize()`.
ℹ In argument: `across(everything(), head, 1)`.
ℹ In group 1: `V4 = "A"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

Get the row number of first (and last) observation by group. In data.table, we use the .I operator, which reports the row number.

DT[, .I[1], by = V4]

>        V4    V1
>     
> 1:      A     1
> 2:      B     2
> 3:      C     3

DT[, range(.I), by = V4]

>        V4    V1
>     
> 1:      A     1
> 2:      A     7
> 3:      B     2
> 4:      B     8
> 5:      C     3
> 6:      C     9

do.call(rbind, by(DF, ~V4, \(x) x[1,]))

>   V1 V2  V3 V4
> A  1  1 0.5  A
> B  2  2 1.0  B
> C  1  3 1.5  C

do.call(rbind, by(DF, ~V4, \(x) x[c(1, nrow(x)),]))

>     V1 V2  V3 V4
> A.1  1  1 0.5  A
> A.7  1  7 0.5  A
> B.2  2  2 1.0  B
> B.8  2  8 1.0  B
> C.3  1  3 1.5  C
> C.9  1  9 1.5  C

TB |>
  group_by(V4) |>
  summarize(cur_group_rows()[1]) |>
  ungroup()

> # A tibble: 3 × 2
>   V4    `cur_group_rows()[1]`
>                    
> 1 A                         1
> 2 B                         2
> 3 C                         3

TB |>
  group_by(V4) |>
  reframe(cur_group_rows()[c(1, n())]) |>
  ungroup()

> # A tibble: 6 × 2
>   V4    `cur_group_rows()[c(1, n())]`
>                            
> 1 A                                 1
> 2 A                                 7
> 3 B                                 2
> 4 B                                 8
> 5 C                                 3
> 6 C                                 9

List-columns are columns where each element is a vector, data frame, or other object.

DT[, .(.(V1)),  by = V4]

>        V4     V1
>     
> 1:      A  1,2,1
> 2:      B  2,1,2
> 3:      C  1,2,1

DT[, .(.(.SD)), by = V4]

>        V4                V1
>                
> 1:      A 
> 2:      B 
> 3:      C

tapply(DF$V1, DF$V4,
  function(x) list(x))

> $A
> [1] 1 2 1
> 
> $B
> [1] 2 1 2
> 
> $C
> [1] 1 2 1

split(DF, DF$V4)

> $A
>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 4  2  4 0.5  A
> 7  1  7 0.5  A
> 
> $B
>   V1 V2 V3 V4
> 2  2  2  1  B
> 5  1  5  1  B
> 8  2  8  1  B
> 
> $C
>   V1 V2  V3 V4
> 3  1  3 1.5  C
> 6  2  6 1.5  C
> 9  1  9 1.5  C

TB |>
  group_by(V4) |>
  summarise(list(V1))

> # A tibble: 3 × 2
>   V4    `list(V1)`
>        
> 1 A      
> 2 B      
> 3 C

TB |>
  group_by(V4) |>
  group_nest()

> # A tibble: 3 × 2
>   V4                  data
>    >
> 1 A                [3 × 3]
> 2 B                [3 × 3]
> 3 C                [3 × 3]

Modify

Single columns

Create or modify a column.

DT[, let(V5 = V1^2)]

>       V1    V2    V3     V4    V5
>        
> 1:     1     1   0.5      A     1
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     1
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     1
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     1
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     1

DT

>       V1    V2    V3     V4    V5
>        
> 1:     1     1   0.5      A     1
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     1
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     1
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     1
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     1

DF$V5 = DF$V1^2
DF

>   V1 V2  V3 V4 V5
> 1  1  1 0.5  A  1
> 2  2  2 1.0  B  4
> 3  1  3 1.5  C  1
> 4  2  4 0.5  A  4
> 5  1  5 1.0  B  1
> 6  2  6 1.5  C  4
> 7  1  7 0.5  A  1
> 8  2  8 1.0  B  4
> 9  1  9 1.5  C  1

TB = TB |> mutate(V5 = V1^2)
TB

> # A tibble: 9 × 5
>      V1    V2    V3 V4       V5
>       
> 1     1     1   0.5 A         1
> 2     2     2   1   B         4
> 3     1     3   1.5 C         1
> 4     2     4   0.5 A         4
> 5     1     5   1   B         1
> 6     2     6   1.5 C         4
> 7     1     7   0.5 A         1
> 8     2     8   1   B         4
> 9     1     9   1.5 C         1

Create several new columns, each named explicitly.

DT[, let(
  V5 = sqrt(V1),
  V6 = "X")]

>       V1    V2    V3     V4       V5     V6
>            
> 1:     1     1   0.5      A 1.000000      X
> 2:     2     2   1.0      B 1.414214      X
> 3:     1     3   1.5      C 1.000000      X
> 4:     2     4   0.5      A 1.414214      X
> 5:     1     5   1.0      B 1.000000      X
> 6:     2     6   1.5      C 1.414214      X
> 7:     1     7   0.5      A 1.000000      X
> 8:     2     8   1.0      B 1.414214      X
> 9:     1     9   1.5      C 1.000000      X

DT

>       V1    V2    V3     V4       V5     V6
>            
> 1:     1     1   0.5      A 1.000000      X
> 2:     2     2   1.0      B 1.414214      X
> 3:     1     3   1.5      C 1.000000      X
> 4:     2     4   0.5      A 1.414214      X
> 5:     1     5   1.0      B 1.000000      X
> 6:     2     6   1.5      C 1.414214      X
> 7:     1     7   0.5      A 1.000000      X
> 8:     2     8   1.0      B 1.414214      X
> 9:     1     9   1.5      C 1.000000      X

DF$V5 = sqrt(DF$V1)
DF$V6 = "X"
DF

>   V1 V2  V3 V4       V5 V6
> 1  1  1 0.5  A 1.000000  X
> 2  2  2 1.0  B 1.414214  X
> 3  1  3 1.5  C 1.000000  X
> 4  2  4 0.5  A 1.414214  X
> 5  1  5 1.0  B 1.000000  X
> 6  2  6 1.5  C 1.414214  X
> 7  1  7 0.5  A 1.000000  X
> 8  2  8 1.0  B 1.414214  X
> 9  1  9 1.5  C 1.000000  X

TB = TB |> mutate(
  V5 = sqrt(V1),
  V6 = "X")
TB

> # A tibble: 9 × 6
>      V1    V2    V3 V4       V5 V6   
>        
> 1     1     1   0.5 A      1    X    
> 2     2     2   1   B      1.41 X    
> 3     1     3   1.5 C      1    X    
> 4     2     4   0.5 A      1.41 X    
> 5     1     5   1   B      1    X    
> 6     2     6   1.5 C      1.41 X    
> 7     1     7   0.5 A      1    X    
> 8     2     8   1   B      1.41 X    
> 9     1     9   1.5 C      1    X

Multiple columns

Take the square of every numeric column.

DT[, names(.SD) := lapply(.SD, \(x) x^2), .SDcols = is.numeric]

>       V1    V2    V3     V4    V5     V6
>         
> 1:     1     1  0.25      A     1      X
> 2:     4     4  1.00      B     2      X
> 3:     1     9  2.25      C     1      X
> 4:     4    16  0.25      A     2      X
> 5:     1    25  1.00      B     1      X
> 6:     4    36  2.25      C     2      X
> 7:     1    49  0.25      A     1      X
> 8:     4    64  1.00      B     2      X
> 9:     1    81  2.25      C     1      X

DT

>       V1    V2    V3     V4    V5     V6
>         
> 1:     1     1  0.25      A     1      X
> 2:     4     4  1.00      B     2      X
> 3:     1     9  2.25      C     1      X
> 4:     4    16  0.25      A     2      X
> 5:     1    25  1.00      B     1      X
> 6:     4    36  2.25      C     2      X
> 7:     1    49  0.25      A     1      X
> 8:     4    64  1.00      B     2      X
> 9:     1    81  2.25      C     1      X

TB = TB |> 
  mutate(across(where(is.numeric), \(x) x^2))
TB

> # A tibble: 9 × 6
>      V1    V2    V3 V4       V5 V6   
>        
> 1     1     1  0.25 A         1 X    
> 2     4     4  1    B         2 X    
> 3     1     9  2.25 C         1 X    
> 4     4    16  0.25 A         2 X    
> 5     1    25  1    B         1 X    
> 6     4    36  2.25 C         2 X    
> 7     1    49  0.25 A         1 X    
> 8     4    64  1    B         2 X    
> 9     1    81  2.25 C         1 X

refresh_data()

Replace

Replace values in rows that match a condition applied to a column.

DT[V2 <= 4, let(V2 = 0)]

>       V1    V2    V3     V4
>       
> 1:     1     0   0.5      A
> 2:     2     0   1.0      B
> 3:     1     0   1.5      C
> 4:     2     0   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

DT

>       V1    V2    V3     V4
>       
> 1:     1     0   0.5      A
> 2:     2     0   1.0      B
> 3:     1     0   1.5      C
> 4:     2     0   0.5      A
> 5:     1     5   1.0      B
> 6:     2     6   1.5      C
> 7:     1     7   0.5      A
> 8:     2     8   1.0      B
> 9:     1     9   1.5      C

DF$V2 = replace(DF$V2, DF$V2 < 4, 0)
DF

>   V1 V2  V3 V4
> 1  1  0 0.5  A
> 2  2  0 1.0  B
> 3  1  0 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

TB = DF |> 
  mutate(V2 = base::replace(V2, V2 < 4, 0))
TB

>   V1 V2  V3 V4
> 1  1  0 0.5  A
> 2  2  0 1.0  B
> 3  1  0 1.5  C
> 4  2  4 0.5  A
> 5  1  5 1.0  B
> 6  2  6 1.5  C
> 7  1  7 0.5  A
> 8  2  8 1.0  B
> 9  1  9 1.5  C

refresh_data()

If, else, case

Fast version of ifelse(). Return a vector the same length as the vector tested, with specific value for in case where the condition is TRUE, where it is FALSE, and where it is NA.

x = c(-3:3, NA)
fifelse(test = x < 0,
  yes  = "neg",
  no   = "pos",
  na   = "NA")

> [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

x = c(-3:3, NA)
result = ifelse(is.na(x),
  "NA",
  ifelse(x < 0, "neg", "pos"))
result

> [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

x = c(-3:3, NA)
if_else(condition = x < 0,
  true      = "neg",
  false     = "pos",
  missing   = "NA")

> [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

Recode several cases at once, based on a vector of conditions.

x = 1:10
fcase(
  x %% 6 == 0, "fizz buzz",
  x %% 2 == 0, "fizz",
  x %% 3 == 0, "buzz",
  default = as.character(x)
)

>  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
>  [7] "7"         "fizz"      "buzz"      "fizz"

x = 1:10
result = ifelse(x %% 6 == 0, "fizz buzz",
  ifelse(x %% 2 == 0, "fizz",
  ifelse(x %% 3 == 0, "buzz", as.character(x))))
result

>  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
>  [7] "7"         "fizz"      "buzz"      "fizz"

x = 1:10
case_when(
  x %% 6 == 0 ~ "fizz buzz",
  x %% 2 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)

>  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
>  [7] "7"         "fizz"      "buzz"      "fizz"

Modify by group

Group summaries of a column in categories.

DT[, by = "V4", .(sumV2 = sum(V2))]

>        V4 sumV2
>     
> 1:      A    12
> 2:      B    15
> 3:      C    18

aggregate(V2 ~ V4, data = DF, FUN = sum)

>   V4 V2
> 1  A 12
> 2  B 15
> 3  C 18

TB |>
  group_by(V4) |>
  summarise(sumV2 = sum(V2)) |>
  ungroup()

> # A tibble: 3 × 2
>   V4    sumV2
>    
> 1 A        12
> 2 B        15
> 3 C        18

Group values of a column in groups while applying a function to each category.

DT[,
  by = tolower(V4),
  .(sumV1 = sum(V1))]

>    tolower sumV1
>      
> 1:       a     4
> 2:       b     5
> 3:       c     4

aggregate(
  V1 ~ tolower(V4),
  data = DF,
  FUN = sum)

>   tolower(V4) V1
> 1           a  4
> 2           b  5
> 3           c  4

TB |>
  group_by(tolower(V4)) |>
  summarise(sumV1 = sum(V1))

> # A tibble: 3 × 2
>   `tolower(V4)` sumV1
>            
> 1 a                 4
> 2 b                 5
> 3 c                 4

Group values of a column in two categories, TRUE (for rows matching the condition) and FALSE (For rows not matching the condition).

DT[,
  keyby = V4 == "A",
  sum(V1)]

> Key: 
>        V4    V1
>     
> 1:  FALSE     9
> 2:   TRUE     4

aggregate(V1 ~ groupA,
  data = transform(DF, groupA = V4 == "A"),
  FUN = sum)

>   groupA V1
> 1  FALSE  9
> 2   TRUE  4

TB |>
  group_by(V4 == "A") |>
  summarise(sum(V1))

> # A tibble: 2 × 2
>   `V4 == "A"` `sum(V1)`
>              
> 1 FALSE               9
> 2 TRUE                4

Group values of a column in several categories with some of the rows of the initial dataset removed.

DT[1:5,
  by = V4,
  .(sumV1 = sum(V1))]

>        V4 sumV1
>     
> 1:      A     3
> 2:      B     3
> 3:      C     1

aggregate(V1 ~ V4,
  data = DF[1:5,],
  FUN = sum)

>   V4 V1
> 1  A  3
> 2  B  3
> 3  C  1

TB |>
  slice(1:5) |>
  group_by(V4) |>
  summarise(sumV1 = sum(V1))

> # A tibble: 3 × 2
>   V4    sumV1
>    
> 1 A         3
> 2 B         3
> 3 C         1

Add a new column with the number of observations per group.

DT[, let(n = .N), by = V1]

>       V1    V2    V3     V4     n
>        
> 1:     1     1   0.5      A     5
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     5
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     5
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     5
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     5

DT

>       V1    V2    V3     V4     n
>        
> 1:     1     1   0.5      A     5
> 2:     2     2   1.0      B     4
> 3:     1     3   1.5      C     5
> 4:     2     4   0.5      A     4
> 5:     1     5   1.0      B     5
> 6:     2     6   1.5      C     4
> 7:     1     7   0.5      A     5
> 8:     2     8   1.0      B     4
> 9:     1     9   1.5      C     5

DF$n <-
  ave(DF$V1,
  DF$V1,
  FUN = length)
DF

>   V1 V2  V3 V4 n
> 1  1  1 0.5  A 5
> 2  2  2 1.0  B 4
> 3  1  3 1.5  C 5
> 4  2  4 0.5  A 4
> 5  1  5 1.0  B 5
> 6  2  6 1.5  C 4
> 7  1  7 0.5  A 5
> 8  2  8 1.0  B 4
> 9  1  9 1.5  C 5

TB = TB |>
  group_by(V1) |>
  add_tally()
TB

> # A tibble: 9 × 5
> # Groups:   V1 [2]
>      V1    V2    V3 V4        n
>       
> 1     1     1   0.5 A         5
> 2     2     2   1   B         4
> 3     1     3   1.5 C         5
> 4     2     4   0.5 A         4
> 5     1     5   1   B         5
> 6     2     6   1.5 C         4
> 7     1     7   0.5 A         5
> 8     2     8   1   B         4
> 9     1     9   1.5 C         5

refresh_data()

Advanced

Summarise all the columns, typically using an aggregation function.

DT[, lapply(.SD, max)]

>       V1    V2    V3     V4
>       
> 1:     2     9   1.5      C

apply(DF, 2, max)

>    V1    V2    V3    V4 
>   "2"   "9" "1.5"   "C"

TB |> summarise(across(everything(), max))

> # A tibble: 1 × 4
>      V1    V2    V3 V4   
>      
> 1     2     9   1.5 C

Summarise several columns, typically using an aggregation function.

DT[, lapply(.SD, mean),
  .SDcols = c("V1", "V2")]

>          V1    V2
>        
> 1: 1.444444     5

data.frame(
  mean_V1 = mean(DF$V1),
  mean_V2 = mean(DF$V2))

>    mean_V1 mean_V2
> 1 1.444444       5

Summarise several columns by groups, typically using an aggregation function.

DT[, by = V4,
  lapply(.SD, mean),
  .SDcols = c("V1", "V2")]

>        V4       V1    V2
>         
> 1:      A 1.333333     4
> 2:      B 1.666667     5
> 3:      C 1.333333     6

DT[, by = V4,
  lapply(.SD, mean),
  .SDcols = patterns("V1|V2|Z0")]

>        V4       V1    V2
>         
> 1:      A 1.333333     4
> 2:      B 1.666667     5
> 3:      C 1.333333     6

cols = intersect(c("V1", "V2", "Z0"), names(DF))
aggregate(DF[cols], by = list(DF$V4), FUN = mean, na.rm = TRUE)

>   Group.1       V1 V2
> 1       A 1.333333  4
> 2       B 1.666667  5
> 3       C 1.333333  6

TB |>
  group_by(V4) |>
  summarise(across(c(V1, V2), mean)) |>
  ungroup()

> # A tibble: 3 × 3
>   V4       V1    V2
>     
> 1 A      1.33     4
> 2 B      1.67     5
> 3 C      1.33     6

Summarise several columns by group using multiple aggregation functions, grouping by one or more variables.

DT[, by = V4,
  c(lapply(.SD, sum),
  lapply(.SD, mean))]

>        V4    V1    V2    V3       V1    V2    V3
>             
> 1:      A     4    12   1.5 1.333333     4   0.5
> 2:      B     5    15   3.0 1.666667     5   1.0
> 3:      C     4    18   4.5 1.333333     6   1.5

aggregate(cbind(DF$V1, DF$V2, DF$V3) ~ V4,
  data = DF,
  FUN = function(x) c(sum = sum(x), mean = mean(x)))

>   V4   V1.sum  V1.mean V2.sum V2.mean V3.sum V3.mean
> 1  A 4.000000 1.333333     12       4    1.5     0.5
> 2  B 5.000000 1.666667     15       5    3.0     1.0
> 3  C 4.000000 1.333333     18       6    4.5     1.5

TB |>
  group_by(V4) |>
  summarise(across(everything(),
  list(sum = sum, mean = mean)))

> # A tibble: 3 × 7
>   V4    V1_sum V1_mean V2_sum V2_mean V3_sum V3_mean
>                  
> 1 A          4    1.33     12       4    1.5     0.5
> 2 B          5    1.67     15       5    3       1  
> 3 C          4    1.33     18       6    4.5     1.5

Summarise a subset of columns by column type or condition.

DT[, lapply(.SD, mean), .SDcols = is.numeric]

>          V1    V2    V3
>         
> 1: 1.444444     5     1

foo = function(x) {is.numeric(x) && mean(x) > 3}
DT[, lapply(.SD, mean), .SDcols = foo]

>       V2
>    
> 1:     5

sapply(DF[sapply(DF, is.numeric)],
  mean)

>       V1       V2       V3 
> 1.444444 5.000000 1.000000

sapply(DF[sapply(DF, \(x) {
   is.numeric(x) && mean(x) > 3
})], mean)

> V2 
>  5

TB |>
  summarise(across(where(is.numeric),
  mean))

> # A tibble: 1 × 3
>      V1    V2    V3
>     
> 1  1.44     5     1

TB |> summarise(across(
  where(~ is.numeric(.x) && mean(.x) > 3), mean))

> # A tibble: 1 × 1
>      V2
>   
> 1     5

Modify all the columns using the same function.

DT[, lapply(.SD, rev)]

>       V1    V2    V3     V4
>       
> 1:     1     9   1.5      C
> 2:     2     8   1.0      B
> 3:     1     7   0.5      A
> 4:     2     6   1.5      C
> 5:     1     5   1.0      B
> 6:     2     4   0.5      A
> 7:     1     3   1.5      C
> 8:     2     2   1.0      B
> 9:     1     1   0.5      A

data.frame(lapply(DF, rev))

>   V1 V2  V3 V4
> 1  1  9 1.5  C
> 2  2  8 1.0  B
> 3  1  7 0.5  A
> 4  2  6 1.5  C
> 5  1  5 1.0  B
> 6  2  4 0.5  A
> 7  1  3 1.5  C
> 8  2  2 1.0  B
> 9  1  1 0.5  A

TB |> mutate(across(everything(), rev))

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     9   1.5 C    
> 2     2     8   1   B    
> 3     1     7   0.5 A    
> 4     2     6   1.5 C    
> 5     1     5   1   B    
> 6     2     4   0.5 A    
> 7     1     3   1.5 C    
> 8     2     2   1   B    
> 9     1     1   0.5 A

Apply a transformation to each element of the variables selected.

DT[, lapply(.SD, sqrt), .SDcols = V1:V2]

>          V1       V2
>           
> 1: 1.000000 1.000000
> 2: 1.414214 1.414214
> 3: 1.000000 1.732051
> 4: 1.414214 2.000000
> 5: 1.000000 2.236068
> 6: 1.414214 2.449490
> 7: 1.000000 2.645751
> 8: 1.414214 2.828427
> 9: 1.000000 3.000000

Apply a transformation for each element of the variables selected.

DT[, names(.SD) := lapply(.SD, sqrt), .SDcols = V1:V3]

>          V1       V2        V3     V4
>                 
> 1: 1.000000 1.000000 0.7071068      A
> 2: 1.414214 1.414214 1.0000000      B
> 3: 1.000000 1.732051 1.2247449      C
> 4: 1.414214 2.000000 0.7071068      A
> 5: 1.000000 2.236068 1.0000000      B
> 6: 1.414214 2.449490 1.2247449      C
> 7: 1.000000 2.645751 0.7071068      A
> 8: 1.414214 2.828427 1.0000000      B
> 9: 1.000000 3.000000 1.2247449      C

DT

>          V1       V2        V3     V4
>                 
> 1: 1.000000 1.000000 0.7071068      A
> 2: 1.414214 1.414214 1.0000000      B
> 3: 1.000000 1.732051 1.2247449      C
> 4: 1.414214 2.000000 0.7071068      A
> 5: 1.000000 2.236068 1.0000000      B
> 6: 1.414214 2.449490 1.2247449      C
> 7: 1.000000 2.645751 0.7071068      A
> 8: 1.414214 2.828427 1.0000000      B
> 9: 1.000000 3.000000 1.2247449      C

Apply a transformation to each element of the variables selected by condition.

DT[, .SD - 1, .SDcols = is.numeric]

>           V1        V2         V3
>                   
> 1: 0.0000000 0.0000000 -0.2928932
> 2: 0.4142136 0.4142136  0.0000000
> 3: 0.0000000 0.7320508  0.2247449
> 4: 0.4142136 1.0000000 -0.2928932
> 5: 0.0000000 1.2360680  0.0000000
> 6: 0.4142136 1.4494897  0.2247449
> 7: 0.0000000 1.6457513 -0.2928932
> 8: 0.4142136 1.8284271  0.0000000
> 9: 0.0000000 2.0000000  0.2247449

data.frame(lapply(DF,
  \(x) if (is.numeric(x)) x - 1 else x))

>   V1 V2   V3 V4
> 1  0  0 -0.5  A
> 2  1  1  0.0  B
> 3  0  2  0.5  C
> 4  1  3 -0.5  A
> 5  0  4  0.0  B
> 6  1  5  0.5  C
> 7  0  6 -0.5  A
> 8  1  7  0.0  B
> 9  0  8  0.5  C

TB |>
  transmute(across(where(is.numeric),
  ~ '-'(., 1L)))

> # A tibble: 9 × 3
>      V1    V2    V3
>     
> 1     0     0  -0.5
> 2     1     1   0  
> 3     0     2   0.5
> 4     1     3  -0.5
> 5     0     4   0  
> 6     1     5   0.5
> 7     0     6  -0.5
> 8     1     7   0  
> 9     0     8   0.5

Apply a transformation to each element of the variables selected by condition.

DT[, names(.SD) := lapply(.SD, as.integer),
  .SDcols = is.numeric]

>       V1    V2    V3     V4
>       
> 1:     1     1     0      A
> 2:     1     1     1      B
> 3:     1     1     1      C
> 4:     1     2     0      A
> 5:     1     2     1      B
> 6:     1     2     1      C
> 7:     1     2     0      A
> 8:     1     2     1      B
> 9:     1     3     1      C

DT

>       V1    V2    V3     V4
>       
> 1:     1     1     0      A
> 2:     1     1     1      B
> 3:     1     1     1      C
> 4:     1     2     0      A
> 5:     1     2     1      B
> 6:     1     2     1      C
> 7:     1     2     0      A
> 8:     1     2     1      B
> 9:     1     3     1      C

DF[sapply(DF, is.numeric)] <-
  lapply(DF[sapply(DF, is.numeric)],
  as.integer)
DF

>   V1 V2 V3 V4
> 1  1  1  0  A
> 2  2  2  1  B
> 3  1  3  1  C
> 4  2  4  0  A
> 5  1  5  1  B
> 6  2  6  1  C
> 7  1  7  0  A
> 8  2  8  1  B
> 9  1  9  1  C

TB = TB |>
  mutate(across(where(is.numeric),
  as.integer))
TB

> # A tibble: 9 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1     0 A    
> 2     2     2     1 B    
> 3     1     3     1 C    
> 4     2     4     0 A    
> 5     1     5     1 B    
> 6     2     6     1 C    
> 7     1     7     0 A    
> 8     2     8     1 B    
> 9     1     9     1 C

Combine multiple functions in a single statement.

DT[, by = V4,
  .(V1[1:2], "X")]

>        V4    V1     V2
>      
> 1:      A     1      X
> 2:      A     1      X
> 3:      B     1      X
> 4:      B     1      X
> 5:      C     1      X
> 6:      C     1      X

DF = do.call(rbind,
  by(DF, DF$V4, function(sub) {
  head(data.frame(V1 = sub$V1,
  V2 = "X"), 2)
}))

TB |>
  group_by(V4) |>
  slice(1:2) |>
  transmute(V1 = V1,
  V2 = "X")

> # A tibble: 6 × 3
> # Groups:   V4 [3]
>   V4       V1 V2   
>     
> 1 A         1 X    
> 2 A         2 X    
> 3 B         2 X    
> 4 B         1 X    
> 5 C         1 X    
> 6 C         2 X

refresh_data()

Chain (pipe)

Expression chaining allows you to perform multiple operations in sequence without creating intermediate objects.

# Chain operations using [][]
DT[, by = V4, .(V1sum = sum(V1))][
  V1sum > 4]

>        V4 V1sum
>     
> 1:      B     5


# Sort results in descending order
DT[, by = V4, .(V1sum = sum(V1))][
  order(-V1sum)]

>        V4 V1sum
>     
> 1:      B     5
> 2:      A     4
> 3:      C     4

# Chain using intermediate assignment
subset(aggregate(V1 ~ V4,
  data = DF,
  FUN = sum), V1 > 4)

>   V4 V1
> 2  B  5


# Sort results
# TODO

# Chain using pipe operator
TB |>
  group_by(V4) |>
  summarise(V1sum = sum(V1)) |>
  filter(V1sum > 4)

> # A tibble: 1 × 2
>   V4    V1sum
>    
> 1 B         5


# Sort results
TB |>
  group_by(V4) |>
  summarise(V1sum = sum(V1)) |>
  arrange(desc(V1sum))

> # A tibble: 3 × 2
>   V4    V1sum
>    
> 1 B         5
> 2 A         4
> 3 C         4

refresh_data()

Join

First, let’s create example datasets for demonstrating joins.

x = data.table(
  Id  = c("A", "B", "C", "C"),
  X1  = c(1L, 3L, 5L, 7L),
  XY  = c("x2", "x4", "x6", "x8")
)
y = data.table(
  Id  = c("A", "B", "B", "D"),
  Y1  = c(1L, 3L, 5L, 7L),
  XY  = c("y1", "y3", "y5", "y7")
)

Basic Joins

There are two basic ways to perform joins in data.table. First, we can use the standard merge() function that should be familiar to most R users.

Second, we can use the square bracket notation with the on argument. Here, the logic is that we select every row from the table outside that match rows from the table inside the bracket. The table inside the brackets is used as an “index”, so x[y] essentially means that we are merging x into y. The benefit of this syntax is that we can efficiently combine it with other operations like summaries, sorts, etc.

Note that, in data.table, the on argument specifies the joining columns, and the i. prefix refers to columns from the right table, while the x. prefix refers to columns from the left table.

# Left join (keep all rows from x)
# merge(x, y, all.x = TRUE, by = "Id")
y[x, on = "Id"]

>        Id    Y1     XY    X1   i.XY
>        
> 1:      A     1     y1     1     x2
> 2:      B     3     y3     3     x4
> 3:      B     5     y5     3     x4
> 4:      C    NA        5     x6
> 5:      C    NA        7     x8


# Right join (keep all rows from y)
# merge(x, y, all.y = TRUE, by = "Id")
x[y, on = "Id"]

>        Id    X1     XY    Y1   i.XY
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      D    NA        7     y7


# Inner join (keep only matching rows)
# merge(x, y, by = "Id"
x[y, on = "Id", nomatch = NULL]

>        Id    X1     XY    Y1   i.XY
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5


# Anti join (keep rows from x with no match in y)
x[!y, on = "Id"]

>        Id    X1     XY
>      
> 1:      C     5     x6
> 2:      C     7     x8


# Full join (keep all rows)
merge(x, y, by = "Id", all = TRUE)

> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA   
> 6:      D    NA        7     y7

# Left join (keep all rows from x)
merge(x, y, by = "Id", all.x = TRUE)

> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA


# Right join (keep all rows from y)
merge(x, y, by = "Id", all.y = TRUE)

> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      D    NA        7     y7


# Inner join (keep only matching rows)
merge(x, y, by = "Id")

> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5


# Full join (keep all rows)
merge(x, y, by = "Id", all = TRUE)

> Key: 
>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA   
> 6:      D    NA        7     y7


# Anti join (more complex in base R)
x[!(x$Id %in% y$Id), ]

>        Id    X1     XY
>      
> 1:      C     5     x6
> 2:      C     7     x8

# Left join (keep all rows from x)
left_join(x, y, by = "Id")

>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA


# Right join (keep all rows from y)
right_join(x, y, by = "Id")

>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      D    NA        7     y7


# Inner join (keep only matching rows)
inner_join(x, y, by = "Id")

>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5


# Full join (keep all rows)
full_join(x, y, by = "Id")

>        Id    X1   XY.x    Y1   XY.y
>        
> 1:      A     1     x2     1     y1
> 2:      B     3     x4     3     y3
> 3:      B     3     x4     5     y5
> 4:      C     5     x6    NA   
> 5:      C     7     x8    NA   
> 6:      D    NA        7     y7


# Anti join
anti_join(x, y, by = "Id")

>        Id    X1     XY
>      
> 1:      C     5     x6
> 2:      C     7     x8

Reshape

Wide to long

Converting data from wide format (where variables are in columns) to long format (where variables become rows).

melt(DT,
  id.vars       = "V4",
  variable.name = "Variable",
  value.name    = "Value")

>         V4 Variable Value
>         
>  1:      A       V1   1.0
>  2:      B       V1   2.0
>  3:      C       V1   1.0
>  4:      A       V1   2.0
>  5:      B       V1   1.0
>  6:      C       V1   2.0
>  7:      A       V1   1.0
>  8:      B       V1   2.0
>  9:      C       V1   1.0
> 10:      A       V2   1.0
> 11:      B       V2   2.0
> 12:      C       V2   3.0
> 13:      A       V2   4.0
> 14:      B       V2   5.0
> 15:      C       V2   6.0
> 16:      A       V2   7.0
> 17:      B       V2   8.0
> 18:      C       V2   9.0
> 19:      A       V3   0.5
> 20:      B       V3   1.0
> 21:      C       V3   1.5
> 22:      A       V3   0.5
> 23:      B       V3   1.0
> 24:      C       V3   1.5
> 25:      A       V3   0.5
> 26:      B       V3   1.0
> 27:      C       V3   1.5
>         V4 Variable Value
>

'measure.vars' [V1, V2, V3] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

reshape(DF,
  varying = setdiff(names(DF), "V4"),
  v.names = "value",
  timevar = "variable",
  times = setdiff(names(DF), "V4"),
  direction = "long")

>      V4 variable value id
> 1.V1  A       V1   1.0  1
> 2.V1  B       V1   2.0  2
> 3.V1  C       V1   1.0  3
> 4.V1  A       V1   2.0  4
> 5.V1  B       V1   1.0  5
> 6.V1  C       V1   2.0  6
> 7.V1  A       V1   1.0  7
> 8.V1  B       V1   2.0  8
> 9.V1  C       V1   1.0  9
> 1.V2  A       V2   1.0  1
> 2.V2  B       V2   2.0  2
> 3.V2  C       V2   3.0  3
> 4.V2  A       V2   4.0  4
> 5.V2  B       V2   5.0  5
> 6.V2  C       V2   6.0  6
> 7.V2  A       V2   7.0  7
> 8.V2  B       V2   8.0  8
> 9.V2  C       V2   9.0  9
> 1.V3  A       V3   0.5  1
> 2.V3  B       V3   1.0  2
> 3.V3  C       V3   1.5  3
> 4.V3  A       V3   0.5  4
> 5.V3  B       V3   1.0  5
> 6.V3  C       V3   1.5  6
> 7.V3  A       V3   0.5  7
> 8.V3  B       V3   1.0  8
> 9.V3  C       V3   1.5  9

TB |> tidyr::pivot_longer(
  cols = c("V1", "V2", "V3"),
  names_to = "Variable",
  values_to = "Value")

> # A tibble: 27 × 3
>    V4    Variable Value
>         
>  1 A     V1         1  
>  2 A     V2         1  
>  3 A     V3         0.5
>  4 B     V1         2  
>  5 B     V2         2  
>  6 B     V3         1  
>  7 C     V1         1  
>  8 C     V2         3  
>  9 C     V3         1.5
> 10 A     V1         2  
> # ℹ 17 more rows

Long to wide

Converting data from long format back to wide format.

# Create example long data
long = CJ(a = 1:2, b = 1:2, c = c("x", "y"))
long[, let(d = rnorm(8))]

> Key: 
>        a     b      c          d
>            
> 1:     1     1      x -0.8425213
> 2:     1     1      y -0.0431396
> 3:     1     2      x -0.4273492
> 4:     1     2      y -0.3644490
> 5:     2     1      x -0.4764893
> 6:     2     1      y  0.2829335
> 7:     2     2      x  0.1844807
> 8:     2     2      y -0.5859007


dcast(long, a + b ~ c)

> Key: 
>        a     b          x          y
>                 
> 1:     1     1 -0.8425213 -0.0431396
> 2:     1     2 -0.4273492 -0.3644490
> 3:     2     1 -0.4764893  0.2829335
> 4:     2     2  0.1844807 -0.5859007

Using 'd' as value column. Use 'value.var' to override

# Create example long data
long = expand.grid(a = 1:2, b = 1:2, c = c("x", "y"))
long$d = rnorm(8)

reshape(long, 
  idvar = c("a", "b"),
  timevar = "c",
  direction = "wide")

>   a b        d.x        d.y
> 1 1 1 -0.5331285 -0.6236030
> 2 2 1 -2.5042215  0.8518563
> 3 1 2  0.6166996 -0.6863949
> 4 2 2 -0.7550700  1.3738420

# Create example long data
long = tidyr::expand_grid(a = 1:2, b = 1:2, c = c("x", "y"))
long$d = rnorm(8)

tidyr::pivot_wider(long,
  id_cols = c("a", "b"),
  names_from = "c",
  values_from = "d")

> # A tibble: 4 × 4
>       a     b      x      y
>        
> 1     1     1  0.442  0.143
> 2     1     2  1.52   0.120
> 3     2     1 -0.289 -0.605
> 4     2     2 -0.233 -1.53

Split rows

Separating data into groups based on a factor.

split(DT, by = "V4")

> $A
>       V1    V2    V3     V4
>       
> 1:     1     1   0.5      A
> 2:     2     4   0.5      A
> 3:     1     7   0.5      A
> 
> $B
>       V1    V2    V3     V4
>       
> 1:     2     2     1      B
> 2:     1     5     1      B
> 3:     2     8     1      B
> 
> $C
>       V1    V2    V3     V4
>       
> 1:     1     3   1.5      C
> 2:     2     6   1.5      C
> 3:     1     9   1.5      C

split(DF, DF$V4)

> $A
>   V1 V2  V3 V4
> 1  1  1 0.5  A
> 4  2  4 0.5  A
> 7  1  7 0.5  A
> 
> $B
>   V1 V2 V3 V4
> 2  2  2  1  B
> 5  1  5  1  B
> 8  2  8  1  B
> 
> $C
>   V1 V2  V3 V4
> 3  1  3 1.5  C
> 6  2  6 1.5  C
> 9  1  9 1.5  C

TB |> group_split(V4)

>    tbl_df<
>     V1: integer
>     V2: integer
>     V3: double
>     V4: character
>   >
> >[3]>
> [[1]]
> # A tibble: 3 × 4
>      V1    V2    V3 V4   
>      
> 1     1     1   0.5 A    
> 2     2     4   0.5 A    
> 3     1     7   0.5 A    
> 
> [[2]]
> # A tibble: 3 × 4
>      V1    V2    V3 V4   
>      
> 1     2     2     1 B    
> 2     1     5     1 B    
> 3     2     8     1 B    
> 
> [[3]]
> # A tibble: 3 × 4
>      V1    V2    V3 V4   
>      
> 1     1     3   1.5 C    
> 2     2     6   1.5 C    
> 3     1     9   1.5 C

Split columns

Separating a single column into multiple columns based on a delimiter.

# Create example data
tmp = data.table(a = c("A:a", "B:b", "C:c"))

tmp[, c("w", "z") := tstrsplit(a, split = ":")]

>         a      w      z
>      
> 1:    A:a      A      a
> 2:    B:b      B      b
> 3:    C:c      C      c

tmp

>         a      w      z
>      
> 1:    A:a      A      a
> 2:    B:b      B      b
> 3:    C:c      C      c

# Create example data
tmp = data.frame(a = c("A:a", "B:b", "C:c"))

tmp$w = sapply(strsplit(as.character(tmp$a), ":"), `[`, 1)
tmp$z = sapply(strsplit(as.character(tmp$a), ":"), `[`, 2)
tmp

>     a w z
> 1 A:a A a
> 2 B:b B b
> 3 C:c C c

# Create example data
tmp = tibble(a = c("A:a", "B:b", "C:c"))

tmp = tidyr::separate(tmp, a, c("w", "z"), remove = FALSE)
tmp

> # A tibble: 3 × 3
>   a     w     z    
>     
> 1 A:a   A     a    
> 2 B:b   B     b    
> 3 C:c   C     c

Bind rows

Combining multiple datasets by rows. To begin, we create example data.

x = data.table(1:3)
y = data.table(4:6)
z = data.table(7:9, 0L)

# Simple row bind
rbind(x, y)


# Simple row bind
rbindlist(list(x, y, x))


# Row bind with different columns
rbind(x, z, fill = TRUE)

>       V1    V2
>     
> 1:     1    NA
> 2:     2    NA
> 3:     3    NA
> 4:     7     0
> 5:     8     0
> 6:     9     0


# Bind list of data.tables with ID column
rbindlist(list(x, y), idcol = TRUE)

>      .id    V1
>     
> 1:     1     1
> 2:     1     2
> 3:     1     3
> 4:     2     4
> 5:     2     5
> 6:     2     6

# Simple row bind
rbind(x, y)


# Simple row bind
do.call(rbind, list(x, y, x))


# Bind with ID column
rbind(
  cbind(x, id = 1),
  cbind(y, id = 2)
)

>       V1    id
>     
> 1:     1     1
> 2:     2     1
> 3:     3     1
> 4:     4     2
> 5:     5     2
> 6:     6     2

# Simple row bind
bind_rows(x, y)


# Simple row bind
bind_rows(list(x, y, x))


# Row bind with different columns
bind_rows(x, z)

>       V1    V2
>     
> 1:     1    NA
> 2:     2    NA
> 3:     3    NA
> 4:     7     0
> 5:     8     0
> 6:     9     0


# Bind list with ID column
bind_rows(list(x, y), .id = "id")

>        id    V1
>     
> 1:      1     1
> 2:      1     2
> 3:      1     3
> 4:      2     4
> 5:      2     5
> 6:      2     6

Bind columns

# Column bind
base::cbind(x, y)

>       V1    V1
>     
> 1:     1     4
> 2:     2     5
> 3:     3     6

# Column bind
cbind(x, y)

>       V1    V1
>     
> 1:     1     4
> 2:     2     5
> 3:     3     6

# Column bind
bind_cols(x, y)

>    V1...1 V1...2
>       
> 1:      1      4
> 2:      2      5
> 3:      3      6

New names:
• `V1` -> `V1...1`
• `V1` -> `V1...2`

Set Operations

Set operations treat datasets as sets and perform mathematical set operations on them.

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))

# Intersection (rows present in both x and y)
fintersect(x, y)


# Set difference (rows in x but not in y)
fsetdiff(x, y)

>       V1
>    
> 1:     1

# Include duplicates
fsetdiff(x, y, all = TRUE)


# Union (combine unique rows)
funion(x, y)

# Include duplicates
funion(x, y, all = TRUE)

>        V1
>     
>  1:     1
>  2:     2
>  3:     2
>  4:     3
>  5:     3
>  6:     2
>  7:     2
>  8:     3
>  9:     4
> 10:     4


# Test for equality
fsetequal(x, x[order(-V1),])

> [1] TRUE

all.equal(x, x)

> [1] TRUE

# Intersection
intersect(x, y)


# Set difference
setdiff(x, y)

>       V1
>    
> 1:     1


# Union (unique rows)
union(x, y)

# With duplicates
unlist(c(x, y))

> V11 V12 V13 V14 V15 V11 V12 V13 V14 V15 
>   1   2   2   3   3   2   2   3   4   4


# Test for equality
identical(x, x)

> [1] TRUE

all.equal(x, x)

> [1] TRUE

# Intersection
dplyr::intersect(x, y)


# Set difference
dplyr::setdiff(x, y)

>       V1
>    
> 1:     1


# Union (unique rows)
dplyr::union(x, y)

# With duplicates
union_all(x, y)

>        V1
>     
>  1:     1
>  2:     2
>  3:     2
>  4:     3
>  5:     3
>  6:     2
>  7:     2
>  8:     3
>  9:     4
> 10:     4


# Test for equality
setequal(x, x[order(-V1),])

> [1] TRUE

all.equal(x, x)

> [1] TRUE

Read and write

# Write to CSV
fwrite(DT, "DT.csv")

# Write tab-delimited
fwrite(DT, "DT.txt", sep = "\t")

# Read CSV or tab-delimited
DT1 = fread("DT.csv")
DT2 = fread("DT.txt", sep = "\t")

# Read specific columns
DT3 = fread("DT.csv", select = c("V1", "V4"))
DT4 = fread("DT.csv", drop = "V4")

# Combine multiple files
rbindlist(lapply(c("DT.csv", "DT.csv"), fread))

>        V1    V2    V3     V4
>        
>  1:     1     1   0.5      A
>  2:     2     2   1.0      B
>  3:     1     3   1.5      C
>  4:     2     4   0.5      A
>  5:     1     5   1.0      B
>  6:     2     6   1.5      C
>  7:     1     7   0.5      A
>  8:     2     8   1.0      B
>  9:     1     9   1.5      C
> 10:     1     1   0.5      A
> 11:     2     2   1.0      B
> 12:     1     3   1.5      C
> 13:     2     4   0.5      A
> 14:     1     5   1.0      B
> 15:     2     6   1.5      C
> 16:     1     7   0.5      A
> 17:     2     8   1.0      B
> 18:     1     9   1.5      C

# Write to CSV
write.csv(DF, "DF.csv", row.names = FALSE)

# Write tab-delimited
write.table(DF, "DF.txt", sep = "\t", 
  row.names = FALSE, col.names = TRUE)

# Read CSV or tab-delimited
DF1 = read.csv("DF.csv")
DF2 = read.table("DF.txt", sep = "\t", header = TRUE)

# Read specific columns
DF3 = read.csv("DF.csv")[, c("V1", "V4")]
DF4 = read.csv("DF.csv")[, !(names(read.csv("DF.csv")) %in% "V4")]

# Combine multiple files
do.call(rbind, lapply(c("DF.csv", "DF.csv"), read.csv))

>    V1 V2  V3 V4
> 1   1  1 0.5  A
> 2   2  2 1.0  B
> 3   1  3 1.5  C
> 4   2  4 0.5  A
> 5   1  5 1.0  B
> 6   2  6 1.5  C
> 7   1  7 0.5  A
> 8   2  8 1.0  B
> 9   1  9 1.5  C
> 10  1  1 0.5  A
> 11  2  2 1.0  B
> 12  1  3 1.5  C
> 13  2  4 0.5  A
> 14  1  5 1.0  B
> 15  2  6 1.5  C
> 16  1  7 0.5  A
> 17  2  8 1.0  B
> 18  1  9 1.5  C

# Write to CSV
TB |> readr::write_csv("TB.csv")

# Write tab-delimited
TB |> readr::write_delim("TB.txt", delim = "\t")

# Read CSV or tab-delimited
TB1 = readr::read_csv("TB.csv")
TB2 = readr::read_delim("TB.txt", delim = "\t")

# Combine multiple files
c("TB.csv", "TB.csv") |>
  purrr::map_dfr(readr::read_csv)

> # A tibble: 18 × 4
>       V1    V2    V3 V4   
>       
>  1     1     1   0.5 A    
>  2     2     2   1   B    
>  3     1     3   1.5 C    
>  4     2     4   0.5 A    
>  5     1     5   1   B    
>  6     2     6   1.5 C    
>  7     1     7   0.5 A    
>  8     2     8   1   B    
>  9     1     9   1.5 C    
> 10     1     1   0.5 A    
> 11     2     2   1   B    
> 12     1     3   1.5 C    
> 13     2     4   0.5 A    
> 14     1     5   1   B    
> 15     2     6   1.5 C    
> 16     1     7   0.5 A    
> 17     2     8   1   B    
> 18     1     9   1.5 C

Rows: 9 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): V4
dbl (3): V1, V2, V3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): V4
dbl (3): V1, V2, V3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): V4
dbl (3): V1, V2, V3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): V4
dbl (3): V1, V2, V3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up local files.

file.remove(c("DT.csv", "TB.csv", "DF.csv", "DT.txt", "TB.txt", "DF.txt"))

> [1] TRUE TRUE TRUE TRUE TRUE TRUE

Loading source...