ctapply is a fast replacement of tapply that assumes contiguous input, i.e. unique values in the index are never separated by any other values. This avoids an expensive split step since both value and the index chunks can be created on the fly. This can make it orders of magnitude faster than the classical lapply(split(), ...) implementation.

ctapply(X, INDEX, FUN, ..., MERGE=c)

Arguments

X

an atomic object, typically a vector

INDEX

numeric or character vector of the same length as X

FUN

the function to be applied

...

additional arguments to FUN. They are passed as-is, i.e., without replication or recycling

MERGE

function to merge the resulting vector or NULL if the arguments to such a function are to be returned instead

Details

Note that ctapply supports either integer, real or character vectors as indices (note that factors are integer vectors and thus supported; you do not need to convert character vectors). Unlike tapply it does not take a list of factors - if you want to use a cross-product of factors, create the product first, e.g. using paste(i1, i2, i3, sep='\01') or multiplication - whetever method is convenient for the input types.

ctapply requires the INDEX to contiguous. One (slow) way to achieve that is to use sort or order, but in typical use-cases it is applied to already structured data which is sharded, but does not need to be sorted.

ctapply also supports X to be a matrix in which case it is split row-wise based on INDEX. The number of rows must match the length of INDEX. Note that the indexed matrices behave as if drop=FALSE was used and currently dimnames are only honored if rownames are present.

If the output is multi-dimensional, you probably want to use MERGE=rbind or MERGE=cbind instead of the default.

Author

Simon Urbanek

Note

This function has been moved to the fastmatch package!

See also

Examples

# contiguous names = LETTERS with ~350k values each
l <- rep(LETTERS, rnorm(length(LETTERS), 350000, 10000))
# random values
i <- rnorm(length(l))

system.time(rt <- tapply(i, l, sum))
#>    user  system elapsed 
#>   0.594   0.096   0.690 
system.time(rc <- ctapply(i, l, sum))
#>    user  system elapsed 
#>   0.123   0.008   0.131 
## tapply always returns an array so compare the same structure
identical(rt, as.array(rc))
#> [1] TRUE

## ctapply() also works on matrices (unlike tapply)
m <- matrix(c("A","A","B","B","B","C","A","B","C","D","E","F","","X","X","Y","Y","Z"),,3)
ctapply(m, m[,1], identity, MERGE=list)
#> $A
#>      [,1] [,2] [,3]
#> [1,] "A"  "A"  ""  
#> [2,] "A"  "B"  "X" 
#> 
#> $B
#>      [,1] [,2] [,3]
#> [1,] "B"  "C"  "X" 
#> [2,] "B"  "D"  "Y" 
#> [3,] "B"  "E"  "Y" 
#> 
#> $C
#>      [,1] [,2] [,3]
#> [1,] "C"  "F"  "Z" 
#> 
ctapply(m, m[,1], identity, MERGE=rbind)
#>      [,1] [,2] [,3]
#> [1,] "A"  "A"  ""  
#> [2,] "A"  "B"  "X" 
#> [3,] "B"  "C"  "X" 
#> [4,] "B"  "D"  "Y" 
#> [5,] "B"  "E"  "Y" 
#> [6,] "C"  "F"  "Z" 
m2 <- m[,-1]
rownames(m2) <- m[,1]
colnames(m2) <- c("V1","V2")
ctapply(m2, rownames(m2), identity, MERGE=list)
#> $A
#>   V1  V2 
#> A "A" "" 
#> A "B" "X"
#> 
#> $B
#>   V1  V2 
#> B "C" "X"
#> B "D" "Y"
#> B "E" "Y"
#> 
#> $C
#>   V1  V2 
#> C "F" "Z"
#> 
ctapply(m2, rownames(m2), identity, MERGE=rbind)
#>   V1  V2 
#> A "A" "" 
#> A "B" "X"
#> B "C" "X"
#> B "D" "Y"
#> B "E" "Y"
#> C "F" "Z"