Split binary or character input into a matrix

mstrsplit takes either raw or character vector and splits it into a character matrix according to the separators.

mstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA,
          type=c("character", "numeric", "logical", "integer",  "complex", "raw"),
          skip=0L, nrows=-1L, quote="")

Arguments

x: character vector (each element is treated as a row) or a raw vector (LF characters '\n' separate rows) to split
sep: single character: field (column) separator. Set to NA for no seperator; in other words, a single column.
nsep: row name separator (single character) or NA if no row names are included
strict: logical, if FALSE then mstrsplit will not fail on parsing errors, otherwise input not matching the format (e.g. more columns than expected) will cause an error.
ncol: number of columns to expect. If NA then the number of columns is guessed from the first line.
type: a character string representing one of the 6 atomic types: 'character', 'numeric', 'logical', 'integer', 'complex', or 'raw'. The output matrix will use this as its storage mode and the input will be parsed directly into this format without using intermediate strings.
skip: integer: the number of lines of the data file to skip before parsing records.
nrows: integer: the maximum number of rows to read in. Negative and other invalid values are ignored, and indiate that the entire input should be processed.
quote: the set of quoting characters as a length 1 vector. To disable quoting altogether, use quote = "" (the default). Quoting is only considered for columns read as character.

Details

If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content with LF ('\n') characters separating lines. If the input is a character vector then each element is treated as a line.

If nsep is specified then all characters up to (but excluding) the occurrence of nsep are treated as the row name. The remaining characters are split using the sep character into fields (columns). If ncol is NA then the first line of the input determines the number of columns. mstrsplit will fail with an error if any line contains more columns then expected unless strict is FALSE. Excessive columns are ignored in that case. Lines may contain fewer columns in which case they are set to NA.

The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.

Note that it is legal to use the same separator for sep and nsep in which case the first field is treated as a row name and subsequent fields as data columns.

Value

A matrix with as many rows as they are lines in the input and as many columns as there are fields in the first line. The storage mode of the matrix will be determined by the input to

type.

Author

Simon Urbanek

Examples

  c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E")
  m <- mstrsplit(gsub("\t","|",c))
  dim(m)
#> [1] 3 4
  m
#>      [,1] [,2] [,3] [,4]
#> [1,] "A"  "B"  "C"  "D" 
#> [2,] "A"  "B"  "B"  "B" 
#> [3,] "B"  "A"  "C"  "E" 
  m <- mstrsplit(c,, "\t")
  rownames(m)
#> [1] "A" "A" "B"
  m
#>   [,1] [,2] [,3]
#> A "B"  "C"  "D" 
#> A "B"  "B"  "B" 
#> B "A"  "C"  "E" 

  ## use raw vectors instead
  r <- charToRaw(paste(c, collapse="\n"))
  mstrsplit(r)
#>      [,1]   [,2] [,3]
#> [1,] "A\tB" "C"  "D" 
#> [2,] "A\tB" "B"  "B" 
#> [3,] "B\tA" "C"  "E" 
  mstrsplit(r, nsep="\t")
#>   [,1] [,2] [,3]
#> A "B"  "C"  "D" 
#> A "B"  "B"  "B" 
#> B "A"  "C"  "E"