imstrsplit.Rd
imstrsplit
takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a character matrix
according to the separator.
imstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA,
type=c("character", "numeric", "logical", "integer", "complex",
"raw"), max.line = 65536L, max.size = 33554432L)
character vector (each element is treated as a row) or a raw
vector (LF characters '\n'
separate rows) to split
single character: field (column) separator. Set to NA
for no seperator; in other words, a single column.
row name separator (single character) or NA
if no
row names are included
logical, if FALSE
then mstrsplit
will not
fail on parsing errors, otherwise input not matching the format
(e.g. more columns than expected) will cause an error.
number of columns to expect. If NA
then the number
of columns is guessed from the first line.
a character string representing one of the 6 atomic types:
'character'
, 'numeric'
, 'logical'
, 'integer'
,
'complex'
, or 'raw'
. The output matrix will use this as its
storage mode and the input will be parsed directly into this format
without using intermediate strings.
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb
maximum size of the chunk (in bytes), default is 32Mb
If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content
with LF ('\n'
) characters separating lines. If the input is a
character vector then each element is treated as a line.
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the row name. The
remaining characters are split using the sep
character into
fields (columns). If ncol
is NA
then the first line of
the input determines the number of columns. mstrsplit
will fail
with an error if any line contains more columns then expected unless
strict
is FALSE
. Excessive columns are ignored in that
case. Lines may contain fewer columns in which case they are set to
NA
.
The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
A matrix with as many rows as they are lines in the input and as many columns as there are fields in the first line. The storage mode of the matrix will be determined by the input to
type
.
mm <- model.matrix(~., iris)
f <- file("iris_mm.io", "wb")
writeBin(as.output(mm), f)
close(f)
it <- imstrsplit("iris_mm.io", type="numeric", nsep="\t")
iris_mm <- it$nextElem()
print(head(iris_mm))
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> 1 1 5.1 3.5 1.4 0.2 0 0
#> 2 1 4.9 3.0 1.4 0.2 0 0
#> 3 1 4.7 3.2 1.3 0.2 0 0
#> 4 1 4.6 3.1 1.5 0.2 0 0
#> 5 1 5.0 3.6 1.4 0.2 0 0
#> 6 1 5.4 3.9 1.7 0.4 0 0
## remove iterator, connections and files
rm("it")
gc(FALSE)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 824549 44.1 1555265 83.1 1555265 83.1
#> Vcells 11565222 88.3 20697285 158.0 20504336 156.5
unlink("iris_mm.io")