idstrsplit.Rd
idstrsplit
takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a series of dataframes
according to the separator.
idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE,
max.line = 65536L, max.size = 33554432L)
character vector (each element is treated as a row) or a raw vector (newlines separate rows)
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
class(x)[1]
will be used to determine the class of the
contained element. It will not be recycled, and must
be at least as long as the longest row if strict
is TRUE
.
Possible values are "NULL"
(when the column is skipped) one of
the six atomic vector types ('character'
, 'numeric'
,
'logical'
, 'integer'
, 'complex'
, 'raw'
)
or POSIXct
.
'POSIXct' will parse date format in the form "YYYY-MM-DD hh:mm:ss.sss"
assuming GMT time zone. The separators between digits can be any
non-digit characters and only the date part is mandatory. See also
fasttime::asPOSIXct
for details.
single character: field (column) separator. Set to NA
for no seperator; in other words, a single column.
index name separator (single character) or NA
if no
index names are included
logical, if FALSE
then dstrsplit
will not
fail on parsing errors, otherwise input not matching the format
(e.g. more columns than expected) will cause an error.
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb
maximum size of the chunk (in bytes), default is 32Mb
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the index name. The
remaining characters are split using the sep
character into
fields (columns). dstrsplit
will fail with an error if any
line contains more columns then expected unless strict
is
FALSE
. Excessive columns are ignored in that case. Lines may
contain fewer columns in which case they are set to NA
.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
If nsep
is specified, the output of dstrsplit
contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
idstrsplit
returns an iterator (closure). When nextElem
is
called on the iterator a data.frame is returned with as many rows as
they are lines in the input and as many columns as there are
non-NULL values in col_types
, plus an additional column if
nsep
is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types
is a named vector in which case the names are
inherited.
col_names <- names(iris)
write.csv(iris, file="iris.csv", row.names=FALSE)
it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"),
sep=",")
# Get the elements
iris_read <- it$nextElem()[-1,]
# or with the iterators package
# nextElem(it)
names(iris_read) <- col_names
print(head(iris_read))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 5.1 3.5 1.4 0.2 "setosa"
#> 3 4.9 3.0 1.4 0.2 "setosa"
#> 4 4.7 3.2 1.3 0.2 "setosa"
#> 5 4.6 3.1 1.5 0.2 "setosa"
#> 6 5.0 3.6 1.4 0.2 "setosa"
#> 7 5.4 3.9 1.7 0.4 "setosa"
## remove iterator, connections and files
rm("it")
gc(FALSE)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 818454 43.8 1555265 83.1 1262721 67.5
#> Vcells 7355520 56.2 20697285 158.0 20504336 156.5
unlink("iris.csv")