dstrfw.Rd
dstrfw
takes raw or character vector and splits it
into a dataframe according to a vector of fixed widths.
dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)
character vector (each element is treated as a row) or a raw vector (newlines separate rows)
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
class(x)[1]
will be used to determine the class of the
contained element. It will not be recycled, and must
be at least as long as the longest row if strict
is TRUE
.
Possible values are "NULL"
(when the column is skipped) one of
the six atomic vector types ('character'
, 'numeric'
,
'logical'
, 'integer'
, 'complex'
, 'raw'
)
or POSIXct
.
'POSIXct' will parse date format in the form "YYYY-MM-DD hh:mm:ss.sss"
assuming GMT time zone. The separators between digits can be any
non-digit characters and only the date part is mandatory. See also
fasttime::asPOSIXct
for details.
a vector of widths of the columns. Must be the same length
as col_types
.
index name separator (single character) or NA
if no
index names are included
logical, if FALSE
then dstrsplit
will not
fail on parsing errors, otherwise input not matching the format
(e.g. more columns than expected) will cause an error.
integer: the number of lines of the data file to skip before beginning to read data.
integer: the maximum number of rows to read in. Negative and other invalid values are ignored.
If nsep
is specified, the output of dstrsplit
contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the index name. The
remaining characters are split using the widths
vector into
fields (columns). dstrfw
will fail with an error if any
line does not contain enough characters to fill all expected columns,
unless strict
is FALSE
. Excessive columns are ignored
in that case. Lines may contain fewer columns (but not partial ones
unless strict
is FALSE
) in which case they are set to
NA
.
dstrfw
returns a data.frame with as many rows as
they are lines in the input and as many columns as there are
non-NA values in col_types
, plus an additional column if
nsep
is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types
is a named vector in which case the names are
inherited.
input = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8")
z = dstrfw(x = input, col_types = c("numeric", "character", "integer"),
width=c(4L,5L,2L), nsep="\t")
z
#> rowindex V1 V2 V3
#> 1 bear 22.7 horse 3
#> 2 pear 3.4 mouse -3
#> 3 dogs 14.8 prime -8
# Now without row names (treat seperator as a 1 char width column with type NULL)
z = dstrfw(x = input,
col_types = c("character", "NULL", "numeric", "character", "integer"),
width=c(4L,1L,4L,5L,2L))
z
#> V1 V3 V4 V5
#> 1 bear 22.7 horse 3
#> 2 pear 3.4 mouse -3
#> 3 dogs 14.8 prime -8