chunk.Rd
chunk.reader
creates a reader that will read from a binary
connection in chunks while preserving integrity of lines.
read.chunk
reads the next chunk using the specified reader.
chunk.reader(source, max.line = 65536L, sep = NULL)
read.chunk(reader, max.size = 33554432L, timeout = Inf)
binary connection or character (which is interpreted as file name) specifying the source
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb
optional string: key separator if key-aware chunking is to be used
character is considered a key and subsequent records holding the same key are guaranteed to be
reader object as returned by chunk.reader
maximum size of the chunk (in bytes), default is 32Mb
numeric, timeout (in seconds) for reads if
source
is a raw file descriptor.
chunk.reader
is essentially a filter that converts binary
connection into chunks that can be subsequently parsed into data while
preserving the integrity of input lines. read.chunk
is used to
read the actual chunks. The implementation is very thin to prevert
copying of large vectors for best efficiency.
If sep
is set to a string, it is treated as a single-character
separator character. If specified, prefix in the input up to the
specified character is treated as a key and subsequent lines with the
same key are guaranteed to be processed in the same chunk. Note that
this implies that the chunk size is practically unlimited, since this
may force accumulation of multiple chunks to satisfy this condition.
Obviously, this increases the processing and memory overhead.
In addition to connections chunk.reader
supports raw file
descriptors (integers of the class "fileDescriptor"
). In that
case the reads are preformed directly by chunk.reader
and
timeout
can be used to perform non-blocking or timed
reads (unix only, not supported on Windows).
chunk.reader
returns an object that can be used by
read.chunk
. If source
is a string, it is equivalent to
calling chunk.reader(file(source, "rb"), ...)
.
read.chunk
returns a raw vector holding the next chunk or
NULL
if timeout was reached. It is deliberate that
read.chunk
does NOT return a character vector since that
would reasult in a high performance penalty. Please use the
appropriate parser to convert the chunk into data, see