Functions for very fast chunk-wise processing

chunk.reader creates a reader that will read from a binary connection in chunks while preserving integrity of lines.

read.chunk reads the next chunk using the specified reader.

chunk.reader(source, max.line = 65536L, sep = NULL)
read.chunk(reader, max.size = 33554432L, timeout = Inf)

Arguments

source: binary connection or character (which is interpreted as file name) specifying the source
max.line: maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb
sep: optional string: key separator if key-aware chunking is to be used

character is considered a key and subsequent records holding the same key are guaranteed to be

reader: reader object as returned by chunk.reader
max.size: maximum size of the chunk (in bytes), default is 32Mb
timeout: numeric, timeout (in seconds) for reads if source is a raw file descriptor.

Details

chunk.reader is essentially a filter that converts binary connection into chunks that can be subsequently parsed into data while preserving the integrity of input lines. read.chunk is used to read the actual chunks. The implementation is very thin to prevert copying of large vectors for best efficiency.

If sep is set to a string, it is treated as a single-character separator character. If specified, prefix in the input up to the specified character is treated as a key and subsequent lines with the same key are guaranteed to be processed in the same chunk. Note that this implies that the chunk size is practically unlimited, since this may force accumulation of multiple chunks to satisfy this condition. Obviously, this increases the processing and memory overhead.

In addition to connections chunk.reader supports raw file descriptors (integers of the class "fileDescriptor"). In that case the reads are preformed directly by chunk.reader and timeout can be used to perform non-blocking or timed reads (unix only, not supported on Windows).

Value

chunk.reader returns an object that can be used by

read.chunk. If source is a string, it is equivalent to calling chunk.reader(file(source, "rb"), ...).

read.chunk returns a raw vector holding the next chunk or

NULL if timeout was reached. It is deliberate that

read.chunk does NOT return a character vector since that would reasult in a high performance penalty. Please use the appropriate parser to convert the chunk into data, see

mstrsplit.

Author

Simon Urbanek