chunk {iotools}R Documentation

Functions for very fast chunk-wise processing


chunk.reader creates a reader that will read from a binary connection in chunks while preserving integrity of lines.

read.chunk reads the next chunk using the specified reader.


chunk.reader(source, max.line = 65536L, sep = NULL)
read.chunk(reader, max.size = 33554432L, timeout = Inf)



binary connection or character (which is interpreted as file name) specifying the source


maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb


optional string: key separator if key-aware chunking is to be used

character is considered a key and subsequent records holding the same key are guaranteed to be


reader object as returned by chunk.reader


maximum size of the chunk (in bytes), default is 32Mb


numeric, timeout (in seconds) for reads if source is a raw file descriptor.


chunk.reader is essentially a filter that converts binary connection into chunks that can be subsequently parsed into data while preserving the integrity of input lines. read.chunk is used to read the actual chunks. The implementation is very thin to prevert copying of large vectors for best efficiency.

If sep is set to a string, it is treated as a single-character separator character. If specified, prefix in the input up to the specified character is treated as a key and subsequent lines with the same key are guaranteed to be processed in the same chunk. Note that this implies that the chunk size is practically unlimited, since this may force accumulation of multiple chunks to satisfy this condition. Obviously, this increases the processing and memory overhead.

In addition to connections chunk.reader supports raw file descriptors (integers of the class "fileDescriptor"). In that case the reads are preformed directly by chunk.reader and timeout can be used to perform non-blocking or timed reads (unix only, not supported on Windows).


chunk.reader returns an object that can be used by read.chunk. If source is a string, it is equivalent to calling chunk.reader(file(source, "rb"), ...).

read.chunk returns a raw vector holding the next chunk or NULL if timeout was reached. It is deliberate that read.chunk does NOT return a character vector since that would reasult in a high performance penalty. Please use the appropriate parser to convert the chunk into data, see mstrsplit.


Simon Urbanek

[Package iotools version 0.3-3 Index]