--------------------------------------------------------------------------- Adding extra symbols to a byte stream Problem: sometimes we have to merge multiple streams into one, in which case its necessary to provide a way to identify block boundaries within a stream. 1. From decoding side, the best way is to have length prefixes for blocks. But at encoding side, it requires either random access to output file (seek to stream start and write the header), or being able to cache the whole streams, which is, in general, impossible. 2. Alternatively, we can add length headers (+ some flags) to blocks of cacheable size. Its surely a solution, but handling is much more complex than [1], especially at encoding (presuming that i/o operations are done with aligned fixed-size blocks). Well, one possible implementation is to write a 0 byte into the buffer, then stream data until its filled. So prefix byte = 0 would mean that there're bufsize-1 bytes of stream data next, and !=0 would mean that there's less... in which case we would be able to insert another prefix byte if end-of-stream is reached. This would only work with bufsize=32k or so, because otherwise the block length would require 3+ bytes to store, and there would be a problem with handling of the case with end-of-stream when there's only one byte of free space in the buffer. (One solution to that would be storing 2-byte prefixes to each buffer and adding 3rd byte when necessary; another is to provide a 2-byte encoding for some special block lengths like bufsize-2). Either way its no so good, because even 1 extra byte per 64k would accumulate to a noticeable number with large files (1526 bytes per 100M). Also hardcoding of the block size into format is bad too. 3. Escape prefix. Eg. EC 4B A7 00 = EC 4B A7, EC 4B A7 01 = end-of-stream. Now this is really easy to encode, but decoding is pretty painful - requires a messy state machine even to extract single bytes. But overall it adds least overhead, so it seem that we still need to find a good implementation for buffered decoding. 3a. Escape prefix with all same bytes (Eg. FF FF FF). Much easier to check, but runs of the same byte in the stream would produce a huge overhead (like 25%), and its not unlikely with any byte value chosen for escape code. 3b. Escape postfix. Store the payload byte before the marker - then decoder just has to skip 1 byte before masked marker, and 4 bytes for control code. So this basically introduces a fixed 4-byte delay for decoder, while [3] has a complex path where marker bytes have to be returned one by one. Still, with [3] encoder is much simpler (it just has to write an extra 0 when marker matches), and this doesn't really simplify the buffer processing.