For instance, consider the case of a backup gone wrong and the first 100k of your backup.tar.gz went missing for some reason. Why isn’t is possible to start reading a gzip file (or many other compressed formats) from some position in the middle of the file?
I want to give two answers. The first one is a general one. Compressed formats like gzip are often stream formats. The stream encoder (the sender) and the decoder (the receiver) of the data are both stateful. At the beginning of the file they start from a defined initial state and with every byte (or sometimes even bit) processed both advance their state in lockstep.
Why is that a good idea? For one thing, it is efficient because the state doesn't need to be transferred explicitly between encoder and decoder. Streaming formats usually also limit the amount of state to keep at any time, so you can decode it without having to keep all of the decoded data as it would be for a random access data format.
The downside of a streaming format is that if you start to read from an arbitrary location inside the stream you don’t know with which state to initialize your decoder, so the decoder doesn’t know how to decode the stream. So, this is the first answer.
The second answer is specific for a gzip / deflate compressed stream. gzip is just a thin wrapper around the deflate algorithm which is specified in RFC 1951. Remember, the deflate algorithm is a combination of the LZ77 algorithm and Huffman coding. LZ77 tries to replace a sequence of bytes with a back reference into the past of the stream. I recommend Julia Evan's cool visualization about how "The Raven" compresses with LZ77.
The output of the LZ77 algorithm is a stream of symbols where each symbol either represents a literal uncompressed byte or a back reference of a certain length into the last 32kb data of the input. These symbols are then efficiently encoded into bits using Huffman coding. The idea of Huffman coding is to compress a byte with less than 8 bits. Sounds magic? Yes, usually you need 8 bits to transfer one of 256 symbols. But only if these symbols are uniformly distributed. If there's any other distribution, the distribution of symbols already contains some information that you can transfer beforehand (as the Huffman tree in this case) and then transfer the actual bytes with less than 8 bits per byte in average by using short codes for frequent symbols and long codes for infrequent ones.
Here is how gzip/deflate encoders and decoders work in theory:
- Encoder: find back references (LZ77) -> Huffman encode
- Decoder: Huffman decode -> resolve back references
A bit we read on a deflate compressed stream is first passed into the Huffman decoder state machine until that reaches a leaf in the Huffman tree in which case an LZ77 symbol is passed on to the LZ77 decoder. The symbol either represents either a literal byte which can be directly emitted or a back reference into the compression window.
Therefore, the decompressor has two stateful components: a state in the Huffman decoder and the contents of the decompression window to resolve LZ77 back references against. Both these states are missing if we start to read in the middle of a deflate stream. We cannot know where we are in Huffman state machine (though, it seems possible to make good guesses). And even if we could Huffman-decode the stream we could only decode literal bytes but wouldn't be able to resolve back references because we missed the past bytes of the stream.
With the question answered, I wondered if it is inevitable that stream formats cannot be read from the middle. Of course, it is not :)
Many stream formats feature special synchronization byte sequences (which are otherwise unlikely to appear) to mark positions from where the decoder can restart decoding from a defined state. Adding synchronization is a trade-off, however. Streams are often used as an efficient way to represent data. This is in conflict with adding extra data. The synchronization markers add redundant data but, more importantly, each synchronization point also has to reset the internal state. So, every extra knowledge a decoder may have acquired while reading from the beginning must be discarded for the benefit of a decoder that starts from later in the stream.
Popular stream formats that employ synchronization markers are live audio or video streams for which a consumer might want to join the stream later. Here, it makes perfectly sense to add synchronization points, e.g. every second of the data for the following reasons:
- a late consumer probably doesn't care about skipping to the next synchronization point
- (video) compression uses correlation between subsequent frames to shrink data, state that is useful for compression will probably only be found in frames of the same scene, so it is no loss to discard any state at a cut between scenes (which offers itself as an synchronization point)
This is different for gzip compressed data for which no general statement about data correlation can be made. Nevertheless, there is some half-official way to synchronize inside of a gzip/deflate stream. This document about "ZLIB flush modes" describes various ways the zlib library can insert synchronization points into a deflate stream. To be actually able to start reading from any synchronization point, aside from emitting a synchronization sequence, you must also reset both LZ77 (i.e. discard the compression window) and the Huffman encoding. A "Sync Flush" emits an empty uncompressed block whose header contains the well-known byte sequence
00 00 FF FF and then starts a new compressed block. This resets the Huffman encoding. However, it doesn't reset the LZ77 encoding so you may still encounter back references to data that appeared before the synchronization point. For this reason a Sync Flush is not used as an actual synchronization point to allow starting to read from this position but instead it is used to fill the output buffer with padding bits so that full bytes can emitted. To reset the LZ77 compression window, as well, you need to emit a "Full Flush" which looks exactly the same but which guarantees that the encoder won't emit back references that point to data from before the synchronization point.
Here are some links to read more about gzip/deflate compression.
- RFC 1951 "DEFLATE Compressed Data Format Specification": https://tools.ietf.org/html/rfc1951
- Julia Evan's blog posts about gzip: http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equals-awesome/ and http://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/
- A very detailed post about "Dissecting the GZIP format": http://www.infinitepartitions.com/art001.html
- Zlib Flush Modes: http://www.bolet.org/~pornin/deflate-flush.html
- Reconstructing corrupt DEFLATEd files: http://www.dfrws.org/2011/proceedings/19-351.pdf
- A presentation about synchronization of Huffman codes: http://www.mimuw.edu.pl/~fmurlak/na_katedre/Marek.ppt
- Information about "Efficient Huffman Decoding": http://www.commandlinefanatic.com/cgi-bin/showarticle.cgi?article=art007
In fact, in deflate it's not bytes (256 symbols) that are Huffman-encoded but either a byte or a back reference. These are represented by one of 285 symbols for which you would usually need 9 bits (with several values unused). See the spec for the details. ↩︎
In practice, it is a little bit more complex as the complete stream can be divided in separate blocks on the Huffman level to allow different Huffman trees for different parts of the file which makes sense as symbol distribution may change over the course of a stream of data. ↩︎