Maximum compression ratio with Zip/Gzip/Deflate

Quite a while ago an issue was reported for Akka HTTP where incoming data, compressed using Content-Type: gzip, was uncompressed without applying any limits, in the worst case leading to an OutOfMemoryException and a shutdown of the server. This can be used for a classical denial of server attack, a "zip bomb", where compression is used to amplify the amount of data an attacker has to send to bring down a machine.[1]

To reproduce the attack I gzip-compressed files of various sizes that contained a repitition of a single character for maximum compression. It turned out that there seems to be a hard compression ratio limit with gzip. This is somewhat surprising as, in general, with run length encoding you could represent such a repetition with log(n) compressed bytes.

Here's a table of various compressed file sizes (all bytes 0):

Original Size Gzip Size Ratio
1000 29 34.5 : 1
1000000 1003 997 : 1
1000000000 970501 1030 : 1
10000000000 9704731 1030 : 1

So, the ratio seems to converge to approximately 1030 : 1. Why is that the case? We'll try to find out in the rest of this post.

Let's have a look at how these 1000000 bytes look in compressed form:

$ dd if=/dev/zero bs=1000 count=1000 status=none|gzip|hexdump -C 
00000000  1f 8b 08 00 19 85 d9 5b  00 03 ed c1 01 01 00 00  |.......[........|
00000010  00 82 20 ff af 6e 48 40  01 00 00 00 00 00 00 00  |.. ..nH@........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000003e0  00 af 06 9e cb 79 12 40  42 0f 00                 |.....y.@B..|
000003eb

Weird, most of the file is just zero bytes (which hexdump abbreviates with *). Let's have a look at the specifications for gzip, RFC 1951 (DEFLATE) and RFC 1952 (GZIP) to make sense of it.

When I do these manual decoding exercises, I usually copy the hex data of a file into an editor and then snip off bytes from the beginning and annotate them with their meaning, keeping the tail of undecoded data at the bottom. If you want to follow along you can copy the data from the above hexdump into an editor and follow the steps.

Gzip is just a small wrapper around deflate compressed data. The Gzip header is at least 10 bytes:

         +---+---+---+---+---+---+---+---+---+---+
         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
         +---+---+---+---+---+---+---+---+---+---+

For our file:
         
ID1 + ID2: 1f 8b 
CM:        08 -> DEFLATE compressed
FLG:       00 -> no extra data
MTIME:     19 85 d9 5b -> time of compression
XFL:       00 -> no extra flags
OS:        03 -> Unix OS

As no flags (FLG) are specified the rest of the file is DEFLATE compressed data (if you use gzip <filename>, instead of piping the input data in, the gzip header also contains the original filename). At the end of a gzip file there's a trailer containing a CRC32 over the uncompressed data (here: 9e cb 79 12) and 4 bytes of the size of the uncompressed data modulo 2^32 (here: 40 42 0f 00 => 0xf4240 = 1000000).

So now we we've have dealt with all the Gzip related data, we can decode the DEFLATE compressed data which starts at offset 0x0a:

ed c1 01 01 00 00 00 82 20 ff af 6e 48 40 01 00
# Lots of 00 bytes
00 af 06

DEFLATE compressed data needs to be interpreted as a bitstream. The order in which to read the bits from the bytes and how to interpret them is the single most confusing thing about the DEFLATE specification. RFC 1951 says the bits of a byte are ordered like this:

         +--------+
         |76543210|
         +--------+

That is, the least-significant bit of each byte needs to be read first. Let's convert our data byte-by-byte into a bit stream

ed       c1       01       01       00       00       00       82

11101101 11000001 00000001 00000001 00000000 00000000 00000000 10000010

20       ff       af       6e       48       40       01       00
00100000 11111111 10101111 01101110 01001000 01000000 00000001 00000000

Lots of 00 bytes

af       06
10101111 00000110

Now we have to read the bits in each byte from right to left to get the final bitstream:

10110111 10000011 10000000 10000000 00000000 00000000 00000000 01000001

00000100 11111111 11110101 01110110 00010010 00000010 10000000 00000000

Lots of 0 bits

11110101 01100000

We can now just concatenate the bits and start reading DEFLATE blocks.

1011011110000011100000001000000000000000000000000000000001000001
0000010011111111111101010111011000010010000000101000000000000000

Lots of 0 bits

1111010101100000

The first bit of a DEFLATE block defines if it is the last block and the next two bits specify which compression type was used for that block. DEFLATE defines 3 compression types:

  • 00 = uncompressed
  • 01 = predefined Huffman tables
  • 10 = custom Huffman tables
  • 11 = reserved

For our data:

1  - BFINAL => Last block
01 - BTYPE = 10 => custom Huffman tables

As you can see, here's another oddity: when we read a fixed number of bits (like for the compression type), we need to reverse them again before interpreting them. In our case, the compression uses custom Huffman tables. That makes sense because our original file contains very specific data (only zero bytes) and also it's long enough to warrant the extra overhead of specifying a set of custom Huffman tables. As you will see most of the following non-zero data will be the definition of the custom Huffman tables.

DEFLATE is a combination of dictionary compression (LZ77) and Huffman coding. LZ77 represents data as a sequence of literal bytes and references to data seen before (as a tuple of offset into previous data and length of data to copy). The LZ77 sequences are then efficiently encoded into a binary stream by using Huffman encoding. DEFLATE uses two Huffman tables. The first table is the LITLEN table. A LITLEN symbol determines either if a literal byte should be decoded (and which) or if this is a reference of a particular length. If it is a reference we will use the second Huffman table for distances to determine the offset of the reference.

The standard defines a set of fixed tables to be used for blocks where including the tables themselves would be too much overhead (BTYPE = 01). In our case, however, we use custom Huffman tables and so the bitstream starts with a definition of the custom tables. These tables themselves are defined using another type of Huffman table, the code length codes.

Section 3.2.7 of RFC 1951 defines how the custom Huffman tables should be constructed.

It starts with some constants to define how big the tables will be. Let's continue reading:

10111 = HLIT: number of LITLEN codes defined - 257, here 11101 = 29 + 257 = 286
10000 = HDIST: number of distance codes - 1, here 00001 = 1 + 1 = 2
0111  = HCLEN: number of code length codes - 4, here 1110 = 14 + 4 = 18

So, our custom definition contains 286 LITLEN codes and 2 distance codes. They are represented using the 18 code length codes of the code length Huffman table used to transmit the other Huffman tables. This kind of recursive usage of Huffman tables to specify Huffman tables is a bit mind-boggling first but will hopefully clear up later.

The code length symbols are defined like this:

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

Next we read the 18 code lengths for the code length table by reading 18 times 3 fixed bits of lengths. The lengths read correspond to the symbols in the order 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. Let's read those:

000 = 0 (symbol 16)
000 = 0 (17)
010 = 2 (18)
000 = 0 (0)
000 = 0 (8)
000 = 0 (7)

000 = 0 (9)
000 = 0 (6)
000 = 0 (10)
000 = 0 (5)
000 = 0 (11)
000 = 0 (4)

000 = 0 (12)
010 = 2 (3)
000 = 0 (13)
010 = 2 (2)
000 = 0 (14)
010 = 2 (1)

We don't read anything for the last symbol 15 as we should only read lengths of 18 symbols.

So, our code length Huffman table only contains the four symbols 1, 2, 3, and 18 all with bit lengths 2. The specification contains rules how to build a Huffman table from the bit lengths. The result for the code length table is this:

Symbol Bit length Huffman code Extra bits to read Interpretation
1 2 00 - Code length "1"
2 2 01 - Code length "2"
3 2 10 - Code length "3"
18 2 11 7 Repeat code length "0", 11 - 138 times

Next we use the code length Huffman table to read 286 LITLEN code lengths and 2 DIST code lengths. Note, that the symbol 18 (and others not in use here) above will read multiple ones so you need to keep track of how many code lengths you have read overall.

01           => Symbol 2 => 1 x code length 2 (1 read so far)
11 + 1111111 => Symbol 18 + 7 extra bits => 138 x code length 0 (139 read so far)
11 + 0101011 => Symbol 18 + 7 extra bits => 11 + 1101010 = 117 x code length 0 (256 read so far)
10           => Symbol 3 => 1 x code length 3 (257 codes read so far)
11 + 0000100 => Symbol 18 + 7 extra bits => 11 + 0010000 = 27 x code length 0 (284 read so far)
10           => Symbol 3 => 1 x code length 3 (285 read so far)
00           => Symbol 1 => 1 x code length 1 (286 read so far, i.e. all LITLEN codes)
00           => Symbol 1 => 1 x code length 1 (287 read so far)
00           => Symbol 1 => 1 x code length 1 (288 read so far)

This translates into these Huffman tables (we read only the bit lengths but I already added the actual codes based on the algorithm from the specification and omitted unused symbols).

LITLEN Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 2 10 - Literal Character '\0'
256 3 110 - End Of Block
284 3 111 5 Copy 227 + extra bytes
285 1 0 - Copy 258 bytes

DIST Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 1 0 - Distance 1
1 1 1 - Distance 2

If you compare these tables with the predefined ones in the specification you can already see how little codes are needed for our special case file and with how few bits those can be encoded.

Now that we have finally created the Huffman tables for decompression, we can start decoding the actual data.

If you followed along, this is the remaining bit stream, we have already almost reached the long stretch of 0 bits:

101000000000000000

Lots of 0 bits (7736 0 bits to be precise)

1111010101100000

Let's decode:

10 => LITLEN Symbol 0   => Literal byte '\0'
10 => LITLEN Symbol 0   => Literal byte '\0'
0  => LITLEN Symbol 285 => Copy 258 bytes from ...
0  => DIST Symbol 0     => ... distance 1

00 => The same, LITLEN Symbol 285, DIST Symbol 0 => Copy 258 bytes from distance 1
00 => ...
00 => ...
00 => ...
00 => ...
00 => ...

3868 times the same

111 + 10101          => LITLEN Symbol 284 => Copy 227 + 10101 = 248 bytes from ...
0                    => DIST Symbol 0 => ... distance 1
110                  => LITLEN Symbol 256 => End Of Block
0000                 => Padding to align to bytes

So the sequence we decoded is this:

  • 2 times the literal byte '\0'
  • 3875 times copy 258 bytes from distance 1 (long stretch of 0 bits in compressed data)
  • copy 248 bytes from distance 1
  • end of block

In total we have decoded 2 + 3875 * 258 + 248 = 1000000 bytes. Mission accomplished!

Now, back to our original question of what limits Gzip compression. As you can see here (and also when looking into the specification), we can achieve the highest compression when we can copy 258 bytes. In the best case, like with these very specific Huffman tables, we can achieve that with one bit for the LITLEN code and one bit for the distance code. The compression ratio is therefore 258 bytes compressed into 2 bits = 8 * 258 bits / 2 bits = 1032 / 1. In the beginning we measured about 1030 : 1. The difference between that and 1032 : 1 is the overhead of having to include the custom Huffman tables.

A follow-up question is why there is the maximum of copying only 258 bytes in the specification. There would have been room for adding greater amounts. I once read an explanation on StackOverflow (cannot find the link right now), that the length value (range 3 to 258) should be able to fit into 8 bits for performance reasons on 8 bit machines (deflate is really old by now). A great thing is that you can ask question about deflate / gzip on StackOverflow and it's likely that Mark Adler (who still maintains zlib) will answer. Like for the question "gzip compression ratio for zeros" which answers the same question as I have here.


  1. We thought that we would be safe against such attacks because of our streaming infrastructure, which in theory is correct. However, as so often with bugs like this, unfortunate circumstances led to collecting the stream without limiting the amount of collected data. ↩︎

Show Comments