The Virtual Void A blog by Johannes Rudolph

Building a blinky sign for Twäng (part 1)

Twäng released its award-winning debut album POP! in September 2021. To celebrate the release I built a small piece of eye-catching hardware.

The sign was shown in a local news item (German): screenshot


Background Music

Before continuing, please head over to one of the online places where you can listen to the music and put it on in the background ;)

POP! Cover

or watch a recent live performance (without my participation).


  • WS2812B RGB LEDs on a small PCB (e.g. from ebay or aliexpress)
  • Raspberry Pi Pico
  • Another Raspberry Pi Pico (using picoprobe for debugging)
  • 0.6mm copper wire
  • connection wires
  • picture frame
  • bubble wrap


  • Soldering Iron
  • Side Cutters
  • Third-Hand Tool

Building Log

First version assembly

After one of the rehearsals for our small release concerts I had the idea of creating a small blinking sign.

Should I order a LED matrix to program or could I build something simpler myself? Looking around on youtube for inspiration about how to get creative with WS2812B, I found someone handling a string of these LEDs each mounted on a small PCB. The LEDs were hold soldered together with a piece of wire.

That sounded promising, so I ordered a pack of 50 WS2812B LEDs from ebay on a whim. They arrived two days later and came as a single PCB panel with v-cuts to break them off easily.

I got to work and soldered together the first few pixels. At the input side, I soldered a few wires with pin connectors to try them out with the Raspberry Pi Pico.

First Attempts

The pico-examples repository contains examples for driving WS2812B. I compiled it without changes and uploaded it to the Pico using UF2. For the experiments, the data in pin was connected to the GPIO that was configured in the example. The +5V connection was connected to the 3V3 for now.1

Each WS2812B RGB LED, in fact, contains three LEDs of red, green, and blue color, together with a small IC for driving these LEDs.

They only use a single digital data line that needs to be chained through all the LEDs. Since there’s no physical clock line, the data rate is predefined and needs to be adhered to accurately to ensure proper transmission. Internally, each LED has a shift register which receives the 24 initial bits (the color value ordered as GRB), and propagates all the remaining bits. This happens until a RESET code is received (low voltage level for at least 50µs). Then all LEDs latch the received values into their display register and use PWM to set the perceived brightness per color led.

Connecting to the Pico and running the pico example without changes worked well and showed the predefined animation patterns of the example on the first three pixels.

This looked promising so I came up with a design about how to wire more of those LEDs to form the letters “pop!”.

First Letter Design

The copper wires are quite stiff, so you can form arbitrary shapes and the resulting form will be stiff enough for careful use.

Producing the rounded forms like the letter ‘o’ or ‘p’ was challenging. You need to bend three small pieces of copper wire into a curved form making sure that those three wires which are only a few millimeters apart do not touch. After a bit of trial and error, the best technique was to start with the inner of those three wires and solder those first.

In my first attempts I tried to precut the wire into the right size but I could not handle the soldering of these small parts with my fumbly fingers. A better way was to keep the wire on its reel while soldering the first connection, and only cutting it right before soldering the other end. The third hand tool was indispensable, as you can imagine!

First Letter Done

The first test on the full first letter revealed that the color scheme looked weird.

First Letter Test

It turned out that the pico WS2812B example, expects a different LED type by default that uses a fourth color channel for a white LED. The boolean rgbw parameter passed to ws2812_program_init controls how many bits are sent per LED, so it was an easy fix to run with the RGB variant of the pixels.

I continued with my quest and soldered another letter.

Second Letter Done

As you can see, some bridges are needed, first, to connect the inner rail of the power supply to the inner rail of the next letter (I chose to run all letters in the same direction, so the inner rail was +5V). Also, the Dout of the last pixel of one letter needs to be connected to the Din of the first letter of the next letter.

After three letters done, I started doing the first little bit of programming just to give each letter its own color:

Pop Letters Done

Finally, I had also finished the exclamation mark and real programming could begin.

All Letters Done

Simple colors firmware

First, constants were defined for the colors per letter.

// COLOR_BRG converts R,G,B value to a 24bit BRG value as expected by WS2812B
const uint32_t p_1_color = COLOR_BRG(240, 5, 2);
const uint32_t p_1_o_color = COLOR_BRG(20, 0, 40);
const uint32_t o_color = COLOR_BRG(12, 35, 120);
const uint32_t o_p2_color = COLOR_BRG(28, 70, 6);
const uint32_t p_2_color = COLOR_BRG(200, 40, 0);
const uint32_t p_2_excl_color = COLOR_BRG(90, 13, 1);
const uint32_t excl_color = COLOR_BRG(150, 12, 8);

Determining the color values was not as straight-forward as expected. Turning on all LEDs with the same brightness value, did not gave a neutral white but was heavily shifted to blue. Two potential explanations come to mind:

  • colors are just not calibrated well
  • a voltage supply smaller than 5V might affect different colors in different ways (red LEDs have a smaller voltage drop than blue ones and there’s no information about how WS2812B implement the constant current source for the LEDs and how it is affected by input voltage)

In the end, I found that a balanced white had an RGB value of 255, 130, 80.

Pixel are chained one after each other and when sending data to them. You send color values in the same order as the pixels are wired. Here is a schematic of how the pixels are wired (this schema belongs to a latter revision while code and pictures above to the first, sorry for that):

Pixel IDs

The association between pixel IDs and letters was done using a simple bitmap:

const int num_leds = 32;
const int p_1_leds[] = {1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_leds[] =   {0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_p_2_leds[]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // only want that middle one
const int p_2_leds[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,0,0};
const int excl_leds[]= {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1};

By design, some pixels are shared between two letter and get their own colors.

Transmitting the colors works like this:

void show_all(uint t) {
    for (int i = 0; i < num_leds; ++i) {
        if (p_1_leds[i] && o_leds[i])
        else if (p_1_leds[i])
        else if (o_p_2_leds[i])
        else if (o_leds[i])
        else if (p_2_leds[i] && excl_leds[i])
        else if (p_2_leds[i])
        else if (excl_leds[i])

That is, we cycle over all the LEDs by id, figure out which letter the pixel belongs to, and transmit the desired color.

Here’s the result:

Initial Letter Colors

As a finishing touch I put it into a frame and added a sheet of paper and some bubble wrap for diffusion and style.

Framed Version v1

Improved hardware version

User testing revealed that the letter shaping was not easily readable. This is no surprise as the initial version only had ~ 30 pixels which doesn’t allow for much details. I had tried to mimic the album art by fusing adjacent letters. The result was nicely colorful but still not an obvious representation of the album cover.

Only a few days before the concert, I decided to redo the hardware slightly. Last minute changes to hardware are much more frightening than doing software fixups.

On one hand, there’s only one piece of hardware available which already works, and if it gets broken in the process, there’s no way of going back other than fixing whatever breaks. On the other hand, the software already depends on the topology of the hardware and also needs to be changed accordingly.

After all I overcame my fears and attempted some cosmetic surgery…

Letter Surgery

I added a few pixels here and there and made the separation between the letters and the exclamation mark clear.

Final setup v2

The backside:

Final setup v2

Leading to the final result:

Framed v2


With the hardware done, I could focus on animations. There are three kinds of animations:

  • Static colors (as seen above) with a slight shimmer by randomly fluctuating pixel colors
  • Fireworks with rockets and flashing letters
  • A snake wandering through the letters

Here’s a shot from the fireworks animation:

Fireworks still

More about the animations in a follow-up blog post.

In the meantime, you can find the code at

  1. For full brightness you can run WS2812B at +5V, but then you need to make sure that the signal level of the data line get to 0.7 times 5V which is 3.5V and therefore slightly too high to be driven by a 3V3 microcontroller. With my pixels both using the VBUS pin (nominally at 5V USB voltage) and VSYS pin (VBUS - the voltage drop of a Zener Diode of ~0.3V, i.e. at ~ 4.7V) of the Pico both also work flawlessly. 

Maximum compression ratio with Zip/Gzip/Deflate

Quite a while ago an issue was reported for Akka HTTP where incoming data, compressed using Content-Type: gzip, was uncompressed without applying any limits, in the worst case leading to an OutOfMemoryException and a shutdown of the server. This can be used for a classical denial of server attack, a “zip bomb”, where compression is used to amplify the amount of data an attacker has to send to bring down a machine.1

To reproduce the attack I gzip-compressed files of various sizes that contained a repitition of a single character for maximum compression. It turned out that there seems to be a hard compression ratio limit with gzip. This is somewhat surprising as, in general, with run length encoding you could represent such a repetition with log(n) compressed bytes.

Here’s a table of various compressed file sizes (all bytes 0):

Original Size Gzip Size Ratio
1000 29 34.5 : 1
1000000 1003 997 : 1
1000000000 970501 1030 : 1
10000000000 9704731 1030 : 1

So, the ratio seems to converge to approximately 1030 : 1. Why is that the case? We’ll try to find out in the rest of this post.

Let’s have a look at how these 1000000 bytes look in compressed form:

$ dd if=/dev/zero bs=1000 count=1000 status=none|gzip|hexdump -C 
00000000  1f 8b 08 00 19 85 d9 5b  00 03 ed c1 01 01 00 00  |.......[........|
00000010  00 82 20 ff af 6e 48 40  01 00 00 00 00 00 00 00  |.. ..nH@........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000003e0  00 af 06 9e cb 79 12 40  42 0f 00                 |.....y.@B..|

Weird, most of the file is just zero bytes (which hexdump abbreviates with *). Let’s have a look at the specifications for gzip, RFC 1951 (DEFLATE) and RFC 1952 (GZIP) to make sense of it.

When I do these manual decoding exercises, I usually copy the hex data of a file into an editor and then snip off bytes from the beginning and annotate them with their meaning, keeping the tail of undecoded data at the bottom. If you want to follow along you can copy the data from the above hexdump into an editor and follow the steps.

Gzip is just a small wrapper around deflate compressed data. The Gzip header is at least 10 bytes:

         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)

For our file:
ID1 + ID2: 1f 8b 
CM:        08 -> DEFLATE compressed
FLG:       00 -> no extra data
MTIME:     19 85 d9 5b -> time of compression
XFL:       00 -> no extra flags
OS:        03 -> Unix OS

As no flags (FLG) are specified the rest of the file is DEFLATE compressed data (if you use gzip <filename>, instead of piping the input data in, the gzip header also contains the original filename). At the end of a gzip file there’s a trailer containing a CRC32 over the uncompressed data (here: 9e cb 79 12) and 4 bytes of the size of the uncompressed data modulo 2^32 (here: 40 42 0f 00 => 0xf4240 = 1000000).

So now we we’ve have dealt with all the Gzip related data, we can decode the DEFLATE compressed data which starts at offset 0x0a:

ed c1 01 01 00 00 00 82 20 ff af 6e 48 40 01 00
# Lots of 00 bytes
00 af 06

DEFLATE compressed data needs to be interpreted as a bitstream. The order in which to read the bits from the bytes and how to interpret them is the single most confusing thing about the DEFLATE specification. RFC 1951 says the bits of a byte are ordered like this:


That is, the least-significant bit of each byte needs to be read first. Let’s convert our data byte-by-byte into a bit stream

ed       c1       01       01       00       00       00       82

11101101 11000001 00000001 00000001 00000000 00000000 00000000 10000010

20       ff       af       6e       48       40       01       00
00100000 11111111 10101111 01101110 01001000 01000000 00000001 00000000

Lots of 00 bytes

af       06
10101111 00000110

Now we have to read the bits in each byte from right to left to get the final bitstream:

10110111 10000011 10000000 10000000 00000000 00000000 00000000 01000001

00000100 11111111 11110101 01110110 00010010 00000010 10000000 00000000

Lots of 0 bits

11110101 01100000

We can now just concatenate the bits and start reading DEFLATE blocks.


Lots of 0 bits


The first bit of a DEFLATE block defines if it is the last block and the next two bits specify which compression type was used for that block. DEFLATE defines 3 compression types:

  • 00 = uncompressed
  • 01 = predefined Huffman tables
  • 10 = custom Huffman tables
  • 11 = reserved

For our data:

1  - BFINAL => Last block
01 - BTYPE = 10 => custom Huffman tables

As you can see, here’s another oddity: when we read a fixed number of bits (like for the compression type), we need to reverse them again before interpreting them. In our case, the compression uses custom Huffman tables. That makes sense because our original file contains very specific data (only zero bytes) and also it’s long enough to warrant the extra overhead of specifying a set of custom Huffman tables. As you will see most of the following non-zero data will be the definition of the custom Huffman tables.

DEFLATE is a combination of dictionary compression (LZ77) and Huffman coding. LZ77 represents data as a sequence of literal bytes and references to data seen before (as a tuple of offset into previous data and length of data to copy). The LZ77 sequences are then efficiently encoded into a binary stream by using Huffman encoding. DEFLATE uses two Huffman tables. The first table is the LITLEN table. A LITLEN symbol determines either if a literal byte should be decoded (and which) or if this is a reference of a particular length. If it is a reference we will use the second Huffman table for distances to determine the offset of the reference.

The standard defines a set of fixed tables to be used for blocks where including the tables themselves would be too much overhead (BTYPE = 01). In our case, however, we use custom Huffman tables and so the bitstream starts with a definition of the custom tables. These tables themselves are defined using another type of Huffman table, the code length codes.

Section 3.2.7 of RFC 1951 defines how the custom Huffman tables should be constructed.

It starts with some constants to define how big the tables will be. Let’s continue reading:

10111 = HLIT: number of LITLEN codes defined - 257, here 11101 = 29 + 257 = 286
10000 = HDIST: number of distance codes - 1, here 00001 = 1 + 1 = 2
0111  = HCLEN: number of code length codes - 4, here 1110 = 14 + 4 = 18

So, our custom definition contains 286 LITLEN codes and 2 distance codes. They are represented using the 18 code length codes of the code length Huffman table used to transmit the other Huffman tables. This kind of recursive usage of Huffman tables to specify Huffman tables is a bit mind-boggling first but will hopefully clear up later.

The code length symbols are defined like this:

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

Next we read the 18 code lengths for the code length table by reading 18 times 3 fixed bits of lengths. The lengths read correspond to the symbols in the order 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. Let’s read those:

000 = 0 (symbol 16)
000 = 0 (17)
010 = 2 (18)
000 = 0 (0)
000 = 0 (8)
000 = 0 (7)

000 = 0 (9)
000 = 0 (6)
000 = 0 (10)
000 = 0 (5)
000 = 0 (11)
000 = 0 (4)

000 = 0 (12)
010 = 2 (3)
000 = 0 (13)
010 = 2 (2)
000 = 0 (14)
010 = 2 (1)

We don’t read anything for the last symbol 15 as we should only read lengths of 18 symbols.

So, our code length Huffman table only contains the four symbols 1, 2, 3, and 18 all with bit lengths 2. The specification contains rules how to build a Huffman table from the bit lengths. The result for the code length table is this:

Symbol Bit length Huffman code Extra bits to read Interpretation
1 2 00 - Code length “1”
2 2 01 - Code length “2”
3 2 10 - Code length “3”
18 2 11 7 Repeat code length “0”, 11 - 138 times

Next we use the code length Huffman table to read 286 LITLEN code lengths and 2 DIST code lengths. Note, that the symbol 18 (and others not in use here) above will read multiple ones so you need to keep track of how many code lengths you have read overall.

01           => Symbol 2 => 1 x code length 2 (1 read so far)
11 + 1111111 => Symbol 18 + 7 extra bits => 138 x code length 0 (139 read so far)
11 + 0101011 => Symbol 18 + 7 extra bits => 11 + 1101010 = 117 x code length 0 (256 read so far)
10           => Symbol 3 => 1 x code length 3 (257 codes read so far)
11 + 0000100 => Symbol 18 + 7 extra bits => 11 + 0010000 = 27 x code length 0 (284 read so far)
10           => Symbol 3 => 1 x code length 3 (285 read so far)
00           => Symbol 1 => 1 x code length 1 (286 read so far, i.e. all LITLEN codes)
00           => Symbol 1 => 1 x code length 1 (287 read so far)
00           => Symbol 1 => 1 x code length 1 (288 read so far)

This translates into these Huffman tables (we read only the bit lengths but I already added the actual codes based on the algorithm from the specification and omitted unused symbols).

LITLEN Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 2 10 - Literal Character ‘\0’
256 3 110 - End Of Block
284 3 111 5 Copy 227 + extra bytes
285 1 0 - Copy 258 bytes

DIST Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 1 0 - Distance 1
1 1 1 - Distance 2

If you compare these tables with the predefined ones in the specification you can already see how little codes are needed for our special case file and with how few bits those can be encoded.

Now that we have finally created the Huffman tables for decompression, we can start decoding the actual data.

If you followed along, this is the remaining bit stream, we have already almost reached the long stretch of 0 bits:


Lots of 0 bits (7736 0 bits to be precise)


Let’s decode:

10 => LITLEN Symbol 0   => Literal byte '\0'
10 => LITLEN Symbol 0   => Literal byte '\0'
0  => LITLEN Symbol 285 => Copy 258 bytes from ...
0  => DIST Symbol 0     => ... distance 1

00 => The same, LITLEN Symbol 285, DIST Symbol 0 => Copy 258 bytes from distance 1
00 => ...
00 => ...
00 => ...
00 => ...
00 => ...

3868 times the same

111 + 10101          => LITLEN Symbol 284 => Copy 227 + 10101 = 248 bytes from ...
0                    => DIST Symbol 0 => ... distance 1
110                  => LITLEN Symbol 256 => End Of Block
0000                 => Padding to align to bytes

So the sequence we decoded is this:

  • 2 times the literal byte ‘\0’
  • 3875 times copy 258 bytes from distance 1 (long stretch of 0 bits in compressed data)
  • copy 248 bytes from distance 1
  • end of block

In total we have decoded 2 + 3875 * 258 + 248 = 1000000 bytes. Mission accomplished!

Now, back to our original question of what limits Gzip compression. As you can see here (and also when looking into the specification), we can achieve the highest compression when we can copy 258 bytes. In the best case, like with these very specific Huffman tables, we can achieve that with one bit for the LITLEN code and one bit for the distance code. The compression ratio is therefore 258 bytes compressed into 2 bits = 8 * 258 bits / 2 bits = 1032 / 1. In the beginning we measured about 1030 : 1. The difference between that and 1032 : 1 is the overhead of having to include the custom Huffman tables.

A follow-up question is why there is the maximum of copying only 258 bytes in the specification. There would have been room for adding greater amounts. I once read an explanation on StackOverflow (cannot find the link right now), that the length value (range 3 to 258) should be able to fit into 8 bits for performance reasons on 8 bit machines (deflate is really old by now). A great thing is that you can ask question about deflate / gzip on StackOverflow and it’s likely that Mark Adler (who still maintains zlib) will answer. Like for the question “gzip compression ratio for zeros” which answers the same question as I have here.

  1. We thought that we would be safe against such attacks because of our streaming infrastructure, which in theory is correct. However, as so often with bugs like this, unfortunate circumstances led to collecting the stream without limiting the amount of collected data. 

Base64 vs. Gzip

For personal projects I mostly try to avoid using databases. Instead, I try to use what I call the “JSON repository” pattern. I create a file structure that contains my data in JSON format. I often use the directory structure as an index by placing the JSON data at well-known places for random access.

For example, I wrote a small web crawler a while ago. To make the process incremental between different runs I created this file structure for caching intermediate results until the next run:

data/                          # Repository root
  head/                        # Results of HEAD requests
    http/                      # protocol             # host
         docs/                 # URI path element 0
           ...                 # ...
             /abc.html.json.gz # result of HEAD request in gzipped JSON format
  response/                    # Results of full GET requests
    # same structure as above
             /abc.html.json.gz # result of GET request in gzipped JSON format
  outlinks/                    # Results of outgoing link analysis
    # same structure as above
             /abc.html.outlinks-v1.json.gz # result of outgoing link analysis in gzipped JSON format

I like this kind of “file system database” because you can browse it with regular tools on the command line. But this shouldn’t be the topic of this post. What I want to talk about in this one is a sub-aspect of storing data in this form.

The good thing about JSON is that it’s human-readable. However, it is less efficient to write, store, and read than a binary format. Also, when you need to embed binary data inside of JSON files you need to encode it in UTF-8 strings which is often done using Base64 encoding. To mitigate at least part of the storage inefficiency of JSON files, I always run gzip on the resulting json. That’s still easy to read on the command line using tools like zless while improving storage space.

Using this scheme, I wondered, how efficient is the encoding of binary data with Base64 first, and then with gzip? More precisely, Base64 wastes space by only using 6 bits out of the 8 bits of a byte. How much of that can be recovered by gzipping Base64-encoded data afterwards? Also, for compressible binary data, how much of the original compression rate can be achieved if data is first Base64-encoded? So, I ran a few experiments.

We have two test bodies:

  • The UTF8 plain text version of Alice In Wonderland from Project Gutenberg (although, there’s no good reason to base64 encode it in json, it serves as a good example of compressible data)
  • 1,000,000 random bytes

Let’s start with plain Base64 and plain gzip encoding:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
raw 1000000 x 1 74703 x 1
base64 1333336 x 1.333 99604 x 1.333
gzip 1000173 x 1.0001 27842 x 0.373

The results are unsurprising. Base64 has a fixed overhead of 33%. As Base64 only uses 6 bits per byte, we would expect an overhead of ca. 8 / 6 - 1 = 1 / 3 = 33%.

Gzip worked well on the text file but slightly increased the size of the random data (which is guaranteed to happen in average on random data).

Now, let’s try base64 encoding first and then apply gzip:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
base64 + gzip 1009020 x 1.009 38064 x 0.510

So, on random, uncompressible bytes, gzip mitigates the effect of base64 encoding, so that only 0.9% overhead are left.

On compressible data, we need less space than the text would have taken in raw, but we lose quite a bit of compression compared to raw compression. The probable reason is that both of gzip’s compression algorithms, LZ77 and Huffman, process input data one byte at a time. Base64’s 8-bit to 6-bit encoding, however, splits up every 3 bytes into 4, so that LZ77 cannot find repetitions as easily because they have to be aligned to those 4-byte base64 blocks.

The Huffman coding is probably also affected for two reasons: First, every character from the Base64 encoding character set will represent 6 bits that were chopped from one or two original bytes. That means that the symbol histogram will not represent the original data bytes symbol histogram perfectly. Second, as the input data takes more space, LZ77 length and distance encoding might require more space (more “extra bits” in the encoding).

To illustrate this effect, I ran a few examples on even simpler input data. Let’s compare gzip compression of raw data versus base64 encoding of the simple string consisting of 1024 times the character e.

If it is compressed raw it can be represented with LZ77 as

  • Literal character e
  • At distance 1 copy 1024 characters (somewhat unintuitive that you can reference data that is not yet decoded but it actually works like this)

This can be very efficiently encoded.

Even, if you would disable LZ77 in Gzip, Huffman coding alone could also represent this string efficiently because it would only have to represent the two symbols:

  • Literal character e
  • End of block

Even if that would be a quite degenerate case of Huffman encoding, you could still encode it in 1025 bits (e.g. 1024 bits of 1 to represent e followed by a single 0 to represent the end of the block, plus the Huffman table).

Now look at the Base64 encoding of the string of 1024 ‘e’s:


It’s quite obvious that it still has some structure but as you can see it exhibits the predicted 4-byte pattern. It can be encoded with LZ77 as

  • Literal string "ZWVl" (base64 representation of "eee")
  • At distance 4 copy 1020/3*4 characters (base64 representation of 1020 'e's)
  • Literal string "ZQo=" (the trailer which is necessary because 1024 doesn’t divide by 3, so that the base64 encoding has a bit of padding at the end)

This can still be quite efficiently compressed but has more complexity than the plain encoding without base64.

However, this is still just a special case because the repetitions of the character ‘e’ are aligned to 3-byte boundaries which translate into periods of 4 characters with base64.

You can see this when looking at the base64 encoding of "123456712345671234567123456712345671234567", i.e. 6 times the string "1234567":


If you look closely you can see that the result is still periodic but only after half of the output. This is because the 7 characters don’t divide by 3 so the next time the string "1234567" aligns at a 3-byte boundary is after 3 periods, that is after 21 input characters or 28 output characters.

It’s pretty clear that this will negatively affect gzip compression afterwards.

Finally, here’s a table showing the difference between compressing the same amount of bytes either of a single character of the string "1234567":

Encoding 4200 * "e" of unencoded 600 * "1234567" of unencoded
raw 4200 x 1 4200 x 1
base64 5600 x 1.333 5600 x 1.333
gzip 40 x 0.0095 48 x 0.0114
base64+gzip 47 x 0.0112 81 x 0.0193

In the last line you can see that adding base64 encoding has only little effect on aligned, compressible data but a much bigger effect when the repetitions in the original data don’t align to the base64 3-byte boundaries.


  • Putting binary data as base64 into json files and Gzipping afterwards works well because gzip mitigates the overhead of base64 almost completely.
  • If the binary data is compressible, you’ll lose some potential for compression by using base64 first. (If you want to offset this, you can gzip the binary data first before applying base64.)
  • Alternatively, you could put binary data not in the json file but next to it on the file system.
  • Or, use another encoding better suited to json, e.g. see this stackoverflow question

See also:

Richard Startin recently did a similar investigation in his blog post about Obfuscated Compressibility.

Why isn't it possible to read gzip files from the middle?

For instance, consider the case of a backup gone wrong and the first 100k of your backup.tar.gz went missing for some reason. Why isn’t is possible to start reading a gzip file (or many other compressed formats) from some position in the middle of the file?

I want to give two answers. The first one is a general one. Compressed formats like gzip are often stream formats. The stream encoder (the sender) and the decoder (the receiver) of the data are both stateful. At the beginning of the file they start from a defined initial state and with every byte (or sometimes even bit) processed both advance their state in lockstep.

Why is that a good idea? For one thing, it is efficient because the state doesn’t need to be transferred explicitly between encoder and decoder. Streaming formats usually also limit the amount of state to keep at any time, so you can decode it without having to keep all of the decoded data as it would be for a random access data format.

The downside of a streaming format is that if you start to read from an arbitrary location inside the stream you don’t know with which state to initialize your decoder, so the decoder doesn’t know how to decode the stream. So, this is the first answer.

The second answer is specific for a gzip / deflate compressed stream. gzip is just a thin wrapper around the deflate algorithm which is specified in RFC 1951. Remember, the deflate algorithm is a combination of the LZ77 algorithm and Huffman coding. LZ77 tries to replace a sequence of bytes with a back reference into the past of the stream. I recommend Julia Evan’s cool visualization about how “The Raven” compresses with LZ77.

The output of the LZ77 algorithm is a stream of symbols where each symbol either represents a literal uncompressed byte or a back reference of a certain length into the last 32kb data of the input. These symbols are then efficiently encoded into bits using Huffman coding. The idea of Huffman coding is to compress a byte1 with less than 8 bits. Sounds magic? Yes, usually you need 8 bits to transfer one of 256 symbols. But only if these symbols are uniformly distributed. If there’s any other distribution, the distribution of symbols already contains some information that you can transfer beforehand (as the Huffman tree in this case) and then transfer the actual bytes with less than 8 bits per byte in average by using short codes for frequent symbols and long codes for infrequent ones.

Here is how gzip/deflate encoders and decoders work in theory2:

  • Encoder: find back references (LZ77) -> Huffman encode
  • Decoder: Huffman decode -> resolve back references

A bit we read on a deflate compressed stream is first passed into the Huffman decoder state machine until that reaches a leaf in the Huffman tree in which case an LZ77 symbol is passed on to the LZ77 decoder. The symbol either represents either a literal byte which can be directly emitted or a back reference into the compression window.

Therefore, the decompressor has two stateful components: a state in the Huffman decoder and the contents of the decompression window to resolve LZ77 back references against. Both these states are missing if we start to read in the middle of a deflate stream. We cannot know where we are in Huffman state machine (though, it seems possible to make good guesses). And even if we could Huffman-decode the stream we could only decode literal bytes but wouldn’t be able to resolve back references because we missed the past bytes of the stream.

With the question answered, I wondered if it is inevitable that stream formats cannot be read from the middle. Of course, it is not :)

Many stream formats feature special synchronization byte sequences (which are otherwise unlikely to appear) to mark positions from where the decoder can restart decoding from a defined state. Adding synchronization is a trade-off, however. Streams are often used as an efficient way to represent data. This is in conflict with adding extra data. The synchronization markers add redundant data but, more importantly, each synchronization point also has to reset the internal state. So, every extra knowledge a decoder may have acquired while reading from the beginning must be discarded for the benefit of a decoder that starts from later in the stream.

Popular stream formats that employ synchronization markers are live audio or video streams for which a consumer might want to join the stream later. Here, it makes perfectly sense to add synchronization points, e.g. every second of the data for the following reasons:

  • a late consumer probably doesn’t care about skipping to the next synchronization point
  • (video) compression uses correlation between subsequent frames to shrink data, state that is useful for compression will probably only be found in frames of the same scene, so it is no loss to discard any state at a cut between scenes (which offers itself as an synchronization point)

This is different for gzip compressed data for which no general statement about data correlation can be made. Nevertheless, there is some half-official way to synchronize inside of a gzip/deflate stream. This document about “ZLIB flush modes” describes various ways the zlib library can insert synchronization points into a deflate stream. To be actually able to start reading from any synchronization point, aside from emitting a synchronization sequence, you must also reset both LZ77 (i.e. discard the compression window) and the Huffman encoding. A “Sync Flush” emits an empty uncompressed block whose header contains the well-known byte sequence 00 00 FF FF and then starts a new compressed block. This resets the Huffman encoding. However, it doesn’t reset the LZ77 encoding so you may still encounter back references to data that appeared before the synchronization point. For this reason a Sync Flush is not used as an actual synchronization point to allow starting to read from this position but instead it is used to fill the output buffer with padding bits so that full bytes can emitted. To reset the LZ77 compression window, as well, you need to emit a “Full Flush” which looks exactly the same but which guarantees that the encoder won’t emit back references that point to data from before the synchronization point.

Here are some links to read more about gzip/deflate compression.

  1. In fact, in deflate it’s not bytes (256 symbols) that are Huffman-encoded but either a byte or a back reference. These are represented by one of 285 symbols for which you would usually need 9 bits (with several values unused). See the spec for the details. 

  2. In practice, it is a little bit more complex as the complete stream can be divided in separate blocks on the Huffman level to allow different Huffman trees for different parts of the file which makes sense as symbol distribution may change over the course of a stream of data. 

Losing track (part 2)

In part 1, I introduced the problem of debugging cycling code using stack traces. We have seen that we already lose debugging information if tail call optimization kicks in.

Let’s see what happens if the control flow isn’t linear but there are concurrently running threads of execution.

Let’s run concurrently

Here’s an example that uses reciprocally dependent lazy vals that are evaluated concurrently:

import scala.concurrent.Future

object A {
  lazy val a: Int = {
object B {
  lazy val b: Int = {

def main(args: Array[String]): Unit = {
  val fa = Future(A.a)
  val fb = Future(B.b)

  val sum: Future[Int] =
    for {
      a  fa
      b  fb
    } yield a + b

  println("Waiting for completion....")
  sum.value match {
    case Some(x)  println(s"Got result: $x")
    case None     println("Future was never completed.")

Again, we’ve got the same bug: A.a depends on the value of B.b and vice versa. In the simplest case, we would just access A.a and see what happens. Mildly surprising, it would again fail with a StackOverflowException because the initializers of A.a and B.b would call themselves recursively (huh, recursive evaluation of a val?). But this is not the case I’m interested here.

Instead, we are interested in what happens if A.a and B.b are both evaluated concurrently from different threads. To make that happen we use Future.apply to run the initial evaluation of those lazy vals from different threads. I also added those sleep times to make it more likely that the lazy val initializers are actually executed concurrently.

The future result of those computations are combined in a for-comprehension to a final future value. We sleep in the main thread for 10 seconds and see if we got a result in the meantime.

Here’s what happens:

[info] Running example.CyclicLazyValsComplex 
Waiting for completion....
Future was never completed.

It turns out that we do not get any result (but also no CPU time is spent). What’s going on? Once again, jstack to the rescue, we find that indeed all threads are blocked:

> jps
7480 CyclicLazyValsComplex

> jstack 7480

   java.lang.Thread.State: BLOCKED (on object monitor)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
	- waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
	- locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)

   java.lang.Thread.State: BLOCKED (on object monitor)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
	- waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
	- locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)

"main" prio=10 tid=0x00007f6ebc00a000 nid=0x1d3d waiting on condition [0x00007f6ec3e6a000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at example.CyclicLazyValsComplex$.main(3complexLazyVals.scala:33)
	at example.CyclicLazyValsComplex.main(3complexLazyVals.scala)


Found one Java-level deadlock:
  waiting to lock monitor 0x00007f6e74006218 (object 0x00000007ade6c800, a example.CyclicLazyValsComplex$A$),
  which is held by "ForkJoinPool-1-worker-13"
  waiting to lock monitor 0x00007f6e60003c88 (object 0x00000007adfb6e80, a example.CyclicLazyValsComplex$B$),
  which is held by "ForkJoinPool-1-worker-9"

Java stack information for the threads listed above:
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
	- waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
	- locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
	- waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
	- locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)

Found 1 deadlock.

Very helpfully, in this case, the program ran itself into a stuck state which is again observable from stack traces and for which the JVM even provides a nice deadlock analysis that actually points to the problem: we have two lazy evaluated values that reciprocally need to access each other to initialize themselves. Luckily, access was guarded by locks which again is part of the stack trace so it was easy to figure out what the problem was.

It may not be clear why this is a realistic example. A lazy val is a mechanism often used to implement a cache. A computation is delayed until its result is first needed. A subsequent access waits for the completion of the initial computation, if necessary, and then returns the result.

This is a useful pattern that is used in some form or the other quite often. In more practical cases than this one the cycle may be much less obvious but still you may run into a similar deadlock. Unfortunately, these kinds of situations are more likely to happen under load when threads are preempted to do some other work which introduces delays similar to the artificial delays introduces by the sleep(100) lines.

As seen here, caching based on lazy vals relies on blocking calls to wait for the result of the computation. Nowadays it’s more fashionable (and more reasonable, of course) to use an asynchronous API that allows to subscribe to the result of a lazy computation. This allows to be notified about the completion of a computation without blocking a thread.

A Future-based cache

spray-caching contains a clever caching solution (introduced to spray by Mathias a few years ago) that doesn’t store the results of a computation but which stores the eventual result of the computation, i.e. a Future of the result. This is ingenious because it allows to put computations into the cache that are not yet complete. It sidesteps the so-called thundering herd problem in a very elegant way. The thundering herd problem describes what happens when lots of similar requests arrive at a cold cache (e.g. after a server restart). A simplistic cache may not have noted that it already has started computing the value for a key and may spawn redundant calculations for the same key until finally one of those redundant calculations puts a value in the cache. In the worst case this may lead to another break down of a server because of too much load.

The Future based solution on the other hand can immediately put a result Future into the cache to which subsequent requests for the same key can subscribe to. This prevents redundant computations and also keeps the cache unconcerned about subscribing to an ongoing computation because this will be naturally handled by the Future.

Here’s a super simple implementation of such a cache:

object Cache {
  def apply(f: Int  Future[Int])(implicit ec: ExecutionContext): Int  Future[Int] = {
    var cache = Map.empty[Int, Future[Int]]
    val lock = new Object

      lock.synchronized {
        cache.get(key) match {
          // cache hit
          case Some(result)  result
          // cache miss
          case None 
            val result = Promise[Int]()
            Future { // run `f` concurrently
            cache += key -> result.future

The cache constructor takes a function Int => Future[Int] and returns a function of the same type. The resulting function runs the underlying function only once per key and puts the resulting Future result into a Map. Calls with the same key will instantly return the cached Future.

A more practical implementation would use something more scalable than a simple synchronized block. spray-caching uses ConcurrentLinkedHashMap, for instance. It would also need to enforce some limits on cache size, etc. This simple implementation, though, should serve its demonstrating purpose here.

Let’s put it into use:

object CyclicFutureCache {

  val f: Int  Future[Int] = Cache(calculateF)
  // the underlying heavy computation
  def calculateF(i: Int): Future[Int] =
    i match {
      case 42  f(i - 19)
      case 23  f(i * 2 - 4)
      case x   Future.successful(123)

  def main(args: Array[String]): Unit = {
    val sum: Future[Int] =
      for {
        aRes  f(23)
        bRes  f(42)
      } yield aRes + bRes

    println("Waiting for completion....")

    sum.onComplete { res 
      println(s"Program was completed with $res")

    // sleep while waiting for completion
    while (true) Thread.sleep(1)

We have an expensive recursive function calculateF (similar to the one shown in part 1). For efficiency reasons we want to cache results of this function so we put the Cache around it which gives us a new cached function f. This is also what we call for the recursive invocations. Again we introduced the cyclic dependency bug similar to the ones seen in the previous examples.

Here’s the output of the program:

Waiting for completion....

Once again, the futures are never completed and the program is stuck. Let’s see what jstack says:

> jps
7918 CyclicFutures

> jstack 7918

"ForkJoinPool-1-worker-13" #10 daemon prio=5 os_prio=0 tid=0x00007f60642a3000 nid=0x1ee9 waiting on condition [0x00007f601b2d3000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000076db30338> (a scala.concurrent.forkjoin.ForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(


"main" #1 prio=5 os_prio=0 tid=0x00007f606400a000 nid=0x1ebb waiting on condition [0x00007f606a856000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at example.CyclicFutureCache$.main(4cyclicFutures.scala:54)
	at example.CyclicFutureCache.main(4cyclicFutures.scala)


This time we are completely left in the dark. Two threads are just waiting and there’s no sign on the stack that a computation is not yet complete. The program is stuck but we cannot see why.

What’s going on? Remember that a stack trace is a representation of the future of a thread or, in other words, the stack represents the continuation of a thread. The point of an asynchronous API like the Future based one from Scala is, however, to detach the continuation, the “what to do after a computation is complete”, from the thread which starts or runs the original computation.

Therefore, no thread or its stack contain any traces of the continuation. Instead, the continuation is explicitly managed on the heap. When you use onComplete (or any of the other Future combinators) you pass a function value to be invoked when the original computation is complete. This function value is put into a list of callbacks which the Promise keeps that implements the Future.

In our case, the cache created two Promises, one for the result of f(23) and one for the result of f(42). It then called promise.completeWith(...) which makes the promise subscribe to the Future given as an argument. The call sequence was roughly like this:

val f23Promise = Promise[Int]()
val f42Promise = Promise[Int]()

So, both Promises subscribed to the respective other but neither could make any progress.

In summary, when working with Future, we neither keep track of the history of the computation nor can we easily find out what the future of the computation is because we cannot inspect the list of still-waiting callbacks. Also, a Future consumer usually doesn’t track what it is currently waiting on. Instead, it just registers a callback at the producing Future. So, there’s a reference from the producer to the consumer but not the other way round.

I don’t have any ready made solutions at this point. What is needed is better support to trace Future call chains and execution. At some point I started the better-future-exceptions experiment to see what it would take to improve Future tracing infrastructure. akka-stream faced similar problems and @drewhk implemented a graph output showing who waits for what in the streaming scenario. It would be cool if we had something like this for Futures as well. If you have ideas please tell them in the comments!

You can get the examples from this project at github. They also include an example I have not discussed here which replicates a lazy calculation which actors but there is not much insight gained compared to the future example.

Thanks for reading.