The Virtual Void A blog by Johannes Rudolph

Experiments with io_uring

This is a quick writeup of my experiments with io_uring.

What is io_uring?

io_uring is a generic, consistent API that allows to (batch) submit Linux kernel IO operations in an asynchronous way ultimately requiring few or zero actual syscalls. It uses a pair of buffer rings for submission and completion data shared between user and kernel space (but that’s just an implementation detail).

To deconstruct the above sentence:

  • generic: io_uring is generic because it supports a growing list of IO-related syscalls
  • consistent: The submission and completion mechanism is the same for different kinds of APIs. Submission entries must by default fit into the 64-byte submission entry, completions into a 32-byte completion entry. For some operations the size of a single entry must be doubled which can be configured during creation of the ring.
  • batch submit: Any number of calls (up to the configured size of the submission ring) can be submitted in one go.
  • asynchronous: The API itself is asynchronous in that submission of an operation will not always lead to immediate completion. Instead, some operations may only be completed later. By default, submission are directly executed inline (in the same thread) during submission if possible. Otherwise, operations may only be completed later.
  • few or zero syscalls: Submissions and completions are serialized into shared memory and can (in theory) be accessed by the kernel and the user application concurrently. The default behavior requires a single io_uring_enter call to make the kernel aware of the latest submissions. An optional SQPOLL mode allows the kernel to run a dedicated thread to poll submission from the application in which ultimately no syscalls are required any more (though, then kernel and application need to (busy-) poll the queues).
  • buffer rings: Application and kernel communicate using shared memory and a submission/completion protocol. In theory, this can be faster than syscall based IO submission.

Why io_uring?

syscalls can become a bottleneck in applications that require lots of IO. Batch submission and the potential polling mode means you can trade off different latency and throughput requirements.

Until io_uring came along, Linux did not have a proper asynchronous interface for doing buffered 1 file IO. In general, io_uring can hide an internally blocking operation inside its own blocking thread pool just providing an asynchronous interface to an actually blocking operation (similar as e.g. Java NIO asynchronous file operations were implemented). Adding more (internal or not) thread pools somewhat defeat the purpose of asynchronous execution (which should be as light-weight as possible). So, later an asynchronous buffered read mode was proposed and added to the kernel (and io_uring) for file-systems that support it.

Registering buffers and file descriptors

Using user-space buffers and file descriptors in the kernel comes with some overhead. Therefore, it is possible to preregister buffers and descriptors using the io_uring_register syscall.

In many server applications that mutiplex handling of multiple connections with a single thread, the connection is first polled to be ready for reading, then a buffer is allocated, and finally the data is read. With io_uring it is possible to avoid the polling step completely and read data asynchronously directly. However, that means that buffers need to be provided upfront for every waiting read call (which in a web server would be all idle connections). To avoid a per-read (per-connection) read buffer for every outstanding read, with io_uring buffers can also be provided more dynamically using a special operation, IORING_OP_PROVIDE_BUFFERS.

You can then use IOSQE_BUFFER_SELECT for submissions to have io_uring select a buffer from the given set when needed.

A downside of this method of providing buffers is that the kernel may choose buffers arbitrarily. After use, the application has to provide them again. While the original IORING_OP_PROVIDE_BUFFERS call can use a contiguous amount of memory to provide multiple buffers at once, spent buffers will fragment that memory over time. The application will either have to

  • provide them back to io_uring in a one-by-one fashion after usage (spending one SQE per buffer after usage), or
  • keep track of which buffers have been redeemed, try to coalesce their memory regions and provide them later with fewer SQEs

In later kernels, it is possible to use IORING_REGISTER_PBUF_RING to create yet another ring just for buffer management, so that providing buffers can happen more efficiently without spending SQEs just for providing single small buffers.

Execution Modes

As shown in the Missing Manuals - io_uring worker pool, an IO operation can go through three different paths when submitted:

  • Be executed directly inline during the execution of io_uring_enter
  • Is not ready but can be polled for readiness. In that case, it is automatically polled. When the operation is ready, it is retried and completed.
  • If it only can be executed in a blocking fashion, it is enqueued to a run queue to the blocking kernel worker thread pool and executed and completed when done.

Inline execution

Simple networking applications (like an echo server or a simple HTTP server) will spent a significant portion of CPU time in the kernel doing TCP work. Whether you use regular syscalls or io_uring, by default much of the TCP processing time will be spent on a user thread (especially on the write side).2

The test application is a simple HTTP server which fakes request processing:

  • Has a simple (single-threaded) event processing loop which blocks on completion events from io_uring (setting the min_complete argument of the io_uring_enter syscall to 1).
  • Accepts any amount of incoming connections.
  • Tries to read an incoming HTTP request (but does not validate it at all) and sends out a response (it will never have to wait on network writes).

Here is a flame graph of simple HTTP server processing using io_uring (click for interactive version):

Flamegraph of inline execution of socket read/write

The blueish and green frames on the left side are the application. Since this application does not do any actual processing, it is only a very small share of total processing time.

The big red humps are frames executing in the kernel. All of those frames are on top of the io_uring_enter call, i.e. executing while doing the submission call. The biggest share of that call is the big hump in the middle which is the actual TCP sendmsg implementation. You will see the same frames on a blocking server as well.

The small part on the left seems to be the overhead of actual submitting events using io_uring.

The small hump on the right is part of the TCP recvmsg implementation.

Async submission

There’s an alternative mode that can be selected using the IOSQE_ASYNC flag per SQE where the task is not executed inline at all but queued for io_uring’s internal worker queue.

One problem when used together with IOSQE_BUFFER_SELECT shows up in the flamegraph (click through for interactive flame graph):

Flamegraph of buffer selection contention with async flag

io_uring will spawn a massive amount of worker threads and all these threads contend on acquiring a pre-provided buffer which will slow everything down.

I filed an issue about that and it was confirmed that using this usage pattern in particular but more generally IOSQE_ASYNC is not recommended. In any case, you should probably stay away from using IOSQE_BUFFER_SELECT when you expect the requests to end up on the internal worker pool.

Inline vs. Async

It may be surprising that using an asynchronous interface may not automatically lead to asynchronous execution or more parallelism. For application servers that are single-threaded, there is the option to force spreading the internal kernel load to the internal worker queue (if you have cores to spare) by using IOSQE_ASYNC. However, as seen, this comes with extra overhead and potential extra scalability issues.

For CPU-bound workloads the inline execution behavior seems useful as it will provide the fastest execution path. For low-latency applications the time spent in the kernel might be an unwanted interruption but for lowest-latency you will anyway need spare CPU resources to avoid any extra delays. In that case, distributing work over multiple worker threads each owning a separate ring (and maybe polling) should give the best solution and best control over how resources are spent.

Usage from the JVM

To support io_uring from the JVM, only a few native calls need to be implemented:

  • Access to the io_uring_setup, io_uring_enter, and io_uring_register syscalls (e.g. by supporting all syscalls by linking to libc’s syscall symbol)
  • Memory mapping the io_uring data structures into memory
  • A way to manipulate the native (non-JVM-heap) memory of the rings and IO buffers.

The JVM’s standard way of interfacing with the native world is using JNI. We only need to provide simple wrappers for three syscalls with simple interfaces so this is simple to do. Alternatively, using a simple JNA declaration like this also works:

import com.sun.jna.Library;
import com.sun.jna.Pointer;
import com.sun.jna.Structure;

public interface LibC extends Library {
  int syscall(int number, Object... args);
  Pointer mmap(Pointer addr, long length, int prot, int flags, int fd, long offset);

Using JNA will introduce some overhead over JNI, but it seems relatively fine compared to all the other work.

After mapping a few of io_uring’s internal data structures, creating a uring can look like this:

val params = new IoUringParams // not in the fast-path, use JNA mapped data structured
val uringFd = libC.syscall(LibC.SYSCALL_IO_URING_SETUP, sizeOfRing, params)

val length = params.sq_off.array + params.sq_entries * 4

val sqRingPointer = libC.mmap(new Pointer(0), length, LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQ_RING)
val cqRingPointer = libC.mmap(new Pointer(0), params.cq_off.cqes + params.cq_entries * 16 /* sizeOf(IoUringCqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_CQ_RING)
val sqePointer = libC.mmap(new Pointer(0), params.sq_entries * 64 /* sizeOf(IoUringSqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQES)

// FIXME: check all result values

// convert JNA pointers to DirectByteBuffer
val sqRingPointerByteBuffer = sqRingPointer.getByteBuffer(0, length)
val cqRingPointerByteBuffer = cqRingPointer.getByteBuffer(0, params.cq_off.cqes + params.cq_entries * 16)
val sqes = sqePointer.getByteBuffer(0, 64 * params.sq_entries)

From here on, all datastructures are manipulated using the DirectByteBuffers whose operations (hopefully) are intrinsified and therefore cheap. JNA’s getByteBuffer is expensive, though, so make sure to create buffers spanning the whole interesting region and reuse them later. For example, to access some parameters, you can use code like this:

def sqHead() = sqRingPointerByteBuffer.getInt(params.sq_off.head)
def sqTail() = sqRingPointerByteBuffer.getInt(params.sq_off.tail)
val sqMask = sqRingPointerByteBuffer.getInt(params.sq_off.ring_mask)

def cqHead() = cqRingPointerByteBuffer.getInt(params.cq_off.head)
def cqTail() = cqRingPointerByteBuffer.getInt(params.cq_off.tail)
val cqMask = cqRingPointerByteBuffer.getInt(params.cq_off.ring_mask)

Here is one way of representing an operation with a case class (which is not quite optimal because of the extra allocation):

case class ReadOp(
    flags:             Byte,
    fd:                Int,
    destinationOffset: Long,
    targetBufferAddr:  Long,
    targetBufferSize:  Int,
    userData:          Long,
    bufferGroup:       Short = 0 // FIXME: buffer selection should be abstracted
) extends Op(IORING_OP_READ) {
  override def prepareSQE(buffer: ByteBuffer): Unit = {
    buffer.putShort(0 /* ioprio */ )

An operation can be submitted like this:

def submitGlobal(op: Op): Unit = {
  // FIXME: check against overflow
  val curTail = sqTail()
  val index = curTail & sqMask
  sqes.position(64 * index)
  sqRingPointerByteBuffer.putInt(params.sq_off.array + 4 * index, index)
  // FIXME: write barrier needed?
  sqRingPointerByteBuffer.putInt(params.sq_off.tail, curTail + 1)
  // FIXME: write barrier needed?

That is, we first figure out the tail of the ring to write to, then prepare an operation using the above method, put the index into the SQE array, and finally increase the tail index. (I’m not sure if the barriers are needed in a simple application, but if multiple threads access the ring, you definitely should add the barriers, e.g. using Unsafe).

Finally, submitting all queued SQEs to the kernel is done by invoking io_uring_enter (the only necessary syscall in the hot path):

val toSubmit = sqTail() - sqHead()
libC.syscall(LibC.SYSCALL_IO_URING_ENTER, uringFd, toSubmit, 1, IORING_ENTER_GETEVENTS, 0)

After the call returns, you can check for completions:

var cqIdx = cqHead()
val last = cqTail()

while (cqIdx < last) {
  val idx = cqIdx & cqMask
  // accessing cqe data via JNA is more expensive but also possible:
  // val cqe = new IoUringCqe(cqRingPointer.share(params.cq_off.cqes + idx * 16))
  // debug(cqe.toString)
  val off = params.cq_off.cqes + idx * 16
  val userData = cqRingPointerByteBuffer.getLong(off)
  val result = cqRingPointerByteBuffer.getInt(off + 8)
  val flags = cqRingPointerByteBuffer.getInt(off + 12)
  handleCompletion(userData, result, flags)
  cqIdx += 1
// give back all handled entries (FIXME: memory barriers)
cqRingPointerByteBuffer.putInt(params.cq_off.head, last)

Here we loop through all new CQEs, extract their data (again using the previously created DirectByteBuffers) and defer handling to a completion method.

A final problem is creating or accessing file descriptors from the JVM. It is possible to extract file descriptors from Java RandomAccessFile and Socket using Java reflection from their internals.

In newer kernel versions, more syscalls are available and it might be possible to open files and sockets using io_uring.

The full source code can be found at It contains more of infrastructure to provide a more idiomatic API, however, everything should be considered experimental (checking return values, anyone?).

Much more

There’s, of course, a lot more to io_uring not covered here:

  • More flags like IOSQE_IO_LINK or IOSQE_IO_DRAIN
  • Full set of supported ops
  • Cancelling requests


  1. Buffered means going through the page cache. I.e. when calling read the file is mapped into memory and then copied into user buffers. 

  2. If you ever wondered where the 5 times difference between the “Json” and the “Plaintext” tests in the Techempower Benchmark comes from, that’s the reason. The json example uses regular HTTP requests where only a single request can be ongoing per TCP connection. In that case you always require at least one TCP/IP packet per request and response. The plaintext benchmark uses HTTP pipelining, which ultimately allows to transmit multiple HTTP requests and responses in a single TCP/IP packet which decreases the in-kernel CPU overhead of TCP massively. 

Building a blinky sign for Twäng (part 1)

Twäng released its award-winning debut album POP! in September 2021. To celebrate the release I built a small piece of eye-catching hardware.

The sign was shown in a local news item (German): screenshot


Background Music

Before continuing, please head over to one of the online places where you can listen to the music and put it on in the background ;)

POP! Cover

or watch a recent live performance (without my participation).


  • WS2812B RGB LEDs on a small PCB (e.g. from ebay or aliexpress)
  • Raspberry Pi Pico
  • Another Raspberry Pi Pico (using picoprobe for debugging)
  • 0.6mm copper wire
  • connection wires
  • picture frame
  • bubble wrap


  • Soldering Iron
  • Side Cutters
  • Third-Hand Tool

Building Log

First version assembly

After one of the rehearsals for our small release concerts I had the idea of creating a small blinking sign.

Should I order a LED matrix to program or could I build something simpler myself? Looking around on youtube for inspiration about how to get creative with WS2812B, I found someone handling a string of these LEDs each mounted on a small PCB. The LEDs were hold soldered together with a piece of wire.

That sounded promising, so I ordered a pack of 50 WS2812B LEDs from ebay on a whim. They arrived two days later and came as a single PCB panel with v-cuts to break them off easily.

I got to work and soldered together the first few pixels. At the input side, I soldered a few wires with pin connectors to try them out with the Raspberry Pi Pico.

First Attempts

The pico-examples repository contains examples for driving WS2812B. I compiled it without changes and uploaded it to the Pico using UF2. For the experiments, the data in pin was connected to the GPIO that was configured in the example. The +5V connection was connected to the 3V3 for now.1

Each WS2812B RGB LED, in fact, contains three LEDs of red, green, and blue color, together with a small IC for driving these LEDs.

They only use a single digital data line that needs to be chained through all the LEDs. Since there’s no physical clock line, the data rate is predefined and needs to be adhered to accurately to ensure proper transmission. Internally, each LED has a shift register which receives the 24 initial bits (the color value ordered as GRB), and propagates all the remaining bits. This happens until a RESET code is received (low voltage level for at least 50µs). Then all LEDs latch the received values into their display register and use PWM to set the perceived brightness per color led.

Connecting to the Pico and running the pico example without changes worked well and showed the predefined animation patterns of the example on the first three pixels.

This looked promising so I came up with a design about how to wire more of those LEDs to form the letters “pop!”.

First Letter Design

The copper wires are quite stiff, so you can form arbitrary shapes and the resulting form will be stiff enough for careful use.

Producing the rounded forms like the letter ‘o’ or ‘p’ was challenging. You need to bend three small pieces of copper wire into a curved form making sure that those three wires which are only a few millimeters apart do not touch. After a bit of trial and error, the best technique was to start with the inner of those three wires and solder those first.

In my first attempts I tried to precut the wire into the right size but I could not handle the soldering of these small parts with my fumbly fingers. A better way was to keep the wire on its reel while soldering the first connection, and only cutting it right before soldering the other end. The third hand tool was indispensable, as you can imagine!

First Letter Done

The first test on the full first letter revealed that the color scheme looked weird.

First Letter Test

It turned out that the pico WS2812B example, expects a different LED type by default that uses a fourth color channel for a white LED. The boolean rgbw parameter passed to ws2812_program_init controls how many bits are sent per LED, so it was an easy fix to run with the RGB variant of the pixels.

I continued with my quest and soldered another letter.

Second Letter Done

As you can see, some bridges are needed, first, to connect the inner rail of the power supply to the inner rail of the next letter (I chose to run all letters in the same direction, so the inner rail was +5V). Also, the Dout of the last pixel of one letter needs to be connected to the Din of the first letter of the next letter.

After three letters done, I started doing the first little bit of programming just to give each letter its own color:

Pop Letters Done

Finally, I had also finished the exclamation mark and real programming could begin.

All Letters Done

Simple colors firmware

First, constants were defined for the colors per letter.

// COLOR_BRG converts R,G,B value to a 24bit BRG value as expected by WS2812B
const uint32_t p_1_color = COLOR_BRG(240, 5, 2);
const uint32_t p_1_o_color = COLOR_BRG(20, 0, 40);
const uint32_t o_color = COLOR_BRG(12, 35, 120);
const uint32_t o_p2_color = COLOR_BRG(28, 70, 6);
const uint32_t p_2_color = COLOR_BRG(200, 40, 0);
const uint32_t p_2_excl_color = COLOR_BRG(90, 13, 1);
const uint32_t excl_color = COLOR_BRG(150, 12, 8);

Determining the color values was not as straight-forward as expected. Turning on all LEDs with the same brightness value, did not gave a neutral white but was heavily shifted to blue. Two potential explanations come to mind:

  • colors are just not calibrated well
  • a voltage supply smaller than 5V might affect different colors in different ways (red LEDs have a smaller voltage drop than blue ones and there’s no information about how WS2812B implement the constant current source for the LEDs and how it is affected by input voltage)

In the end, I found that a balanced white had an RGB value of 255, 130, 80.

Pixel are chained one after each other and when sending data to them. You send color values in the same order as the pixels are wired. Here is a schematic of how the pixels are wired (this schema belongs to a latter revision while code and pictures above to the first, sorry for that):

Pixel IDs

The association between pixel IDs and letters was done using a simple bitmap:

const int num_leds = 32;
const int p_1_leds[] = {1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_leds[] =   {0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_p_2_leds[]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // only want that middle one
const int p_2_leds[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,0,0};
const int excl_leds[]= {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1};

By design, some pixels are shared between two letter and get their own colors.

Transmitting the colors works like this:

void show_all(uint t) {
    for (int i = 0; i < num_leds; ++i) {
        if (p_1_leds[i] && o_leds[i])
        else if (p_1_leds[i])
        else if (o_p_2_leds[i])
        else if (o_leds[i])
        else if (p_2_leds[i] && excl_leds[i])
        else if (p_2_leds[i])
        else if (excl_leds[i])

That is, we cycle over all the LEDs by id, figure out which letter the pixel belongs to, and transmit the desired color.

Here’s the result:

Initial Letter Colors

As a finishing touch I put it into a frame and added a sheet of paper and some bubble wrap for diffusion and style.

Framed Version v1

Improved hardware version

User testing revealed that the letter shaping was not easily readable. This is no surprise as the initial version only had ~ 30 pixels which doesn’t allow for much details. I had tried to mimic the album art by fusing adjacent letters. The result was nicely colorful but still not an obvious representation of the album cover.

Only a few days before the concert, I decided to redo the hardware slightly. Last minute changes to hardware are much more frightening than doing software fixups.

On one hand, there’s only one piece of hardware available which already works, and if it gets broken in the process, there’s no way of going back other than fixing whatever breaks. On the other hand, the software already depends on the topology of the hardware and also needs to be changed accordingly.

After all I overcame my fears and attempted some cosmetic surgery…

Letter Surgery

I added a few pixels here and there and made the separation between the letters and the exclamation mark clear.

Final setup v2

The backside:

Final setup v2

Leading to the final result:

Framed v2


With the hardware done, I could focus on animations. There are three kinds of animations:

  • Static colors (as seen above) with a slight shimmer by randomly fluctuating pixel colors
  • Fireworks with rockets and flashing letters
  • A snake wandering through the letters

Here’s a shot from the fireworks animation:

Fireworks still

More about the animations in a follow-up blog post.

In the meantime, you can find the code at

  1. For full brightness you can run WS2812B at +5V, but then you need to make sure that the signal level of the data line get to 0.7 times 5V which is 3.5V and therefore slightly too high to be driven by a 3V3 microcontroller. With my pixels both using the VBUS pin (nominally at 5V USB voltage) and VSYS pin (VBUS - the voltage drop of a Zener Diode of ~0.3V, i.e. at ~ 4.7V) of the Pico both also work flawlessly. 

Maximum compression ratio with Zip/Gzip/Deflate

Quite a while ago an issue was reported for Akka HTTP where incoming data, compressed using Content-Type: gzip, was uncompressed without applying any limits, in the worst case leading to an OutOfMemoryException and a shutdown of the server. This can be used for a classical denial of server attack, a “zip bomb”, where compression is used to amplify the amount of data an attacker has to send to bring down a machine.1

To reproduce the attack I gzip-compressed files of various sizes that contained a repitition of a single character for maximum compression. It turned out that there seems to be a hard compression ratio limit with gzip. This is somewhat surprising as, in general, with run length encoding you could represent such a repetition with log(n) compressed bytes.

Here’s a table of various compressed file sizes (all bytes 0):

Original Size Gzip Size Ratio
1000 29 34.5 : 1
1000000 1003 997 : 1
1000000000 970501 1030 : 1
10000000000 9704731 1030 : 1

So, the ratio seems to converge to approximately 1030 : 1. Why is that the case? We’ll try to find out in the rest of this post.

Let’s have a look at how these 1000000 bytes look in compressed form:

$ dd if=/dev/zero bs=1000 count=1000 status=none|gzip|hexdump -C 
00000000  1f 8b 08 00 19 85 d9 5b  00 03 ed c1 01 01 00 00  |.......[........|
00000010  00 82 20 ff af 6e 48 40  01 00 00 00 00 00 00 00  |.. ..nH@........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000003e0  00 af 06 9e cb 79 12 40  42 0f 00                 |.....y.@B..|

Weird, most of the file is just zero bytes (which hexdump abbreviates with *). Let’s have a look at the specifications for gzip, RFC 1951 (DEFLATE) and RFC 1952 (GZIP) to make sense of it.

When I do these manual decoding exercises, I usually copy the hex data of a file into an editor and then snip off bytes from the beginning and annotate them with their meaning, keeping the tail of undecoded data at the bottom. If you want to follow along you can copy the data from the above hexdump into an editor and follow the steps.

Gzip is just a small wrapper around deflate compressed data. The Gzip header is at least 10 bytes:

         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)

For our file:
ID1 + ID2: 1f 8b 
CM:        08 -> DEFLATE compressed
FLG:       00 -> no extra data
MTIME:     19 85 d9 5b -> time of compression
XFL:       00 -> no extra flags
OS:        03 -> Unix OS

As no flags (FLG) are specified the rest of the file is DEFLATE compressed data (if you use gzip <filename>, instead of piping the input data in, the gzip header also contains the original filename). At the end of a gzip file there’s a trailer containing a CRC32 over the uncompressed data (here: 9e cb 79 12) and 4 bytes of the size of the uncompressed data modulo 2^32 (here: 40 42 0f 00 => 0xf4240 = 1000000).

So now we we’ve have dealt with all the Gzip related data, we can decode the DEFLATE compressed data which starts at offset 0x0a:

ed c1 01 01 00 00 00 82 20 ff af 6e 48 40 01 00
# Lots of 00 bytes
00 af 06

DEFLATE compressed data needs to be interpreted as a bitstream. The order in which to read the bits from the bytes and how to interpret them is the single most confusing thing about the DEFLATE specification. RFC 1951 says the bits of a byte are ordered like this:


That is, the least-significant bit of each byte needs to be read first. Let’s convert our data byte-by-byte into a bit stream

ed       c1       01       01       00       00       00       82

11101101 11000001 00000001 00000001 00000000 00000000 00000000 10000010

20       ff       af       6e       48       40       01       00
00100000 11111111 10101111 01101110 01001000 01000000 00000001 00000000

Lots of 00 bytes

af       06
10101111 00000110

Now we have to read the bits in each byte from right to left to get the final bitstream:

10110111 10000011 10000000 10000000 00000000 00000000 00000000 01000001

00000100 11111111 11110101 01110110 00010010 00000010 10000000 00000000

Lots of 0 bits

11110101 01100000

We can now just concatenate the bits and start reading DEFLATE blocks.


Lots of 0 bits


The first bit of a DEFLATE block defines if it is the last block and the next two bits specify which compression type was used for that block. DEFLATE defines 3 compression types:

  • 00 = uncompressed
  • 01 = predefined Huffman tables
  • 10 = custom Huffman tables
  • 11 = reserved

For our data:

1  - BFINAL => Last block
01 - BTYPE = 10 => custom Huffman tables

As you can see, here’s another oddity: when we read a fixed number of bits (like for the compression type), we need to reverse them again before interpreting them. In our case, the compression uses custom Huffman tables. That makes sense because our original file contains very specific data (only zero bytes) and also it’s long enough to warrant the extra overhead of specifying a set of custom Huffman tables. As you will see most of the following non-zero data will be the definition of the custom Huffman tables.

DEFLATE is a combination of dictionary compression (LZ77) and Huffman coding. LZ77 represents data as a sequence of literal bytes and references to data seen before (as a tuple of offset into previous data and length of data to copy). The LZ77 sequences are then efficiently encoded into a binary stream by using Huffman encoding. DEFLATE uses two Huffman tables. The first table is the LITLEN table. A LITLEN symbol determines either if a literal byte should be decoded (and which) or if this is a reference of a particular length. If it is a reference we will use the second Huffman table for distances to determine the offset of the reference.

The standard defines a set of fixed tables to be used for blocks where including the tables themselves would be too much overhead (BTYPE = 01). In our case, however, we use custom Huffman tables and so the bitstream starts with a definition of the custom tables. These tables themselves are defined using another type of Huffman table, the code length codes.

Section 3.2.7 of RFC 1951 defines how the custom Huffman tables should be constructed.

It starts with some constants to define how big the tables will be. Let’s continue reading:

10111 = HLIT: number of LITLEN codes defined - 257, here 11101 = 29 + 257 = 286
10000 = HDIST: number of distance codes - 1, here 00001 = 1 + 1 = 2
0111  = HCLEN: number of code length codes - 4, here 1110 = 14 + 4 = 18

So, our custom definition contains 286 LITLEN codes and 2 distance codes. They are represented using the 18 code length codes of the code length Huffman table used to transmit the other Huffman tables. This kind of recursive usage of Huffman tables to specify Huffman tables is a bit mind-boggling first but will hopefully clear up later.

The code length symbols are defined like this:

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

Next we read the 18 code lengths for the code length table by reading 18 times 3 fixed bits of lengths. The lengths read correspond to the symbols in the order 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. Let’s read those:

000 = 0 (symbol 16)
000 = 0 (17)
010 = 2 (18)
000 = 0 (0)
000 = 0 (8)
000 = 0 (7)

000 = 0 (9)
000 = 0 (6)
000 = 0 (10)
000 = 0 (5)
000 = 0 (11)
000 = 0 (4)

000 = 0 (12)
010 = 2 (3)
000 = 0 (13)
010 = 2 (2)
000 = 0 (14)
010 = 2 (1)

We don’t read anything for the last symbol 15 as we should only read lengths of 18 symbols.

So, our code length Huffman table only contains the four symbols 1, 2, 3, and 18 all with bit lengths 2. The specification contains rules how to build a Huffman table from the bit lengths. The result for the code length table is this:

Symbol Bit length Huffman code Extra bits to read Interpretation
1 2 00 - Code length “1”
2 2 01 - Code length “2”
3 2 10 - Code length “3”
18 2 11 7 Repeat code length “0”, 11 - 138 times

Next we use the code length Huffman table to read 286 LITLEN code lengths and 2 DIST code lengths. Note, that the symbol 18 (and others not in use here) above will read multiple ones so you need to keep track of how many code lengths you have read overall.

01           => Symbol 2 => 1 x code length 2 (1 read so far)
11 + 1111111 => Symbol 18 + 7 extra bits => 138 x code length 0 (139 read so far)
11 + 0101011 => Symbol 18 + 7 extra bits => 11 + 1101010 = 117 x code length 0 (256 read so far)
10           => Symbol 3 => 1 x code length 3 (257 codes read so far)
11 + 0000100 => Symbol 18 + 7 extra bits => 11 + 0010000 = 27 x code length 0 (284 read so far)
10           => Symbol 3 => 1 x code length 3 (285 read so far)
00           => Symbol 1 => 1 x code length 1 (286 read so far, i.e. all LITLEN codes)
00           => Symbol 1 => 1 x code length 1 (287 read so far)
00           => Symbol 1 => 1 x code length 1 (288 read so far)

This translates into these Huffman tables (we read only the bit lengths but I already added the actual codes based on the algorithm from the specification and omitted unused symbols).

LITLEN Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 2 10 - Literal Character ‘\0’
256 3 110 - End Of Block
284 3 111 5 Copy 227 + extra bytes
285 1 0 - Copy 258 bytes

DIST Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 1 0 - Distance 1
1 1 1 - Distance 2

If you compare these tables with the predefined ones in the specification you can already see how little codes are needed for our special case file and with how few bits those can be encoded.

Now that we have finally created the Huffman tables for decompression, we can start decoding the actual data.

If you followed along, this is the remaining bit stream, we have already almost reached the long stretch of 0 bits:


Lots of 0 bits (7736 0 bits to be precise)


Let’s decode:

10 => LITLEN Symbol 0   => Literal byte '\0'
10 => LITLEN Symbol 0   => Literal byte '\0'
0  => LITLEN Symbol 285 => Copy 258 bytes from ...
0  => DIST Symbol 0     => ... distance 1

00 => The same, LITLEN Symbol 285, DIST Symbol 0 => Copy 258 bytes from distance 1
00 => ...
00 => ...
00 => ...
00 => ...
00 => ...

3868 times the same

111 + 10101          => LITLEN Symbol 284 => Copy 227 + 10101 = 248 bytes from ...
0                    => DIST Symbol 0 => ... distance 1
110                  => LITLEN Symbol 256 => End Of Block
0000                 => Padding to align to bytes

So the sequence we decoded is this:

  • 2 times the literal byte ‘\0’
  • 3875 times copy 258 bytes from distance 1 (long stretch of 0 bits in compressed data)
  • copy 248 bytes from distance 1
  • end of block

In total we have decoded 2 + 3875 * 258 + 248 = 1000000 bytes. Mission accomplished!

Now, back to our original question of what limits Gzip compression. As you can see here (and also when looking into the specification), we can achieve the highest compression when we can copy 258 bytes. In the best case, like with these very specific Huffman tables, we can achieve that with one bit for the LITLEN code and one bit for the distance code. The compression ratio is therefore 258 bytes compressed into 2 bits = 8 * 258 bits / 2 bits = 1032 / 1. In the beginning we measured about 1030 : 1. The difference between that and 1032 : 1 is the overhead of having to include the custom Huffman tables.

A follow-up question is why there is the maximum of copying only 258 bytes in the specification. There would have been room for adding greater amounts. I once read an explanation on StackOverflow (cannot find the link right now), that the length value (range 3 to 258) should be able to fit into 8 bits for performance reasons on 8 bit machines (deflate is really old by now). A great thing is that you can ask question about deflate / gzip on StackOverflow and it’s likely that Mark Adler (who still maintains zlib) will answer. Like for the question “gzip compression ratio for zeros” which answers the same question as I have here.

  1. We thought that we would be safe against such attacks because of our streaming infrastructure, which in theory is correct. However, as so often with bugs like this, unfortunate circumstances led to collecting the stream without limiting the amount of collected data. 

Base64 vs. Gzip

For personal projects I mostly try to avoid using databases. Instead, I try to use what I call the “JSON repository” pattern. I create a file structure that contains my data in JSON format. I often use the directory structure as an index by placing the JSON data at well-known places for random access.

For example, I wrote a small web crawler a while ago. To make the process incremental between different runs I created this file structure for caching intermediate results until the next run:

data/                          # Repository root
  head/                        # Results of HEAD requests
    http/                      # protocol             # host
         docs/                 # URI path element 0
           ...                 # ...
             /abc.html.json.gz # result of HEAD request in gzipped JSON format
  response/                    # Results of full GET requests
    # same structure as above
             /abc.html.json.gz # result of GET request in gzipped JSON format
  outlinks/                    # Results of outgoing link analysis
    # same structure as above
             /abc.html.outlinks-v1.json.gz # result of outgoing link analysis in gzipped JSON format

I like this kind of “file system database” because you can browse it with regular tools on the command line. But this shouldn’t be the topic of this post. What I want to talk about in this one is a sub-aspect of storing data in this form.

The good thing about JSON is that it’s human-readable. However, it is less efficient to write, store, and read than a binary format. Also, when you need to embed binary data inside of JSON files you need to encode it in UTF-8 strings which is often done using Base64 encoding. To mitigate at least part of the storage inefficiency of JSON files, I always run gzip on the resulting json. That’s still easy to read on the command line using tools like zless while improving storage space.

Using this scheme, I wondered, how efficient is the encoding of binary data with Base64 first, and then with gzip? More precisely, Base64 wastes space by only using 6 bits out of the 8 bits of a byte. How much of that can be recovered by gzipping Base64-encoded data afterwards? Also, for compressible binary data, how much of the original compression rate can be achieved if data is first Base64-encoded? So, I ran a few experiments.

We have two test bodies:

  • The UTF8 plain text version of Alice In Wonderland from Project Gutenberg (although, there’s no good reason to base64 encode it in json, it serves as a good example of compressible data)
  • 1,000,000 random bytes

Let’s start with plain Base64 and plain gzip encoding:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
raw 1000000 x 1 74703 x 1
base64 1333336 x 1.333 99604 x 1.333
gzip 1000173 x 1.0001 27842 x 0.373

The results are unsurprising. Base64 has a fixed overhead of 33%. As Base64 only uses 6 bits per byte, we would expect an overhead of ca. 8 / 6 - 1 = 1 / 3 = 33%.

Gzip worked well on the text file but slightly increased the size of the random data (which is guaranteed to happen in average on random data).

Now, let’s try base64 encoding first and then apply gzip:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
base64 + gzip 1009020 x 1.009 38064 x 0.510

So, on random, uncompressible bytes, gzip mitigates the effect of base64 encoding, so that only 0.9% overhead are left.

On compressible data, we need less space than the text would have taken in raw, but we lose quite a bit of compression compared to raw compression. The probable reason is that both of gzip’s compression algorithms, LZ77 and Huffman, process input data one byte at a time. Base64’s 8-bit to 6-bit encoding, however, splits up every 3 bytes into 4, so that LZ77 cannot find repetitions as easily because they have to be aligned to those 4-byte base64 blocks.

The Huffman coding is probably also affected for two reasons: First, every character from the Base64 encoding character set will represent 6 bits that were chopped from one or two original bytes. That means that the symbol histogram will not represent the original data bytes symbol histogram perfectly. Second, as the input data takes more space, LZ77 length and distance encoding might require more space (more “extra bits” in the encoding).

To illustrate this effect, I ran a few examples on even simpler input data. Let’s compare gzip compression of raw data versus base64 encoding of the simple string consisting of 1024 times the character e.

If it is compressed raw it can be represented with LZ77 as

  • Literal character e
  • At distance 1 copy 1024 characters (somewhat unintuitive that you can reference data that is not yet decoded but it actually works like this)

This can be very efficiently encoded.

Even, if you would disable LZ77 in Gzip, Huffman coding alone could also represent this string efficiently because it would only have to represent the two symbols:

  • Literal character e
  • End of block

Even if that would be a quite degenerate case of Huffman encoding, you could still encode it in 1025 bits (e.g. 1024 bits of 1 to represent e followed by a single 0 to represent the end of the block, plus the Huffman table).

Now look at the Base64 encoding of the string of 1024 ‘e’s:


It’s quite obvious that it still has some structure but as you can see it exhibits the predicted 4-byte pattern. It can be encoded with LZ77 as

  • Literal string "ZWVl" (base64 representation of "eee")
  • At distance 4 copy 1020/3*4 characters (base64 representation of 1020 'e's)
  • Literal string "ZQo=" (the trailer which is necessary because 1024 doesn’t divide by 3, so that the base64 encoding has a bit of padding at the end)

This can still be quite efficiently compressed but has more complexity than the plain encoding without base64.

However, this is still just a special case because the repetitions of the character ‘e’ are aligned to 3-byte boundaries which translate into periods of 4 characters with base64.

You can see this when looking at the base64 encoding of "123456712345671234567123456712345671234567", i.e. 6 times the string "1234567":


If you look closely you can see that the result is still periodic but only after half of the output. This is because the 7 characters don’t divide by 3 so the next time the string "1234567" aligns at a 3-byte boundary is after 3 periods, that is after 21 input characters or 28 output characters.

It’s pretty clear that this will negatively affect gzip compression afterwards.

Finally, here’s a table showing the difference between compressing the same amount of bytes either of a single character of the string "1234567":

Encoding 4200 * "e" of unencoded 600 * "1234567" of unencoded
raw 4200 x 1 4200 x 1
base64 5600 x 1.333 5600 x 1.333
gzip 40 x 0.0095 48 x 0.0114
base64+gzip 47 x 0.0112 81 x 0.0193

In the last line you can see that adding base64 encoding has only little effect on aligned, compressible data but a much bigger effect when the repetitions in the original data don’t align to the base64 3-byte boundaries.


  • Putting binary data as base64 into json files and Gzipping afterwards works well because gzip mitigates the overhead of base64 almost completely.
  • If the binary data is compressible, you’ll lose some potential for compression by using base64 first. (If you want to offset this, you can gzip the binary data first before applying base64.)
  • Alternatively, you could put binary data not in the json file but next to it on the file system.
  • Or, use another encoding better suited to json, e.g. see this stackoverflow question

See also:

Richard Startin recently did a similar investigation in his blog post about Obfuscated Compressibility.

Why isn't it possible to read gzip files from the middle?

For instance, consider the case of a backup gone wrong and the first 100k of your backup.tar.gz went missing for some reason. Why isn’t is possible to start reading a gzip file (or many other compressed formats) from some position in the middle of the file?

I want to give two answers. The first one is a general one. Compressed formats like gzip are often stream formats. The stream encoder (the sender) and the decoder (the receiver) of the data are both stateful. At the beginning of the file they start from a defined initial state and with every byte (or sometimes even bit) processed both advance their state in lockstep.

Why is that a good idea? For one thing, it is efficient because the state doesn’t need to be transferred explicitly between encoder and decoder. Streaming formats usually also limit the amount of state to keep at any time, so you can decode it without having to keep all of the decoded data as it would be for a random access data format.

The downside of a streaming format is that if you start to read from an arbitrary location inside the stream you don’t know with which state to initialize your decoder, so the decoder doesn’t know how to decode the stream. So, this is the first answer.

The second answer is specific for a gzip / deflate compressed stream. gzip is just a thin wrapper around the deflate algorithm which is specified in RFC 1951. Remember, the deflate algorithm is a combination of the LZ77 algorithm and Huffman coding. LZ77 tries to replace a sequence of bytes with a back reference into the past of the stream. I recommend Julia Evan’s cool visualization about how “The Raven” compresses with LZ77.

The output of the LZ77 algorithm is a stream of symbols where each symbol either represents a literal uncompressed byte or a back reference of a certain length into the last 32kb data of the input. These symbols are then efficiently encoded into bits using Huffman coding. The idea of Huffman coding is to compress a byte1 with less than 8 bits. Sounds magic? Yes, usually you need 8 bits to transfer one of 256 symbols. But only if these symbols are uniformly distributed. If there’s any other distribution, the distribution of symbols already contains some information that you can transfer beforehand (as the Huffman tree in this case) and then transfer the actual bytes with less than 8 bits per byte in average by using short codes for frequent symbols and long codes for infrequent ones.

Here is how gzip/deflate encoders and decoders work in theory2:

  • Encoder: find back references (LZ77) -> Huffman encode
  • Decoder: Huffman decode -> resolve back references

A bit we read on a deflate compressed stream is first passed into the Huffman decoder state machine until that reaches a leaf in the Huffman tree in which case an LZ77 symbol is passed on to the LZ77 decoder. The symbol either represents either a literal byte which can be directly emitted or a back reference into the compression window.

Therefore, the decompressor has two stateful components: a state in the Huffman decoder and the contents of the decompression window to resolve LZ77 back references against. Both these states are missing if we start to read in the middle of a deflate stream. We cannot know where we are in Huffman state machine (though, it seems possible to make good guesses). And even if we could Huffman-decode the stream we could only decode literal bytes but wouldn’t be able to resolve back references because we missed the past bytes of the stream.

With the question answered, I wondered if it is inevitable that stream formats cannot be read from the middle. Of course, it is not :)

Many stream formats feature special synchronization byte sequences (which are otherwise unlikely to appear) to mark positions from where the decoder can restart decoding from a defined state. Adding synchronization is a trade-off, however. Streams are often used as an efficient way to represent data. This is in conflict with adding extra data. The synchronization markers add redundant data but, more importantly, each synchronization point also has to reset the internal state. So, every extra knowledge a decoder may have acquired while reading from the beginning must be discarded for the benefit of a decoder that starts from later in the stream.

Popular stream formats that employ synchronization markers are live audio or video streams for which a consumer might want to join the stream later. Here, it makes perfectly sense to add synchronization points, e.g. every second of the data for the following reasons:

  • a late consumer probably doesn’t care about skipping to the next synchronization point
  • (video) compression uses correlation between subsequent frames to shrink data, state that is useful for compression will probably only be found in frames of the same scene, so it is no loss to discard any state at a cut between scenes (which offers itself as an synchronization point)

This is different for gzip compressed data for which no general statement about data correlation can be made. Nevertheless, there is some half-official way to synchronize inside of a gzip/deflate stream. This document about “ZLIB flush modes” describes various ways the zlib library can insert synchronization points into a deflate stream. To be actually able to start reading from any synchronization point, aside from emitting a synchronization sequence, you must also reset both LZ77 (i.e. discard the compression window) and the Huffman encoding. A “Sync Flush” emits an empty uncompressed block whose header contains the well-known byte sequence 00 00 FF FF and then starts a new compressed block. This resets the Huffman encoding. However, it doesn’t reset the LZ77 encoding so you may still encounter back references to data that appeared before the synchronization point. For this reason a Sync Flush is not used as an actual synchronization point to allow starting to read from this position but instead it is used to fill the output buffer with padding bits so that full bytes can emitted. To reset the LZ77 compression window, as well, you need to emit a “Full Flush” which looks exactly the same but which guarantees that the encoder won’t emit back references that point to data from before the synchronization point.

Here are some links to read more about gzip/deflate compression.

  1. In fact, in deflate it’s not bytes (256 symbols) that are Huffman-encoded but either a byte or a back reference. These are represented by one of 285 symbols for which you would usually need 9 bits (with several values unused). See the spec for the details. 

  2. In practice, it is a little bit more complex as the complete stream can be divided in separate blocks on the Huffman level to allow different Huffman trees for different parts of the file which makes sense as symbol distribution may change over the course of a stream of data.