The Virtual Void A blog by Johannes Rudolph

Finding the top-p elements as used in Nucleus Sampling

Here’s a quick report on some experiments on algorithms to figure out the top-p elements of a distribution.

Why Nucleus Sampling?

The final stage of the current LLMs need to figure out how to generate the next token given a distribution over all tokens. Using the maximum element directly (greedy search) is the simplest solution but does not lead to interesting results. On the other end of the spectrum, sampling from the full distribution is risky because a significant share of the probability mass may consist of a large number of low probability tokens and accidentally selecting an unlikely token may lead the further text generation astray.1

Many potential improvements have been proposed. A simple, and widely-supported one is top-k sampling, where the k most probable tokens are selected and the probability mass is redistributed over those k tokens. It is easy to see when this can go wrong, e.g. in a common scenario where there’s a single token with a very high probability and all other tokens are similarly unlikely. Or, in the opposite case, of a somewhat flat distribution over many tokens where that flat part is cut short seemingly arbitrarily.

A more sophisticated approach is top-p sampling, also known as Nucleus Sampling. Here, the top tokens are selected until the cumulative probability mass exceeds a threshold p.

While working on llama2.scala, the practical problem of how to figure out the top-p elements out of the 32000 potential tokens (of the commonly used vocabulary), can be selected in a performant way. This is only relevant for small models where sampling time might become significant.

I implemented a few algorithms and compared them.

Naive Sorting

The naive approach is to sort the tokens by their probability and then select the top-p elements keeping track of the cumulative probability. This is the baseline to beat. This is simple to implement using a sorting implementations from your runtime of choice.

Since we need to keep track of the original indices of the elements (i.e. the identity of the token in the vocabulary), sorting can be done on the list of indices and then the original elements can be looked up. Alternatively, for better locality, the probabilities and indices could be kept together (leading to better locality but a more expensive swap operation).

Drawbacks of this approach:

  • In the common case only few elements will be selected, so sorting the whole array is wasteful.
  • The sorting algorithm does not take some properties of the distribution into account:
    • The distribution is likely to be skewed (expecting an inverse power law distribution)
    • The distribution sums up to 1

Filter out some elements and then sort

The first improvement is to filter out some elements that are impossible to be selected and then sort the remaining ones going on with the naive approach.

To figure out a conservative bound, consider this: In the edge case, only one element is selected that has exactly the probability p. In that case, all the remaining n-1 elements will have to share the remaining probability of 1-p. Thus, all elements below (1-p) / (n-1) can be filtered out immediately because they will never be selected. Note, how the denominator will only get smaller when more elements are selected. Also, when the single element has a higher probability than p, the numerator will be bigger. Thus, any other scenario will lead to an even higher value, so the given value is a conservative bound.

Filter out some elements and then iteratively select the next element

Instead of using a full-blown sort, the idea is here that after filtering we will be left with a small enough number of elements that using something like a selection sort becomes feasible (especially in the common case). In this algorithm, we keep track of the cumulative probability and select the next biggest element as long as the cumulative probability is still below p.

The runtime depends on the numbers of elements that will ultimately be selected. In the worst case, all elements will be selected and the list is traversed k times. One way to speed up the iterations, is to take into account, that we know the probability mass that is still available in each iteration. If we find a new maximum element that is bigger than half of the remaining probability mass, we can stop this iteration early and go on with selecting the next element.

Build a histogram of elements

Observing that selecting the top-p elements is equivalent to finding the probability of the last element that is still included in the selection, one idea is to build a histogram of the elements to filter in and out whole buckets after a single iteration over the values.

After the first step, we know a more precise bound on the probability of the last element. We can then only operate on the bucket that contains the last element and find the exact result.

A difficult question in this algorithm is how to set up the buckets. Even with top-p, the last element might already be part of a long, flat tail, so that regardless of exact choice of buckets, the last element might be part of a big bucket. Another consideration has to be how to practically calculate the bucket without spending to much computation time on it. A seemingly obvious choice might be to use a logarithmic scale, but calculating logarithm is expensive. Another possibility would be to extract the exponent of the floating point number and use that as the bucket.

Use a quick select algorithm

The quick select algorithm is a well-known algorithm to find the k-th smallest element in an array. It is a variant of the quick sort algorithm that only recurses into the part of the array that contains the k-th smallest element (and avoids sorting as much as possible).

I found it hard to implement, it’s easy to get the indices slightly wrong and choosing the right pivot is difficult. Also, care needs to be taken that the algorithm actually terminates. Depending on the choice of pivot it can happen that the algorithm does not make progress (WHY? Is that even true or a bug in the implementation?) It seems the way this is dealt with is by choosing the pivot randomly which means that eventually the right pivot will be chosen to make progress (there’s always one that works).


Running bench/jmh:run -f1 -wi 2 TopP* at 7f0774b3 on my machine gives the following results:

             sorting  thrpt    5    122.276 ±   1.498  ops/s
       filterAndSort  thrpt    5   2417.850 ± 135.264  ops/s
   filterAndFastSort  thrpt    5  17899.529 ± 600.464  ops/s
       filterAndScan  thrpt    5  23479.921 ± 895.475  ops/s
quick-find-max-top-p  thrpt    5   5294.213 ± 302.202  ops/s
           histogram  thrpt    5  19456.664 ± 210.921  ops/s
        quick select -- has still bugs


  • sorting = Naive idiomatic Scala sorting
  • filterAndSort = Filter out some elements and then sort, idiomatic Scala
  • filterAndFastSort = Filter out some elements and then sort, avoid expensive Scala collections
  • filterAndScan = Filter out some elements and then iteratively select the next element
  • quick-find-max-top-p = Tries to use filtering and selecting in each step, iterating over all elements all the time
  • histogram = Build a histogram of elements, then use previous algorithm on the last bucket

The results are an average over 100 different distributions (generated by running the stories15M model).


  • Just doing the filtering does most of the work. Due to its simplicity this is what I contributed to llama2.c.
  • Doing filtering and sorting in Scala requires a bit of work to avoiding the overhead of Scala collections.
  • Avoiding a full-blown sort and replacing it with a selection sort gives a little bit of extra performance.
  • Building a histogram before doing the selection sort steps, does not seem to be quite worth it. In theory, it should further reduce the number of elements to consider for the selection sort, but the additional complexity is not well amortized.
  • In general, a first filtering step is worth it, maybe just because it reduces the amount of data by such a degree that further iterations over all elements are much cheaper than before.
  • The results are an average over 100 distributions. In many cases, only a handful of elements is selected (for top-p = 0.9) A heuristic could be developed to detect cases with very few elements (e.g. while filtering count elements above a certain threshold), and use a fast-path for those cases.
  • More work is needed to get the quick select algorithm right.


You can find the implementations and benchmarks in the top-p-benchmarking branch.


This was a fun experiment doing some algorithm engineering and benchmarking. I hope you enjoyed it as well!

  1. This is a consequence of autoregressive LLMs “just predicting the next word/token”. This is often stated as an seemingly intuitive drawback of how LLMs are trained and how they generate text. For me, it is not all that clear, why predicting just the next token is obviously weak. That said, of course, this main architectural property has consequences. As I see it, a (2023 generation) LLM is a very complex state machine that can solve a (maybe surprisingly) wide range of tasks given enough depths/width/parameters. The state is built up by calculating key/value pairs for each token for each layer (and multiple heads) of transformers. The whole machine is instructed and clocked token-by-token of the sequence. In generation mode, the machine generates its own next instruction auto-regressively. So, the choice of the next token is very important as the machine cannot take back its choice. The only thing it might do, is try to weasel itself out of a bad choice by generating more text. Note, how this kind of backtracking by adding apologies, is only applicable to some language generation situations. For example, in a chat situation, the machine can apologize and use more user input as ways to take the conversation further getting meaningful back on track. However, e.g. when generating code or generating under a stricter regime, this might not be feasible and will have to lead to incoherence, hallucinations, or outright garbage. 

Experiments with io_uring

This is a quick writeup of my experiments with io_uring.

What is io_uring?

io_uring is a generic, consistent API that allows to (batch) submit Linux kernel IO operations in an asynchronous way ultimately requiring few or zero actual syscalls. It uses a pair of buffer rings for submission and completion data shared between user and kernel space (but that’s just an implementation detail).

To deconstruct the above sentence:

  • generic: io_uring is generic because it supports a growing list of IO-related syscalls
  • consistent: The submission and completion mechanism is the same for different kinds of APIs. Submission entries must by default fit into the 64-byte submission entry, completions into a 32-byte completion entry. For some operations the size of a single entry must be doubled which can be configured during creation of the ring.
  • batch submit: Any number of calls (up to the configured size of the submission ring) can be submitted in one go.
  • asynchronous: The API itself is asynchronous in that submission of an operation will not always lead to immediate completion. Instead, some operations may only be completed later. By default, submission are directly executed inline (in the same thread) during submission if possible. Otherwise, operations may only be completed later.
  • few or zero syscalls: Submissions and completions are serialized into shared memory and can (in theory) be accessed by the kernel and the user application concurrently. The default behavior requires a single io_uring_enter call to make the kernel aware of the latest submissions. An optional SQPOLL mode allows the kernel to run a dedicated thread to poll submission from the application in which ultimately no syscalls are required any more (though, then kernel and application need to (busy-) poll the queues).
  • buffer rings: Application and kernel communicate using shared memory and a submission/completion protocol. In theory, this can be faster than syscall based IO submission.

Why io_uring?

syscalls can become a bottleneck in applications that require lots of IO. Batch submission and the potential polling mode means you can trade off different latency and throughput requirements.

Until io_uring came along, Linux did not have a proper asynchronous interface for doing buffered 1 file IO. In general, io_uring can hide an internally blocking operation inside its own blocking thread pool just providing an asynchronous interface to an actually blocking operation (similar as e.g. Java NIO asynchronous file operations were implemented). Adding more (internal or not) thread pools somewhat defeat the purpose of asynchronous execution (which should be as light-weight as possible). So, later an asynchronous buffered read mode was proposed and added to the kernel (and io_uring) for file-systems that support it.

Registering buffers and file descriptors

Using user-space buffers and file descriptors in the kernel comes with some overhead. Therefore, it is possible to preregister buffers and descriptors using the io_uring_register syscall.

In many server applications that mutiplex handling of multiple connections with a single thread, the connection is first polled to be ready for reading, then a buffer is allocated, and finally the data is read. With io_uring it is possible to avoid the polling step completely and read data asynchronously directly. However, that means that buffers need to be provided upfront for every waiting read call (which in a web server would be all idle connections). To avoid a per-read (per-connection) read buffer for every outstanding read, with io_uring buffers can also be provided more dynamically using a special operation, IORING_OP_PROVIDE_BUFFERS.

You can then use IOSQE_BUFFER_SELECT for submissions to have io_uring select a buffer from the given set when needed.

A downside of this method of providing buffers is that the kernel may choose buffers arbitrarily. After use, the application has to provide them again. While the original IORING_OP_PROVIDE_BUFFERS call can use a contiguous amount of memory to provide multiple buffers at once, spent buffers will fragment that memory over time. The application will either have to

  • provide them back to io_uring in a one-by-one fashion after usage (spending one SQE per buffer after usage), or
  • keep track of which buffers have been redeemed, try to coalesce their memory regions and provide them later with fewer SQEs

In later kernels, it is possible to use IORING_REGISTER_PBUF_RING to create yet another ring just for buffer management, so that providing buffers can happen more efficiently without spending SQEs just for providing single small buffers.

Execution Modes

As shown in the Missing Manuals - io_uring worker pool, an IO operation can go through three different paths when submitted:

  • Be executed directly inline during the execution of io_uring_enter
  • Is not ready but can be polled for readiness. In that case, it is automatically polled. When the operation is ready, it is retried and completed.
  • If it only can be executed in a blocking fashion, it is enqueued to a run queue to the blocking kernel worker thread pool and executed and completed when done.

Inline execution

Simple networking applications (like an echo server or a simple HTTP server) will spent a significant portion of CPU time in the kernel doing TCP work. Whether you use regular syscalls or io_uring, by default much of the TCP processing time will be spent on a user thread (especially on the write side).2

The test application is a simple HTTP server which fakes request processing:

  • Has a simple (single-threaded) event processing loop which blocks on completion events from io_uring (setting the min_complete argument of the io_uring_enter syscall to 1).
  • Accepts any amount of incoming connections.
  • Tries to read an incoming HTTP request (but does not validate it at all) and sends out a response (it will never have to wait on network writes).

Here is a flame graph of simple HTTP server processing using io_uring (click for interactive version):

Flamegraph of inline execution of socket read/write

The blueish and green frames on the left side are the application. Since this application does not do any actual processing, it is only a very small share of total processing time.

The big red humps are frames executing in the kernel. All of those frames are on top of the io_uring_enter call, i.e. executing while doing the submission call. The biggest share of that call is the big hump in the middle which is the actual TCP sendmsg implementation. You will see the same frames on a blocking server as well.

The small part on the left seems to be the overhead of actual submitting events using io_uring.

The small hump on the right is part of the TCP recvmsg implementation.

Async submission

There’s an alternative mode that can be selected using the IOSQE_ASYNC flag per SQE where the task is not executed inline at all but queued for io_uring’s internal worker queue.

One problem when used together with IOSQE_BUFFER_SELECT shows up in the flamegraph (click through for interactive flame graph):

Flamegraph of buffer selection contention with async flag

io_uring will spawn a massive amount of worker threads and all these threads contend on acquiring a pre-provided buffer which will slow everything down.

I filed an issue about that and it was confirmed that using this usage pattern in particular but more generally IOSQE_ASYNC is not recommended. In any case, you should probably stay away from using IOSQE_BUFFER_SELECT when you expect the requests to end up on the internal worker pool.

Inline vs. Async

It may be surprising that using an asynchronous interface may not automatically lead to asynchronous execution or more parallelism. For application servers that are single-threaded, there is the option to force spreading the internal kernel load to the internal worker queue (if you have cores to spare) by using IOSQE_ASYNC. However, as seen, this comes with extra overhead and potential extra scalability issues.

For CPU-bound workloads the inline execution behavior seems useful as it will provide the fastest execution path. For low-latency applications the time spent in the kernel might be an unwanted interruption but for lowest-latency you will anyway need spare CPU resources to avoid any extra delays. In that case, distributing work over multiple worker threads each owning a separate ring (and maybe polling) should give the best solution and best control over how resources are spent.

Usage from the JVM

To support io_uring from the JVM, only a few native calls need to be implemented:

  • Access to the io_uring_setup, io_uring_enter, and io_uring_register syscalls (e.g. by supporting all syscalls by linking to libc’s syscall symbol)
  • Memory mapping the io_uring data structures into memory
  • A way to manipulate the native (non-JVM-heap) memory of the rings and IO buffers.

The JVM’s standard way of interfacing with the native world is using JNI. We only need to provide simple wrappers for three syscalls with simple interfaces so this is simple to do. Alternatively, using a simple JNA declaration like this also works:

import com.sun.jna.Library;
import com.sun.jna.Pointer;
import com.sun.jna.Structure;

public interface LibC extends Library {
  int syscall(int number, Object... args);
  Pointer mmap(Pointer addr, long length, int prot, int flags, int fd, long offset);

Using JNA will introduce some overhead over JNI, but it seems relatively fine compared to all the other work.

After mapping a few of io_uring’s internal data structures, creating a uring can look like this:

val params = new IoUringParams // not in the fast-path, use JNA mapped data structured
val uringFd = libC.syscall(LibC.SYSCALL_IO_URING_SETUP, sizeOfRing, params)

val length = params.sq_off.array + params.sq_entries * 4

val sqRingPointer = libC.mmap(new Pointer(0), length, LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQ_RING)
val cqRingPointer = libC.mmap(new Pointer(0), params.cq_off.cqes + params.cq_entries * 16 /* sizeOf(IoUringCqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_CQ_RING)
val sqePointer = libC.mmap(new Pointer(0), params.sq_entries * 64 /* sizeOf(IoUringSqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQES)

// FIXME: check all result values

// convert JNA pointers to DirectByteBuffer
val sqRingPointerByteBuffer = sqRingPointer.getByteBuffer(0, length)
val cqRingPointerByteBuffer = cqRingPointer.getByteBuffer(0, params.cq_off.cqes + params.cq_entries * 16)
val sqes = sqePointer.getByteBuffer(0, 64 * params.sq_entries)

From here on, all datastructures are manipulated using the DirectByteBuffers whose operations (hopefully) are intrinsified and therefore cheap. JNA’s getByteBuffer is expensive, though, so make sure to create buffers spanning the whole interesting region and reuse them later. For example, to access some parameters, you can use code like this:

def sqHead() = sqRingPointerByteBuffer.getInt(params.sq_off.head)
def sqTail() = sqRingPointerByteBuffer.getInt(params.sq_off.tail)
val sqMask = sqRingPointerByteBuffer.getInt(params.sq_off.ring_mask)

def cqHead() = cqRingPointerByteBuffer.getInt(params.cq_off.head)
def cqTail() = cqRingPointerByteBuffer.getInt(params.cq_off.tail)
val cqMask = cqRingPointerByteBuffer.getInt(params.cq_off.ring_mask)

Here is one way of representing an operation with a case class (which is not quite optimal because of the extra allocation):

case class ReadOp(
    flags:             Byte,
    fd:                Int,
    destinationOffset: Long,
    targetBufferAddr:  Long,
    targetBufferSize:  Int,
    userData:          Long,
    bufferGroup:       Short = 0 // FIXME: buffer selection should be abstracted
) extends Op(IORING_OP_READ) {
  override def prepareSQE(buffer: ByteBuffer): Unit = {
    buffer.putShort(0 /* ioprio */ )

An operation can be submitted like this:

def submitGlobal(op: Op): Unit = {
  // FIXME: check against overflow
  val curTail = sqTail()
  val index = curTail & sqMask
  sqes.position(64 * index)
  sqRingPointerByteBuffer.putInt(params.sq_off.array + 4 * index, index)
  // FIXME: write barrier needed?
  sqRingPointerByteBuffer.putInt(params.sq_off.tail, curTail + 1)
  // FIXME: write barrier needed?

That is, we first figure out the tail of the ring to write to, then prepare an operation using the above method, put the index into the SQE array, and finally increase the tail index. (I’m not sure if the barriers are needed in a simple application, but if multiple threads access the ring, you definitely should add the barriers, e.g. using Unsafe).

Finally, submitting all queued SQEs to the kernel is done by invoking io_uring_enter (the only necessary syscall in the hot path):

val toSubmit = sqTail() - sqHead()
libC.syscall(LibC.SYSCALL_IO_URING_ENTER, uringFd, toSubmit, 1, IORING_ENTER_GETEVENTS, 0)

After the call returns, you can check for completions:

var cqIdx = cqHead()
val last = cqTail()

while (cqIdx < last) {
  val idx = cqIdx & cqMask
  // accessing cqe data via JNA is more expensive but also possible:
  // val cqe = new IoUringCqe(cqRingPointer.share(params.cq_off.cqes + idx * 16))
  // debug(cqe.toString)
  val off = params.cq_off.cqes + idx * 16
  val userData = cqRingPointerByteBuffer.getLong(off)
  val result = cqRingPointerByteBuffer.getInt(off + 8)
  val flags = cqRingPointerByteBuffer.getInt(off + 12)
  handleCompletion(userData, result, flags)
  cqIdx += 1
// give back all handled entries (FIXME: memory barriers)
cqRingPointerByteBuffer.putInt(params.cq_off.head, last)

Here we loop through all new CQEs, extract their data (again using the previously created DirectByteBuffers) and defer handling to a completion method.

A final problem is creating or accessing file descriptors from the JVM. It is possible to extract file descriptors from Java RandomAccessFile and Socket using Java reflection from their internals.

In newer kernel versions, more syscalls are available and it might be possible to open files and sockets using io_uring.

The full source code can be found at It contains more of infrastructure to provide a more idiomatic API, however, everything should be considered experimental (checking return values, anyone?).

Much more

There’s, of course, a lot more to io_uring not covered here:

  • More flags like IOSQE_IO_LINK or IOSQE_IO_DRAIN
  • Full set of supported ops
  • Cancelling requests


  1. Buffered means going through the page cache. I.e. when calling read the file is mapped into memory and then copied into user buffers. 

  2. If you ever wondered where the 5 times difference between the “Json” and the “Plaintext” tests in the Techempower Benchmark comes from, that’s the reason. The json example uses regular HTTP requests where only a single request can be ongoing per TCP connection. In that case you always require at least one TCP/IP packet per request and response. The plaintext benchmark uses HTTP pipelining, which ultimately allows to transmit multiple HTTP requests and responses in a single TCP/IP packet which decreases the in-kernel CPU overhead of TCP massively. 

Building a blinky sign for Twäng (part 1)

Twäng released its award-winning debut album POP! in September 2021. To celebrate the release I built a small piece of eye-catching hardware.

The sign was shown in a local news item (German): screenshot


Background Music

Before continuing, please head over to one of the online places where you can listen to the music and put it on in the background ;)

POP! Cover

or watch a recent live performance (without my participation).


  • WS2812B RGB LEDs on a small PCB (e.g. from ebay or aliexpress)
  • Raspberry Pi Pico
  • Another Raspberry Pi Pico (using picoprobe for debugging)
  • 0.6mm copper wire
  • connection wires
  • picture frame
  • bubble wrap


  • Soldering Iron
  • Side Cutters
  • Third-Hand Tool

Building Log

First version assembly

After one of the rehearsals for our small release concerts I had the idea of creating a small blinking sign.

Should I order a LED matrix to program or could I build something simpler myself? Looking around on youtube for inspiration about how to get creative with WS2812B, I found someone handling a string of these LEDs each mounted on a small PCB. The LEDs were hold soldered together with a piece of wire.

That sounded promising, so I ordered a pack of 50 WS2812B LEDs from ebay on a whim. They arrived two days later and came as a single PCB panel with v-cuts to break them off easily.

I got to work and soldered together the first few pixels. At the input side, I soldered a few wires with pin connectors to try them out with the Raspberry Pi Pico.

First Attempts

The pico-examples repository contains examples for driving WS2812B. I compiled it without changes and uploaded it to the Pico using UF2. For the experiments, the data in pin was connected to the GPIO that was configured in the example. The +5V connection was connected to the 3V3 for now.1

Each WS2812B RGB LED, in fact, contains three LEDs of red, green, and blue color, together with a small IC for driving these LEDs.

They only use a single digital data line that needs to be chained through all the LEDs. Since there’s no physical clock line, the data rate is predefined and needs to be adhered to accurately to ensure proper transmission. Internally, each LED has a shift register which receives the 24 initial bits (the color value ordered as GRB), and propagates all the remaining bits. This happens until a RESET code is received (low voltage level for at least 50µs). Then all LEDs latch the received values into their display register and use PWM to set the perceived brightness per color led.

Connecting to the Pico and running the pico example without changes worked well and showed the predefined animation patterns of the example on the first three pixels.

This looked promising so I came up with a design about how to wire more of those LEDs to form the letters “pop!”.

First Letter Design

The copper wires are quite stiff, so you can form arbitrary shapes and the resulting form will be stiff enough for careful use.

Producing the rounded forms like the letter ‘o’ or ‘p’ was challenging. You need to bend three small pieces of copper wire into a curved form making sure that those three wires which are only a few millimeters apart do not touch. After a bit of trial and error, the best technique was to start with the inner of those three wires and solder those first.

In my first attempts I tried to precut the wire into the right size but I could not handle the soldering of these small parts with my fumbly fingers. A better way was to keep the wire on its reel while soldering the first connection, and only cutting it right before soldering the other end. The third hand tool was indispensable, as you can imagine!

First Letter Done

The first test on the full first letter revealed that the color scheme looked weird.

First Letter Test

It turned out that the pico WS2812B example, expects a different LED type by default that uses a fourth color channel for a white LED. The boolean rgbw parameter passed to ws2812_program_init controls how many bits are sent per LED, so it was an easy fix to run with the RGB variant of the pixels.

I continued with my quest and soldered another letter.

Second Letter Done

As you can see, some bridges are needed, first, to connect the inner rail of the power supply to the inner rail of the next letter (I chose to run all letters in the same direction, so the inner rail was +5V). Also, the Dout of the last pixel of one letter needs to be connected to the Din of the first letter of the next letter.

After three letters done, I started doing the first little bit of programming just to give each letter its own color:

Pop Letters Done

Finally, I had also finished the exclamation mark and real programming could begin.

All Letters Done

Simple colors firmware

First, constants were defined for the colors per letter.

// COLOR_BRG converts R,G,B value to a 24bit BRG value as expected by WS2812B
const uint32_t p_1_color = COLOR_BRG(240, 5, 2);
const uint32_t p_1_o_color = COLOR_BRG(20, 0, 40);
const uint32_t o_color = COLOR_BRG(12, 35, 120);
const uint32_t o_p2_color = COLOR_BRG(28, 70, 6);
const uint32_t p_2_color = COLOR_BRG(200, 40, 0);
const uint32_t p_2_excl_color = COLOR_BRG(90, 13, 1);
const uint32_t excl_color = COLOR_BRG(150, 12, 8);

Determining the color values was not as straight-forward as expected. Turning on all LEDs with the same brightness value, did not gave a neutral white but was heavily shifted to blue. Two potential explanations come to mind:

  • colors are just not calibrated well
  • a voltage supply smaller than 5V might affect different colors in different ways (red LEDs have a smaller voltage drop than blue ones and there’s no information about how WS2812B implement the constant current source for the LEDs and how it is affected by input voltage)

In the end, I found that a balanced white had an RGB value of 255, 130, 80.

Pixel are chained one after each other and when sending data to them. You send color values in the same order as the pixels are wired. Here is a schematic of how the pixels are wired (this schema belongs to a latter revision while code and pictures above to the first, sorry for that):

Pixel IDs

The association between pixel IDs and letters was done using a simple bitmap:

const int num_leds = 32;
const int p_1_leds[] = {1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_leds[] =   {0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0};
const int o_p_2_leds[]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // only want that middle one
const int p_2_leds[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,0,0};
const int excl_leds[]= {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1};

By design, some pixels are shared between two letter and get their own colors.

Transmitting the colors works like this:

void show_all(uint t) {
    for (int i = 0; i < num_leds; ++i) {
        if (p_1_leds[i] && o_leds[i])
        else if (p_1_leds[i])
        else if (o_p_2_leds[i])
        else if (o_leds[i])
        else if (p_2_leds[i] && excl_leds[i])
        else if (p_2_leds[i])
        else if (excl_leds[i])

That is, we cycle over all the LEDs by id, figure out which letter the pixel belongs to, and transmit the desired color.

Here’s the result:

Initial Letter Colors

As a finishing touch I put it into a frame and added a sheet of paper and some bubble wrap for diffusion and style.

Framed Version v1

Improved hardware version

User testing revealed that the letter shaping was not easily readable. This is no surprise as the initial version only had ~ 30 pixels which doesn’t allow for much details. I had tried to mimic the album art by fusing adjacent letters. The result was nicely colorful but still not an obvious representation of the album cover.

Only a few days before the concert, I decided to redo the hardware slightly. Last minute changes to hardware are much more frightening than doing software fixups.

On one hand, there’s only one piece of hardware available which already works, and if it gets broken in the process, there’s no way of going back other than fixing whatever breaks. On the other hand, the software already depends on the topology of the hardware and also needs to be changed accordingly.

After all I overcame my fears and attempted some cosmetic surgery…

Letter Surgery

I added a few pixels here and there and made the separation between the letters and the exclamation mark clear.

Final setup v2

The backside:

Final setup v2

Leading to the final result:

Framed v2


With the hardware done, I could focus on animations. There are three kinds of animations:

  • Static colors (as seen above) with a slight shimmer by randomly fluctuating pixel colors
  • Fireworks with rockets and flashing letters
  • A snake wandering through the letters

Here’s a shot from the fireworks animation:

Fireworks still

More about the animations in a follow-up blog post.

In the meantime, you can find the code at

  1. For full brightness you can run WS2812B at +5V, but then you need to make sure that the signal level of the data line get to 0.7 times 5V which is 3.5V and therefore slightly too high to be driven by a 3V3 microcontroller. With my pixels both using the VBUS pin (nominally at 5V USB voltage) and VSYS pin (VBUS - the voltage drop of a Zener Diode of ~0.3V, i.e. at ~ 4.7V) of the Pico both also work flawlessly. 

Maximum compression ratio with Zip/Gzip/Deflate

Quite a while ago an issue was reported for Akka HTTP where incoming data, compressed using Content-Type: gzip, was uncompressed without applying any limits, in the worst case leading to an OutOfMemoryException and a shutdown of the server. This can be used for a classical denial of server attack, a “zip bomb”, where compression is used to amplify the amount of data an attacker has to send to bring down a machine.1

To reproduce the attack I gzip-compressed files of various sizes that contained a repitition of a single character for maximum compression. It turned out that there seems to be a hard compression ratio limit with gzip. This is somewhat surprising as, in general, with run length encoding you could represent such a repetition with log(n) compressed bytes.

Here’s a table of various compressed file sizes (all bytes 0):

Original Size Gzip Size Ratio
1000 29 34.5 : 1
1000000 1003 997 : 1
1000000000 970501 1030 : 1
10000000000 9704731 1030 : 1

So, the ratio seems to converge to approximately 1030 : 1. Why is that the case? We’ll try to find out in the rest of this post.

Let’s have a look at how these 1000000 bytes look in compressed form:

$ dd if=/dev/zero bs=1000 count=1000 status=none|gzip|hexdump -C 
00000000  1f 8b 08 00 19 85 d9 5b  00 03 ed c1 01 01 00 00  |.......[........|
00000010  00 82 20 ff af 6e 48 40  01 00 00 00 00 00 00 00  |.. ..nH@........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000003e0  00 af 06 9e cb 79 12 40  42 0f 00                 |.....y.@B..|

Weird, most of the file is just zero bytes (which hexdump abbreviates with *). Let’s have a look at the specifications for gzip, RFC 1951 (DEFLATE) and RFC 1952 (GZIP) to make sense of it.

When I do these manual decoding exercises, I usually copy the hex data of a file into an editor and then snip off bytes from the beginning and annotate them with their meaning, keeping the tail of undecoded data at the bottom. If you want to follow along you can copy the data from the above hexdump into an editor and follow the steps.

Gzip is just a small wrapper around deflate compressed data. The Gzip header is at least 10 bytes:

         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)

For our file:
ID1 + ID2: 1f 8b 
CM:        08 -> DEFLATE compressed
FLG:       00 -> no extra data
MTIME:     19 85 d9 5b -> time of compression
XFL:       00 -> no extra flags
OS:        03 -> Unix OS

As no flags (FLG) are specified the rest of the file is DEFLATE compressed data (if you use gzip <filename>, instead of piping the input data in, the gzip header also contains the original filename). At the end of a gzip file there’s a trailer containing a CRC32 over the uncompressed data (here: 9e cb 79 12) and 4 bytes of the size of the uncompressed data modulo 2^32 (here: 40 42 0f 00 => 0xf4240 = 1000000).

So now we we’ve have dealt with all the Gzip related data, we can decode the DEFLATE compressed data which starts at offset 0x0a:

ed c1 01 01 00 00 00 82 20 ff af 6e 48 40 01 00
# Lots of 00 bytes
00 af 06

DEFLATE compressed data needs to be interpreted as a bitstream. The order in which to read the bits from the bytes and how to interpret them is the single most confusing thing about the DEFLATE specification. RFC 1951 says the bits of a byte are ordered like this:


That is, the least-significant bit of each byte needs to be read first. Let’s convert our data byte-by-byte into a bit stream

ed       c1       01       01       00       00       00       82

11101101 11000001 00000001 00000001 00000000 00000000 00000000 10000010

20       ff       af       6e       48       40       01       00
00100000 11111111 10101111 01101110 01001000 01000000 00000001 00000000

Lots of 00 bytes

af       06
10101111 00000110

Now we have to read the bits in each byte from right to left to get the final bitstream:

10110111 10000011 10000000 10000000 00000000 00000000 00000000 01000001

00000100 11111111 11110101 01110110 00010010 00000010 10000000 00000000

Lots of 0 bits

11110101 01100000

We can now just concatenate the bits and start reading DEFLATE blocks.


Lots of 0 bits


The first bit of a DEFLATE block defines if it is the last block and the next two bits specify which compression type was used for that block. DEFLATE defines 3 compression types:

  • 00 = uncompressed
  • 01 = predefined Huffman tables
  • 10 = custom Huffman tables
  • 11 = reserved

For our data:

1  - BFINAL => Last block
01 - BTYPE = 10 => custom Huffman tables

As you can see, here’s another oddity: when we read a fixed number of bits (like for the compression type), we need to reverse them again before interpreting them. In our case, the compression uses custom Huffman tables. That makes sense because our original file contains very specific data (only zero bytes) and also it’s long enough to warrant the extra overhead of specifying a set of custom Huffman tables. As you will see most of the following non-zero data will be the definition of the custom Huffman tables.

DEFLATE is a combination of dictionary compression (LZ77) and Huffman coding. LZ77 represents data as a sequence of literal bytes and references to data seen before (as a tuple of offset into previous data and length of data to copy). The LZ77 sequences are then efficiently encoded into a binary stream by using Huffman encoding. DEFLATE uses two Huffman tables. The first table is the LITLEN table. A LITLEN symbol determines either if a literal byte should be decoded (and which) or if this is a reference of a particular length. If it is a reference we will use the second Huffman table for distances to determine the offset of the reference.

The standard defines a set of fixed tables to be used for blocks where including the tables themselves would be too much overhead (BTYPE = 01). In our case, however, we use custom Huffman tables and so the bitstream starts with a definition of the custom tables. These tables themselves are defined using another type of Huffman table, the code length codes.

Section 3.2.7 of RFC 1951 defines how the custom Huffman tables should be constructed.

It starts with some constants to define how big the tables will be. Let’s continue reading:

10111 = HLIT: number of LITLEN codes defined - 257, here 11101 = 29 + 257 = 286
10000 = HDIST: number of distance codes - 1, here 00001 = 1 + 1 = 2
0111  = HCLEN: number of code length codes - 4, here 1110 = 14 + 4 = 18

So, our custom definition contains 286 LITLEN codes and 2 distance codes. They are represented using the 18 code length codes of the code length Huffman table used to transmit the other Huffman tables. This kind of recursive usage of Huffman tables to specify Huffman tables is a bit mind-boggling first but will hopefully clear up later.

The code length symbols are defined like this:

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

Next we read the 18 code lengths for the code length table by reading 18 times 3 fixed bits of lengths. The lengths read correspond to the symbols in the order 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. Let’s read those:

000 = 0 (symbol 16)
000 = 0 (17)
010 = 2 (18)
000 = 0 (0)
000 = 0 (8)
000 = 0 (7)

000 = 0 (9)
000 = 0 (6)
000 = 0 (10)
000 = 0 (5)
000 = 0 (11)
000 = 0 (4)

000 = 0 (12)
010 = 2 (3)
000 = 0 (13)
010 = 2 (2)
000 = 0 (14)
010 = 2 (1)

We don’t read anything for the last symbol 15 as we should only read lengths of 18 symbols.

So, our code length Huffman table only contains the four symbols 1, 2, 3, and 18 all with bit lengths 2. The specification contains rules how to build a Huffman table from the bit lengths. The result for the code length table is this:

Symbol Bit length Huffman code Extra bits to read Interpretation
1 2 00 - Code length “1”
2 2 01 - Code length “2”
3 2 10 - Code length “3”
18 2 11 7 Repeat code length “0”, 11 - 138 times

Next we use the code length Huffman table to read 286 LITLEN code lengths and 2 DIST code lengths. Note, that the symbol 18 (and others not in use here) above will read multiple ones so you need to keep track of how many code lengths you have read overall.

01           => Symbol 2 => 1 x code length 2 (1 read so far)
11 + 1111111 => Symbol 18 + 7 extra bits => 138 x code length 0 (139 read so far)
11 + 0101011 => Symbol 18 + 7 extra bits => 11 + 1101010 = 117 x code length 0 (256 read so far)
10           => Symbol 3 => 1 x code length 3 (257 codes read so far)
11 + 0000100 => Symbol 18 + 7 extra bits => 11 + 0010000 = 27 x code length 0 (284 read so far)
10           => Symbol 3 => 1 x code length 3 (285 read so far)
00           => Symbol 1 => 1 x code length 1 (286 read so far, i.e. all LITLEN codes)
00           => Symbol 1 => 1 x code length 1 (287 read so far)
00           => Symbol 1 => 1 x code length 1 (288 read so far)

This translates into these Huffman tables (we read only the bit lengths but I already added the actual codes based on the algorithm from the specification and omitted unused symbols).

LITLEN Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 2 10 - Literal Character ‘\0’
256 3 110 - End Of Block
284 3 111 5 Copy 227 + extra bytes
285 1 0 - Copy 258 bytes

DIST Huffman table:

Symbol Bit length Huffman code Extra bits to read Interpretation
0 1 0 - Distance 1
1 1 1 - Distance 2

If you compare these tables with the predefined ones in the specification you can already see how little codes are needed for our special case file and with how few bits those can be encoded.

Now that we have finally created the Huffman tables for decompression, we can start decoding the actual data.

If you followed along, this is the remaining bit stream, we have already almost reached the long stretch of 0 bits:


Lots of 0 bits (7736 0 bits to be precise)


Let’s decode:

10 => LITLEN Symbol 0   => Literal byte '\0'
10 => LITLEN Symbol 0   => Literal byte '\0'
0  => LITLEN Symbol 285 => Copy 258 bytes from ...
0  => DIST Symbol 0     => ... distance 1

00 => The same, LITLEN Symbol 285, DIST Symbol 0 => Copy 258 bytes from distance 1
00 => ...
00 => ...
00 => ...
00 => ...
00 => ...

3868 times the same

111 + 10101          => LITLEN Symbol 284 => Copy 227 + 10101 = 248 bytes from ...
0                    => DIST Symbol 0 => ... distance 1
110                  => LITLEN Symbol 256 => End Of Block
0000                 => Padding to align to bytes

So the sequence we decoded is this:

  • 2 times the literal byte ‘\0’
  • 3875 times copy 258 bytes from distance 1 (long stretch of 0 bits in compressed data)
  • copy 248 bytes from distance 1
  • end of block

In total we have decoded 2 + 3875 * 258 + 248 = 1000000 bytes. Mission accomplished!

Now, back to our original question of what limits Gzip compression. As you can see here (and also when looking into the specification), we can achieve the highest compression when we can copy 258 bytes. In the best case, like with these very specific Huffman tables, we can achieve that with one bit for the LITLEN code and one bit for the distance code. The compression ratio is therefore 258 bytes compressed into 2 bits = 8 * 258 bits / 2 bits = 1032 / 1. In the beginning we measured about 1030 : 1. The difference between that and 1032 : 1 is the overhead of having to include the custom Huffman tables.

A follow-up question is why there is the maximum of copying only 258 bytes in the specification. There would have been room for adding greater amounts. I once read an explanation on StackOverflow (cannot find the link right now), that the length value (range 3 to 258) should be able to fit into 8 bits for performance reasons on 8 bit machines (deflate is really old by now). A great thing is that you can ask question about deflate / gzip on StackOverflow and it’s likely that Mark Adler (who still maintains zlib) will answer. Like for the question “gzip compression ratio for zeros” which answers the same question as I have here.

  1. We thought that we would be safe against such attacks because of our streaming infrastructure, which in theory is correct. However, as so often with bugs like this, unfortunate circumstances led to collecting the stream without limiting the amount of collected data. 

Base64 vs. Gzip

For personal projects I mostly try to avoid using databases. Instead, I try to use what I call the “JSON repository” pattern. I create a file structure that contains my data in JSON format. I often use the directory structure as an index by placing the JSON data at well-known places for random access.

For example, I wrote a small web crawler a while ago. To make the process incremental between different runs I created this file structure for caching intermediate results until the next run:

data/                          # Repository root
  head/                        # Results of HEAD requests
    http/                      # protocol             # host
         docs/                 # URI path element 0
           ...                 # ...
             /abc.html.json.gz # result of HEAD request in gzipped JSON format
  response/                    # Results of full GET requests
    # same structure as above
             /abc.html.json.gz # result of GET request in gzipped JSON format
  outlinks/                    # Results of outgoing link analysis
    # same structure as above
             /abc.html.outlinks-v1.json.gz # result of outgoing link analysis in gzipped JSON format

I like this kind of “file system database” because you can browse it with regular tools on the command line. But this shouldn’t be the topic of this post. What I want to talk about in this one is a sub-aspect of storing data in this form.

The good thing about JSON is that it’s human-readable. However, it is less efficient to write, store, and read than a binary format. Also, when you need to embed binary data inside of JSON files you need to encode it in UTF-8 strings which is often done using Base64 encoding. To mitigate at least part of the storage inefficiency of JSON files, I always run gzip on the resulting json. That’s still easy to read on the command line using tools like zless while improving storage space.

Using this scheme, I wondered, how efficient is the encoding of binary data with Base64 first, and then with gzip? More precisely, Base64 wastes space by only using 6 bits out of the 8 bits of a byte. How much of that can be recovered by gzipping Base64-encoded data afterwards? Also, for compressible binary data, how much of the original compression rate can be achieved if data is first Base64-encoded? So, I ran a few experiments.

We have two test bodies:

  • The UTF8 plain text version of Alice In Wonderland from Project Gutenberg (although, there’s no good reason to base64 encode it in json, it serves as a good example of compressible data)
  • 1,000,000 random bytes

Let’s start with plain Base64 and plain gzip encoding:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
raw 1000000 x 1 74703 x 1
base64 1333336 x 1.333 99604 x 1.333
gzip 1000173 x 1.0001 27842 x 0.373

The results are unsurprising. Base64 has a fixed overhead of 33%. As Base64 only uses 6 bits per byte, we would expect an overhead of ca. 8 / 6 - 1 = 1 / 3 = 33%.

Gzip worked well on the text file but slightly increased the size of the random data (which is guaranteed to happen in average on random data).

Now, let’s try base64 encoding first and then apply gzip:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
base64 + gzip 1009020 x 1.009 38064 x 0.510

So, on random, uncompressible bytes, gzip mitigates the effect of base64 encoding, so that only 0.9% overhead are left.

On compressible data, we need less space than the text would have taken in raw, but we lose quite a bit of compression compared to raw compression. The probable reason is that both of gzip’s compression algorithms, LZ77 and Huffman, process input data one byte at a time. Base64’s 8-bit to 6-bit encoding, however, splits up every 3 bytes into 4, so that LZ77 cannot find repetitions as easily because they have to be aligned to those 4-byte base64 blocks.

The Huffman coding is probably also affected for two reasons: First, every character from the Base64 encoding character set will represent 6 bits that were chopped from one or two original bytes. That means that the symbol histogram will not represent the original data bytes symbol histogram perfectly. Second, as the input data takes more space, LZ77 length and distance encoding might require more space (more “extra bits” in the encoding).

To illustrate this effect, I ran a few examples on even simpler input data. Let’s compare gzip compression of raw data versus base64 encoding of the simple string consisting of 1024 times the character e.

If it is compressed raw it can be represented with LZ77 as

  • Literal character e
  • At distance 1 copy 1024 characters (somewhat unintuitive that you can reference data that is not yet decoded but it actually works like this)

This can be very efficiently encoded.

Even, if you would disable LZ77 in Gzip, Huffman coding alone could also represent this string efficiently because it would only have to represent the two symbols:

  • Literal character e
  • End of block

Even if that would be a quite degenerate case of Huffman encoding, you could still encode it in 1025 bits (e.g. 1024 bits of 1 to represent e followed by a single 0 to represent the end of the block, plus the Huffman table).

Now look at the Base64 encoding of the string of 1024 ‘e’s:


It’s quite obvious that it still has some structure but as you can see it exhibits the predicted 4-byte pattern. It can be encoded with LZ77 as

  • Literal string "ZWVl" (base64 representation of "eee")
  • At distance 4 copy 1020/3*4 characters (base64 representation of 1020 'e's)
  • Literal string "ZQo=" (the trailer which is necessary because 1024 doesn’t divide by 3, so that the base64 encoding has a bit of padding at the end)

This can still be quite efficiently compressed but has more complexity than the plain encoding without base64.

However, this is still just a special case because the repetitions of the character ‘e’ are aligned to 3-byte boundaries which translate into periods of 4 characters with base64.

You can see this when looking at the base64 encoding of "123456712345671234567123456712345671234567", i.e. 6 times the string "1234567":


If you look closely you can see that the result is still periodic but only after half of the output. This is because the 7 characters don’t divide by 3 so the next time the string "1234567" aligns at a 3-byte boundary is after 3 periods, that is after 21 input characters or 28 output characters.

It’s pretty clear that this will negatively affect gzip compression afterwards.

Finally, here’s a table showing the difference between compressing the same amount of bytes either of a single character of the string "1234567":

Encoding 4200 * "e" of unencoded 600 * "1234567" of unencoded
raw 4200 x 1 4200 x 1
base64 5600 x 1.333 5600 x 1.333
gzip 40 x 0.0095 48 x 0.0114
base64+gzip 47 x 0.0112 81 x 0.0193

In the last line you can see that adding base64 encoding has only little effect on aligned, compressible data but a much bigger effect when the repetitions in the original data don’t align to the base64 3-byte boundaries.


  • Putting binary data as base64 into json files and Gzipping afterwards works well because gzip mitigates the overhead of base64 almost completely.
  • If the binary data is compressible, you’ll lose some potential for compression by using base64 first. (If you want to offset this, you can gzip the binary data first before applying base64.)
  • Alternatively, you could put binary data not in the json file but next to it on the file system.
  • Or, use another encoding better suited to json, e.g. see this stackoverflow question

See also:

Richard Startin recently did a similar investigation in his blog post about Obfuscated Compressibility.