For personal projects I mostly try to avoid using databases. Instead, I try to use what I call the "JSON repository" pattern. I create a file structure that contains my data in JSON format. I often use the directory structure as an index by placing the JSON data at well-known places for random access.

For example, I wrote a small web crawler a while ago. To make the process incremental between different runs I created this file structure for caching intermediate results until the next run:

data/                          # Repository root
  head/                        # Results of HEAD requests
    http/                      # protocol
      doc.akka.io/             # host
         docs/                 # URI path element 0
           ...                 # ...
             /abc.html.json.gz # result of HEAD request in gzipped JSON format
  response/                    # Results of full GET requests
    # same structure as above
             /abc.html.json.gz # result of GET request in gzipped JSON format
  outlinks/                    # Results of outgoing link analysis
    # same structure as above
             /abc.html.outlinks-v1.json.gz # result of outgoing link analysis in gzipped JSON format

I like this kind of "file system database" because you can browse it with regular tools on the command line. But this shouldn't be the topic of this post. What I want to talk about in this one is a sub-aspect of storing data in this form.

The good thing about JSON is that it's human-readable. However, it is less efficient to write, store, and read than a binary format. Also, when you need to embed binary data inside of JSON files you need to encode it in UTF-8 strings which is often done using Base64 encoding. To mitigate at least part of the storage inefficiency of JSON files, I always run gzip on the resulting json. That's still easy to read on the command line using tools like zless while improving storage space.

Using this scheme, I wondered, how efficient is the encoding of binary data with Base64 first, and then with gzip? More precisely, Base64 wastes space by only using 6 bits out of the 8 bits of a byte. How much of that can be recovered by gzipping Base64-encoded data afterwards? Also, for compressible binary data, how much of the original compression rate can be achieved if data is first Base64-encoded? So, I ran a few experiments.

We have two test bodies:

  • The UTF8 plain text version of Alice In Wonderland from Project Gutenberg (although, there's no good reason to base64 encode it in json, it serves as a good example of compressible data)
  • 1,000,000 random bytes

Let's start with plain Base64 and plain gzip encoding:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
raw 1000000 x 1 74703 x 1
base64 1333336 x 1.333 99604 x 1.333
gzip 1000173 x 1.0001 27842 x 0.373

The results are unsurprising. Base64 has a fixed overhead of 33%. As Base64 only uses 6 bits per byte, we would expect an overhead of ca. 8 / 6 - 1 = 1 / 3 = 33%.

Gzip worked well on the text file but slightly increased the size of the random data (which is guaranteed to happen in average on random data).

Now, let's try base64 encoding first and then apply gzip:

Encoding Random Bytes of unencoded Alice Bytes of unencoded
base64 + gzip 1009020 x 1.009 38064 x 0.510

So, on random, uncompressible bytes, gzip mitigates the effect of base64 encoding, so that only 0.9% overhead are left.

On compressible data, we need less space than the text would have taken in raw, but we lose quite a bit of compression compared to raw compression. The probable reason is that both of gzip's compression algorithms, LZ77 and Huffman, process input data one byte at a time. Base64's 8-bit to 6-bit encoding, however, splits up every 3 bytes into 4, so that LZ77 cannot find repetitions as easily because they have to be aligned to those 4-byte base64 blocks.

The Huffman coding is probably also affected for two reasons: First, every character from the Base64 encoding character set will represent 6 bits that were chopped from one or two original bytes. That means that the symbol histogram will not represent the original data bytes symbol histogram perfectly. Second, as the input data takes more space, LZ77 length and distance encoding might require more space (more "extra bits" in the encoding).

To illustrate this effect, I ran a few examples on even simpler input data. Let's compare gzip compression of raw data versus base64 encoding of the simple string consisting of 1024 times the character e.

If it is compressed raw it can be represented with LZ77 as

  • Literal character e
  • At distance 1 copy 1024 characters (somewhat unintuitive that you can reference data that is not yet decoded but it actually works like this)

This can be very efficiently encoded.

Even, if you would disable LZ77 in Gzip, Huffman coding alone could also represent this string efficiently because it would only have to represent the two symbols:

  • Literal character e
  • End of block

Even if that would be a quite degenerate case of Huffman encoding, you could still encode it in 1025 bits (e.g. 1024 bits of 1 to represent e followed by a single 0 to represent the end of the block, plus the Huffman table).

Now look at the Base64 encoding of the string of 1024 'e's:

ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVl
ZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZWVlZQo=

It's quite obvious that it still has some structure but as you can see it exhibits the predicted 4-byte pattern. It can be encoded with LZ77 as

  • Literal string "ZWVl" (base64 representation of "eee")
  • At distance 4 copy 1020/3*4 characters (base64 representation of 1020 'e's)
  • Literal string "ZQo=" (the trailer which is necessary because 1024 doesn't divide by 3, so that the base64 encoding has a bit of padding at the end)

This can still be quite efficiently compressed but has more complexity than the plain encoding without base64.

However, this is still just a special case because the repetitions of the character 'e' are aligned to 3-byte boundaries which translate into periods of 4 characters with base64.

You can see this when looking at the base64 encoding of "123456712345671234567123456712345671234567", i.e. 6 times the string "1234567":

MTIzNDU2NzEyMzQ1NjcxMjM0NTY3MTIzNDU2NzEyMzQ1NjcxMjM0NTY3

If you look closely you can see that the result is still periodic but only after half of the output. This is because the 7 characters don't divide by 3 so the next time the string "1234567" aligns at a 3-byte boundary is after 3 periods, that is after 21 input characters or 28 output characters.

It's pretty clear that this will negatively affect gzip compression afterwards.

Finally, here's a table showing the difference between compressing the same amount of bytes either of a single character of the string "1234567":

Encoding 4200 * "e" of unencoded 600 * "1234567" of unencoded
raw 4200 x 1 4200 x 1
base64 5600 x 1.333 5600 x 1.333
gzip 40 x 0.0095 48 x 0.0114
base64+gzip 47 x 0.0112 81 x 0.0193

In the last line you can see that adding base64 encoding has only little effect on aligned, compressible data but a much bigger effect when the repetitions in the original data don't align to the base64 3-byte boundaries.

Summary

  • Putting binary data as base64 into json files and Gzipping afterwards works well because gzip mitigates the overhead of base64 almost completely.
  • If the binary data is compressible, you'll lose some potential for compression by using base64 first. (If you want to offset this,
    you can gzip the binary data first before applying base64.)
  • Alternatively, you could put binary data not in the json file but next to it on the file system.
  • Or, use another encoding better suited to json, e.g. see this stackoverflow question

See also:

Richard Startin recently did a similar investigation in his blog post about Obfuscated Compressibility.