TinyBits: Smaller, faster serialization for Ruby apps and beyond

3 months ago 1

Introduction

In my last blog, I went over multiple serialization solutions for Ruby including two JSON based flavors (Oj and the standard library JSON gem) alongside CBOR and MessagePack as binary alternatives.

If you recall my verdict, MessagePack came out on top regarding encoded size and encoding performance, while the JSON gem was the best when it came to decoding performance. Leaving the question regarding which serializer to use for your apps with the infamous “it depends” answer.

Today I am introducing a preview release of TinyBits, my take on serializing JSON like objects in a schema-less, binary format.

What is TinyBits

TinyBits is a C library that implements a serializer and a de-serializer from a specific set of data types to a binary format and back, supported data types are:

Integer Numbers using in64 capacity
Floating point numbers using IEEE 754 double precision (including NaN, +/-INF)
Strings
Blobs (strings with binary data)
Booleans
NULL values
Arrays
Maps (key/value pairs)
Datetime (stored as seconds since epoch, with fractional data and a time zone offset)

TinyBits packs those values in a very tight binary format, achieved mostly due to:

Single byte type headers
Integer compression using the SQLite4 (yes SQLite4, not SQLite3) varint format.
Floating point compression via scaling + varint storage.
Strings deduplication using backward referencing (this is very effective with arrays of maps with similar keys).

Design Tradeoffs

TinyBits makes the following design tradeoffs:

Documents with repeated strings, like arrays of hashes (e.g. returned from a database) will have the best size reduction
Size reduction is balanced to not affect encoding performance very much, even if we end up slightly larger
Decoding performance is prioritized over encoding performance
Small to medium sized documents are prioritized over larger ones

Currently a Ruby extension of TinyBits is available as a Ruby gem. Source code is available here. The C implementation can be found here.

Usage

TinyBits is straight forward to use. You just need to install the gem.

gem install tinybits # or bundle add tinybits bundle install

Then you can either use the class methods

require 'tinybits' document = { name: "TinyBits", version: 1.0, features: [:simple, :fast, :compact] } packed = TinyBits.pack(document) unpacked = TinyBits.unpack(packed)

Or you can use the faster object interface

require 'tinybits' document = { name: "TinyBits", version: 1.0, features: [:simple, :fast, :compact] } packer = TinyBits::Packer.new packed = packer.pack(document) unpacker = TinyBits::Unpacker.new unpacked = unpacker.unpack(packed)

If you use the packer/unpacker objects to pack/unpack individual documents then you don’t need to reset the packer between invocations.

Compactness

Let’s look at an example to see how compact TinyBits can be compared to other solutions.

[ { "user_id" : 7, "score" : 3.1, "rank" : 23, "name" : "droid" }, { "user_id" : 23, "score" : 2.01, "rank" : 2540, "name" : "griffin" }, { "user_id" : 23, "score" : 2.6, "rank" : 1891, "name" : "ghoul" } ]

This simple document translates to the following sizes:

FormatSize

JSON	163 bytes
CBOR	134 bytes
MessagePack	133 bytes
TinyBits	80 bytes

As can be seen, TinyBits was able to compress the above document greatly since it had multiple repeated strings, integer and floating point numbers.

Needless to say, if the document had less redundancy, the difference will be less profound, for example:

{ "user_id" : 7, "score" : 3.1, "hp" : 55.7, "name" : "droid" }

Which translates to the following sizes:

FormatSize

JSON	50 bytes
CBOR	48 bytes
MessagePack	48 bytes
TinyBits	35 bytes

The difference between TinyBits and the other binary formats here is due to the presence of the value 3.1, which takes 9 bytes (1 tag + 8) for CBOR and MessagePack to store, while it takes only 2 bytes for TinyBits to store.

This is how the data looks like in hex for the three binary formats

CBOR	a467757365725f696476573636f7265fb408cccccccccccd626870fb404bd9999999999a646e616d656564726f6964
MessagePack	84a7757365725f69647a573636f7265cb408cccccccccccda26870cb404bd9999999999aa46e616d65a564726f6964
TinyBits	1447757365725f6964874573636f7265211f42687021f23d446e616d654564726f6964

Just look how similar the CBOR and MessagePack sequences are. And even the TinyBits sequence is very similar, it differs in the middle though where the numbers 3.1 and 55.7 are rendered.

If we take our data set from the last blog, which had 11 different document types, 10 records each, these are average document sizes when serialized using the different serializers:

FormatAverage Size

JSON	3,096 bytes
CBOR	2,625 bytes
MessagePack	2,239 bytes
TinyBits	1,458 bytes

Fast Encoding

TinyBits strives to streamline the encoding process, and it manages to do so. Of course there are some overheads associated with string deduplication and integer/floating point compression, but these were kept at minimum to achieve speeds that are on par or even better than other binary encoders.

These are the average encoding rates of the 11 document types mentioned above

FormatAverage Encoding Rate

Oj	63,872 docs/sec
JSON	73,877 docs/sec
CBOR	105,488 docs/sec
MessagePack	124,081 docs/sec
TinyBits	135,054 docs/sec

As can be seen, for the test data set, TinyBits is the fastest encoder with ~9% advantage over MessagePack while doing a lot more work to compress and deduplicate.

Fast Decoding

As a nice side effect of its compact encoded size, the decoding process is much faster, and thanks to string deduplication, TinyBits doesn’t need to create many Ruby strings. As a result we see the following decoding performance.

FormatAverage Decoding Rate

Oj	25,281 docs/sec
JSON	40,590 docs/sec
CBOR	32,126 docs/sec
MessagePack	35,428 docs/sec
TinyBits	55,880 docs/sec

In this test, the binary formats generally lag behind the standard library JSON parser. But TinyBits beats the JSON parser in decoding, delivering ~38% more performance.

Compressibility

Encoding to smaller sizes is one thing, ultimately delivering the smallest size is another. Case in point, MessagePack usually encodes to smaller size than JSON, which is great if you are sending the data as is. If you try to compress the data further though, it turns out that in most cases, specially when using Zstd, JSON compresses to a smaller absolute size. This is also true of CBOR.

TinyBits, on the other hand, stays very competitive with JSON compressibility, trading blows with it over the documents in the data set and edging it slightly on average. If you are using LZ4 specifically, then TinyBits is always more compressible than JSON.

Here are the average sizes of the documents when compressed using LZ4 and Zstd

FormatRaw SizeLz4 SizeZstd Size

JSON	3,096 bytes	1,339 bytes	952 bytes
CBOR	2,625 bytes	1,294 bytes	992 bytes
MessagePack	2,239 bytes	1,294 bytes	1,012 bytes
TinyBits	1,458 bytes	1160 bytes	942 bytes

Note that for these documents, raw TinyBits, without any compression applied, is just 53% larger than the smallest compressed format (JSON + Zstd).

Memory Usage

As we have seen before most of these serializers are memory efficient. Still, TinyBit’s Ruby extension allocates the lest amount of memory in all the serializers tested, even if by a small margin

FormatEncoding Memory

Oj	3.21 KBs/doc
JSON	4.19 KBs/doc
CBOR	2.81 KBs/doc
MessagePack	2.74 KBs/doc
TinyBits	1.53 KBs/doc

FormatDecoding Memory

Oj	11.13 KBs/doc
JSON	10.55 KBs/doc
CBOR	16.46 KBs/doc
MessagePack	10.5 KBs/doc
TinyBits	9.62 KBs/doc

An All Rounder

As we have seen TinyBits, at least for the provided data set, offers the best encoding/decoding performance, all while delivering the smallest encoded sizes without sacrificing compressibility in the process.

In summary TinyBits is:

~9% faster than the fastest existing schema-less encoder (MessagePack)
~38% faster than the fastest existing schema-less decoder (stdlib JSON)
All while producing over 35% reduction in size compared to the most compact format (MessagePack)
Not to mention that it compresses as good as, if not slightly better than JSON, which is usually the most compressible format

End to end performance

If you try to think of the advantage for data transmission (assuming a 100Mbps connection):

FormatEncoding Time (us)Transmission Time (us)Decoding Time (us)Total time (us)

JSON	13.54	336.2	24.64	274.37
CBOR	9.48	200.29	31.13	240.89
MessagePack	8.06	201.32	28.23	237.60
TinyBits	7.40	111.24	17.9	136.54

In this (hypothetical) scenario, TinyBits delivers 45% time savings over the second best alternative (MessagePack)

Bonus Feature: Multi-Object Packing (experimental)

The TinyBits Ruby gem comes with a feature that was purpose built for my use case, but I believe it could be useful for others as well.

The packer can stitch multiple objects into the same buffer
The objects can be added to the buffer at separate points in time (e.g. once each is generated)
The objects will all share the same deduplication dictionary
The unpacker can unpack these objects one after the other

I had a specific need for this feature as I wanted to:

Capture objects on the fly as they are being generated, without copying them or keeping them around
Benefit from shared strings across those different objects
Still be able to unpack each object individually on the other end

Here’s an example code snippet

require 'tinybits' packer = TinyBits::Packer.new objects = [{obj1: 1}, {obj2: 3.1}, ['obj1', 'obj2']] objects.each { |obj| packer << obj } packed = packer.to_s unpacker = TinyBits::Unpacker.new unpacker.buffer = packed while !unpacker.finished? pp unpacker.pop end

State Of Development

The TinyBits C library and the format spec are almost finalized. Currently only an (experimental) Ruby extension is available, the Python extension is also progressing nicely and work is underway to produce extensions for other languages and platforms. Of course any help in that regard will be greatly appreciated.

Conclusion

Born from a real need to encode->transmit->decode schema-less data efficiently, TinyBits turns out to be the most space efficient and generally the fastest of the existing options for serializing schema-less data. Give it a try and see for yourself!

Read Entire Article