TinyBits: Smaller, faster serialization for Ruby apps and beyond

3 months ago 1

Introduction

In my last blog, I went over multiple serialization solutions for Ruby including two JSON based flavors (Oj and the standard library JSON gem) alongside CBOR and MessagePack as binary alternatives.

If you recall my verdict, MessagePack came out on top regarding encoded size and encoding performance, while the JSON gem was the best when it came to decoding performance. Leaving the question regarding which serializer to use for your apps with the infamous “it depends” answer.

Today I am introducing a preview release of TinyBits, my take on serializing JSON like objects in a schema-less, binary format.

What is TinyBits

TinyBits is a C library that implements a serializer and a de-serializer from a specific set of data types to a binary format and back, supported data types are:

  • Integer Numbers using in64 capacity
  • Floating point numbers using IEEE 754 double precision (including NaN, +/-INF)
  • Strings
  • Blobs (strings with binary data)
  • Booleans
  • NULL values
  • Arrays
  • Maps (key/value pairs)
  • Datetime (stored as seconds since epoch, with fractional data and a time zone offset)

TinyBits packs those values in a very tight binary format, achieved mostly due to:

  • Single byte type headers
  • Integer compression using the SQLite4 (yes SQLite4, not SQLite3) varint format.
  • Floating point compression via scaling + varint storage.
  • Strings deduplication using backward referencing (this is very effective with arrays of maps with similar keys).

Design Tradeoffs

TinyBits makes the following design tradeoffs:

  • Documents with repeated strings, like arrays of hashes (e.g. returned from a database) will have the best size reduction
  • Size reduction is balanced to not affect encoding performance very much, even if we end up slightly larger
  • Decoding performance is prioritized over encoding performance
  • Small to medium sized documents are prioritized over larger ones

Currently a Ruby extension of TinyBits is available as a Ruby gem. Source code is available here. The C implementation can be found here.

Usage

TinyBits is straight forward to use. You just need to install the gem.

gem install tinybits # or bundle add tinybits bundle install

Then you can either use the class methods

require 'tinybits' document = { name: "TinyBits", version: 1.0, features: [:simple, :fast, :compact] } packed = TinyBits.pack(document) unpacked = TinyBits.unpack(packed)

Or you can use the faster object interface

require 'tinybits' document = { name: "TinyBits", version: 1.0, features: [:simple, :fast, :compact] } packer = TinyBits::Packer.new packed = packer.pack(document) unpacker = TinyBits::Unpacker.new unpacked = unpacker.unpack(packed)

If you use the packer/unpacker objects to pack/unpack individual documents then you don’t need to reset the packer between invocations.

Compactness

Let’s look at an example to see how compact TinyBits can be compared to other solutions.

[ { "user_id" : 7, "score" : 3.1, "rank" : 23, "name" : "droid" }, { "user_id" : 23, "score" : 2.01, "rank" : 2540, "name" : "griffin" }, { "user_id" : 23, "score" : 2.6, "rank" : 1891, "name" : "ghoul" } ]

This simple document translates to the following sizes:

FormatSize
JSON163 bytes
CBOR134 bytes
MessagePack133 bytes
TinyBits80 bytes

As can be seen, TinyBits was able to compress the above document greatly since it had multiple repeated strings, integer and floating point numbers.

Needless to say, if the document had less redundancy, the difference will be less profound, for example:

{ "user_id" : 7, "score" : 3.1, "hp" : 55.7, "name" : "droid" }

Which translates to the following sizes:

FormatSize
JSON50 bytes
CBOR48 bytes
MessagePack48 bytes
TinyBits35 bytes

The difference between TinyBits and the other binary formats here is due to the presence of the value 3.1, which takes 9 bytes (1 tag + 8) for CBOR and MessagePack to store, while it takes only 2 bytes for TinyBits to store.

This is how the data looks like in hex for the three binary formats

CBORa467757365725f696476573636f7265fb408cccccccccccd626870fb404bd9999999999a646e616d656564726f6964
MessagePack84a7757365725f69647a573636f7265cb408cccccccccccda26870cb404bd9999999999aa46e616d65a564726f6964
TinyBits1447757365725f6964874573636f7265211f42687021f23d446e616d654564726f6964

Just look how similar the CBOR and MessagePack sequences are. And even the TinyBits sequence is very similar, it differs in the middle though where the numbers 3.1 and 55.7 are rendered.

If we take our data set from the last blog, which had 11 different document types, 10 records each, these are average document sizes when serialized using the different serializers:

FormatAverage Size
JSON3,096 bytes
CBOR2,625 bytes
MessagePack2,239 bytes
TinyBits1,458 bytes

Fast Encoding

TinyBits strives to streamline the encoding process, and it manages to do so. Of course there are some overheads associated with string deduplication and integer/floating point compression, but these were kept at minimum to achieve speeds that are on par or even better than other binary encoders.

These are the average encoding rates of the 11 document types mentioned above

FormatAverage Encoding Rate
Oj63,872 docs/sec
JSON73,877 docs/sec
CBOR105,488 docs/sec
MessagePack124,081 docs/sec
TinyBits135,054 docs/sec

As can be seen, for the test data set, TinyBits is the fastest encoder with ~9% advantage over MessagePack while doing a lot more work to compress and deduplicate.

Fast Decoding

As a nice side effect of its compact encoded size, the decoding process is much faster, and thanks to string deduplication, TinyBits doesn’t need to create many Ruby strings. As a result we see the following decoding performance.

FormatAverage Decoding Rate
Oj25,281 docs/sec
JSON40,590 docs/sec
CBOR32,126 docs/sec
MessagePack35,428 docs/sec
TinyBits55,880 docs/sec

In this test, the binary formats generally lag behind the standard library JSON parser. But TinyBits beats the JSON parser in decoding, delivering ~38% more performance.

Compressibility

Encoding to smaller sizes is one thing, ultimately delivering the smallest size is another. Case in point, MessagePack usually encodes to smaller size than JSON, which is great if you are sending the data as is. If you try to compress the data further though, it turns out that in most cases, specially when using Zstd, JSON compresses to a smaller absolute size. This is also true of CBOR.

TinyBits, on the other hand, stays very competitive with JSON compressibility, trading blows with it over the documents in the data set and edging it slightly on average. If you are using LZ4 specifically, then TinyBits is always more compressible than JSON.

Here are the average sizes of the documents when compressed using LZ4 and Zstd

FormatRaw SizeLz4 SizeZstd Size
JSON3,096 bytes1,339 bytes952 bytes
CBOR2,625 bytes1,294 bytes992 bytes
MessagePack2,239 bytes1,294 bytes1,012 bytes
TinyBits1,458 bytes1160 bytes942 bytes

Note that for these documents, raw TinyBits, without any compression applied, is just 53% larger than the smallest compressed format (JSON + Zstd).

Memory Usage

As we have seen before most of these serializers are memory efficient. Still, TinyBit’s Ruby extension allocates the lest amount of memory in all the serializers tested, even if by a small margin

FormatEncoding Memory
Oj3.21 KBs/doc
JSON4.19 KBs/doc
CBOR2.81 KBs/doc
MessagePack2.74 KBs/doc
TinyBits1.53 KBs/doc
FormatDecoding Memory
Oj11.13 KBs/doc
JSON10.55 KBs/doc
CBOR16.46 KBs/doc
MessagePack10.5 KBs/doc
TinyBits9.62 KBs/doc

An All Rounder

As we have seen TinyBits, at least for the provided data set, offers the best encoding/decoding performance, all while delivering the smallest encoded sizes without sacrificing compressibility in the process.


In summary TinyBits is:

  • ~9% faster than the fastest existing schema-less encoder (MessagePack)
  • ~38% faster than the fastest existing schema-less decoder (stdlib JSON)
  • All while producing over 35% reduction in size compared to the most compact format (MessagePack)
  • Not to mention that it compresses as good as, if not slightly better than JSON, which is usually the most compressible format

End to end performance

If you try to think of the advantage for data transmission (assuming a 100Mbps connection):

FormatEncoding Time (us)Transmission Time (us)Decoding Time (us)Total time (us)
JSON13.54336.224.64274.37
CBOR9.48200.2931.13240.89
MessagePack8.06 201.3228.23237.60
TinyBits7.40111.2417.9136.54

In this (hypothetical) scenario, TinyBits delivers 45% time savings over the second best alternative (MessagePack)

Bonus Feature: Multi-Object Packing (experimental)

The TinyBits Ruby gem comes with a feature that was purpose built for my use case, but I believe it could be useful for others as well.

  • The packer can stitch multiple objects into the same buffer
  • The objects can be added to the buffer at separate points in time (e.g. once each is generated)
  • The objects will all share the same deduplication dictionary
  • The unpacker can unpack these objects one after the other

I had a specific need for this feature as I wanted to:

  • Capture objects on the fly as they are being generated, without copying them or keeping them around
  • Benefit from shared strings across those different objects
  • Still be able to unpack each object individually on the other end

Here’s an example code snippet

require 'tinybits' packer = TinyBits::Packer.new objects = [{obj1: 1}, {obj2: 3.1}, ['obj1', 'obj2']] objects.each { |obj| packer << obj } packed = packer.to_s unpacker = TinyBits::Unpacker.new unpacker.buffer = packed while !unpacker.finished? pp unpacker.pop end

State Of Development

The TinyBits C library and the format spec are almost finalized. Currently only an (experimental) Ruby extension is available, the Python extension is also progressing nicely and work is underway to produce extensions for other languages and platforms. Of course any help in that regard will be greatly appreciated.

Conclusion

Born from a real need to encode->transmit->decode schema-less data efficiently, TinyBits turns out to be the most space efficient and generally the fastest of the existing options for serializing schema-less data. Give it a try and see for yourself!

Read Entire Article