I realise the title is a bit of a mouthful. It was either that or “No-copy IPC lock-free queue for variable length messages”. Naming is hard.
DISCLAIMER: It’s not uncommon that one needs to send messages to another process as fast as possible, but 99% of the cases the best answer for that is boost::interprocess[1]. Even in the last 1%, in production one should obviously be very careful before attempting to re-invent the wheel. However, as the spirit of this blog is to learn and show how the sausage is made, we shall do just that 😉
Here is the code for the people that just want to get to the point.
Intro
It’s easy(ish) to create a ring-based lockfree queue that is Single Producer-Single Consumer[2], but it gets tricky if one also needs variable size messages. The problem with variable size messages is that the end of the ring and the beginning of the ring could hold two halves of a single message, so we’d have to copy it into a contiguous buffer before reading it. This is usually okay, but for a (very) small amount of users this is not really an option: needless calls to memcpy should be avoided and messages should be read straight off the queue’s buffer.
There is one way (that I know of) to have a contiguous address space that loops around, and it boils down to mmapping the ring region twice and making sure the two mappings are next to one another. This is what this project accomplishes.
Implementation
To abstract the calls to mmap and friends we use a small RAII class called mapped_memory. This will take care of unmapping things for us when we are done using them. The double mapping magic happens in a factory function that builds our reader and writer queues:
Ignoring for the moment the return type Derived, the first step is creating an anonymous mapping which is twice the size of the ring buffer; this is to ensure that we can actually map two regions one next to the other. In fact right after we create the two mappings passing the same filename and size, but crucially we pass the address of the end of the first mapping to the second mapping.
One notable detail of the above function is that it takes a file path to a “control block”, this is a file containing the following data structure:
We note that we use a feature of C++17: std::hardware_destructive_interference_size[3]. This helps us split the two important members of the header, the reader and writer pointers, into two different cache lines. This is a common feature in lock-free data structures as you don’t want your reader thread/process to invalidate your writer thread/process while moving the its pointer and vice-versa.
As mentioned, we use our queue_factory function to construct the two “queue” classes: queue_reader and queue_writer. The first one is:
The reader application has to call get_buffer to attempt reading N bytes, and the return value could be an empty buffer if N bytes are not available (i.e. the writer is being slow). The returned const_view is a non-owning buffer akin to asio::const_buffer (which could be a drop in replacement in production), containing a pointer and a size member. Its interface is
The buffer is overloading operator bool, so it’s easy to check whether the call to get_buffer was successful.
Notably the user is just passed the pointer to the internal buffer, so no copy is done. Once the user read the message, they have to call pop to move the internal reader pointer; this will notify the writer that some space has just freed up.
Similarly the writer class is:
The only other noteworthy detail is the conspicuous CRTP[4]: that’s just there as a trick to inherit the factory function we described above, which can then use the Derived template as return type.
Examples
There are two apps in the repo: a reader and a writer. Starting with the writer, after a bit of boilerplate, we can find
In the first line we setup the writer and hard-code a message to send. In a real scenario the message can be anything of any length and obviously not the same every time.
The following lines mention a header. This is a customizable struct that is used by the reader to make sense of the incoming message. In our example the header is:
We notice right away that the struct is packed, even if the members are carefully placed not to leave any padding anyway. What we want to achieve with packing is actually changing its alignment[6] to 1. This is because we need to read it straight off the buffer via reintepret_cast, and doing so is already technically UB (until we’ll get std::bless/std::start_lifetime_as at least[7][8]), but having it be aligned to its natural alignment could lead to crashes.
After that we just call get_buffer and memcpy into the buffer, followed by a push of the written bytes.
The reader is symmetric as expected:
In the above you can see that first we reinterpret_cast the header, then we know how long is the incoming messages so that we can call get_buffer again.
Performance Analysis
In the reader’s and writer’s code you can find a timer and a vector that stores the deltas of TSC[9] between sending and receiving for a million messages. We can run the reader/writer with
We are greeted with the output:
Receiving 1000000 messages took: 59490ms INFO: Average: 110.163 INFO: st. dev: 1534.65 INFO: 99th percentile: 238The total time it took is irrelevant since we are artificially stalling the writer every message to simulate “sparse events”. Feel free to modify the code of the writer to send messages in a loop and measure throughput.
Then we can use an include gnuplot script (see the “plot” directory) to produce the following histograms of TSC Deltas:

The deltas of TSC in the graph are in “cycles” and we notice that the distribution is somewhat discrete. These kind of spectrum reflects the different kind of “cache misses” our application is experiencing. Regardless, we achieved an impressive speed: considering that my test machine has an invariant TSC of about 1696 MHz, we are looking at latencies hovering around 60 nanoseconds here!
Lots more could be done to investigate the performance of this queue and perhaps we will in a follow up post. To name a few things we should definitely do
- Run with “perf stat”. Perf can tell us quickly how many cache misses we have, how many branch prediction misses and if the CPU is stalled waiting for instructions or data (here almost certainly the latter)
- Check the generated Assembly. This could be fun to see on a x86 platform how the code gets optimised on the hot-path
- Plug the code in a more realistic application, and measure performance again. There is only so much information that micro-benchmarking can give you, you need to profile in a real environment.
That’s it for the moment. As usual, if you have any comments and/or corrections and/or suggestion please do let me know 😉
References
[1] https://www.boost.org/doc/libs/1_63_0/doc/html/interprocess.html
[2] https://github.com/facebook/folly/blob/main/folly/ProducerConsumerQueue.h
[3] https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
[4] https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
[5] https://www.oreilly.com/library/view/97-things-every/9780596809515/ch55.html
[6] https://en.wikipedia.org/wiki/Data_structure_alignment
[7] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0593r4.html#166x-implicit-object-creation-ptrbless
[8] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2590r0.pdf
[9] https://en.wikipedia.org/wiki/Time_Stamp_Counter