Memory Maps (MMAP) Deliver 25x Faster File Access in Go

2 days ago 1

One of the slowest things you can do in an application is making system calls. They're slow because you do have to enter the kernel, which is quite expensive. What should you do when you need to do a lot of disk I/O but you care about performance? One solution is to use memory maps.

Memory maps are a modern Unix mechanism where you can take a file and make it part of the virtual memory. In Unix context, modern means that it was introduced in the 1980s or later. You have a file, containing data, you mmap it and you'll get a pointer to where this resides. Now, instead of seeking and reading, you just read from this pointer, adjusting the offset to get to the right data.

Performance

To show what kind of performance you can get using memory maps, I've written a little Go library that allows you to read from a file using a memory map or a ReaderAt. ReaderAt will do a pread(), which is a seek/read combo, while mmap will just read from the memory map.

This almost feels like magic. Initially, when we launched Varnish Cache back in 2006, this was one of the features that made Varnish Cache very fast when delivering content. Varnish Cache would use memory maps to deliver content at blistering speeds.

Also, since you can operate with pointers into memory that is allocated by the memory map, you'll reduce memory pressure as well as raw latency.

The Downside of Memory Maps

The downside of memory maps is that you really can't write to the memory map. The reason is due to the way virtual memory works. When you're writing to a part of virtual memory that isn't mapped into physical memory, the CPU will generate a page fault. On a modern computer, the CPU is responsible for tracking what virtual memory pages are mapped onto what physical memory. Since you're writing to a page that isn't mapped, the CPU needs help.

So, when the page fault occurs, the OS will 1) allocate a new memory page, 2) read the contents of the file at the correct offset, 3) write this to the new memory page. Then control is returned to the application. The application will now overwrite the virtual memory page with new data.

Can we stop and appreciate how extremely inefficient this is? I think it is fairly safe to say that writing through a memory map is never a good idea when considering performance. At least if there is any risk, the file isn't mapped up in physical memory.

Let me illustrate this with a few more benchmarks.

As you can see, whether or not the pages are in cache is crucial for performance. WriterAt, which uses the pwrite call, is a much more predictable bet.

Still, writing through memory maps, was what Varnish Cache did initially. It somehow got away with it, but mostly because the competition was pretty bad.

This is why Varnish Cache got the malloc backend and why Varnish Enterprise got the various Massive Storage Engines. The malloc backend resolved the problem by just allocating system memory through the malloc system call, and the Massive Storage Engine uses io_uring, which is so new that support for it is still somewhat limited.

Using Memory Maps to Solve Real-world Performance Problems

The last couple of weeks I've been working on an HTTP-backed filesystem. This is part of our AI Storage Acceleration solution, geared towards high performance computing environments. In this filesystem we needed a way to transfer folder data over HTTP. A folder is really just a listing of files, symbolic links and directories. The naive approach would be just to use JSON encoding, but JSON is notorious for being slow.

Our priority is performance. We made a benchmarking suite, comparing various databases with each other. CDB was overall the fastest. Looking at the numbers, we'd still see that CDB would spend something like 1200ns on a database lookup that was entirely in the page cache. This seems very slow to me. After all, everything should be in memory and spending 1200ns reading memory sounds at least 100x too slow. I started looking into the CDB implementation I was using. It was the above ReaderAt implementation. So, most of the time is likely spent waiting for the operating system.

Some hours later, I was able to replace the seek/read with a memory map. This resulted in a 25x improvement in performance. Again, it feels like magic. Unlike the original file stevedore in Varnish Cache, this performance improvement has no downside.

Benchmarks: https://github.com/perbu/mmaps-in-go 
CDB64 files with memory maps: https://github.com/perbu/cdb 

Read Entire Article