Dot Product on Misaligned Data

3 months ago 2

One of my most popular blog posts of all times is Data alignment for speed: myth or reality? According to my dashboard, hundreds of people a week still load the old blog post. A few times a year, I get an email from someone who disagrees.

The blog post makes a simple point. Programmers are often told to worry about ‘unaligned loads’ for performance. The point of my blog post is that you should generally no worry about alignment when optimizing your code.

An unaligned load occurs when a processor attempts to read data from memory at an address that is not properly aligned. Most computer architectures require data to be accessed at addresses that are multiples of the data’s size (e.g., 4-byte data should be accessed at addresses divisible by 4). For example, a 4-byte integer of float should be loaded from an address like 0x1000 or 0x1004 (aligned), but if the load is attempted from 0x1001 (not divisible by 4), it is unaligned. In some conditions, an unaligned load can crash your system and it general leads to ‘undefined behaviors’ in C++ or C.

Related to this alignment issue is that data is typically organized in cache lines (64 bytes or 128 bytes or most systems) that are loaded together. If you load data randomly from memory, you might touch two cache lines which could cause an additional cache miss. If you need to load data spanning two cache lines, there might be a penalty (say one cycle) as the processor needs to access the two cache lines and reassemble the data. Further, there is also the concept of a page of memory (4 kB or more). Accessing an additional page could be costly, and you typically want to avoid accessing more pages than you need to. However, you have to be somewhat unlucky to frequently cross two pages with one load operation.

How can you end up with unaligned loads? It often happens when you access low-level data structures, assigning data to some bytes. For example, you might load a binary file from disk, and it might say that all the bytes after the first one are 32-bit integers. Without copying the data, it could be difficult to align the data. You might also be packing data: imagine that you have a pair of values, one that first in byte and the other that requires 4 bytes. You could pack these values using 5 bytes, instead of 8 bytes.

There are cases were you should worry about alignment. If you are crafting your own memory copy function, you want to be standard compliant (in C/C++) or you need atomic operations (for multithreaded operations).

However, my general point is that it is unlikely to be a performance concern.

I decided to run a new test given that I haven’t revisited this problem since 2012. Back then I used a hash function. I use SIMD-based dot products with either ARM NEON intrinsics or AVX2 intrinsics. I build two large arrays of 32-bit floats and I compute the scalar product. That is, I multiply the elements and sum the products.

I run benchmarks on an Apple M4 processor as well as on an Intel Ice Lake processor.

On the Apple M4… we cannot see the alignment.

Byte Offset ns/float ins/float instruction/cycle

0	0.18	2.00	2.40
1	0.18	2.00	2.40
2	0.18	2.00	2.40
3	0.18	2.00	2.40
4	0.18	2.00	2.40
5	0.18	2.00	2.40
6	0.18	2.00	2.40
7	0.18	2.00	2.40

And neither can we on the Intel Ice Lake processor.

Byte Offset ns/float ins/float instruction/cycle

0	0.42	0.75	0.56
1	0.42	0.75	0.56
2	0.42	0.75	0.56
3	0.42	0.75	0.56
4	0.42	0.75	0.56
5	0.42	0.75	0.55
6	0.42	0.75	0.56
7	0.42	0.75	0.56

My point is not that you cannot somehow detect the performance difference due to alignment in some tests. My point is that it is simply not something that you should generally worry about as far as performance goes.

My source code is available.

Read Entire Article