Blazing Matrix Products

4 months ago 2

The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size:

mat‿mbt ← ⟨⋈˜2⥊500, ⋈˜5⥊600⟩ /¨⊸⊔¨ ma‿mb ← •rand.Range⟜0¨1e3×⟨1‿1, 3‿3⟩ >⟨ma‿ma‿mat, mb‿mb‿mbt⟩ {𝕎˜•_timed𝕩}¨¨˜ <⟨Dgemm, +˝∘×⎉1‿∞, ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞⟩
┌─ ╵ 0.008988871 0.646108393 0.37081367400000004 0.16528436400000002 45.110128999000004 7.460860705000001 ┘

Moreover, there is only a modest 10-character leap from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞. As a bonus, here's a convenient function to compute powers of a square matrix 𝕩 (particularly useful in graph theory) using blocks of size 𝕨, which pads the matrices with zeros as needed:

MPB ← {𝕩≢⊸↑∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞˜𝕩(⥊⟜𝕨¨∘⊢/¨⊸⊔𝕨⊸×↑⊣)⌈𝕨÷˜≢𝕩}

Here I could have used a fancier but slower under 𝔽˜⌾((/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩)⊸⊔). Or even the memory-hungry outer product formulation +˝⍉∘⊢(+˝∘×⎉1‿∞¨)⌜˘⊢, which is only marginally slower. A naïve search for the optimal block size yields:

(300+50×↕8) {𝕨⊸MPB•_timed𝕩}¨ <3e3‿3e3 •rand.Range 0
⟨ 8.30279774 10.112563361000001 9.781014477000001 9.670085717000001 7.556631647000001 10.970897867000001 7.570657628 10.231164773000001 ⟩

Although nested tiling can be easily implemented, I found that it does not improve performance at all:

MPB2 ← {∾∾×_p¨_p¨(_p←{+˝∘𝔽⎉1‿∞})˜𝕩{𝕩⊔˜/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩}´𝕨} ⟨10‿60, 4‿250, 3‿500⟩ {𝕨⊸MPB2•_timed𝕩}¨ <3e3‿3e3•rand.Range 0
⟨ 14.096323785000001 9.16644102 7.668334754000001 ⟩

But we're not finished yet. Here is a little divide-and-conquer (and cache-oblivious) algorithm in its classic radix-2 form. It works for any square matrix, regardless of dimension: if it is odd, we pad with an extra row and column, and then take back the original.

_strassen_ ← {𝕘≥≠𝕩 ? 𝕨𝔽𝕩; [a‿b,c‿d]‿[e‿f,g‿h] ← (2⊸⥊¨∘⊢/¨⊸⊔2⊸×↑⊣)¨⟜(⌈2÷˜≢¨)𝕨‿𝕩 p1‿p2‿p3‿p4‿p5‿p6‿p7 ← 𝕊´¨⟨a+d,e+h⟩‿⟨c+d,e⟩‿⟨a,f-h⟩‿⟨d,g-e⟩‿⟨a+b,h⟩‿⟨c-a,e+f⟩‿⟨b-d,g+h⟩ 𝕩≢⊸↑∾⟨p1+p4+p7-p5, p3+p5⟩≍⟨p2+p4, p1+p3+p6-p2⟩ }

Let's go somewhat big for a solid 9x speed-up over the naive implementation:

⟨+˝∘×⎉1‿∞, 600⊸MPB, +˝∘×⎉1‿∞ _strassen_ 256, Dgemm _strassen_ 256, Dgemm⟩ {𝕎˜•_timed𝕩}¨ <4096‿4096•rand.Range 0
⟨ 121.21441014300001 23.299975492 13.688074838 2.1399266160000003 0.400549596 ⟩

To the best of my ability, this marks the limit of what can be achieved with a pure, single-threaded BQN implementation3.

Read Entire Article