The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size:
Moreover, there is only a modest 10-character leap from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞. As a bonus, here's a convenient function to compute powers of a square matrix 𝕩 (particularly useful in graph theory) using blocks of size 𝕨, which pads the matrices with zeros as needed:
Here I could have used a fancier but slower under 𝔽˜⌾((/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩)⊸⊔). Or even the memory-hungry outer product formulation +˝⍉∘⊢(+˝∘×⎉1‿∞¨)⌜˘⊢, which is only marginally slower. A naïve search for the optimal block size yields:
Although nested tiling can be easily implemented, I found that it does not improve performance at all:
But we're not finished yet. Here is a little divide-and-conquer (and cache-oblivious) algorithm in its classic radix-2 form. It works for any square matrix, regardless of dimension: if it is odd, we pad with an extra row and column, and then take back the original.
Let's go somewhat big for a solid 9x speed-up over the naive implementation:
To the best of my ability, this marks the limit of what can be achieved with a pure, single-threaded BQN implementation3.
.png)


