Data-Driven Loop Fusion

1 week ago 6

Loop Fusion

In scientific and machine learning programs, computers often do repetitive tasks, called loops. These loops frequently need to use the same information, which is usually stored in arrays (like a list or grid of numbers). When a computer runs these loops one after another, it has to fetch that shared information multiple times. Imagine going to your fridge, grabbing an ingredient, putting it back, and then immediately going back to get it again for the next step of a recipe – it's inefficient! In the computer's case, "fetching" this data into its super-fast temporary memory (like caches and registers) takes a lot of time and energy.

To make things faster, we like to combine these separate loops into a single, bigger loop. This way, the computer only has to grab the shared information once. This technique is called loop fusion, and it's like grabbing all the ingredients you need from the fridge at once, rather than making multiple trips. Let's look at a simple example to see how much this helps.

Loop Fusion Example

General matrix-multiplication (GEMM) is a common operation, and in machine learning, it is often followed by an element-wise operation such as RELU. The result of matrix multiplication is then fed into the RELU, applying the non-linear transformation. The following figure shows these operations with their computation code. The matrix D1 is shared, as indicated in the corresponding code. GEMM fetches and stores D1 and then RELU fetches D1 from the memory:

Instead of having two loop nests, a single fused loop as shown below performs the same operation by fetching vector d from the shared matrix D1 once into the fast memory and reusing the vector in the second computation, i.e. RELU:

Other fusion opportunities can be explored such as when loops i2, j2 are fused which is not discussed here.

Fusion in Matrix Multiplication Operations

Let’s look at a more complex case, where two matrix multiplications are performed back-to-back. Where D1=B*C and then D = A*D1. This is a common operation in Graph Convolution Networks. The following figure shows the two operations:

In this sequence of multiplications, matrix D1 is used in both operations. The key difference between this setup and the earlier GEMM-RELU example is that matrix multiplication isn't an "element-wise" operation (where the computations can safely share a vector of D1 between each other). Because of this, directly combining the two loops into one wouldn't produce the correct result. For instance, trying to reuse an element D1[i1,i3] immediately after calculating it in the inner loop (i3) would lead to incorrect results. Because, all rows of D1 are needed to compute a row of D . This is why these two matrix multiplications are usually performed separately.

Data-Driven Loop Fusion with Tile Fusion

However, if matrix A in the pair of multiplication operations contains many zeros (known as a sparse matrix), we can sometimes fuse parts of the multiplications based on nonzero locations. Consider the sparse matrix example below. Here, D1 is shared between the two operations. Once we've calculated the first two rows of D1, we can immediately use those results to calculate rows one and three of D.

This type of fusion is driven by data, i.e., the sparsity pattern of the input matrices, making it a data-driven loop fusion technique. This property is commonly observed in sparse matrices used in graph neural networks and scientific computing. The figure below illustrates the degree of computation sharing possible across SuiteSparse matrices, a widely used collection in scientific and graph applications. As evident, a significant portion of computations in most matrices share data and operations, presenting clear opportunities for loop fusion.

Fused ratio across sparse matrices and graphs

However, fully leveraging this property necessitates ensuring data dependencies between operations, coupled with an efficient scheduling mechanism that interleaves iterations with shared data. Tile Fusion, a recent work by Salehi and Cheshmi accepted at ICS'25, proposes a domain-specific compiler designed to enable data-driven fusion in machine learning and scientific computing. Tile Fusion first creates tiles and then selectively enables fusion within those tiles that do not violate data dependencies, thereby ensuring full utilization of computing resources on both CPUs and GPUs. Consequently, Tile Fusion has demonstrated significant improvements in GNN training, achieving speedups of up to 3.84x and a geometric mean (gmean) of 2.33x. For further details on Tile Fusion and its implementation, please refer to the paper and code repository linked below:

Tile Fusion paper: https://www.cheshmi.cc/KazemCheshmi_files/tile_fusion_ics25.pdf
Tile Fusion code repository: https://github.com/SwiftWare-Lab/tile-fusion

Read Entire Article