The operation of SLO is best illustrated by comparing it to other cache profiling tools. Using the example code below. The program below calculates inproducts and summations of all elements on a number of arrays.

double inproduct (double *X, double *Y, int len) { int i; double result=0.0; for(i=0; i<len; i++) result += X[i]*Y[i]; return result; } double sum (double *X, int len) { int i; double result=0.0; for(i=0; i<len; i++) result += X[i]; return result; } void f (double **X, double **Y, double** Z, int len, int N) { int i,j; for (i=0; i<N; i++) for (j=0; j<N; j++) { double inp = inproduct (X[i],Y[j],len); double sumX = sum (X[i],len); double sumY = sum (Y[j],len); Z[i][j] = inp+sumX+sumY; } }

Most other cache profiling tools highlight the source code lines where most cache misses occur. For example, Valgrind reports the following source code lines where cache misses occur:

L1 L2 . . double inproduct (double *X, double *Y, int len) . . { . . int i; double result=0.0; 0.0% 0.0% for(i=0; i<len; i++) 50.0% 50.0% result += X[i]*Y[i]; . . return result; . . } . . . . double sum (double *X, int len) . . { . . int i; double result=0.0; 0.0% 0.0% for(i=0; i<len; i++) 50.0% 50.0% result += X[i]; . . return result; . . } . . . . void f . . (double **X, double **Y, double** Z, int len, int N) . . { . . int i,j; 0.0% 0.0% for (i=0; i<N; i++) 0.0% 0.0% for (j=0; j<N; j++) { 0.0% 0.0% double inp = inproduct (X[i],Y[j],len); . . double sumX = sum (X[i],len); . . double sumY = sum (Y[j],len); 0.0% 0.0% Z[i][j] = inp+sumX+sumY; . . } . . }

Valgrind shows that half of the cache misses occur insided the loop in `inproduct`

; while the other half occur inside the loop in `sum`

. While this information clearly shows where the misses occur, it is not directly obvious how to refactor the program so that these misses disappear.

In contrast, SLO analyzes cache misses in a more abstract way. Most cache misses occur on data that is reused, but for which the previous use is so far in the past, that the data has been evicted from the cache by other data that has been accessed since. SLO profiles the data reuses, and a histogram of reuses and their corresponding distance is generated, like the histogram below.

The histogram shows that for this particular run of the program, all reuses occur at distances larger than 2^{19}.

The reuse distance estimates the minimum cache size needed for the data to remain in the cache between use and reuse. E.g., for a reuse with distance 2^{19}, the cache must be able to hold 2^{19} data elements. Therefore, the cache missing reuses can easily be read from the histogram as all reuses that are at a distance larger than the cache size.

The coloured background area, indicated with `L1`

and `L2`

show no reuses at those distances, meaning no reuses produce cache hits in either the L1 cache or the L2 cache. The different colors of the bars in the histogram indicate different refactorings that are needed for nearing the reuses. Bringing reuses closer together will move them to the left in the histogram, into the areas where L2 or even L1 cache hits occur. When you use the mouse and click on one of the colored bars, the corresponding refactoring is highlighted in the code.

The screenshot shows that the blue reuses (at distance 2^{20} in the histogram) can be shortened by merging the computations in `inproduct`

with the second call to `sum`

. Similarly, the green reuses can be shortened by merging the computations of `inproduct`

with the first call to `sum`

. After merging the computations of in `inproduct`

and `sum`

, the code looks as follows:

void f (double **X, double **Y, double** Z, int len, int N) { int i,j; for (i=0; i<N; i++) for (j=0; j<N; j++) { double inp=0.0, sumX=0.0, sumY=0.0; double *Xi=X[i], *Yj=Y[j]; int i2; for(i2=0; i2<len; i2++) { inp += Xi[i2]*Yj[i2]; sumX += Xi[i2]; sumY += Yj[i2]; } Z[i][j] = inp+sumX+sumY; } }

SLO also allows to visually compare the histograms. In the screenshot below, the reuse distance histogram of the original code is shown in blue, while the reuse distance histogram of the optimized code is shown in red. As a result of the fusion, the large blue peak at distance 2^{20} has been moved to distance 2^{2}. Therefore, those reuses now generate L1 cache hits instead of L2 misses (see background colors). The resulting code runs about two times faster on a Pentium4.