1. Comparison between using SLO and more traditional cache profilers

The operation of SLO is best illustrated by comparing it to other cache profiling tools. Using the example code below. The program below calculates inproducts and summations of all elements on a number of arrays.

double inproduct (double *X, double *Y, int len)
{
   int i; double result=0.0;
   for(i=0; i<len; i++)
     result += X[i]*Y[i];
   return result;
}

double sum (double *X, int len)
{
  int i; double result=0.0;
  for(i=0; i<len; i++)
    result += X[i];
  return result;
}

void f 
(double **X, double **Y, double** Z, int len, int N)
{
  int i,j;
  for (i=0; i<N; i++)
    for (j=0; j<N; j++) {
      double inp = inproduct (X[i],Y[j],len);
      double sumX = sum (X[i],len);
      double sumY = sum (Y[j],len);
      Z[i][j] = inp+sumX+sumY;
    }
}

Most other cache profiling tools highlight the source code lines where most cache misses occur. For example, Valgrind reports the following source code lines where cache misses occur:

   L1     L2  

     .      .  double inproduct (double *X, double *Y, int len)
     .      .  {
     .      .    int i; double result=0.0;
  0.0%   0.0%    for(i=0; i<len; i++)
 50.0%  50.0%      result += X[i]*Y[i];
     .      .    return result;
     .      .  }
     .      .  
     .      .  double sum (double *X, int len)
     .      .  {
     .      .    int i; double result=0.0;
  0.0%   0.0%    for(i=0; i<len; i++)
 50.0%  50.0%      result += X[i];
     .      .    return result;
     .      .  }
     .      .  
     .      .  void f 
     .      .  (double **X, double **Y, double** Z, int len, int N)
     .      .  {
     .      .    int i,j;
  0.0%   0.0%    for (i=0; i<N; i++)
  0.0%   0.0%      for (j=0; j<N; j++) {
  0.0%   0.0%        double inp = inproduct (X[i],Y[j],len);
     .      .        double sumX = sum (X[i],len);
     .      .        double sumY = sum (Y[j],len);
  0.0%   0.0%        Z[i][j] = inp+sumX+sumY;
     .      .      }
     .      .  }

Valgrind shows that half of the cache misses occur insided the loop in inproduct; while the other half occur inside the loop in sum. While this information clearly shows where the misses occur, it is not directly obvious how to refactor the program so that these misses disappear.

In contrast, SLO analyzes cache misses in a more abstract way. Most cache misses occur on data that is reused, but for which the previous use is so far in the past, that the data has been evicted from the cache by other data that has been accessed since. SLO profiles the data reuses, and a histogram of reuses and their corresponding distance is generated, like the histogram below.

Screenshot of SLO showing the histogram of reuse distances.

The histogram shows that for this particular run of the program, all reuses occur at distances larger than 219.

Note

The reuse distance estimates the minimum cache size needed for the data to remain in the cache between use and reuse. E.g., for a reuse with distance 219, the cache must be able to hold 219 data elements. Therefore, the cache missing reuses can easily be read from the histogram as all reuses that are at a distance larger than the cache size.

The coloured background area, indicated with L1 and L2 show no reuses at those distances, meaning no reuses produce cache hits in either the L1 cache or the L2 cache. The different colors of the bars in the histogram indicate different refactorings that are needed for nearing the reuses. Bringing reuses closer together will move them to the left in the histogram, into the areas where L2 or even L1 cache hits occur. When you use the mouse and click on one of the colored bars, the corresponding refactoring is highlighted in the code.

Screenshot of SLO showing the suggested refactorings for the blue and green reuses.

The screenshot shows that the blue reuses (at distance 220 in the histogram) can be shortened by merging the computations in inproduct with the second call to sum. Similarly, the green reuses can be shortened by merging the computations of inproduct with the first call to sum. After merging the computations of in inproduct and sum, the code looks as follows:

void f
(double **X, double **Y, double** Z, int len, int N)
{
  int i,j;
  for (i=0; i<N; i++)
    for (j=0; j<N; j++) {
      double inp=0.0, sumX=0.0, sumY=0.0;
      double *Xi=X[i], *Yj=Y[j];
      int i2;
      for(i2=0; i2<len; i2++) {
        inp += Xi[i2]*Yj[i2];
        sumX += Xi[i2];
        sumY += Yj[i2];
      }
      Z[i][j] = inp+sumX+sumY;
    }
}

SLO also allows to visually compare the histograms. In the screenshot below, the reuse distance histogram of the original code is shown in blue, while the reuse distance histogram of the optimized code is shown in red. As a result of the fusion, the large blue peak at distance 220 has been moved to distance 22. Therefore, those reuses now generate L1 cache hits instead of L2 misses (see background colors). The resulting code runs about two times faster on a Pentium4.

Screenshot of comparison between reuse distance histograms before (blue) and after (red) fusing inproduct and sum.