SLO - Suggestions for Locality Optimizations

The SLO tool analyzes the causes of poor temporal data locality, and suggests program refactorings that are needed to increase locality. As a result, the number of data cache misses may be reduced, and execution speed may be enhanced.

The following powerpoint presentation (presented at the HPCC2006 conference) gives a short introduction into the underlying ideas and the operation of the SLO tool.

The SLO tool is discussed in the following papers:.

Beyls, K.; D'Hollander, E. Refactoring for Data Locality, IEEE Computer Vol. 42, no. 2 2009. pp. 62-71 doi:10.1109/MC.2009.57
Beyls, K.; D'Hollander, E. Discovery of Locality-Improving Refactorings by Reuse Path Analysis. Proceedings of the 2nd International Conference on High Performance Computing and Communications (HPCC). Springer. Lecture Notes in Computer Science. Vol. 4208. 2006. pp. 220--229 [BibTeX][Abstract][PDF][PPT]
Beyls, K.; D'Hollander, E. Intermediately Executed Code is the Key to Find Refactorings that Improve Temporal Data Locality. Proceedings of the 3rd conference on Computing frontiers. ACM. 2006. pp. 373-382 [BibTeX][Abstract][PDF][PPT]

Documentation of both exploring locality optimizations using SLO and how to instrument programs using GCC-SLO is available as HTML and PDF.

Principles of Analysis performed by SLO

Effectively using SLO to optimize a program's locality requires understanding the program and locality model used by SLO to analyze programs. These are explained here.

Want to give it a try?

The SLO-tool consists of two parts:

GCC-SLO: an expanded version of the GCC compiler, that can instrument programs to perform the analysis described above.
SLO: a java-based visualizer of the data analyzed be GCC-SLO.

For both GCC-SLO and SLO, the most recent version can be downloaded by following the download link from http://sourceforge.net/projects/slo.

You can download slo-1.1.jar, and run SLO from the command line using a command like java -jar slo-1.1.jar example1.slo.zip.

Example input files

The table below provides some examples of .slo.zip-files produced by GCC-SLO. The left column in the table provides the .slo.zip input files that are input the the SLO tool. The middle column gives a short description of the program that was instrumented. The right column gives a link to an HTML-page automatically produced by SLO, so that you can browse the results of the locality analysis without needing to install SLO. (To have full functionality in these pages, enable JavaScript in your browser).

GCC-SLO

The GCC compiler has been adapted to instrument programs using the command line option \literal{-fslo-instrument}. To build your own gcc-compiler that can instrument programs, download the gcc-slo-1.1.0-4.1.0.tar.gz, extract it in executed the script build-gcc-slo.sh in directory gcc-slo-1.1.0-4.1.0. This will build the gcc-slo compiler and install it in $HOME/gcc-slo.

Instrumenting programs with GCC-SLO

See documentation, available as HTML and PDF.

Other information about the way GCC-SLO instruments.

GCC-SLO samples reuses to reduce the overhead of profiling. The sampling is based on reservoir sampling. Basically, the implementation makes sure that at the end of program execution, about 4 million reuses are present in the sample buffer, and that each reuse that occured at run-time has an equal probability of being present in the sample buffer. Consequently, at the start of execution, the instrumented program runs slowly, as each data reuse is sampled, and the code executed between the reuses is analyzed. After 4 million reuses have been seen, more and more data reuses can be skipped, resulting in faster execution. Short after the instrumented program has started, the typical slow-down is about a factor 1000. After the program has been running for a few hundred seconds the slow-down starts to drop quickly to a typical steady-state slow-down of about a factor 5. (More details are given in our HPCC06 paper)

Documentation

Download

Results after using SLO to refactor selected SPEC2000 programs

Links

Program	Speedup on Pentium4 2.66Ghz	Speedup on Itanium1 733Mhz	Speedup on Alpha EV67 667Mhz	Speedup on PA-RISC 8500 400Mhz	Speedup on UltraSPARC IV 1.05Ghz
173.applu	1.63	2.46	1.69	1.17	2.71
175.vpr	1.51	1.40	1.41	1.17	1.09
178.galgel	2.14	2.63	2.48	1.23	1.46
179.art	4.11	1.54	1.16	2.30	1.89
183.equake	1.10	2.93	3.09	1.54	1.57

Comments can be sent to Kristof Beyls, e-mail: Kristof.Beyls at www.elis.ugent.be

SLO input file	Description	HTML-output
example1.slo.zip	The example used in the explanation of the principles behind SLO.	No HTML output available right now.
applu_orig_ref1_reuse_distance.slo.zip	173.applu from the SPEC2000 benchmark suite, run with the reference input.	HTML-output generated by SLO
vpr_orig_ref2_reuse_distance.slo.zip	175.vpr from the SPEC2000 benchmark suite, run with the second reference input.	HTML-output generated by SLO
galgel_orig_ref1_reuse_distance.slo.zip	178.galgel from the SPEC2000 benchmark suite, run with the reference input.	HTML-output generated by SLO
art_orig_ref2_reuse_distance.slo.zip	179.art from the SPEC2000 benchmark suite, run with the reference input.	HTML-output generated by SLO
optimized_histograms/equake_orig_ref1_reuse_distance.slo.zip	183.equake from the SPEC2000 benchmark suite, run with the reference input.	HTML-output generated by SLO