The "smart" features in modern microprocessors that enhance the runtime of typical applications actually increase the effort and expertise needed to hand-tune high performance applications. It is difficult for a programmer to fully anticipate dynamic effects: the overlapping interactions of caches, branch predictors, load/store units and queues, TLBs, prefetchers, and many unexposed mechanisms in the underlying hardware. Without such knowledge, seemingly sound code/ algorithm optimizations to correct a supposed performance bottleneck can perform below expectations, or worse, have the opposite outcome.
High-resolution dynamic performance profiles from hardware performance counters can help in pinpointing bottlenecks and in disambiguating the root cause from overlapping hardware activities.
Unfortunately, performance counters [1] today have two inherent limitations that hinder the collection of high-resolution measurements by periodic interruption of unmodified executables. First, the overhead of logging performance counter data limits measurement frequency so as to avoid performance perturbations. Second, residual effects from varying initial system state, interrupts, and context switches introduce over-count errors into measurements. Commercial performance analysis tools such as Intel's VTune have a recommended lower bound of one millisecond as the minimum interval in between taking counter measurements to constrain the impact of these two types of error.
We apply the concept of super-resolution [2] to reconstruct higher resolution performance profiles that attribute events to dynamic locations in a program's execution trace. High-resolution performance profiles have potential uses in hand-tuning software, automated software optimization, and performance comparisons and analysis of real computer architectures. We plan to determine how much super- resolution can improve resolution and error tolerance beyond current limits using redundant, overlapping performance counter measurements of lower resolution.
The image processing technique of super-resolution is the reconstruction of a higher-resolution image from multiple low- resolution images with varying sub-pixel offsets. There are two parts to super-resolution: first, registering (a.k.a. aligning) the low- resolution input images, and second, fusing the inputs to produce a high-resolution output image. General-purpose super-resolution algorithms produce mixed results in which maximum resolution increases below ten-times magnification in practice [3]. In addition, the algorithms tolerate only small amounts of random noise.
Super-resolution can be applied analogously to dynamic performance profiling of reproducible program executions. (Note: this is not a requirement that the program be deterministic.) In theory, applying existing image processing super-resolution algorithms to performance counter measurements only requires modifying the algorithms to work in a single dimension instead of two dimensions. However, our intended uses demand much more than ten-times magnification and we wish to tolerate potentially large over-count errors. Existing algorithms do not meet our super-resolution requirements, so we developed our own method that is suited to the properties of performance counter measurements.
We have created a graphical user interface to display the collected measurements, allow a user to specify reconstruction parameters, and view the resulting super-resolution reconstruction. The rapid feedback of results as we change the reconstruction parameters has been a great help in designing our linear programming formulation and debugging the entire super-resolution approach. The tool is written in Java, and uses the Coin-CLP library (written in C++) for the linear programming optimizations.
There is a strong need for high-resolution performance profiles beyond the capabilities of performance counters. High-resolution dynamic performance data allows programmers to determine the events incurred by very small regions of code, on the order of thousands of instructions or less. This capability enables exact pinpointing of performance bottlenecks. A second application is linear regression modeling of CPI. Regression works best when the input contains many different execution scenarios. The performance of general purpose CPUs often looks steady state at low-resolution, but our super- resolution method may greatly improve the effectiveness of regression even on tightly repetitive code.
We did not publish this work yet. Below are a few references that our work builds upon.
Copyrights to many of the above papers are held by the publishers. The attached PDF files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. Some links to papers above connect to IEEE Xplore with permission from IEEE, and viewers must follow all of IEEE's copyright policies.
For more information please send email to Roland Wunderlich: rolandw (at) cmu (dot) edu