Skip to content

Spanner Project Background

Patrick Lavin edited this page May 19, 2026 · 7 revisions

Spatter is a memory benchmark we have used 1,2 to study how sparse and irregular memory access perform on CPUs and GPUs. In the Spanner project, we seek to enhance the way Spatter represents memory access patterns.

Memory access patterns

A memory access pattern is a trace of the reads and write produced by a program. For instance, it may look something like this:

R 0x0080
R 0x0088
W 0x0090
W 0x0098

Memory traces can be quite large. Consider a program that runs for 1 second on a 2GHz processor where roughly 1/2 of the instructions access memory. Tracing this program will yield roughly 1e9 memory accesses. Each address is 8 bytes. If we also use 1 byte to store whether it was a read or a write, we will have 9e9 bytes, or 9GB. Of course, compression will help this, but it should be clear that naively tracing entire programs that run for minutes or hours, typically on more than just 1 core, is infeasible, especially considering that the tracing itself slows the program down.

Nevertheless, we need to collect some traces for the purposes of training a neural network to reproduce them. We need to consider:

  1. what inputs to the program are appropriate,
  2. which parts we need to trace,
  3. whether we can edit the program in some way (e.g. reduce the number of times loops run), and
  4. how we store the trace.

Anyways, let us now look at how Spatter currently represents patterns.

The old way (stencil-like patterns)

The original Spatter paper represents a single pattern as a 4-tuple (kernel, offset, delta, length).

It may look something like this:

(kernel: gather, offset: [0,2,4,6], delta: 8, length:2)

The gather kernel, which "decodes" or "replays" this pattern, looks like this:

for(i: 0..length):
  for(j: 0..len(offset))
    read(data[i*delta + offset[j]])

In Spatter, the default data type is a double (8 bytes). This will produce the following memory trace:

R 0x0000
R 0x0010
R 0x0020
R 0x0030
R 0x0040
R 0x0050
R 0x0060
R 0x0070

The 4-tuple representation is somewhat restrictive. We want to address the following restrictions:

  1. We want to model more complex patterns that do not neatly fit into this mold
  2. We want to be able to interleave reads and writes

The new way (neural nets)

In this project, we will investigate using LSTMs to create bespoke representations for each trace. I am not an expert in NNs, so take the rest of this section with a grain of salt, and verify for yourself that what I said makes sense. And let me know when I am wrong!

LSTMs are well-suited to compressing sequences. The architecture features a sort of "funnel" where layers get smaller, forcing the NN to identify which features of the input matter. Before the funnel is the encoder which transforms an input trace into the compressed or latent representation. After the funnel, the decoder aims to reproduce the input trace from the latent representation.

We want to train such a neural net. Then, we would like to distribute the latent representations as well as the decoder portion of the neural net, so that we are able to generate memory access traces for use in benchmarking, or, as we will discuss shortly, simulation.

Research Questions

  1. What RNN or LSTM architecture is appropriate for the data we have?
  2. How small can the latent representation be? How does the size of the latent affect the size of the trace?;
  3. What metrics are important for us to understand during this process?
  4. Can we parameterize a neural net, so that it can learn how a trace should change with changing program inputs?
  5. How can we ensure that the trace we end up with obeys important cache stats such as footprint and locality?

Simulation

As it is extremely expensive and time consuming to construct new computer systems, it is common to simulate them at many different levels of detail at different points in the design process. Initially, coarse analytical models may be constructed such as the roofline model, which help us to understand systems from a high level. However, analytical models fail to capture many important details about computers, which is when simulation enters the picture.

We will use simulation as it will be easy to evaluate how our generated traces differ from the originals. For instance, we can examine how statistics such as cache hit rates differ between the traces.

The Structural Simulation Toolkit (SST) is a collection of simulation components that can be combined to simulation all parts of computer systems at various levels of detail. For the purpose of this project, we will just need a few of the many elements included with SST.

Primarily, we need to look at Prospero CPU model. Prospero is a trace-based CPU model, meaning it takes an instruction trace as input (containing only memory instructions, in Prospero's case) and simply issues the reads and writes to the memory system. This is in contrast to an execution-based simulator, which would take a compiled program as input and would simulate each instruction and what it does.

Prospero includes a tracing tool which is used to collect traces from programs of interest. We can then use those traces (1) as input to Prospero-based simulations and (2) as input to our neural networks.

Tasks

  1. Get the SST container
  2. Run the Prospero tests
  3. Examine stats available from the simulation
  4. Try tracing a program of your choice
  5. Use Vtune or Caliper to identify hotspots in the code
  6. Compile a program, use the Prospero API, and link against libprospero so that you can trace just hotspots

Beyond memory tracing

Memory traces contain a wealth of information about a program, but they are missing a key piece of information: dependency. It is common for one memory access to depend on previous instructions. We can sometimes determine this from a memory trace, such as the following:

R r1 0x04
W r2 0x04

If we allow the write to begin execution before the read, we will get incorrect execution.

However, if we strip out all non-memory instructions, we lose the ability to understand the dependency information in the following trace:

R r1 0x04
R r2 0x08
add r3 r1 r2
W r3 0x0B

Without the add, we cannot see that the write needs to wait for both reads to complete before beginning.

I won't include an example here, but we also need to consider how branch instructions affect dependencies.

One way to capture all of the information is to collect a Control and Data Flow Graph, or CDFG. You can read up about this data structure and a tool to collect them in this report. The tool will be available on GitHub in the next couple weeks.

RNNs for Graph Generation

Research Questions

  1. What NN architectures are appropriate for representing CDFGs?
  2. What metrics are important to understand for this research?

Clone this wiki locally