About:
I am Jackson Vanover. I received my Ph.D. in Computer Science from UC Davis where I researched practical tools for testing and tuning numerical code.
Contact:
jacksonvanover [at] gmail [dot] com
DISCUSSED: questionably memory-bound matmuls, warm vs cold caches, Agner Fog's meticulously constructed instruction tables, my trash test system, grad student email listservs
During grad school at UC Davis, I spent a lot of time with students from the various ecology departments who introduced me to a strange and wonderful mailing list called eco-social through which I acquired a desktop tower for dirt cheap. My only goal with such a system was to have a personal machine at my house that I could SSH to for development. It has served its purpose well! In fact, it is the machine I used to perform all of the experiments for the EXCVATE paper.
How fast can this thing execute SGEMM?
To get an idea of what’s possible on this trash test system, let’s benchmark the BLIS implementation of SGEMM and compare its performance to the machine’s roofline model.
We are considering the operation \(C=AB+C\) with \(C \in \mathbb{F}^{m \times n}, A \in \mathbb{F}^{m \times k}, B \in \mathbb{F}^{k \times n}\) where \(\mathbb{F}\) is the set of 32-bit floating-point numbers.
Compiling BLIS: We compile for the sandybridge architecture for which BLIS has a pre-configured multithreaded build.
Spanning shapes and sizes: We run the BLIS implementation of SGEMM spanning different shapes and sizes per the following pseudo code:
int min_dim = 16;
int max_dim = 1024;
for ( int m = min_dim; m < max_dim; m = 2*m ){
for ( int n = min_dim; n < max_dim; n = 2*n ){
for ( int k = min_dim; k < max_dim; k = 2*k ){
...
}
}
}
Measuring performance: For each parameterization of SGEMM, we measure the FLOPs/second using \(\frac{2mnk}{t}\) where \(t\) is the minimum execution time in seconds over 10 executions; this is intended to account for run-to-run performance variability and to exclude any outliers due to cold caches. Subsequent multiplication by \(10^{-9}\) gives us GFLOPs/second.
Calculating arithmetic intensity: We calculate arithmetic intensity by dividing the GFLOPS/second measured above by \(4(2mn + mk + kn)\), i.e., \(2mn\) reads/writes of C, \(mk\) reads of A, \(kn\) reads of B, and each element is a 4 byte float.
Initializing the A and B matrices: Elements of A and B are initialized with values drawn from \(U(0,1)\).
Additional Information: To get an idea of how “rectangular” the matmul is, we will also measure the maximum dimension ratio. If m=k=n (all matrices are square), this quantity will be 1. If m=1, k=2, and n=3, this quantity will be 3.
The theoretical peak performance of the i7-3770K processor (measured in GFLOPs/second) can be calculated according to (Processor Frequency in GHz) x (# of Operations per cycle) x (# of Operands per Register) x (# of Physical Cores). The details below come from the output of lscpu and from Agner Fog’s instruction tables.
→Plugging these values in yields a peak GFLOPs/second of 224 or 249.6 with Turbo Boost.
The theoretical peak memory bandwidth (measured in GB/second) can be calculated according to (Memory Bus Width in Bytes) x (Memory Transfer Rate in MT/second) x (# active channels) / 1000. The details below come from the output of dmidecode -t memory:
→Plugging these values in yields a peak memory bandwidth of 21.328 GB/second.
Why do some parameterizations of BLIS SGEMM surpass the theoretical memory bound?
From the color of the markers, we can see these are rectangular GEMMs. Zooming in, there are three offenders. Described as (m,k,n) tuples, they are:
( 512, 16, 1024)
(1024, 16, 512)
(1024, 16, 1024)
Since the analogous rectangular GEMM for the next largest size of k – the tuple (1024, 32, 1024) – is below the roofline, let’s gather more data by looking at smaller sizes of k.
Lowering the min_dim to 2 and rerunning our experiments, patterns emerge for memory-bound GEMMs:
Now our offenders are:
(1024, 2, 1024) (1024, 4, 1024) (1024, 8, 1024) (1024, 16, 1024)
( 512, 2, 1024) ( 512, 4, 1024) ( 512, 8, 1024) ( 512, 16, 1024)
(1024, 2, 512) (1024, 4, 512) (1024, 8, 512) (1024, 16, 512)
( 512, 2, 512) ( 512, 4, 512) ( 512, 8, 512)
( 256, 2, 1024) ( 256, 4, 1024) ( 256, 8, 1024)
( 256, 2, 512) ( 256, 4, 512) ( 256, 8, 512)
( 512, 2, 256) ( 512, 4, 256) ( 512, 8, 256)
(1024, 2, 256) (1024, 4, 256)
I feel confident in the theoretical memory bound that I calculated. However, upon reflection, I realize that this bound is derived from the bandwidth of the DRAM moving data into the last level cache, not to the actual arithmetic unit. Therefore, the “issue” I am observing likely stems from the fact that cache memory is faster than DRAM. Let’s test this.
To control for cache behavior, we will execute each GEMM with a cold cache. Before each GEMM execution, we will allocate a buffer that exceeds the capacity of our 8 MiB Last-Level Cache and loop over the elements of that buffer with some trivial read or write. If we wanted more performant flushing, we could be more precise about the alignment of the buffer and divide it up into cache lines so that we only touch one element per cache line, but this will suffice for now. Rerunning our experiments, we get the following:
It looks like we’ve found our culprit!
Looking into this led me to a 2014 paper called “Cache-aware Roofline model: Upgrading the loft“. In that work, the authors point out that the original formulation for the roofline model misses a key aspect of modern applications and architectures which rely on the on-chip memory hierarchy: the calculated DRAM memory bandwidth only accurately describes the behavior of moving data from the DRAM to the Lowest Level Cache (LLC). In reality, the highest memory bound is actually set by the peak L1-to-Execution Unit bandwidth.
Fun bonus: the “Upgrading the Loft” paper is dated enough that they even construct and experimentally validate a cache-aware roofline model using the same i7-3770K processor as my Trash Test System! Taking the numbers from that paper, let’s plot the SGEMM executions with a warm cache using an extended, cache-aware roofline model:
Now, courtesy of the “upgraded loft”, our roof is no longer “shattered” by BLIS SGEMM.