ARTICLE

Performance Limits and Profiling

From Parallel and High Performance Computing by Robert Robey

This article discusses benchmarking: bandwidth and flops.

__________________________________________________________________

Save 37% off Parallel and High Performance Computing. Just enter fccrobey into the discount code box at checkout at manning.com.
__________________________________________________________________

Determine your hardware capabilities: benchmarking

Let’s imagine that you have prepared your application and your test suites, now you can begin characterizing the hardware which you’ve been tasked with working on. To do this, you need to develop a conceptual model for the hardware which allows you to understand its performance. Performance can be characterized by a number of metrics:

  • The rate at which floating-point operations can be executed (FLOPs/sec)
  • The rate at which data can be moved between various levels of memory (GB/sec)
  • The rate at which energy is used by your application (Watts)

The conceptual models allow you to estimate the theoretical peak performance of various components of the computer hardware. The metrics you work with in these models, and those you aim to optimize, depend on what you and your team value in your application. To complement this conceptual model, you can also make empirical measurements on your target hardware. The empirical measurements are made with micro-benchmark applications. One example of a micro-benchmark is the stream benchmark used for bandwidth-limited cases.

Tools for gathering system characteristics

In determining hardware performance, we use a mixture of theoretical and empirical measurements. These are complimentary with the theoretical value providing an upper bound to performance and the empirical confirming what can be achieved in a simplified kernel in close to actual operating conditions.

It’s surprisingly difficult to get hardware performance specifications. The explosion of processor models and the focus of marketing and media reviews for the broader public often obscure the technical details. For Intel processors, https://ark.intel.com is a good resource. For AMD processors, the site is https://www.amd.com/en/products/ specifications/processors.

Image for post
Image for post

One of the best tools for understanding the hardware you’re running on is the lstopo program. It’s bundled with the hwloc package that comes with nearly every MPI distribution. This command outputs a graphical view of the hardware on the system. Shown in figure 1 is the output for the Mac laptop. The output can be graphical or text. To get the picture in figure 1 on the Mac laptop currently requires a custom installation of hwloc and the cairo package to get the X11 interface enabled. The text version works with the standard package manager installs. Linux and Unix versions of hwloc usually work as long as an X11 window can be displayed. A new command, netloc, is being added to the hwloc package to display the network connections.

Custom Install for Mac

Install Cairo (version 1.16.0)

  1. Download Cairo from https://www.cairographics.org/releases/
  2. Configure with “./configure –with-x –prefix=/usr/local
  3. make
  4. make install

Install hwloc (2.1.0a1-git)

  1. git clone https://github.com/open-mpi/hwloc.git
  2. Configure with “./configure –prefix=/usr/local
  3. make
  4. make install

Some other commands for probing hardware details are ‘lscpu’ on Linux systems, ‘wmic’ on Windows, and ‘sysctl’ or ‘system_profiler’ on Mac. The linux lscpu command outputs a consolidated report of the information from the /proc/cpuinfo file. You can see the full information for every logical core by viewing /proc/cpuinfo directly.

The information from the lscpu command and /proc/cpuinfo file can help determine the number of processors, the processor model, the cache sizes and the clock frequency for the system. The flags contain important information on the vector instruction set for the chip. In the report in figure 2, we see that the AVX2 and various forms of the SSE vector instruction set are available.

Architecture:    x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
Stepping: 3
CPU MHz: 871.241
CPU max MHz: 3600.0000
CPU min MHz: 800.0000
BogoMIPS: 6384.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

The above is output from lscpu for a Linux desktop showing a 4 core i5–6500 CPU @ 3.2 GHz with AVX2 instructions.

00:00.0 Host bridge: Intel Corporation Skylake Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #19 (rev f1)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 0fba (rev a1)

The above is output from lspci command from a Linux desktop showing a Nvidia GeForce GTX 960 GPU.

Obtaining information on the devices on the PCI bus can be helpful, particularly for identifying the number and type of the graphics processor. The ‘lspci’ command reports all the devices, as shown in figure 3. From the output in the figure, we can see that there’s one GPU and it’s a Nvidia GeForce GTX 960.

Calculating theoretical maximum FLOPS

Let’s run through the numbers for a mid-2017 MacBook Pro laptop with an Intel Core i7–7920HQ processor. This is a 4-core processor running at a nominal frequency of 3.1 GHz and with hyperthreading. With its turbo boost feature, it can run at 3.7 GHz when using four processors and up to 4.1 GHz when using a single processor. The theoretical maximum flops, FT, can be calculated from

FT = Cv x fc x Ic = Virtual Cores x Clock Rate x flops/cycle

The number of cores includes the effects of hyperthreads which make the physical cores (Ch) appear to be a greater number of virtual or logical cores (Cv). Here we have two2 hyperthreads that make the virtual number of processors appear to be eight.

The clock rate is the turbo boost rate when all the processors are engaged. For the example processor, it is 3.7 GHz.

The flops per cycle, or more generally, instructions per cycle, Ic, includes the number of simultaneous operations that can be executed by the vector unit. To determine the number of operations which can be performed, we take the vector width (VW) and divide by the word size in bits (Wbits). We also include the Fused Multiply Add (FMA) instruction as another factor of two operations per cycle. We refer to this as fused operations (Fops) in the equation. For this specific processor we get

Ic = VW/Wbits x Fops = (256 bit vector unit/64 bits) x (2 FMA) = 8 flops/cycle

Cv = Ch x HT = (4 hardware cores x 2 hyperthreads)

FT = (8 virtual cores) x (3.7 GHz) x (8 flops/cycle) = 236.8 GFlops/sec.

The memory hierarchy and theoretical memory bandwidth

For most large computational problems, we can assume that there are large arrays that need to be loaded from main memory through the cache hierarchy as shown in figure 4. The memory hierarchy has grown deeper over the years with the addition of more levels of cache to compensate for the increase in processing speed relative to the main memory access times.

Image for post
Image for post

We can calculate the theoretical memory bandwidth of the main memory using the specs of the memory chips. The general formula is

BT = MTR x Mc x Tw x Ns = Data Transfer Rate x memory channels x bytes per access x sockets

Processors are installed in a socket on the motherboard. The motherboard is the main system board of the computer and the socket is the location where the processor is inserted. Most motherboards are single socket where only one processor can be installed, but there are some dual socket motherboards. Two processors can be installed in a dual socket motherboard, giving more processing cores, but also giving more memory bandwidth.

The data, or memory, transfer rate, (MTR), is usually given in million transfers per sec (MT/s). The double data rate (DDR) memory performs transfers at the top and bottom of the cycle for two transactions per cycle. This means that the memory bus clock rate is half of the transfer rate in MHz. The memory transfer width (Tw) is 64 bits and because there are 8 bits/byte, there are 8 bytes transferred. Most desktop and laptop architectures have two memory channels (Mc).

For the 2017 MacBook Pro, as above, with LPDDR3–2133 memory and two channels, the theoretical memory bandwidth (BT) can be calculated from the memory transfer rate (MTR) of 2133 MT/s, the number of channels (Nc), and the number of sockets on the motherboard.

BT = 2133 MT/s x 2 channels x 8 bytes x 1 socket = 34,128 MiB/sec or 34.1 GiB/sec.

The achievable memory bandwidth is lower than the theoretical bandwidth due to the effects of the rest of the memory hierarchy. Complex theoretical models for estimating the effects of the memory hierarchy exist, but it’s beyond what we want to consider in our simplified processor model. For this, we turn to empirical measurements of bandwidth at the CPU.

Empirical measurement of bandwidth and flops

The empirical bandwidth is the measurement of the fastest rate that memory can be loaded from main memory into the processor. If a single byte of memory is requested, it takes one cycle to retrieve it from a CPU register. It if isn’t in the CPU register, it has to get it from the L1 cache. If it isn’t in the L1 cache, the L1 cache loads it from L2, and so on, to main memory. If it has to go all the way to main memory to get the single byte of memory, it can take around 400 clock cycles. This time required for the first byte of data from each level of memory is called memory latency. Once the value is in a higher level of cache, it can be retrieved faster until it gets evicted from that level of the cache. If all memory had to be loaded a byte at a time, this is painfully slow. When a byte of memory is loaded, a whole chunk of data, called a cache line is loaded at the same time. If nearby values are subsequently accessed, they’re already in the higher levels of the cache.

The cache lines, cache sizes and number of cache levels are sized to try to use as much of the theoretical bandwidth of the main memory as possible. If we load contiguous data as fast as possible to make the best use of the caches, we get the maximum possible data transfer rate at the CPU. This maximum data transfer rate is called the bandwidth of the memory system. To determine the memory bandwidth, we can measure the time for a reading and writing a large array. From the empirical measurements below, the measured bandwidth is about 22 GiB/sec.

We’ve two methods which are used for measuring the bandwidth–the stream benchmark and the roofline model measured by the empirical roofline toolkit. The stream benchmark was created by John McCalpin in 1995 to support his argument that memory bandwidth is far more important than the peak floating point capability. The roofline model, in comparison, integrates both the memory bandwidth limit and the peak flop rate into a single plot with regions which show each performance limit. The empirical roofline toolkit was created to measure and plot the roofline model.

The stream benchmark measures the time to read and write a large array. Depending on the operations performed on the data by the CPU as it is being read, there are four variants. These are copy, scale, add and triad measurement. The copy does no floating point work, the scale and add do one arithmetic operation, and the triad does two. These each give a slightly different measure of the maximum rate that data can be expected to be loaded from main memory for the case that each data value is only used once. In this regime, the flop rate is limited by how fast memory can be loaded.

Bytes    Arithmetic Operations
Copy: a(i) = b(i) 16 0
Scale: a(i) = q*b(i) 16 1
Sum: a(i) = b(i) + c(i) 24 1
Triad: a(i) = b(i) + q*c(i) 24 2

The following exercise shows how to use the stream benchmark to measure bandwidth on a given CPU.

Measuring bandwidth using the Stream Benchmark

Jeff Hammond, a scientist at Intel, has put the McCalpin stream benchmark code into a Git repository for more convenience in retrieving a version. We’ll use his version in this example

  1. git clone https://github.com/jeffhammond/STREAM.git
  2. Edit Makefile and change compile line to -O3 -march=native -fstrict-aliasing -ftree-vectorize -fopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20
  3. make
  4. ./stream_c.exe

Here are the results for the 2017 Mac Laptop

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy: 22086.5 0.060570 0.057954 0.062090
Scale: 16156.6 0.081041 0.079225 0.082322
Add: 16646.0 0.116622 0.115343 0.117515
Triad: 16605.8 0.117036 0.115622 0.118004

We can select the best bandwidth from one of the four measurements as our empirical value of maximum bandwidth.

If a calculation can reuse the data in cache, much higher flop rates are possible. If we assume that all data being operated on is in a CPU register or maybe the L1 cache, then the maximum flop rate is determined by the clock frequency of the CPU and how many flops it can do per cycle. This is the theoretical maximum flop rate calculated above.

Now we can put these two together to create a plot of the roofline model. The roofline model has a vertical axis of the flops per second and a horizontal axis of arithmetic intensity. For high arithmetic intensity where there are a lot of flops compared to the data loaded, the theoretical maximum flop rate is the limit. This produces a horizontal line on the plot at the maximum flop rate. As the arithmetic intensity decreases, the time for the memory loads starts to dominate and we no longer can get the maximum theoretical flops. This then creates the sloped roof in the roofline model where the achievable flop rate slopes down as the arithmetic intensity drops. The horizontal line on the right of the plot and the sloped line on the left produce the characteristic shape reminiscent of a roofline and what has become known as the roofline model or plot. The roofline plot can be determined for a CPU or even a GPU as shown in the following exercise.

Measuring bandwidth using the Empirical Roofline Toolkit

To prepare for this exercise, install either OpenMPI or MPICH to get a working MPI. Install gnuplot version 4.2 and make sure you have python version 2.0. On Macs, you should download the GCC compiler to replace the default compiler. These installs can be done using a package manager (brew on Mac; apt, or synaptic on Ubuntu Linux).

  1. Get the roofline toolkit
    “git clone https://bitbucket.org/berkeleylab/cs-roofline-toolkit.git
  2. cd cs-roofline-toolkit/Empirical_Roofline_Tool-1.1.0
  3. cp Config/config.madonna.lbl.gov.01 Config/MacLaptop2017
  4. edit Config/MacLaptop2017. Below is the file for the 2017 Mac laptop
  5. Run tests ./ert Config/MacLaptop2017
  6. View Results.MacLaptop2017/Run.001/roofline.ps
Image for post
Image for post

Shown in figure 3 is the roofline for the 2017 Mac laptop. The empirical measurement of the maximum flops is a little higher than we calculated analytically. This is probably due to a higher clock frequency for a short period of time. Trying different configuration parameters such as turning off vectorization or running one process can help to determine whether you found the right hardware specifications. The sloped lines are the bandwidth limits at different arithmetic intensities. Because these are determined empirically, the labels for each slope may not be correct and extra lines may be present.

Image for post
Image for post

From these two empirical measurements, we get a similar maximum bandwidth through the cache hierarchy of around 22 MB/s or about 65% of the theoretical bandwidth at the DRAM chips (22 GiB/s / 34.1 GiB/s).

Calculating the machine balance between flops and bandwidth

Now we can calculate the theoretical machine balance and the empirical machine balance. The machine balance is the flops divided by the memory bandwidth. We can calculate both a theoretical machine balance (MBT) and an empirical machine balance (MBE).

MBT = FT / BT = 236.8 GFlops/sec / 34.1 GiB/sec x (8 bytes/word) = 56 Flops/word

MBE = FE / BE = 264.4 GFlops/sec / 22 GiB/sec x (8 bytes/word) = 96 Flops/word

In Figure 5, the machine balance is the intersection of the DRAM bandwidth line with the horizontal flop limit line. We see that intersection’s above 10 Flops/Byte. Multiplying by eight gives a machine balance above 80 Flops/word. We get a range of values for the machine balance, but in general for most applications, the conclusion is that we’re in the bandwidth-bound regime.

If you want to learn more about the book, check it out on liveBook here and see this slide deck.

Written by

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store