INSPECT

Intra Node Stencil Performance Evaluation Collection

Stencil Properties

dimension 3D
radius 2
weighting homogeneous
kind star
coefficients constant
datatype double
machine SkylakeSP_Gold-6148
FLOP per LUP 13

Review status: might be okay...

might be okay...

Julian Hammer says:

The correct way to measure and predict L2-Memory and L3-Memory traffic is unknown.


Benchmark raw data shown on this page can be found in the according folder of the git repository.

If you have feedback, issues or found errors on this page, please submit an issue on the github page.

Kernel Source Code

double a[M][N][P];
double b[M][N][P];
double c0;

for (long k = 2; k < M - 2; ++k) {
  for (long j = 2; j < N - 2; ++j) {
    for (long i = 2; i < P - 2; ++i) {
      b[k][j][i] =
          c0 * (a[k][j][i] + a[k][j][i - 1] + a[k][j][i + 1] +
                a[k - 1][j][i] + a[k + 1][j][i] + a[k][j - 1][i] +
                a[k][j + 1][i] + a[k][j][i - 2] + a[k][j][i + 2] +
                a[k - 2][j][i] + a[k + 2][j][i] + a[k][j - 2][i] +
                a[k][j + 2][i]);
    }
  }
}

Layer Conditions

1D Layer Condition:

  • L1: unconditionally fulfilled
  • L2: unconditionally fulfilled
  • L3: unconditionally fulfilled

2D Layer Condition:

  • L1: P <= 2048/5, that is P <= 400
  • L2: P <= 65536/5, that is P <= 13100
  • L3: P <= 1835008/5, that is P <= 367000

3D Layer Condition:

  • L1: 32*N*P + 16*P*(N - 2) + 32*P <= 32768, that is N*P <= 20²
  • L2: 32*N*P + 16*P*(N - 2) + 32*P <= 1048576, that is N*P <= 80²
  • L3: 32*N*P + 16*P*(N - 2) + 32*P <= 29360128, that is N*P <= 680²

Have a look at the kernel source code for dimension naming.

How to test this stencil

Generate this stencil with:

stempel gen -D 3 -r 2 -t "double" -C constant -k star -o --store stencil.c

and generate the compilable benchmark code with:

stempel bench stencil.c -m SkylakeSP_Gold-6148.yml --store

Compiler flags

icc -O3 -fno-alias -xCORE-AVX512 -qopenmp -qopenmp -DLIKWID_PERFMON -Ilikwid-4.3.3/include -Llikwid-4.3.3/lib -Iheaders/dummy.c stencil_compilable.c -o stencil -llikwid

Single Core Grid Scaling

ECM Prediction vs. Performance

02004006008001000010203040506070
T​L3-MEMT​L2-L3T​L1-L2T​Reg-L1T​compBenchmarkRoofline /w LCGrid Size (N^(1/3))Cycles / Cacheline

Comparison of the measured stencil performance (in cycles per cache line), roofline prediction and the (stacked) contributions of the ECM Performance Model predicted by kerncraft using Layer Conditions to model the cache behavior. The calculated layer conditions shown above correspond to the jumps in the ECM prediction in this plot.

Data Transfers between Caches

02004006008001000020406080100
L1-L2 BenchmarkL2-L3 BenchmarkL3-MEM BenchmarkL1-L2 with LCL2-L3 with LCL3-MEM with LCGrid Size (N^(1/3))Data Transfers [Byte/LUP]

Data transfers between the different cache levels and main memory. The shown data for each level contains evicted and loaded data. The measured data is represented by points and the predicted transfer rates by kerncraft by lines.

Multi Core Thread Scaling

24681012141618200500100015002000
BenchmarkECM LC PredictionRoofline LC PredictionNumber of ThreadsPerformance [MLUP/s]

Something is wrong...

Julian Hammer says:

This is based on a faulty machine file. Needs to be rerun.


Single Core Spatial Blocking

020040060080010000100200300400500600700800
BenchmarkBenchmark w/ L3-3D blockingRoofline LCECMGrid Size (N^(1/3))Performance [MLUP/s]

How to replicate this data

Single Core Measurements

Using the generated stencil and kerncraft, all single core performance data shown on this page can be reproduced by:

Layer Condition Data

kerncraft -p ECM -p RooflineIACA -p Benchmark -p LC -P LC -m SkylakeSP_Gold-6148.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores 1 --compiler icc

Cache Simulator Data

kerncraft -p ECM -p RooflineIACA -p Benchmark -p LC -P CS -m SkylakeSP_Gold-6148.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores 1 --compiler icc

Thread Scaling Measurements

The generated benchmark code can be used to reproduce the thread scaling data shown on this page by:

kerncraft -p ECM -p RooflineIACA -p Benchmark -P LC -m SkylakeSP_Gold-6148.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores $CORES --compiler icc

The roofline prediction can be obtained with kerncraft and the generated stencil:

kerncraft -p RooflineIACA -P LC -m SkylakeSP_Gold-6148.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores ${threads} --compiler icc

Spatial Blocking Measurements

Generate benchmark code from the stencil with blocking and compile it as shown before:

stempel bench stencil.c -m SkylakeSP_Gold-6148.yml -b 2 --store
OMP_NUM_THREADS=1 likwid-pin -C S0:0 ./stencil $GRID_SIZE $GRID_SIZE $GRID_SIZE $BLOCKING_M $BLOCKING_N $BLOCKING_P