Intra Node Stencil Performance Evaluation Collection

Stencil Properties

dimension 3D
radius 1
weighting heterogeneous
kind box
coefficients variable
datatype double
machine IvyBridgeEP_E5-2660v2
FLOP per LUP 53

might be okay...

Julian Hornich says:

Ivy Bridge and Sandy Bridge have problems with splitloads and small stencils containing a central point. This issue was raised to Intel, but no solution was offered. As of Haswell, the problem is resolved. Including `#pragma vector aligned` reduced the problem, but does not solve it completly. Removing the central point solves this issue.

Benchmark raw data shown on this page can be found in the according folder of the git repository.

If you have feedback, issues or found errors on this page, please submit an issue on the github page.

Kernel Source Code

double a[M][N][P];
double b[M][N][P];
double W[27][M][N][P];

for(long k=1; k < M-1; ++k){
for(long j=1; j < N-1; ++j){
for(long i=1; i < P-1; ++i){
b[k][j][i] = W[0][k][j][i] * a[k][j][i]
+ W[1][k][j][i] * a[k-1][j-1][i-1]
+ W[2][k][j][i] * a[k][j-1][i-1]
+ W[3][k][j][i] * a[k+1][j-1][i-1]
+ W[4][k][j][i] * a[k-1][j][i-1]
+ W[5][k][j][i] * a[k][j][i-1]
+ W[6][k][j][i] * a[k+1][j][i-1]
+ W[7][k][j][i] * a[k-1][j+1][i-1]
+ W[8][k][j][i] * a[k][j+1][i-1]
+ W[9][k][j][i] * a[k+1][j+1][i-1]
+ W[10][k][j][i] * a[k-1][j-1][i]
+ W[11][k][j][i] * a[k][j-1][i]
+ W[12][k][j][i] * a[k+1][j-1][i]
+ W[13][k][j][i] * a[k-1][j][i]
+ W[14][k][j][i] * a[k+1][j][i]
+ W[15][k][j][i] * a[k-1][j+1][i]
+ W[16][k][j][i] * a[k][j+1][i]
+ W[17][k][j][i] * a[k+1][j+1][i]
+ W[18][k][j][i] * a[k-1][j-1][i+1]
+ W[19][k][j][i] * a[k][j-1][i+1]
+ W[20][k][j][i] * a[k+1][j-1][i+1]
+ W[21][k][j][i] * a[k-1][j][i+1]
+ W[22][k][j][i] * a[k][j][i+1]
+ W[23][k][j][i] * a[k+1][j][i+1]
+ W[24][k][j][i] * a[k-1][j+1][i+1]
+ W[25][k][j][i] * a[k][j+1][i+1]
+ W[26][k][j][i] * a[k+1][j+1][i+1]

Layer Conditions

1D Layer Condition:

  • L1: unconditionally fulfilled
  • L2: unconditionally fulfilled
  • L3: unconditionally fulfilled

2D Layer Condition:

  • L1: P <= 4152/37
  • L2: P <= 32824/37
  • L3: P <= 3276856/37

3D Layer Condition:

  • L1: 248*N*P - 448*P - 448 <= 32768
  • L2: 248*N*P - 448*P - 448 <= 262144
  • L3: 248*N*P - 448*P - 448 <= 26214400

Have a look at the kernel source code for dimension naming.

How to test this stencil

Generate this stencil with:

stempel gen -D 3 -r 1 -t "double" -C variable -k box -e --store stencil.c

and generate the compilable benchmark code with:

stempel bench stencil.c -m IvyBridgeEP_E5-2660v2.yml --store

Compiler flags

icc -O3 -xAVX -fno-alias -qopenmp -qopenmp -DLIKWID_PERFMON -I/apps/likwid/4.3.4/include -L/apps/likwid/4.3.4/lib -I/headers/dummy.c stencil_compilable.c -o stencil -llikwid

Single Core Grid Scaling

ECM Prediction vs. Performance

Comparison of the measured stencil performance (in cycles per cache line), roofline prediction and the (stacked) contributions of the ECM Performance Model predicted by kerncraft using Layer Conditions to model the cache behavior. The calculated layer conditions shown above correspond to the jumps in the ECM prediction in this plot.

Data Transfers between Caches

Data transfers between the different cache levels and main memory. The shown data for each level contains evicted and loaded data. The measured data is represented by points and the predicted transfer rates by kerncraft by lines.

Multi Core Thread Scaling

How to replicate this data

Single Core Measurements

Using the generated stencil and kerncraft, all single core performance data shown on this page can be reproduced by:

Layer Condition Data

kerncraft -p ECM -p RooflineIACA -p Benchmark -p LC -P LC -m IvyBridgeEP_E5-2660v2.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores 1 --compiler icc

Cache Simulator Data

kerncraft -p ECM -p RooflineIACA -p Benchmark -p LC -P CS -m IvyBridgeEP_E5-2660v2.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores 1 --compiler icc

Thread Scaling Measurements

The generated benchmark code can be used to reproduce the thread scaling data shown on this page by:

kerncraft -p ECM -p RooflineIACA -p Benchmark -P LC -m IvyBridgeEP_E5-2660v2.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores $CORES --compiler icc

The roofline prediction can be obtained with kerncraft and the generated stencil:

kerncraft -p RooflineIACA -P LC -m IvyBridgeEP_E5-2660v2.yml stencil.c -D N $GRID_SIZE -D M $GRID_SIZE -D P $GRID_SIZE -vvv --cores ${threads} --compiler icc

Spatial Blocking Measurements

Generate benchmark code from the stencil with blocking and compile it as shown before:

stempel bench stencil.c -m IvyBridgeEP_E5-2660v2.yml -b 2 --store

IACA Output

Throughput Analysis Report
Block Throughput: 67.00 Cycles       Throughput Bottleneck: Backend. Port2_DATA, Port3_DATA

Port Binding In Cycles Per Iteration:
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
| Cycles | 27.0   0.0  | 26.0 | 51.0   67.0 | 51.0   67.0 | 2.0  | 22.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x378]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm9, xmmword ptr [r11+rdi*8+0x8]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm0, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm1, ymm0, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x380]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm4, ymm1, ymmword ptr [rdx+rdi*8+0x8]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm1, xmmword ptr [r14+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm2, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm3, ymm2, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x3e8]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm5, ymm3, ymmword ptr [r10+rdi*8]
|   1    |           |     |           | 1.0   2.0 |     |     | CP | vmovupd ymm6, ymmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm7, ymm4, ymm5
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm8, ymm6, ymmword ptr [r9+rdi*8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x368]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm11, ymm7, ymm8
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm13, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm14, ymm13, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x3f8]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm0, ymm14, ymmword ptr [rcx+rdi*8]
|   1    |           |     | 1.0   2.0 |           |     |     | CP | vmovupd ymm5, ymmword ptr [r15+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x358]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm7, ymm5, ymmword ptr [rax+rdi*8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm8, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm10, ymm9, xmmword ptr [r11+rdi*8+0x18], 0x1
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm12, ymm10, ymmword ptr [rsi+rdi*8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm15, ymm11, ymm12
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm3, ymm15, ymm0
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm9, ymm8, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x400]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm11, ymm9, ymmword ptr [r13+rdi*8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm12, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm13, ymm12, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x348]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm15, ymm13, ymmword ptr [r12+rdi*8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm0, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm2, ymm1, xmmword ptr [r14+rdi*8+0x18], 0x1
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm4, ymm2, ymmword ptr [rdx+rdi*8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm6, ymm3, ymm4
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm10, ymm6, ymm7
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm14, ymm10, ymm11
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm2, ymm14, ymm15
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm1, ymm0, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x3f0]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm3, ymm1, ymmword ptr [r8+rdi*8]
|   1    |           |     |           | 1.0   2.0 |     |     | CP | vmovupd ymm4, ymmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm5, ymm2, ymm3
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm6, ymm4, ymmword ptr [r10+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x330]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm9, ymm5, ymm6
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm7, xmmword ptr [r15+rdi*8+0x8]
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm8, ymm7, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x3e0]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm10, ymm8, ymmword ptr [r9+rdi*8+0x8]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm11, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm13, ymm9, ymm10
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm12, ymm11, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x310]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm14, ymm12, ymmword ptr [rsi+rdi*8+0x8]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm15, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm1, ymm13, ymm14
|   1*   |           |     |           |           |     |     |    | nop
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm0, ymm15, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x3d8]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm2, ymm0, ymmword ptr [rcx+rdi*8+0x8]
|   1    |           |     | 1.0   2.0 |           |     |     | CP | vmovupd ymm3, ymmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm4, ymm1, ymm2
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm5, ymm3, ymmword ptr [rax+rdi*8+0x8]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x300]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm8, ymm4, ymm5
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm6, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm7, ymm6, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x2f8]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm9, ymm7, ymmword ptr [r13+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm10, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm12, ymm8, ymm9
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm11, ymm10, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x318]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm13, ymm11, ymmword ptr [r12+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm14, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm1, ymm12, ymm13
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm0, ymm14, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x320]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm2, ymm0, ymmword ptr [r8+rdi*8+0x8]
|   1    |           |     |           | 1.0   2.0 |     |     | CP | vmovupd ymm3, ymmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm4, ymm1, ymm2
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm5, ymm3, ymmword ptr [r10+rdi*8+0x10]
|   1*   |           |     |           |           |     |     |    | nop
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x328]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm8, ymm4, ymm5
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm6, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm7, ymm6, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x3d0]
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm9, ymm7, ymmword ptr [r9+rdi*8+0x10]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm10, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm12, ymm8, ymm9
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm11, ymm10, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x338]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm13, ymm11, ymmword ptr [rsi+rdi*8+0x10]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm10, xmmword ptr [rbx+rdi*8+0x8]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | vmovupd xmm15, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm1, ymm12, ymm13
|   2    |           |     | 1.0   1.0 |           |     | 1.0 | CP | vinsertf128 ymm0, ymm15, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x3c8]
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm2, ymm0, ymmword ptr [rcx+rdi*8+0x10]
|   1    |           |     |           | 1.0   2.0 |     |     | CP | vmovupd ymm3, ymmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm4, ymm1, ymm2
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm5, ymm3, ymmword ptr [rdx+rdi*8+0x10]
|   1    |           |     |           | 1.0   1.0 |     |     | CP | mov r15, qword ptr [rsp+0x340]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm8, ymm4, ymm5
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm6, xmmword ptr [r15+rdi*8+0x8]
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm7, ymm6, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x308]
|   1*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm9, ymm7, ymmword ptr [rax+rdi*8+0x10]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | vmovupd xmm14, xmmword ptr [r15+rdi*8+0x8]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm12, ymm8, ymm9
|   1*   |           |     |           |           |     |     |    | nop
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm0, ymm14, xmmword ptr [r15+rdi*8+0x18], 0x1
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x388]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm2, ymm0, ymmword ptr [r12+rdi*8+0x10]
|   1    |           |     | 1.0   2.0 |           |     |     | CP | vmovupd ymm3, ymmword ptr [r15+rdi*8+0x8]
|   2    | 1.0       |     |           | 1.0   2.0 |     |     | CP | vmulpd ymm5, ymm3, ymmword ptr [r8+rdi*8+0x10]
|   1    |           |     | 1.0   1.0 |           |     |     | CP | mov r15, qword ptr [rsp+0x350]
|   1*   |           |     |           |           |     |     |    | nop
|   2    |           |     |           | 1.0   1.0 |     | 1.0 | CP | vinsertf128 ymm11, ymm10, xmmword ptr [rbx+rdi*8+0x18], 0x1
|   2    | 1.0       |     | 1.0   2.0 |           |     |     | CP | vmulpd ymm13, ymm11, ymmword ptr [r13+rdi*8+0x10]
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm1, ymm12, ymm13
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm4, ymm1, ymm2
|   1    |           | 1.0 |           |           |     |     |    | vaddpd ymm6, ymm4, ymm5
|   2    |           |     | 1.0       |           | 1.0 |     |    | vmovupd xmmword ptr [r15+rdi*8+0x8], xmm6
|   2    |           |     |           | 1.0       | 1.0 |     |    | vextractf128 xmmword ptr [r15+rdi*8+0x18], ymm6, 0x1
|   1    |           |     |           |           |     | 1.0 |    | add rdi, 0x4
|   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp rdi, qword ptr [rsp+0x2f0]
|   0F   |           |     |           |           |     |     |    | jb 0xfffffffffffffc8b
Total Num Of Uops: 189

Detected pointer increment: 32

System Information

# Hostname

# Operating System
CentOS Linux release 7.6.1810 (Core)
Derived from Red Hat Enterprise Linux 7.6 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID_LIKE="rhel fedora"
PRETTY_NAME="CentOS Linux 7 (Core)"


CentOS Linux release 7.6.1810 (Core)
CentOS Linux release 7.6.1810 (Core)

# Operating System (LSB)
/home/hpc/iwia/iwia84/INSPECT-repo/scripts/Artifact-description/ line 149: lsb_release: command not found

# Operating System Kernel
Linux e0451 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# Logged in users
 10:35:11 up 1 day, 19:56,  0 users,  load average: 0.10, 0.76, 2.98
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT

# CPUset
Domain N:

Domain S0:

Domain S1:

Domain C0:

Domain C1:

Domain M0:

Domain M1:

# CGroups
Allowed CPUs: 0-39
Allowed Memory controllers: 0-1

# Topology
CPU name:	Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPU type:	Intel Xeon IvyBridge EN/EP/EX processor
CPU stepping:	4
Hardware Thread Topology
Sockets:		2
Cores per socket:	10
Threads per core:	2
HWThread	Thread		Core		Socket		Available
0		0		0		0		*
1		0		1		0		*
2		0		2		0		*
3		0		3		0		*
4		0		4		0		*
5		0		5		0		*
6		0		6		0		*
7		0		7		0		*
8		0		8		0		*
9		0		9		0		*
10		0		10		1		*
11		0		11		1		*
12		0		12		1		*
13		0		13		1		*
14		0		14		1		*
15		0		15		1		*
16		0		16		1		*
17		0		17		1		*
18		0		18		1		*
19		0		19		1		*
20		1		0		0		*
21		1		1		0		*
22		1		2		0		*
23		1		3		0		*
24		1		4		0		*
25		1		5		0		*
26		1		6		0		*
27		1		7		0		*
28		1		8		0		*
29		1		9		0		*
30		1		10		1		*
31		1		11		1		*
32		1		12		1		*
33		1		13		1		*
34		1		14		1		*
35		1		15		1		*
36		1		16		1		*
37		1		17		1		*
38		1		18		1		*
39		1		19		1		*
Socket 0:		( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 )
Socket 1:		( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 )
Cache Topology
Level:			1
Size:			32 kB
Cache groups:		( 0 20 ) ( 1 21 ) ( 2 22 ) ( 3 23 ) ( 4 24 ) ( 5 25 ) ( 6 26 ) ( 7 27 ) ( 8 28 ) ( 9 29 ) ( 10 30 ) ( 11 31 ) ( 12 32 ) ( 13 33 ) ( 14 34 ) ( 15 35 ) ( 16 36 ) ( 17 37 ) ( 18 38 ) ( 19 39 )
Level:			2
Size:			256 kB
Cache groups:		( 0 20 ) ( 1 21 ) ( 2 22 ) ( 3 23 ) ( 4 24 ) ( 5 25 ) ( 6 26 ) ( 7 27 ) ( 8 28 ) ( 9 29 ) ( 10 30 ) ( 11 31 ) ( 12 32 ) ( 13 33 ) ( 14 34 ) ( 15 35 ) ( 16 36 ) ( 17 37 ) ( 18 38 ) ( 19 39 )
Level:			3
Size:			25 MB
Cache groups:		( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 ) ( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 )
NUMA Topology
NUMA domains:		2
Domain:			0
Processors:		( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 )
Distances:		10 21
Free memory:		30744.9 MB
Total memory:		32734.2 MB
Domain:			1
Processors:		( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 )
Distances:		21 10
Free memory:		30438.1 MB
Total memory:		32768 MB

# NUMA Topology
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 32734 MB
node 0 free: 30755 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 32768 MB
node 1 free: 30438 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

# Frequencies
Cannot read frequency data from cpufreq module

# Prefetchers
likwid-features not available

# Load
0.10 0.76 2.98 1/537 12040

# Performance energy bias
Performance energy bias: 7 (0=highest performance, 15 = lowest energy)

# NUMA balancing
Enabled: 1

# General memory info
MemTotal:       65936896 kB
MemFree:        62661116 kB
MemAvailable:   62438116 kB
Buffers:               0 kB
Cached:          1877788 kB
SwapCached:            0 kB
Active:            37820 kB
Inactive:        1856952 kB
Active(anon):      35296 kB
Inactive(anon):  1817692 kB
Active(file):       2524 kB
Inactive(file):    39260 kB
Unevictable:       50044 kB
Mlocked:           54140 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:            40 kB
AnonPages:         66084 kB
Mapped:            47636 kB
Shmem:           1836072 kB
Slab:             229120 kB
SReclaimable:      55688 kB
SUnreclaim:       173432 kB
KernelStack:        9760 kB
PageTables:         4368 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32968448 kB
Committed_AS:    2020592 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      783248 kB
VmallocChunk:   34324873212 kB
HardwareCorrupted:     0 kB
AnonHugePages:     26624 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      289396 kB
DirectMap2M:     9113600 kB
DirectMap1G:    59768832 kB

# Transparent huge pages
Enabled: [always] madvise never
Use zero page: 1

# Hardware power limits
RAPL domain package-1
- Limit0 long_term MaxPower 95000000uW Limit 95000000uW TimeWindow 9994240us
- Limit1 short_term MaxPower 150000000uW Limit 114000000uW TimeWindow 7808us
RAPL domain core
- Limit0 long_term MaxPower NAuW Limit 0uW TimeWindow 976us
RAPL domain dram
- Limit0 long_term MaxPower 39000000uW Limit 0uW TimeWindow 976us
RAPL domain package-0
- Limit0 long_term MaxPower 95000000uW Limit 95000000uW TimeWindow 9994240us
- Limit1 short_term MaxPower 150000000uW Limit 114000000uW TimeWindow 7808us
RAPL domain core
- Limit0 long_term MaxPower NAuW Limit 0uW TimeWindow 976us
RAPL domain dram
- Limit0 long_term MaxPower 39000000uW Limit 0uW TimeWindow 976us

# Compiler
icc (ICC) 20190117
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

Intel(R) MPI Library for Linux* OS, Version 2019 Update 2 Build 20190123 (id: e2d820d49)
Copyright 2003-2019, Intel Corporation.

# dmidecode
dmidecode not executable, so ask your administrator to put the
dmidecode output to a file (configured /etc/dmidecode.txt)

# environment variables
MKL_LIB=-Wl,--start-group /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_sequential.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_core.a -Wl,--end-group -lpthread -lm
MKL_CDFT=-Wl,--start-group  /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_cdft_core.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_thread.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_core.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -lpthread -lm -openmp
MKL_SCALAPACK=/apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.a -Wl,--start-group  /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_thread.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_core.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -lpthread -lm -openmp
MKL_SLIB_THREADED=-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -lpthread -lm -openmp
LESSOPEN=||/usr/bin/ %s
MKL_LIB_THREADED=-Wl,--start-group  /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_intel_thread.a /apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_core.a -Wl,--end-group -lpthread -lm -openmp
MKL_SHLIB=-L/apps/intel/ComposerXE2019/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm