INSPECT

Intra Node Stencil Performance Evaluation Collection

Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz

Data has been reviewed

Julian Hornich says:

Level two (L2) cache bandwidth is optimistic, may be up to 64 B/cy as stated by Intel. But in practice this value is rarely reached.


General

model type Intel Xeon Haswell EN/EP/EX processor
model name Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
micro-architecture  
micro-architecture modeler  
cores per socket 14
cores per NUMA domain 7
cacheline size 64 B
clock 2.3 GHz
NUMA domains per socket 2

This machine file was generated for kerncraft version 0.8.6.dev0.

Compiler Flags

icc -O3 -xCORE-AVX2 -fno-alias -qopenmp -ffreestanding -nolib-inline
clang -O3 -mavx2 -D_POSIX_C_SOURCE=200809L -fopenmp -ffreestanding
gcc -O3 -march=core-avx2 -D_POSIX_C_SOURCE=200809L -fopenmp -lm -ffreestanding

Flops per Cycle

  ADD MUL FMA total
Single Precission 8 8 16 32
Double Precission 4 4 8 16

Memory Hierarchy

L1

groups 28
cores per group 1
threads per group 2
transfers overlap false

Cache Per Group

sets 64
ways 8
cl_size 64
replacement_policy LRU
write_allocate true
write_back true
load_from L2
store_to L2

Performance Counter Metrics

accesses MEM_UOPS_RETIRED_LOADS:PMC[0-3] + MEM_UOPS_RETIRED_STORES:PMC[0-3]
misses L1D_REPLACEMENT:PMC[0-3]
evicts L1D_M_EVICT:PMC[0-3]

L2

groups 28
cores per group 1
threads per group 2
transfers overlap false

Cache Per Group

sets 512
ways 8
cl_size 64
replacement_policy LRU
write_allocate true
write_back true
load_from L3
store_to L3

Performance Counter Metrics

accesses L1D_REPLACEMENT:PMC[0-3] + L1D_M_EVICT:PMC[0-3]
misses L2_LINES_IN_ALL:PMC[0-3]
evicts L2_TRANS_L2_WB:PMC[0-3]

L3

groups 4
cores per group 7
threads per group 14
transfers overlap false

Cache Per Group

sets 9216
ways 16
cl_size 64
replacement_policy LRU
write_allocate true
write_back true

Performance Counter Metrics

accesses L2_LINES_IN_ALL:PMC[0-3] + L2_TRANS_L2_WB:PMC[0-3]
misses (CAS_COUNT_RD:MBOX0C[01] + CAS_COUNT_RD:MBOX1C[01] + CAS_COUNT_RD:MBOX2C[01] + CAS_COUNT_RD:MBOX3C[01] + CAS_COUNT_RD:MBOX4C[01] + CAS_COUNT_RD:MBOX5C[01] + CAS_COUNT_RD:MBOX6C[01] + CAS_COUNT_RD:MBOX7C[01])
evicts (CAS_COUNT_WR:MBOX0C[01] + CAS_COUNT_WR:MBOX1C[01] + CAS_COUNT_WR:MBOX2C[01] + CAS_COUNT_WR:MBOX3C[01] + CAS_COUNT_WR:MBOX4C[01] + CAS_COUNT_WR:MBOX5C[01] + CAS_COUNT_WR:MBOX6C[01] + CAS_COUNT_WR:MBOX7C[01])

MEM

cores per group 14
threads per group 28
transfers overlap false

Overlapping Model

Ports:

IACA00DV1234567, OSACA00DV1234567, LLVM-MCAHWDividerHWFPDividerHWPort0HWPort1HWPort2HWPort3HWPort4HWPort5HWPort6HWPort7

Performance Counter Metric

Max(UOPS_EXECUTED_PORT_PORT_0:PMC[0-3], UOPS_EXECUTED_PORT_PORT_1:PMC[0-3], UOPS_EXECUTED_PORT_PORT_4:PMC[0-3], UOPS_EXECUTED_PORT_PORT_5:PMC[0-3], UOPS_EXECUTED_PORT_PORT_6:PMC[0-3], UOPS_EXECUTED_PORT_PORT_7:PMC[0-3])

Non-Overlapping Model

Ports:

IACA2D3D, OSACA2D3D, LLVM-MCAHWPort2HWPort3

Performance Counter Metric

T_nOL + T_L1L2 + T_L2L3 + T_L3MEM

Benchmarks

Kernels

copy

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 1 Streams with 8.00 B
read+write streams 0 Streams with 0.00 B

daxpy

FLOPs per iteration 2
read streams 2 Streams with 16.00 B
write streams 1 Streams with 8.00 B
read+write streams 1 Streams with 8.00 B

load

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 0 Streams with 0.00 B
read+write streams 0 Streams with 0.00 B

triad

FLOPs per iteration 2
read streams 3 Streams with 24.00 B
write streams 1 Streams with 8.00 B
read+write streams 0 Streams with 0.00 B

update

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 1 Streams with 8.00 B
read+write streams 1 Streams with 8.00 B