INSPECT

Intra Node Stencil Performance Evaluation Collection

Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz

General

model type Intel Xeon Broadwell EN/EP/EX processor
model name Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
micro-architecture BDW
micro-architecture modeler  
cores per socket 10
cores per NUMA domain 10
cacheline size 64 B
clock 2.2 GHz
NUMA domains per socket 1

This machine file was generated for kerncraft version 0.8.0.

Compiler Flags

icc -O3 -xCORE-AVX2 -fno-alias -qopenmp
gcc -Ofast -march=core-avx2 -fargument-noalias -ffast-math -D_POSIX_C_SOURCE=200112L -fopenmp
clang -03 -mavx2 -D_POSIX_C_SOURCE=200112L -fopenmp

Flops per Cycle

  ADD MUL FMA total
Single Precission 8 8 16 32
Double Precission 4 4 8 16

Memory Hierarchy

L1

groups 20
cores per group 1
threads per group 1
transfers overlap  

Cache Per Group

cl_size 64
load_from L2
replacement_policy LRU
sets 64
store_to L2
ways 8
write_allocate true
write_back true

Performance Counter Metrics

accesses MEM_UOPS_RETIRED_LOADS_ALL:PMC[0-3]
misses L1D_REPLACEMENT:PMC[0-3]
evicts L2_TRANS_L1D_WB:PMC[0-3]

L2

groups 20
cores per group 1
threads per group 1
transfers overlap  
non-overlap upstream throughput 64 B/cy, half-duplex

Cache Per Group

cl_size 64
load_from L3
replacement_policy LRU
sets 512
store_to L3
ways 8
write_allocate true
write_back true

Performance Counter Metrics

accesses L1D_REPLACEMENT:PMC[0-3]
misses L2_LINES_IN_ALL:PMC[0-3]
evicts L2_TRANS_L2_WB:PMC[0-3]

L3

groups 2
cores per group 10
threads per group 10
transfers overlap  
non-overlap upstream throughput 32 B/cy, half-duplex

Cache Per Group

cl_size 64
replacement_policy LRU
sets 6400
ways 64
write_allocate true
write_back true

Performance Counter Metrics

accesses L2_LINES_IN_ALL:PMC[0-3]
misses (LLC_LOOKUP_DATA_READ:CBOX0C[01] + LLC_LOOKUP_DATA_READ:CBOX1C[01] + LLC_LOOKUP_DATA_READ:CBOX2C[01] + LLC_LOOKUP_DATA_READ:CBOX3C[01] + LLC_LOOKUP_DATA_READ:CBOX4C[01] + LLC_LOOKUP_DATA_READ:CBOX5C[01] + LLC_LOOKUP_DATA_READ:CBOX6C[01] + LLC_LOOKUP_DATA_READ:CBOX7C[01] + LLC_LOOKUP_DATA_READ:CBOX8C[01] + LLC_LOOKUP_DATA_READ:CBOX9C[01] + LLC_LOOKUP_DATA_READ:CBOX10C[01] + LLC_LOOKUP_DATA_READ:CBOX11C[01] + LLC_LOOKUP_DATA_READ:CBOX12C[01] + LLC_LOOKUP_DATA_READ:CBOX13C[01] + LLC_LOOKUP_DATA_READ:CBOX14C[01] + LLC_LOOKUP_DATA_READ:CBOX15C[01] + LLC_LOOKUP_DATA_READ:CBOX16C[01] + LLC_LOOKUP_DATA_READ:CBOX17C[01] + LLC_LOOKUP_DATA_READ:CBOX18C[01] + LLC_LOOKUP_DATA_READ:CBOX19C[01] + LLC_LOOKUP_DATA_READ:CBOX20C[01] + LLC_LOOKUP_DATA_READ:CBOX21C[01])
evicts (LLC_VICTIMS_M:CBOX0C[01] + LLC_VICTIMS_M:CBOX1C[01] + LLC_VICTIMS_M:CBOX2C[01] + LLC_VICTIMS_M:CBOX3C[01] + LLC_VICTIMS_M:CBOX4C[01] + LLC_VICTIMS_M:CBOX5C[01] + LLC_VICTIMS_M:CBOX6C[01] + LLC_VICTIMS_M:CBOX7C[01] + LLC_VICTIMS_M:CBOX8C[01] + LLC_VICTIMS_M:CBOX9C[01] + LLC_VICTIMS_M:CBOX10C[01] + LLC_VICTIMS_M:CBOX11C[01] + LLC_VICTIMS_M:CBOX12C[01] + LLC_VICTIMS_M:CBOX13C[01] + LLC_VICTIMS_M:CBOX14C[01] + LLC_VICTIMS_M:CBOX15C[01] + LLC_VICTIMS_M:CBOX16C[01] + LLC_VICTIMS_M:CBOX17C[01] + LLC_VICTIMS_M:CBOX18C[01] + LLC_VICTIMS_M:CBOX19C[01] + LLC_VICTIMS_M:CBOX20C[01] + LLC_VICTIMS_M:CBOX21C[01])

MEM

cores per group 10
threads per group 10
transfers overlap  
non-overlap upstream throughput full socket memory bandwidth, half-duplex

Overlapping Model

Ports:

0, 0DV, 1, 2, 3, 4, 5, 6, 7

Performance Counter Metric

Max(UOPS_EXECUTED_PORT_PORT_0:PMC[0-3], UOPS_EXECUTED_PORT_PORT_1:PMC[0-3], UOPS_EXECUTED_PORT_PORT_4:PMC[0-3], UOPS_EXECUTED_PORT_PORT_5:PMC[0-3], UOPS_EXECUTED_PORT_PORT_6:PMC[0-3], UOPS_EXECUTED_PORT_PORT_7:PMC[0-3])

Non-Overlapping Model

Ports:

2D, 3D

Performance Counter Metric

T_OL + T_L1L2 + T_L2L3 + T_L3MEM

Benchmarks

Kernels

copy

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 1 Streams with 8.00 B
read+write streams 0 Streams with 0.00 B

daxpy

FLOPs per iteration 2
read streams 2 Streams with 16.00 B
write streams 1 Streams with 8.00 B
read+write streams 1 Streams with 8.00 B

load

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 0 Streams with 0.00 B
read+write streams 0 Streams with 0.00 B

triad

FLOPs per iteration 2
read streams 3 Streams with 24.00 B
write streams 1 Streams with 8.00 B
read+write streams 0 Streams with 0.00 B

update

FLOPs per iteration 0
read streams 1 Streams with 8.00 B
write streams 1 Streams with 8.00 B
read+write streams 1 Streams with 8.00 B