Information

likwid-bench is a benchmark suite for low-level (assembly) benchmarks to measure bandwidths and instruction throughput for specific instruction code on x86 systems. The currently included benchmark codes include common data access patterns like load and store but also calculations like vector triad and sum. likwid-bench includes architecture specific benchmarks for x86, x86_64 and x86 for Intel Xeon Phi coprocessors. The performance values can either be calculated by likwid-bench or measured using hardware performance counters by using likwid-perfctr as a wrapper to likwid-bench. This requires to build likwid-bench with instrumentation enabled in config.mk (INSTRUMENT_BENCH).

Options

Option Description

-h Print help message

-a List all available benchmarks

-p List all available thread affinity domains

-i <iters> Use <iters> iterations of the benchmark kernel

-d <delim> Use <delim> instead of ',' for the output of -p

-l <test> List characteristics of <test> like number of streams, data used per loop iteration, ...

-t <test> Perform assembly benchmark <test>

-s <min_time> Minimal time in seconds to run the benchmark.
Using this time, the iteration count is determined automatically to provide reliable results. Default is 1. If the determined iteration count is below 10, it is normalized to 10.

-w <workgroup>

Set a workgroup for the benchmark. A workgroup can have different formats:

Format	Description
<affinity_domain>:<size>	Allocate in total <size> in affinity domain <affinity_domain>. `likwid-bench` starts as many threads as available in affinity domain <affinity_domain>
<affinity_domain>:<size>:<num_threads>	Allocate in total <size> in affinity domain <affinity_domain>. `likwid-bench` starts <num_threads> in affinity domain <affinity_domain>
<affinity_domain>:<size>:<num_threads>:<chunk_size>:<stride>	Allocate in total <size> in affinity domain <affinity_domain>. `likwid-bench` starts <num_threads> in affinity domain <affinity_domain> with <chunk_size> selected in row and a distance of <stride>. See CPU_expressions on the `likwid-pin` page for further information.
<above_formats>-<streamID>:<stream_domain>	In combination with every above mentioned format, the test streams (arrays, vectors) can be place in different affinity domains than the threads. This can be achieved by adding a stream placement option -<streamID>:<stream_domain> for all streams of the test to the workgroup definition. The stream with <streamID> is placed in affinity domain <stream_domain>. The amount of streams of a test can be determined with the -l <test> commandline option.

Examples

likwid-bench -t copy -w S0:100kB
Run test copy using all threads in affinity domain S0. The input and output stream of the copy benchmark sum up to 100kB placed in affinity domain S0. The iteration count is calculated automatically.
likwid-bench -t triad -i 100 -w S0:1GB:2:1:2
Run test triad using 2 threads in affinity domain S0. Assuming S0 = 0,4,1,5 the threads are pinned to CPUs 0 and 1, hence skipping of one thread during selection. The streams of the triad benchmark sum up to 1GB placed in affinity domain S0. The number of iteration is explicitly set to 100
likwid-bench -t update -w S0:100kB -w S1:100kB
Run test update using all threads in affinity domain S0 and S1. The threads scheduled on S0 use stream that sum up to 100kB. Similar to S1 the threads are placed there working only on their socket-local streams. The results of both workgroups are combined.
likwid-perfctr -c E:S0:4 -g MEM -m likwid-bench -t update -w S0:100kB:4
Run test update using 4 threads in affinity domain S0. The input and output stream of the copy benchmark sum up to 100kB placed in affinity domain S0. The benchmark execution is measured using the Marker_API. It measures the MEM performance group on the first four CPUs of the S0 affinity domain. For further information about hardware performance counters see likwid-perfctr
Note: Currently it is not possible to pin the threads already at likwid-perfctr. The pinning is done by likwid-bench
likwid-bench -t copy -w S0:1GB:2:1:2-0:S1,1:S1
Run test copy using 2 threads in affinity domain S0 skipping one thread during selection. The two streams used in the copy benchmark have the IDs 0 and 1 and a summed up size of 1GB. Both streams are placed in affinity domain S1.

*/