Information

likwid-perfctr is a lightweight command line application to configure and read out hardware performance monitoring data on supported x86 processors. It can measure either as wrapper without changing the measured application or with Marker_API functions inside the code, which will turn on and off the counters. Moreover, there are the timeline and stethoscope mode. There are preconfigured performance groups with useful event sets and derived metrics. Additonally, arbitrary events can be measured with custom event sets. The Marker_API can measure mulitple named regions and the results are accumulated over multiple region calls.

Note that likwid-perfctr measures all events on the specified CPUs and not only the context of the executable. On a highly loaded system it will be hard to determine which part of the given application caused the counter increment. Moreover, it is necessary to ensure that processes and threads are pinned to dedicated resources. You can either pin the application yourself or use the builtin pin functionality.

Options

Option Description

-h, –help Print help message.

-v, –version Print version information.

-V, –verbose <level>

Verbose output during execution for debugging. Possible values for <level>:

0	Output only errors
1	Output some information
2	Output detailed information
3	Output developer information

-i, –info Print CPUID information about processor and about Intel Performance Monitoring features.

-g, –group <arg> Specify which event string or performance group should be measured.

-c <arg> Defines the CPUs that should be measured
See CPU_expressions on the likwid-pin page for information about the syntax.

-C <arg> Defines the CPUs that should be measured and pin the executable to the CPUs
See CPU_expressions on the likwid-pin page for information about the syntax.

-H Print information about a performance group given with -g, –group option.

-m Run in marker API mode

-a Print available performance groups for current processor.

-e Print available counters and performance events and suitable options of current processor.

-E <pattern> Print available performance events matching <pattern> and print the usable counters for the found events.
The matching is done with *<pattern>*, so all events matching the substring are returned.

-o, –output <file> Store all ouput to file instead of stdout. LIKWID enables the reformatting of output files according to their suffix.
You can place additional output formatters in folder <PREFIX>/share/likwid/filter. LIKWID ships with one filter script xml written in Perl and a Perl template for developing own output scripts. If the suffix is .csv, the internal CSV printer is used for file output.
Moreover, there are substitutions possible in the output filename. %h is replaced by the host name, %p by the PID, %j by the job ID of batch systems and %r by the MPI rank.

-S <time> Specify the time between starting and stopping of counters. Can be used to monitor applications. Option does not require an executable
Examples for <time> are 1s, 250ms, 500us.

-t <time> Activates the timeline mode that reads the counters in the given frequency <time> during the whole run of the executable
Examples for <time> are 1s, 250ms, 500us.

-T <time> If multiple event sets are given on commandline, switch every <time> to next group. Default is 2s.
Examples for <time> are 1s, 250ms, 500us.
If only a single event set is given, the default read frequency is 30s to catch overflows.

-O Print output in CSV format (conform to RFC 4180). The output contains some markers that help to parse the output.

-f, –force Configure events even if the counter registers are already in use.

-s, –skip <arg> 'arg' must be a bitmask in hex. Threads with the ID equal to a set bit in bitmask will be skipped during pinning
Example: 0x1 = Thread 0 is skipped.

–stats Always print the statistics table.

Examples

likwid-perfctr -C 0-2 -g TLB ./a.out
Pin the executable ./a.out to CPUs 0,1,2 and measure on the specified CPUs the performance group TLB. If not set, the environment variable OMP_NUM_THREADS is set to 3.
likwid-perfctr -C 0-4 -g INSTRUCTIONS_RETIRED_SSE:PMC0,CPU_CLOCKS_UNHALTED:PMC3 ./a.out
Pin the executable ./a.out to CPUs 0,1,2,3,4 and measure on the specified CPUs the event set INSTRUCTIONS_RETIRED_SSE:PMC0,CPU_CLOCKS_UNHALTED:PMC3.
The event set consists of two event definitions:
- INSTRUCTIONS_RETIRED_SSE:PMC0 measures event INSTRUCTIONS_RETIRED_SSE using counter register named PMC0
- CPU_CLOCKS_UNHALTED:PMC3 measures event CPU_CLOCKS_UNHALTED using counter register named PMC3. This event can be used to calculate the run time of the application.
likwid-perfctr -C 0 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,UNC_L3_LINES_IN_ANY:UPMC0 ./a.out
Run and pin executable ./a.out on CPU 0 with a custom event set containing three events.
The event set consists of three event definitions:
- INSTR_RETIRED_ANY:FIXC0 measures event INSTR_RETIRED_ANY using Intel's fixed-purpose counter register named FIXC0.
- CPU_CLK_UNHALTED_CORE:FIXC1 measures event CPU_CLOCKS_UNHALTED using Intel's fixed-purpose counter register named FIXC1. This event can be used to calculate the run time of the application.
- UNC_L3_LINES_IN_ANY:UPMC0 measures event UNC_L3_LINES_IN_ANY using Uncore counter register named UPMC0. Uncore counters are socket-specific, hence LIKWID reads the counter registers only on one CPU per socket.
likwid-perfctr -m -C 0-4 -g INSTRUCTIONS_RETIRED_SSE:PMC0,CPU_CLOCKS_UNHALTED:PMC3 ./a.out
Run and pin the executable to CPUs 0,1,2,3,4 and activate the Marker API. The code in a.out is assumed to be instrumented with LIKWID's Marker API. Only the marked code regions are measured.
- INSTRUCTIONS_RETIRED_SSE:PMC0 measures event INSTRUCTIONS_RETIRED_SSE using counter register named PMC0.
- CPU_CLOCKS_UNHALTED:PMC3 measures event CPU_CLOCKS_UNHALTED using counter register named PMC3. This event can be used to calculate the run time of the application.
The Marker API for C/C++ offers 6 functions to measure named regions. You can use instrumented code with and without LIKWID. In order to activate the Marker API, -DLIKWID_PERFMON needs to be added to the compiler call. The following listing describes each function shortly (complete list see Marker_API):
- LIKWID_MARKER_INIT: Initialize LIKWID globally. Must be called in serial region and only once.
- LIKWID_MARKER_THREADINIT: Initialize LIKWID for each thread. Must be called in parallel region and executed by every thread.
- LIKWID_MARKER_START('compute'): Start a code region and associate it with the name 'compute'. The names are freely selectable and are used for grouping and outputting regions.
- LIKWID_MARKER_STOP('compute'): Stop the code region associated with the name 'compute'.
- LIKWID_MARKER_SWITCH: Switches to the next performance group or event set in a round-robin fashion. Can be used to measure the same region with multiple events. If called inside a code region, the results for all groups will be faulty. Be aware that each programming of the config registers causes overhead.
- LIKWID_MARKER_CLOSE: Finalize LIKWID globally. Should be called in the end of your application. This writes out all region results to a file that is picked up by likwid-perfctr for evaluation.
likwid-perfctr -c 0-3 -g FLOPS_DP -t 300ms ./a.out 2> out.txt
Runs the executable a.out and measures the performance group FLOPS_DP on CPUs 0,1,2,3 every 300 ms. Since -c is used, the application is not pinned to the CPUs and OMP_NUM_THREADS is not set. The performance group FLOPS_DP is not available on every architecture, use likwid-perfctr -a for a complete list. Please note, that likwid-perfctr writes the measurements to stderr while the application's output and LIKWID's final results are printed to stdout.
The syntax of the timeline mode output lines is:
<groupID> <numberOfEvents> <numberOfThreads> <Timestamp> <Event1_Thread1> <Event1_Thread2> ... <EventN_ThreadN>
You can also use the tool likwid-perfscope to print the measured values live with gnuplot.
likwid-perfctr -c 0-3 -g FLOPS_DP -S 2s
Measures the performance group FLOPS_DP on CPUs 0,1,2,3 for 2 seconds. This option can be used to measure application from external or to perform low-level system monitoring.
likwid-perfctr -c S0:0@S1:0 -g LLC_LOOKUPS_DATA_READ:CBOX0C0:STATE=0x9 -S 2s
Measures the event LLC_LOOKUPS_DATA_READ on the first CPU of socket 0 and the first CPU on socket 1 for 2 seconds using the counter 0 in CBOX 0 (LLC cache coherency engine). The counting is filtered to only lookups in the 'invalid' and 'modified' state. Look at the microarchitecture Uncore documentation for possible bitmasks. Which option is available for which counter class can be found in section Supported Architectures.

Performance groups

One of the outstanding features of LIKWID are the performance groups. Each microarchitecture has its own set of events and related counters and finding the suitable events in the documentation is tedious. Moreover, the raw results of the events are often not meaningful, they need to be combined with other events like run time or clock speed. LIKWID addresses those problems by providing performance groups that specify a set of events and counter combinations as well as a set of derived metrics. Starting with LIKWID 4, the performance group definitions are not compiled in anymore, they are read on the fly when they are selected on the commandline. This enables users to define their own performance groups without recompiling and reinstalling LIKWID.
Please note that performance groups is a feature of the Lua API and not available for the C/C++ API.

Directory structure

While installation of LIKWID, the performance groups are copied to the path ${INSTALL_PREFIX}/share/likwid. In this folder there is one subfolder per microarchitecture that contains all performance groups for that microarchitecture. The folder names are not freely selectable, they are defined in src/topology.c. For every microarchitecture at the time of release, there is already a folder that can be extended with your own performance groups. You can change the path to the performance group directory structure by settings the variable likwid.groupfolder in your Lua application, the default is ${INSTALL_PREFIX}/share/likwid.

Syntax of performance group files

SHORT <string> // Short description of the performance group

EVENTSET // Starts the event set definition
<counter>(:<options>) <event> // Each line defines one event/counter combination with optional options.
FIXC0 INSTR_RETIRED_ANY // Example

METRICS // Starts the derived metric definitions
<metricname> <formula> // Each line defines one derived metric. <metricname> can contain spaces, <formula> must be free of spaces. The counter names (with options) and the variables time and inverseClock can be used as variables in <formula>. CPI FIXC1/FIXC0 // Example

LONG // Starts the detailed description of the performance group
<TEXT> // <TEXT> is displayed with -H commandline option

Marker API

The Marker API enables measurement of user-defined code regions in order to get deeper insight what is happening at a specific point in the application. The Marker API itself has 8 commands. In order to activate the Marker API, the code must be compiled with -DLIKWID_PERFMON. If the code is compiled without this define, the Marker API functions perform no operation and cause no overhead. You can also run code compiled with LIKWID_PERFMON defined without measurements but a message will be printed.
Even pure serial applications have to call LIKWID_MARKER_THREADINIT to initialize the accessDaemon or the direct accesses.
The names for the regions can be freely chosen but whitespaces are not allowed.

C/C++ Code

Original code

#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char* argv[]) { int i=0; double sum = 0; #pragma omp parallel for reduction(+:sum) for(i=0;i<100000;i++) { sum += 1.0/(omp_get_thread_num()+1); } printf("Sum is %f\n", sum); return 0; }

Instrumented code

#include <stdlib.h> #include <stdio.h> #include <omp.h> #include <likwid.h> int main(int argc, char* argv[]) { int i=0; double sum = 0; LIKWID_MARKER_INIT; #pragma omp parallel { LIKWID_MARKER_THREADINIT; } #pragma omp parallel { LIKWID_MARKER_START("sum"); #pragma omp for reduction(+:sum) for(i=0;i<100000;i++) { sum += 1.0/(omp_get_thread_num()+1); } LIKWID_MARKER_STOP("sum"); } printf("Sum is %f\n", sum); LIKWID_MARKER_CLOSE; return 0; } The LIKWID package contains an example code: see Marker API in a C/C++ application or Marker API in a Fortran90 application.

Running code

With the help of likwid-perfctr the counters are configured to the selected events. The counters are also started and stopped by likwid-perfctr, the Marker API only reads the counters to minimize the overhead of the instrumented application. Only if you use LIKWID_MARKER_SWITCH the Marker API itself configures a new event set to the registers. Basically, likwid-perfctr exports the whole configuration needed by the Marker API through environment variables that are evaluated during LIKWID_MARKER_INIT. In the end, likwid-perfctr picks up the file with the results of the Marker API run and prints out the performance results.
In order to build your instrumented application:
$CC -openmp -L<PATH_TO_LIKWID_LIBRARY> -I<PATH_TO_LIKWID_INCLUDES> <SRC_CODE> -o <EXECUTABLE> -llikwid
With standard installation, the paths are <PATH_TO_LIKWID_LIBRARY>=/usr/local/lib and <PATH_TO_LIKWID_INCLUDES>=/usr/local/include
Example Marker API call:
likwid-perfctr -C 0-4 -g L3 -m ./a.out

Fortran Code

Besides the Marker API for C/C++ programms, LIKWID offers to build a Fortran module to access the Marker API functions from Fortran. Only the Marker API calls are exported, not the whole API. In config.mk the variable FORTRAN_INTERFACE must be set to true. LIKWID's default is to use the Intel Fortran compiler to build the interface but it can be modified to use GCC's Fortran compiler in make/include_<COMPILER>.
The LIKWID package contains an example code: see Marker API in a Fortran90 application.

Hints for the usage of the Marker API

Since the calls to the LIKWID library are executed by your application, the runtime will raise and in specific circumstances, there are some other problems like the time measurement. You can execute LIKWID_MARKER_THREADINIT and LIKWID_MARKER_START inside the same parallel region but put a barrier between the calls to ensure that there is no big timing difference between the threads. The common way is to init LIKWID and the participating threads inside of an initialization routine, use only START and STOP in your code and close the Marker API in a finalization routine. Be aware that at the first start of a region, the thread-local hash table gets a new entry to store the measured values. If your code inside the region is short or you are executing the region only once, the overhead of creating the hash table entry can be significant compared to the execution of the region code. The overhead of creating the hash tables can be done in prior by using the LIKWID_MARKER_REGISTER function. It must be called by each thread and one time for each compute region. It is completely optional, LIKWID_MARKER_START performs the same operations.

*/