Add pstats macro #6

bicycle1885 · 2020-06-20T06:01:42Z

This is a port of my @pstats macro I post on Twitter. This is still a work in progress, but I would like to get your feedback.

Here is a simple test function called tarai.

function tarai(x, y, z)
    if x ≤ y
        return y
    else
        return tarai(tarai(x-1, y, z), tarai(y-1, z, x), tarai(z-1, x, y))
    end
end

And this is an example of default behavior compared to @measure:

julia> @pstats tarai(14, 2, 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.29e+09   49.8%  #  3.7 cycles per ns
│ stalled-cycles-frontend  4.16e+06   49.8%  #  0.2% of cycles
└ stalled-cycles-backend   1.11e+09   49.8%  # 48.5% of cycles
┌ instructions             6.19e+09   50.2%  #  2.7 insns per cycle
│ branch-instructions      7.73e+08   50.2%  # 12.5% of instructions
└ branch-misses            3.60e+06   50.2%  #  0.5% of branch instructions
┌ task-clock               6.12e+08  100.0%
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @measure tarai(14, 2, 0)
┌───────────────────────┬───────────────┬─────────────┐
│                       │ Events        │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│             hw:cycles │ 1,522,873,900 │ 66.5 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_access │ 1,183,473     │ 66.6 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_misses │ 311,611       │ 66.6 %      │
├───────────────────────┼───────────────┼─────────────┤
│           hw:branches │ 519,874,250   │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 2,595,129     │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:instructions │ 4,142,317,183 │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│       sw:ctx_switches │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│        sw:page_faults │ 1             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:minor_page_faults │ 1             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:major_page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│     sw:cpu_migrations │ 0             │ 100.0 %     │
└───────────────────────┴───────────────┴─────────────┘

The most important difference is @pstats scales the counter by the reciprocal of the ratio of time_running and time_enabled while @measure shows the raw counter value. In my opinion, scaled values are easier to understand and can be easily compared to other runs.

Also, some comments are added on the right hand side of the table. These are useful because you can easily notice stalled cycles and/or branch prediction misses. These values are computed on demand and not explicitly stored in a statistics object.

You can focus on specific events you're interested in by passing a string to the macro as follows:

julia> @pstats "(instructions,cpu-cycles),task-clock" tarai(14, 2, 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ instructions             6.19e+09  100.0%  #  2.6 insns per cycle
└ cpu-cycles               2.36e+09  100.0%  #  3.8 cycles per ns
╶ task-clock               6.15e+08  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Events in a paired parens are grouped together: they are measured as a single unit of execution and thus their values can be compared to each other. In the case above, "instructions" and "cpu-cycles" are grouped, and "task-clock" is a singleton group. Groups are also shown in the reported result using box-drawing characters (the leftmost column of the table).

The event names are taken from the perf command-line tool because they would be more familiar to users of the tool. Aliases are not supported yet.

I'm going to add more tweaks and docs to this pull request. I appreciate your feedback.

vchuravy

Looks good. @NHDaly & @YingboMa do you guys think we should delete the @measure macros?

vchuravy · 2020-06-21T20:17:41Z

src/LinuxPerf.jl

+            error("unknown option: $(opt)")
+        end
+    end
+    return (events = events, exclude_kernel = exclude_kernel,)


The keyword argument right now is called userspace_only

I wonder if there is a rationale behind renaming exclude_kernel to userspace_only. The original name of the perf_event_attr struct is exclude_kernel, which seems enough natural to me. Also, we have exclude_user, exclude_hv (hypervisor), exclude_idle, and so on. I'm not sure the meaning of userspace_only in this situation.

vchuravy · 2020-06-21T20:19:38Z

src/LinuxPerf.jl

@@ -334,18 +345,378 @@ const reasonable_defaults =
     [EventType(:cache, :L1_data, :write, :access),
      EventType(:cache, :L1_data, :write, :miss)]=#]

-function make_bench(x)
+function make_bench(x; kwargs...)


We probably should follow perf here and add proper modifiers to the events https://perf.wiki.kernel.org/index.php/Tutorial#Modifiers

so allowing :u for userspace only, instead of making it a keywordargument to the EventGroup

I found the logic behind these event modifiers: https://elixir.bootlin.com/linux/v5.7.2/source/tools/perf/util/parse-events.c#L1731. So, a better design to support these modifiers would be removing the userspace_only keyword argument of the EventGroup constructor and augmenting the EventType struct with modifier(s).

YingboMa · 2020-06-21T20:33:23Z

Yes, we can remove @measure. It will be cool to make @pstats more flexible and able to use raw events #3.

bicycle1885 · 2020-06-22T05:22:40Z

Supporting raw events should be easy. I'd like to follow the syntax of the perf command, which specifies event numbers like rNNN (quoted from perf list --help):

RAW HARDWARE EVENT DESCRIPTOR
       Even when an event is not available in a symbolic form within perf right now, it
       can be encoded in a per processor specific way.

       For instance For x86 CPUs NNN represents the raw register encoding with the layout
       of IA32_PERFEVTSELx MSRs (see [Intel® 64 and IA-32 Architectures Software
       Developer’s Manual Volume 3B: System Programming Guide] Figure 30-1 Layout of
       IA32_PERFEVTSELx MSRs) or AMD’s PerfEvtSeln (see [AMD64 Architecture Programmer’s
       Manual Volume 2: System Programming], Page 344, Figure 13-7 Performance
       Event-Select Register (PerfEvtSeln)).

       Note: Only the following bit fields can be set in x86 counter registers: event,
       umask, edge, inv, cmask. Esp. guest/host only and OS/user mode flags must be setup
       using EVENT MODIFIERS.

       Example:

       If the Intel docs for a QM720 Core i7 describe an event as:

           Event  Umask  Event Mask
           Num.   Value  Mnemonic    Description                        Comment

           A8H      01H  LSD.UOPS    Counts the number of micro-ops     Use cmask=1 and
                                     delivered by loop stream detector  invert to count
                                                                        cycles

       raw encoding of 0x1A8 can be used:

           perf stat -e r1a8 -a sleep 1
           perf record -e r1a8 ...

       You should refer to the processor specific documentation for getting these details.
       Some of them are referenced in the SEE ALSO section below.

YingboMa

LGTM. @bicycle1885 do you want to finish up this PR and add raw events in a separate PR?

bicycle1885 · 2020-06-30T02:25:20Z

Yes, I think adding raw events to this pull request is too much. I found a few problems in the current pr so please give me a moment.

bicycle1885 · 2020-06-30T04:27:29Z

I think it is now in good shape!

Here is a highlight of changes I added yesterday. Before those changes, count statistics except from the main thread are just ignored. This is not good because some multithreaded routines such as matrix multiplication just lose some valuable information. The current @pstats macro aggregates statistics from all threads and thus it is more precise and useful. The LinuxPerf.printsummary function can print per-thread statistics along with aggregated ones.

julia> using LinuxPerf

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> A = randn(3000, 3000); B = copy(A);

julia> @time A*B;
  1.545560 seconds (2.71 M allocations: 199.055 MiB, 8.22% gc time)

julia> @time A*B;
  0.631846 seconds (2 allocations: 68.665 MiB, 1.44% gc time)

julia> stats = @pstats A*B
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               7.51e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  6.03e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   5.44e+09   50.1%  # 72.5% of cycles
┌ instructions             1.31e+10   49.9%  #  1.7 insns per cycle
│ branch-instructions      9.35e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.18e+06   49.9%  #  1.3% of branch instructions
┌ task-clock               2.46e+09  100.0%  #  2.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.20e+04  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> LinuxPerf.printsummary(stats, expandthreads = true)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 109445)
┌ cpu-cycles               1.86e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  1.23e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.39e+09   50.1%  # 74.6% of cycles
┌ instructions             3.28e+09   49.9%  #  1.8 insns per cycle
│ branch-instructions      2.32e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.57e+05   49.9%  #  0.7% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.03e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 109450)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  6.85e+05   50.1%  #  0.0% of cycles
└ stalled-cycles-backend   1.35e+09   50.1%  # 71.4% of cycles
┌ instructions             3.29e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.29e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.33e+05   49.9%  #  0.6% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.1 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.91e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 109451)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  2.81e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.40e+09   50.1%  # 74.6% of cycles
┌ instructions             3.27e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.52e+07   49.9%  #  0.8% of instructions
└ branch-misses            7.36e+05   49.9%  #  2.9% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.54e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 109452)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  1.30e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.30e+09   50.1%  # 69.2% of cycles
┌ instructions             3.22e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.22e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.52e+05   49.9%  #  0.7% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.1 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              6.52e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.51e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  6.03e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   5.44e+09   50.1%  # 72.5% of cycles
┌ instructions             1.31e+10   49.9%  #  1.7 insns per cycle
│ branch-instructions      9.35e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.18e+06   49.9%  #  1.3% of branch instructions
┌ task-clock               2.46e+09  100.0%  #  2.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.20e+04  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

vchuravy reviewed Jun 21, 2020

View reviewed changes

bicycle1885 added 10 commits June 28, 2020 14:08

Add pstats macro

d20ed16

Export pstats macro

fb4e206

Add list function to list events

6fc53d4

Write docs

78bdb4e

Close PerfBench

b6a85c5

Add exclude_kernel option

c4077ed

Update examples

033e29e

Refactor show

da8de67

Show warning if events are not measured

cc755eb

Show human-readable time

8602a74

bicycle1885 force-pushed the pstats branch from adee27f to 8602a74 Compare June 28, 2020 05:11

bicycle1885 added 13 commits June 28, 2020 14:12

Fix PERF_TYPE_BREAKPOINT

dc26110

Add event modifiers

7473c03

Add user, kernel, and hypervisor flags

dfa787c

Add dump_groups for debug

768deb1

Fix docs

36ec2fc

Add group-level modifiers

34e0d2b

Fix bug

7f73836

Add tests on events parser

b913820

Update examples

fbf77fa

Make sure to close PerfBench

882d700

Measure user space only by default

affbb6e

Support multithreads

bd1dfc6

Fix

e838806

vchuravy requested review from YingboMa and NHDaly June 29, 2020 18:18

YingboMa approved these changes Jun 29, 2020

View reviewed changes

bicycle1885 added 2 commits June 30, 2020 11:34

Fix printsummary

444edea

Tweaks

1ea871b

bicycle1885 merged commit b67de03 into JuliaPerf:master Jul 1, 2020

bicycle1885 deleted the pstats branch July 1, 2020 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pstats macro #6

Add pstats macro #6

bicycle1885 commented Jun 20, 2020

vchuravy left a comment

vchuravy Jun 21, 2020

bicycle1885 Jun 22, 2020

vchuravy Jun 21, 2020

bicycle1885 Jun 22, 2020 •

edited

Loading

YingboMa commented Jun 21, 2020

bicycle1885 commented Jun 22, 2020

YingboMa left a comment

bicycle1885 commented Jun 30, 2020

bicycle1885 commented Jun 30, 2020 •

edited

Loading

Add pstats macro #6

Add pstats macro #6

Conversation

bicycle1885 commented Jun 20, 2020

vchuravy left a comment

Choose a reason for hiding this comment

vchuravy Jun 21, 2020

Choose a reason for hiding this comment

bicycle1885 Jun 22, 2020

Choose a reason for hiding this comment

vchuravy Jun 21, 2020

Choose a reason for hiding this comment

bicycle1885 Jun 22, 2020 • edited Loading

Choose a reason for hiding this comment

YingboMa commented Jun 21, 2020

bicycle1885 commented Jun 22, 2020

YingboMa left a comment

Choose a reason for hiding this comment

bicycle1885 commented Jun 30, 2020

bicycle1885 commented Jun 30, 2020 • edited Loading

bicycle1885 Jun 22, 2020 •

edited

Loading

bicycle1885 commented Jun 30, 2020 •

edited

Loading