Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pstats macro #6

Merged
merged 25 commits into from
Jul 1, 2020
Merged

Add pstats macro #6

merged 25 commits into from
Jul 1, 2020

Conversation

bicycle1885
Copy link
Member

This is a port of my @pstats macro I post on Twitter. This is still a work in progress, but I would like to get your feedback.

Here is a simple test function called tarai.

function tarai(x, y, z)
    if x  y
        return y
    else
        return tarai(tarai(x-1, y, z), tarai(y-1, z, x), tarai(z-1, x, y))
    end
end

And this is an example of default behavior compared to @measure:

julia> @pstats tarai(14, 2, 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.29e+09   49.8%  #  3.7 cycles per ns
│ stalled-cycles-frontend  4.16e+06   49.8%  #  0.2% of cycles
└ stalled-cycles-backend   1.11e+09   49.8%  # 48.5% of cycles
┌ instructions             6.19e+09   50.2%  #  2.7 insns per cycle
│ branch-instructions      7.73e+08   50.2%  # 12.5% of instructions
└ branch-misses            3.60e+06   50.2%  #  0.5% of branch instructions
┌ task-clock               6.12e+08  100.0%
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @measure tarai(14, 2, 0)
┌───────────────────────┬───────────────┬─────────────┐
│                       │ Events        │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│             hw:cycles │ 1,522,873,900 │ 66.5 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_access │ 1,183,473     │ 66.6 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_misses │ 311,611       │ 66.6 %      │
├───────────────────────┼───────────────┼─────────────┤
│           hw:branches │ 519,874,250   │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 2,595,129     │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:instructions │ 4,142,317,183 │ 66.9 %      │
├───────────────────────┼───────────────┼─────────────┤
│       sw:ctx_switches │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│        sw:page_faults │ 1             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:minor_page_faults │ 1             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:major_page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│     sw:cpu_migrations │ 0             │ 100.0 %     │
└───────────────────────┴───────────────┴─────────────┘

The most important difference is @pstats scales the counter by the reciprocal of the ratio of time_running and time_enabled while @measure shows the raw counter value. In my opinion, scaled values are easier to understand and can be easily compared to other runs.

Also, some comments are added on the right hand side of the table. These are useful because you can easily notice stalled cycles and/or branch prediction misses. These values are computed on demand and not explicitly stored in a statistics object.

You can focus on specific events you're interested in by passing a string to the macro as follows:

julia> @pstats "(instructions,cpu-cycles),task-clock" tarai(14, 2, 0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ instructions             6.19e+09  100.0%  #  2.6 insns per cycle
└ cpu-cycles               2.36e+09  100.0%  #  3.8 cycles per ns
╶ task-clock               6.15e+08  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Events in a paired parens are grouped together: they are measured as a single unit of execution and thus their values can be compared to each other. In the case above, "instructions" and "cpu-cycles" are grouped, and "task-clock" is a singleton group. Groups are also shown in the reported result using box-drawing characters (the leftmost column of the table).

The event names are taken from the perf command-line tool because they would be more familiar to users of the tool. Aliases are not supported yet.

I'm going to add more tweaks and docs to this pull request. I appreciate your feedback.

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @NHDaly & @YingboMa do you guys think we should delete the @measure macros?

src/LinuxPerf.jl Outdated
error("unknown option: $(opt)")
end
end
return (events = events, exclude_kernel = exclude_kernel,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword argument right now is called userspace_only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a rationale behind renaming exclude_kernel to userspace_only. The original name of the perf_event_attr struct is exclude_kernel, which seems enough natural to me. Also, we have exclude_user, exclude_hv (hypervisor), exclude_idle, and so on. I'm not sure the meaning of userspace_only in this situation.

src/LinuxPerf.jl Outdated
@@ -334,18 +345,378 @@ const reasonable_defaults =
[EventType(:cache, :L1_data, :write, :access),
EventType(:cache, :L1_data, :write, :miss)]=#]

function make_bench(x)
function make_bench(x; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should follow perf here and add proper modifiers to the events https://perf.wiki.kernel.org/index.php/Tutorial#Modifiers

so allowing :u for userspace only, instead of making it a keywordargument to the EventGroup

Copy link
Member Author

@bicycle1885 bicycle1885 Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the logic behind these event modifiers: https://elixir.bootlin.com/linux/v5.7.2/source/tools/perf/util/parse-events.c#L1731. So, a better design to support these modifiers would be removing the userspace_only keyword argument of the EventGroup constructor and augmenting the EventType struct with modifier(s).

@YingboMa
Copy link
Member

Yes, we can remove @measure. It will be cool to make @pstats more flexible and able to use raw events #3.

@bicycle1885
Copy link
Member Author

Supporting raw events should be easy. I'd like to follow the syntax of the perf command, which specifies event numbers like rNNN (quoted from perf list --help):

RAW HARDWARE EVENT DESCRIPTOR
       Even when an event is not available in a symbolic form within perf right now, it
       can be encoded in a per processor specific way.

       For instance For x86 CPUs NNN represents the raw register encoding with the layout
       of IA32_PERFEVTSELx MSRs (see [Intel® 64 and IA-32 Architectures Software
       Developer’s Manual Volume 3B: System Programming Guide] Figure 30-1 Layout of
       IA32_PERFEVTSELx MSRs) or AMD’s PerfEvtSeln (see [AMD64 Architecture Programmer’s
       Manual Volume 2: System Programming], Page 344, Figure 13-7 Performance
       Event-Select Register (PerfEvtSeln)).

       Note: Only the following bit fields can be set in x86 counter registers: event,
       umask, edge, inv, cmask. Esp. guest/host only and OS/user mode flags must be setup
       using EVENT MODIFIERS.

       Example:

       If the Intel docs for a QM720 Core i7 describe an event as:

           Event  Umask  Event Mask
           Num.   Value  Mnemonic    Description                        Comment

           A8H      01H  LSD.UOPS    Counts the number of micro-ops     Use cmask=1 and
                                     delivered by loop stream detector  invert to count
                                                                        cycles

       raw encoding of 0x1A8 can be used:

           perf stat -e r1a8 -a sleep 1
           perf record -e r1a8 ...

       You should refer to the processor specific documentation for getting these details.
       Some of them are referenced in the SEE ALSO section below.

@vchuravy vchuravy requested review from YingboMa and NHDaly June 29, 2020 18:18
Copy link
Member

@YingboMa YingboMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @bicycle1885 do you want to finish up this PR and add raw events in a separate PR?

@bicycle1885
Copy link
Member Author

Yes, I think adding raw events to this pull request is too much. I found a few problems in the current pr so please give me a moment.

@bicycle1885
Copy link
Member Author

bicycle1885 commented Jun 30, 2020

I think it is now in good shape!

Here is a highlight of changes I added yesterday. Before those changes, count statistics except from the main thread are just ignored. This is not good because some multithreaded routines such as matrix multiplication just lose some valuable information. The current @pstats macro aggregates statistics from all threads and thus it is more precise and useful. The LinuxPerf.printsummary function can print per-thread statistics along with aggregated ones.

julia> using LinuxPerf

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> A = randn(3000, 3000); B = copy(A);

julia> @time A*B;
  1.545560 seconds (2.71 M allocations: 199.055 MiB, 8.22% gc time)

julia> @time A*B;
  0.631846 seconds (2 allocations: 68.665 MiB, 1.44% gc time)

julia> stats = @pstats A*B
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               7.51e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  6.03e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   5.44e+09   50.1%  # 72.5% of cycles
┌ instructions             1.31e+10   49.9%  #  1.7 insns per cycle
│ branch-instructions      9.35e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.18e+06   49.9%  #  1.3% of branch instructions
┌ task-clock               2.46e+09  100.0%  #  2.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.20e+04  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> LinuxPerf.printsummary(stats, expandthreads = true)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Thread #1 (TID = 109445)
┌ cpu-cycles               1.86e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  1.23e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.39e+09   50.1%  # 74.6% of cycles
┌ instructions             3.28e+09   49.9%  #  1.8 insns per cycle
│ branch-instructions      2.32e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.57e+05   49.9%  #  0.7% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.03e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #2 (TID = 109450)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  6.85e+05   50.1%  #  0.0% of cycles
└ stalled-cycles-backend   1.35e+09   50.1%  # 71.4% of cycles
┌ instructions             3.29e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.29e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.33e+05   49.9%  #  0.6% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.1 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.91e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #3 (TID = 109451)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  2.81e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.40e+09   50.1%  # 74.6% of cycles
┌ instructions             3.27e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.52e+07   49.9%  #  0.8% of instructions
└ branch-misses            7.36e+05   49.9%  #  2.9% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              5.54e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Thread #4 (TID = 109452)
┌ cpu-cycles               1.88e+09   50.1%  #  3.1 cycles per ns
│ stalled-cycles-frontend  1.30e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   1.30e+09   50.1%  # 69.2% of cycles
┌ instructions             3.22e+09   49.9%  #  1.7 insns per cycle
│ branch-instructions      2.22e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.52e+05   49.9%  #  0.7% of branch instructions
┌ task-clock               6.16e+08  100.0%  # 616.1 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              6.52e+03  100.0%
┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Aggregated
┌ cpu-cycles               7.51e+09   50.1%  #  3.0 cycles per ns
│ stalled-cycles-frontend  6.03e+06   50.1%  #  0.1% of cycles
└ stalled-cycles-backend   5.44e+09   50.1%  # 72.5% of cycles
┌ instructions             1.31e+10   49.9%  #  1.7 insns per cycle
│ branch-instructions      9.35e+07   49.9%  #  0.7% of instructions
└ branch-misses            1.18e+06   49.9%  #  1.3% of branch instructions
┌ task-clock               2.46e+09  100.0%  #  2.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              2.20e+04  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@bicycle1885 bicycle1885 merged commit b67de03 into JuliaPerf:master Jul 1, 2020
@bicycle1885 bicycle1885 deleted the pstats branch July 1, 2020 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants