Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implausible cycle count when PMU muxing is active #53

Open
topolarity opened this issue Dec 19, 2024 · 6 comments
Open

Implausible cycle count when PMU muxing is active #53

topolarity opened this issue Dec 19, 2024 · 6 comments

Comments

@topolarity
Copy link
Member

topolarity commented Dec 19, 2024

It's not uncommon for me to get a reading like this:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               8.60e+04   62.9%  #  0.0 cycles per ns
╶ instructions             7.89e+06  100.0%  # 91.8 insns per cycle
╶ branch-instructions      1.29e+05  100.0%  #  1.6% of insns
╶ branch-misses            9.09e+02  100.0%  #  0.7% of branch insns
╶ cache-misses             3.44e+03  100.0%  #  0.5% of cache refs
╶ cache-references         6.77e+05   37.5%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

about ~10-30% of the time, compared to a more reasonable measurement:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.59e+07   82.6%  #  1.6 cycles per ns
╶ instructions             4.35e+07   82.6%  #  1.2 insns per cycle
╶ branch-instructions      7.04e+06   82.6%  # 16.2% of insns
╶ branch-misses            1.93e+05   82.6%  #  2.7% of branch insns
╶ cache-misses             5.19e+05   82.6%  # 11.7% of cache refs
╶ cache-references         4.46e+06   87.2%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

~100 instructions per cycle seems... a little high

@topolarity
Copy link
Member Author

topolarity commented Dec 19, 2024

The above measurements are with the diff from #50 (comment), but this also happens using disable_all! / enable_all!

Here's a strange measurement from current master:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.43e+04   59.4%  #  0.0 cycles per ns
╶ instructions             1.26e+07  100.0%  # 877.8 insns per cycle
╶ branch-instructions      1.07e+06  100.0%  #  8.5% of insns
╶ branch-misses            1.19e+04  100.0%  #  1.1% of branch insns
╶ cache-misses             1.09e+05  100.0%  #  8.2% of cache refs
╶ cache-references         1.33e+06   40.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@topolarity
Copy link
Member Author

topolarity commented Dec 19, 2024

I haven't been able to get perf stat to make the same mistake so LinuxPerf may be doing something wrong here:

$ perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses echo


 Performance counter stats for 'echo':

         1,417,515      cycles                                                        (81.44%)
         1,048,902      instructions              #    0.74  insn per cycle
           240,734      branches
            31,285      branch-misses             #   13.00% of all branches
            87,314      cache-references
            25,054      cache-misses              #   28.694 % of all cache refs      (18.56%)

       0.001116123 seconds time elapsed

       0.000000000 seconds user
       0.001277000 seconds sys

@topolarity
Copy link
Member Author

topolarity commented Dec 19, 2024

Seems to be affected by the number of events we're creating for threading (I'm running on a 64-core / 128-hyper-threads machine). This is the worst outlier with threads=false which is orders of magnitude improved:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" threads=false rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               4.27e+06   73.0%  #  1.2 cycles per ns
╶ instructions             1.03e+07  100.0%  #  2.4 insns per cycle
╶ branch-instructions      6.23e+05  100.0%  #  6.0% of insns
╶ branch-misses            4.94e+03  100.0%  #  0.8% of branch insns
╶ cache-misses             8.85e+04  100.0%  #  7.6% of cache refs
╶ cache-references         1.17e+06   27.4%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Using OPENBLAS_NUM_THREADS=1 also dramatically reduces the noise, since most of the threads we're registering perf counters for are OpenBLAS's

Weirdly enough, the situation also improves dramatically if I use -t auto (with threads=true)

@topolarity
Copy link
Member Author

If I group the cpu-cycles together with the instructions then both estimates seem to have these very low readings:

julia> @pstats "(cpu-cycles,instructions),branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               8.71e+04   40.7%  #  0.0 cycles per ns
└ instructions             2.60e+04   40.7%  #  0.3 insns per cycle
╶ branch-instructions      1.01e+06  100.0%  # 3900.4% of insns
╶ branch-misses            1.33e+04  100.0%  #  1.3% of branch insns
╶ cache-misses             1.04e+05  100.0%  # 11.7% of cache refs
╶ cache-references         8.91e+05   59.6%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Somehow these counters were active ~40% of the time, but the reading was ~100x lower than expected

@Zentrik
Copy link
Collaborator

Zentrik commented Dec 19, 2024

All the issues you've had have been on a single machine right?

@topolarity
Copy link
Member Author

Yeah, that's right. I'm going to try to take some measurements tomorrow on other HW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants