Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CodSpeed Benchmarks #4243

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

erikwrede
Copy link

@erikwrede erikwrede commented Oct 17, 2024

Summary

As discussed on the discord #wg channel, this is the prototype for CodSpeed benchmarks using vitest.

Some of the benchmarks we currently have may not be suitable for CodSpeeds instrumentation and may still provide variance in results. CodSpeed, for now, is just meant as to supplement the fully-fledged benchmark suite to prevent accidental regressions and get a quick impact overview on each PR. We are always able to remove certain benchmarks from the CodSpeed suite and keep them in the more powerful main benchmark suite.

Additionally, the introduction of vitest for benchmarking provides a path forward to using vitest for the tests, too

A sample run of CodSpeed on my fork can be found here: erikwrede/graphql-js#3 (comment)

Changes in this PR

  • Add Codspeed
  • Add Vitest
  • remove @types/chai because Vitest bundles it, no other way I'm aware of to fix this unfortunately - no impact on development
  • Refactor all benchmarks as CodSpeed+Vitest benchmarks
  • Add Github Workflows

Administrative steps before merging

@erikwrede erikwrede requested a review from a team as a code owner October 17, 2024 16:24
@erikwrede erikwrede changed the base branch from 16.x.x to main October 17, 2024 16:24
Copy link

linux-foundation-easycla bot commented Oct 17, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Comment on lines +4 to +7
export default defineConfig({
plugins: [codspeedPlugin()],
// ...
});
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic config for now, will need adjustment if we decide to do testing with vitest

Comment on lines 59 to 62
IGNORED_FILES_UNPROCESSED=$(git ls-files --cached --ignored --exclude-from=all.gitignore)
IGNORED_FILES=$(grep -v -F "patches/@codspeed+core+3.1.0.patch" <<< "$IGNORED_FILES_UNPROCESSED" || true)

echo "IGNORED_FILES: $IGNORED_FILES"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert this as soon as codspeed doesn't require the patch anymore

Comment on lines 1 to 16
diff --git a/node_modules/@codspeed/core/dist/index.cjs.js b/node_modules/@codspeed/core/dist/index.cjs.js
index 1c40cda..4a5d588 100644
--- a/node_modules/@codspeed/core/dist/index.cjs.js
+++ b/node_modules/@codspeed/core/dist/index.cjs.js
@@ -26,7 +26,10 @@ const getV8Flags = () => {
"--no-opt",
"--predictable",
"--predictable-gc-schedule",
- "--interpreted-frames-native-stack"
+ "--interpreted-frames-native-stack",
+ // "--jitless",
+ '--no-concurrent-sweeping',
+ '--max-old-space-size=4096',
];
if (nodeVersionMajor < 18) {
flags.push("--no-randomize-hashes");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As recommended by codspeed maintainers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing benchmark suite, we have the following node options:

  '--predictable',
  '--no-concurrent-sweeping',
  '--no-minor-gc-task',
  '--min-semi-space-size=1024', // 1GB
  '--max-semi-space-size=1024', // 1GB
  '--trace-gc', // no gc calls should happen during benchmark, so trace them

Do we want to use similar flags? Are they equivalent? Asking from a place of ignorance here.

Tiny nit in terms of uniform quoting for the options, double quote vs single quote, just because I can't help myself. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely missed this comment. For now, I'd prefer to keep the current options as my tests have shown an acceptable level of variance between runs with that. I have however sent these options to the CodSpeed maintainers and they're testing out if they see any improvements. As mentioned below, since CodSpeed and our benchmark differ in terms of instrumentation, we may also need other flags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And regarding the quotes, since this is a patch applied to codspeed-core, I tried to align with the code style in that folder.

Comment on lines +11 to +12
- src/__benchmarks__/github-schema.json
- src/__benchmarks__/github-schema.graphql
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge these two with the existing benchmark resource files into a shared folder

@@ -51,14 +51,17 @@
"build:deno": "node --loader ts-node/esm resources/build-deno.ts",
"diff:npm": "node --loader ts-node/esm resources/diff-npm-package.ts",
"gitpublish:npm": "bash ./resources/gitpublish.sh npm npmDist",
"gitpublish:deno": "bash ./resources/gitpublish.sh deno denoDist"
"gitpublish:deno": "bash ./resources/gitpublish.sh deno denoDist",
"postinstall": "patch-package"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For patching CodSpeed with added stability

@yaacovCR
Copy link
Contributor

Thanks so much for working on this, @erikwrede !!!

Some of the benchmarks we currently have may not be suitable for CodSpeeds instrumentation and may still provide variance in results.

Just taking a quick look, this is my biggest point of concern. Is there anything strange about our benchmarks that leads to unusual variance? Is there any suggestion that we might be able to eventually migrate all benchmarks to CodSpeed? It would be great to be able to deprecate the old benchmarks entirely with a solution that provides similar coverage.

Copy link

netlify bot commented Oct 18, 2024

Deploy Preview for compassionate-pike-271cb3 failed.

Name Link
🔨 Latest commit 0574c6b
🔍 Latest deploy log https://app.netlify.com/sites/compassionate-pike-271cb3/deploys/67163557bdab330008e9b1cc

@erikwrede
Copy link
Author

I can fully relate to your concerns. While I'd also love to see CodSpeed be the only benchmarking solution setup long-term, the indeterministic nature of the JIT can cause performance differences in some of the benchmarks:

https://codspeed.io/erikwrede/graphql-js/benchmarks

For now, I'd suggest to keep all benchmarks and just ignore all instable cases. With CodSpeed, we can freely choose our regression threshhold, but too many false positives for regressions or improvements will certainly degrade the experience. Once we see an improvement in stability over the ignored benchmarks, we can re-evaluate.

My take: let's try it. If it proves unreliable, even for the benchmarks we thought to be stable, we can always remove it again and fall back to our other benchmarking solution.

@yaacovCR
Copy link
Contributor

For now, I'd suggest to keep all benchmarks and just ignore all instable cases.
Once we see an improvement in stability over the ignored benchmarks, we can re-evaluate.

I guess I'm not totally understanding the big picture here. From a place of ignorance, I will try to ask some more specific questions:

  1. How do we tell from the codspeed UI which benches are unstable and should be ignored?
  2. When you write "once we see an improvement in stability over the ignored benchmarks" do you mean that the unstable benchmarks are expected to become stable? Or that we can write better benchmarks that are more stable? Or that the usefulness of the stable once is so great, that the fact that there are unstable ones is not going to bother us?

In my head, I am comparing this proposed solution to setting up a non-shared privately hosted Github Actions runner with a dedicated CPU at a cost of about $20 a month and trying to understand the differences.

Does our current benchmarking solution have the same variance problem for some of the benchmarks between runs, but get around this by always rerunning the pre-change and post-change right away?

As you might be able to tell from the above, I am a bit uncertain as to the trade-offs here => feel free to enlighten as much of my ignorance as you can with the time you have available; @JoviDeCroock might also have some thoughts, of course!

(cherry picked from commit 57d1f7f4bbcf55b22758f516911e335528b42cc6)
(cherry picked from commit 9eacdea7e2ddfc10eb9d17d026bc1d8fd1a3dc59)
@erikwrede
Copy link
Author

First of all, let me say I don't consider your questions coming from a place of ignorance but rather a desire for rigorousness, which is what a project like this needs ;)

Let me reiterate the rationale behind suggesting CodSpeed and why I use it for other open-source projects:

  • It always runs in your pipeline.
  • Brief Reporting
  • Comes with flame graphs -> Quick overview of what changes
  • Monitoring what changed over time

I consider CodSpeed to be a "linter" for performance. It will catch regressions you might not think about when making code changes. And when things improve, it will give you a satisfying comment about improvements. Continuously seeing that pushes you to think of other cases to benchmark and be performance-conscious. The USP of CodSpeed is the instrumentation enabling it to run on GH-Actions runners or just about any hardware while providing consistent results.

However, to ensure perfect code quality, more than linting is required. A more rigorous review would be best. In terms of benchmarking GraphQL-js, this is what our current DIY benchmarking solution is. It tests on a built package and includes memory usage. For now, I don't see CodSpeed replacing it but supplementing it.

Could we still host a $20/Month Hetzner bare metal machine and run our DIY script there? Sure! A custom solution will always best suit our purpose. However, to get the same benefit, we'd also need reporting, a dashboard, and a way to extract flame graphs. If we want to invest in this and build our own solution, I wouldn't oppose it.

So, now that we got the use case straight, let's take a deeper look at performance:
In node benchmarking, we have a couple of factors affecting benchmark consistency:

  • Garbage Collection
  • JIT
  • Syscalls, Context Switching

CodSpeed's instrumentation excludes the syscalls from measures; the DIY solution ignores runs with too many context switches. The remaining problems are mostly GC-related and JIT related. We want our benchmark to run on optimized hot code, and we don't want GC to interrupt it at different points in time.
I saw some variance between runs of the same codebase in earlier versions of this PR, but I made some changes to the CodSpeed patch today. Now, everything seems stable. If, over time, the benchmark turns out to be flaky on CodSpeed, we can always exclude it from monitoring. Flaky benchmarks would diminish the value of every solution, because they lead to ignoring the results in the long run. Any solution we choose should have some measures in place to avoid flakiness / variance.

So, please see this as a PoC and feel free to discuss the tradeoffs. I put this on the table as an option, and I'm curious to see what you think.

@erikwrede
Copy link
Author

And to be very specific about your questions:

  1. How do we tell from the codspeed UI which benches are unstable and should be ignored?

Looking at changes in benchmark performance over time and seeing that a benchmark changed, even though the corresponding code wasn't modified. For example for this benchmark over multiple readme-only commits:
image
Ideally, we would want a near-straight line here.

  1. When you write "once we see an improvement in stability over the ignored benchmarks" do you mean that the unstable benchmarks are expected to become stable? Or that the usefulness of the stable once is so great, that the fact that there are unstable ones is not going to bother us?

CodSpeed for Node is still relatively new, hence my patches to it in this PR. Future changes could help bring down variance on GH Actions runners and cause tests to be stable. Reiterating on my previous comment, I don't see any unexpected variance between commits anymore. It seems that my newest patch was able to fix this problem.

Or that we can write better benchmarks that are more stable?

At the current moment, I don't see a way for us to write benchmarks that are more stable on CodSpeed. The ones where I saw variance are, as of now, less compatible with CodSpeed's instrumented approach.

Does our current benchmarking solution have the same variance problem for some of the benchmarks between runs, but get around this by always rerunning the pre-change and post-change right away?

Our current benchmarking solution has a different approach to measurement, it's not directly comparable. CodSpeed instruments the measurements, filtering out noise like context switches, we run the benchmark several times while dropping any results that contain GC or context switches.

Copy link
Contributor

@yaacovCR yaacovCR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate very much the depth of the response! I think from a maintenance perspective, if we are confident in the full suite of benchmarks with codspeed, it seems like a better long-term solution compared to maintaining are own self-hosted git runner.

package.json Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants