-
-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JS method to get the outstanding job count + queue depth #380
Comments
Hi Jake; the first thing to do would be benchmarking how much doing this would reduce performance. If it reduces by a non-negligible amount (2% or more) then it would need to be opt-in. If more than 10% then I'm unlikely to add it. Have you considered using the row count estimation feature in Postgres instead? select reltuples::bigint as rough_row_count from pg_class where oid = 'graphile_worker.jobs'::regclass; I would expect it to be more "swingy" in a Graphile Worker context than on a regular table, but again it might be worth looking into - especially if you already know what proportion of your jobs get queued in the future versus queued to execute now.
No, but you can use other heuristics, for example looking at |
Hey @benjie ! Thanks for the quick response.
Happy to look into this, but can you clarify what you mean by opt-in? I'm imagining a function that a user could call, something like:
(except maybe as a workerUtil helper method in the JS API?) Given it wouldn't actually be used by any of the worker internals, I'm not sure how it wouldn't be opt-in. As for perf, for my use-case, I'd be calling this every 30s to 1m. Are you thinking something like "what's the effect of running this query rapid-fire while the existing perf tests are running?" The |
Makes sense 👍 I was thinking more like automatic job queue depth monitoring, which if this were to get landed would be a very likely next request!
Yes. I assume from this you'd want quite tight granularity because if you're only looking at doing it every 30 seconds then a 15 second lag in jobs executing would be a clear and (performance-wise) free signal that you need to run more workers. Then again, depends how long it takes you to spin up new workers... Actually running node should be under a second, but getting the environment up to run node in might take longer 😉
Excellent, would love to hear your results! |
@benjie you're right that the next thing wanted is going to be queue depth monitoring... FYI I'm currently looking at handling this (using v13 schema, but shouldn't be hard to bump to the new schema) as: select
count(*)
case
when jobs.locked_at is not null and jobs.locked_at >= (now() - interval '4 hours') then 'leased'
when jobs.queue_name is not null and exists (
select 1
from graphile_worker.job_queues
where job_queues.queue_name = jobs.queue_name
and (job_queues.locked_at is not null or job_queues.locked_at >= (now() - interval '4 hours'))
) then 'waiting_on_queue'
when attempts >= max_attempts then 'permanently_failed'
when attempts = 0 and run_at >= now() then 'future'
when attempts > 1 and run_at >= now() then 'waiting_to_retry'
else 'ready'
end as status
from graphile_worker.jobs It turns out the clause to know how many jobs are ready is about the same amount of work as knowing all the other "status"es. WDYT about having a more-canonical derived |
It can't be I don't see much value adding this to every request to |
Ah, of course. I wrote the above in a hurry on my way to a funeral and wasn't thinking about how this changes over time. I guess what I'm really asking is would you accept a PR that added an official function for this kind of query? Something like |
I'm extremely hesitant to add anything like that that may encourage people to follow bad patterns (e.g. polling it) and result in reduced queue performance. |
Understood. It seems you're trying to keep graphile-worker as lean as possible, and letting folks build whatever additional tooling they need using only performant hooks (like the existing event emitters). I guess I balk at the idea of making the next dev discover their own path forward organically without some additional direction, even if that's just entries in the docs about how they might build that tooling. You have lightweight entries around shadow jobs tables and you now have the public At this point, I'd be happy to contribute additions to the docs to point out, e.g., the job lag time and total number of jobs in the table. A query like the above seems like it'd also be useful in the docs, labelled appropriately with BIG RED WARNINGS about performance implications. I'd also be happy to share / open-source the OpenTelemetry gauges work I've done if it would help with future efforts to provide automated job queue depth monitoring. Let me know what would be useful. |
All these things sound like great documentation additions 👍 Please file each one separately so we can merge the easy ones and discuss the less easy ones ❤️ |
I'm currently struggling with the same issues and couldn't find anything in the docs. Would you mind sharing some learnings? Even just some sql queries dumped into a gist would be very helpful! |
Essentially the performance of Graphile Worker comes down to very careful use of querying the database, and as you scale up it's incredibly easy to upset that if you're issuing additional queries - Worker is already incredibly heavy on IO if you have a lot of jobs and a lot of workers (hence the work on #474). The act of monitoring the performance of Worker, if not done carefully, could result in the performance of Worker tanking. I'd advise that rather than "queue depth" you think in terms of "job latency" - the time between a job being due to execute, and it actually executing. Generally a job latency up to |
I threw together a gist showing how we're using opentelemetry metrics with graphile worker. I had to rip a fair bit of specialization out of that code, so this variant is untested. @i-tu I hope it's useful, at least as a jumping-off point for your own needs. @benjie I'd love to get your eyes on that -- hopefully we're calculating the delays etc correctly. If it's to your liking, feel free to link to it or include the code directly in your docs etc. |
Thanks for the reply! This was a great pointer. I wrote the following code to instrument job events in Sentry: runner.events.on('job:start', ({ job }) => {
// Time spent waiting on queue before starting
Sentry.metrics.distribution('job_start_lag', differenceInMilliseconds(new Date(), job.run_at), {
tags: { type: 'worker', task: job.task_identifier },
unit: 'millisecond',
})
})
runner.events.on('job:complete', async ({ job }) => {
Sentry.metrics.distribution(
'job_completion_time',
differenceInMilliseconds(new Date(), job.run_at),
{
tags: { type: 'worker', task: job.task_identifier },
unit: 'millisecond',
}
)
}) |
Thanks so much @jakebiesinger-storyhealth ! 🤩 |
@benjie is there a recommended (performant) measure for determining scaling workers from 0->1 and 1->0, without measuring queue depth or active tasks. The |
Graphile Worker is not designed to scale to zero; it's designed to always have at least one running worker. Technically you can scale it to zero, you won't lose tasks, but that's not an aim. Of course, if you are scaled to zero then there's no worries about reading from the |
Feature description
I would like to be able to tell how many jobs are ready to be run in the
jobs
table. There is no public interface for doing that at this point.Motivating example
Our background workers run on "serverless" architecture with a minimal number of workers. We are occasionally faced with bursty background task (thousands or tens of thousands or jobs added) and would like to tell the underlying infrastructure that it needs to spin up additional workers temporarily to handle the load.
A public interface for querying the current count of outstanding jobs (ready to run + not locked or past their expiry time) would enable us to report graphile_worker metrics (useful for monitoring purposes anyway) and to spin up additional (temporary) workers (e.g., via having them
runOnce
).Alternatives considered
The existing
WorkerEvents
lets us know that there are no work items to be run but gives no insight into the job / queue depth. We could instrument a running job count ourselves using some separate table, but many of our jobs are fired off from postgres triggers which makes instrumentation a little more awkward and less explicit / clear to a new developer.The new public
jobs
view is helpful and we can use that... I just thought a JS helper function might allow your users to navigate some of the more complex SQL gotchas and get counts that are more likely to be correct.Supporting development
I [tick all that apply]:
The text was updated successfully, but these errors were encountered: