Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JS method to get the outstanding job count + queue depth #380

Open
4 of 6 tasks
jakebiesinger-storyhealth opened this issue Oct 30, 2023 · 16 comments
Open
4 of 6 tasks

Comments

@jakebiesinger-storyhealth
Copy link

jakebiesinger-storyhealth commented Oct 30, 2023

Feature description

I would like to be able to tell how many jobs are ready to be run in the jobs table. There is no public interface for doing that at this point.

Motivating example

Our background workers run on "serverless" architecture with a minimal number of workers. We are occasionally faced with bursty background task (thousands or tens of thousands or jobs added) and would like to tell the underlying infrastructure that it needs to spin up additional workers temporarily to handle the load.

A public interface for querying the current count of outstanding jobs (ready to run + not locked or past their expiry time) would enable us to report graphile_worker metrics (useful for monitoring purposes anyway) and to spin up additional (temporary) workers (e.g., via having them runOnce).

Alternatives considered

The existing WorkerEvents lets us know that there are no work items to be run but gives no insight into the job / queue depth. We could instrument a running job count ourselves using some separate table, but many of our jobs are fired off from postgres triggers which makes instrumentation a little more awkward and less explicit / clear to a new developer.

The new public jobs view is helpful and we can use that... I just thought a JS helper function might allow your users to navigate some of the more complex SQL gotchas and get counts that are more likely to be correct.

Supporting development

I [tick all that apply]:

  • am interested in building this feature myself
  • am interested in collaborating on building this feature
  • am willing to help testing this feature before it's released
  • am willing to write a test-driven test suite for this feature (before it exists)
  • am a Graphile sponsor ❤️
  • have an active support or consultancy contract with Graphile
@benjie
Copy link
Member

benjie commented Oct 31, 2023

Hi Jake; the first thing to do would be benchmarking how much doing this would reduce performance. If it reduces by a non-negligible amount (2% or more) then it would need to be opt-in. If more than 10% then I'm unlikely to add it.

Have you considered using the row count estimation feature in Postgres instead?

select reltuples::bigint as rough_row_count from pg_class where oid = 'graphile_worker.jobs'::regclass;

I would expect it to be more "swingy" in a Graphile Worker context than on a regular table, but again it might be worth looking into - especially if you already know what proportion of your jobs get queued in the future versus queued to execute now.

The existing WorkerEvents lets us know that there are no work items to be run but gives no insight into the job / queue depth.

No, but you can use other heuristics, for example looking at runAt and comparing it to the current time. Once there's more than a 2x poll interval second gap between runAt and the current time the job started there's a hint we need more workers. If this gets really big then we clearly need a lot of workers to work through the backlog. Once we get the idle event then we know we can reduce the number of workers back down again.

@jakebiesinger-storyhealth
Copy link
Author

Hey @benjie ! Thanks for the quick response.

the first thing to do would be benchmarking how much doing this would reduce performance. If it reduces by a non-negligible amount (2% or more) then it would need to be opt-in. If more than 10% then I'm unlikely to add it.

Happy to look into this, but can you clarify what you mean by opt-in? I'm imagining a function that a user could call, something like:

CREATE FUNCTION graphile_worker.get_pending_job_count(
    -- All of these parameters are optional and would narrow down the list of jobs being counted
    queue_name text[] DEFAULT NULL::text[],
    task_identifiers text[] DEFAULT NULL::text[],
    job_expiry interval DEFAULT '04:00:00'::interval,
    forbidden_flags text[] DEFAULT NULL::text[],
)
    RETURNS bigint

(except maybe as a workerUtil helper method in the JS API?)

Given it wouldn't actually be used by any of the worker internals, I'm not sure how it wouldn't be opt-in. As for perf, for my use-case, I'd be calling this every 30s to 1m. Are you thinking something like "what's the effect of running this query rapid-fire while the existing perf tests are running?"

The runAt comparison and rough row count ideas are interesting. I'll have to play around a bit to see what they feel like for our usage.

@benjie
Copy link
Member

benjie commented Nov 1, 2023

Given it wouldn't actually be used by any of the worker internals, I'm not sure how it wouldn't be opt-in.

Makes sense 👍 I was thinking more like automatic job queue depth monitoring, which if this were to get landed would be a very likely next request!

Are you thinking something like "what's the effect of running this query rapid-fire while the existing perf tests are running?"

Yes. I assume from this you'd want quite tight granularity because if you're only looking at doing it every 30 seconds then a 15 second lag in jobs executing would be a clear and (performance-wise) free signal that you need to run more workers. Then again, depends how long it takes you to spin up new workers... Actually running node should be under a second, but getting the environment up to run node in might take longer 😉

The runAt comparison and rough row count ideas are interesting. I'll have to play around a bit to see what they feel like for our usage.

Excellent, would love to hear your results!

@jakebiesinger-storyhealth
Copy link
Author

jakebiesinger-storyhealth commented Nov 10, 2023

@benjie you're right that the next thing wanted is going to be queue depth monitoring... FYI I'm currently looking at handling this (using v13 schema, but shouldn't be hard to bump to the new schema) as:

select
  count(*)
  case
    when jobs.locked_at is not null and jobs.locked_at >= (now() - interval '4 hours') then 'leased'
    when jobs.queue_name is not null and exists (
        select 1
        from graphile_worker.job_queues
        where job_queues.queue_name = jobs.queue_name
        and (job_queues.locked_at is not null or job_queues.locked_at >= (now() - interval '4 hours'))
    ) then 'waiting_on_queue'
    when attempts >= max_attempts then 'permanently_failed'
    when attempts = 0 and run_at >= now() then 'future'
    when attempts > 1 and run_at >= now() then 'waiting_to_retry'
    else 'ready'
  end as status
from graphile_worker.jobs

It turns out the clause to know how many jobs are ready is about the same amount of work as knowing all the other "status"es. WDYT about having a more-canonical derived status like this? Is this too inefficient? Should there be a (possibly GENERATED) column on the jobs / _private_jobs table instead?

@benjie
Copy link
Member

benjie commented Nov 13, 2023

It can't be generated since that only happens at row write time, and thus cannot factor in the current time (it's only allowed to use immutable functions).

I don't see much value adding this to every request to jobs when it will only be needed in specific circumstances; I think it's fine (performance concerns aside) for you to do this ad-hoc when you need it.

@jakebiesinger-storyhealth
Copy link
Author

Ah, of course. I wrote the above in a hurry on my way to a funeral and wasn't thinking about how this changes over time.

I guess what I'm really asking is would you accept a PR that added an official function for this kind of query? Something like SELECT * FROM graphile_worker.get_job_count_by_status() which would run a more-official version of the above, and still operate in the event of future schema changes.

@benjie
Copy link
Member

benjie commented Nov 14, 2023

I'm extremely hesitant to add anything like that that may encourage people to follow bad patterns (e.g. polling it) and result in reduced queue performance.

@jakebiesinger-storyhealth
Copy link
Author

Understood. It seems you're trying to keep graphile-worker as lean as possible, and letting folks build whatever additional tooling they need using only performant hooks (like the existing event emitters). I guess I balk at the idea of making the next dev discover their own path forward organically without some additional direction, even if that's just entries in the docs about how they might build that tooling. You have lightweight entries around shadow jobs tables and you now have the public jobs table with caveats around performance implications for querying. It took me some time + work to get a monitoring system built on top of graphile-worker -- queue depth monitoring + job scaleup were both requirements before we could consider using graphile-worker in production. I'd love to save the next dev that effort, particularly if I can benefit from having my work migrated for me through the next schema iteration :D

At this point, I'd be happy to contribute additions to the docs to point out, e.g., the job lag time and total number of jobs in the table. A query like the above seems like it'd also be useful in the docs, labelled appropriately with BIG RED WARNINGS about performance implications. I'd also be happy to share / open-source the OpenTelemetry gauges work I've done if it would help with future efforts to provide automated job queue depth monitoring.

Let me know what would be useful.

@benjie
Copy link
Member

benjie commented Nov 14, 2023

All these things sound like great documentation additions 👍 Please file each one separately so we can merge the easy ones and discuss the less easy ones ❤️

@i-tu
Copy link

i-tu commented Jun 14, 2024

At this point, I'd be happy to contribute additions to the docs to point out, e.g., the job lag time and total number of jobs in the table. A query like the above seems like it'd also be useful in the docs, labelled appropriately with BIG RED WARNINGS about performance implications. I'd also be happy to share / open-source the OpenTelemetry gauges work I've done if it would help with future efforts to provide automated job queue depth monitoring.

Let me know what would be useful.

@jakebiesinger-storyhealth

I'm currently struggling with the same issues and couldn't find anything in the docs. Would you mind sharing some learnings? Even just some sql queries dumped into a gist would be very helpful!

@benjie
Copy link
Member

benjie commented Jun 14, 2024

Essentially the performance of Graphile Worker comes down to very careful use of querying the database, and as you scale up it's incredibly easy to upset that if you're issuing additional queries - Worker is already incredibly heavy on IO if you have a lot of jobs and a lot of workers (hence the work on #474). The act of monitoring the performance of Worker, if not done carefully, could result in the performance of Worker tanking.

I'd advise that rather than "queue depth" you think in terms of "job latency" - the time between a job being due to execute, and it actually executing. Generally a job latency up to pollInterval + 1 second is probably fine, but if it goes beyond this, it's probably a sign you need more workers (or better IO in your database). This can be monitored using the events system and is effectively free (incredibly low JS cost).

@jakebiesinger-storyhealth
Copy link
Author

I threw together a gist showing how we're using opentelemetry metrics with graphile worker. I had to rip a fair bit of specialization out of that code, so this variant is untested.

@i-tu I hope it's useful, at least as a jumping-off point for your own needs.

@benjie I'd love to get your eyes on that -- hopefully we're calculating the delays etc correctly. If it's to your liking, feel free to link to it or include the code directly in your docs etc.

@i-tu
Copy link

i-tu commented Jun 14, 2024

Thanks for the reply! This was a great pointer. I wrote the following code to instrument job events in Sentry:

runner.events.on('job:start', ({ job }) => {
  // Time spent waiting on queue before starting
  Sentry.metrics.distribution('job_start_lag', differenceInMilliseconds(new Date(), job.run_at), {
    tags: { type: 'worker', task: job.task_identifier },
    unit: 'millisecond',
  })
})

runner.events.on('job:complete', async ({ job }) => {
  Sentry.metrics.distribution(
    'job_completion_time',
    differenceInMilliseconds(new Date(), job.run_at),
    {
      tags: { type: 'worker', task: job.task_identifier },
      unit: 'millisecond',
    }
  )
})

@i-tu
Copy link

i-tu commented Jun 14, 2024

Thanks so much @jakebiesinger-storyhealth ! 🤩

@owlas
Copy link

owlas commented Jun 14, 2024

@benjie is there a recommended (performant) measure for determining scaling workers from 0->1 and 1->0, without measuring queue depth or active tasks. The worker:getJob:empty gets me close to 1->0 as long as I can measure how many active jobs there are. What about scaling from 0->1 workers. Is there a way to determine the queue size without polling the jobs table manually?

@benjie
Copy link
Member

benjie commented Jun 14, 2024

Graphile Worker is not designed to scale to zero; it's designed to always have at least one running worker. Technically you can scale it to zero, you won't lose tasks, but that's not an aim. Of course, if you are scaled to zero then there's no worries about reading from the jobs view causing read contention because there's nothing to contend with, so have at it! Just only do it when you've got no workers running. TBH it's probably fine with one or two workers running too, it's when you scale higher that it'd become a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants