-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conductor: add heartbeat monitor for background workers #1023
base: main
Are you sure you want to change the base?
Conversation
|
||
#[derive(Clone)] | ||
pub struct HeartbeatUpdater { | ||
shared_heartbeat: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with a lock-free approach rather than something like Arc<RwLock<Instant>>
, as the issue with workers getting stuck might stem from lock contention, so adding more lock contention probably wouldn't help
self.shared_heartbeat | ||
.store(current_timestamp(), Ordering::Relaxed); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep this as something manually updated rather than being updated by yet another background thread
|
||
if current_time >= last_update { | ||
let elapsed = Duration::from_secs(current_time - last_update); | ||
elapsed < self.update_interval * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the * 2
part for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check if there's been an update within twice the expected timeout duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that, but why 2x? Maybe that should just be part of the update interval config?
Keep in mind there is healthcheck config on kubernetes side too. Like how many consecutive failed requests will restart the pod. I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could replace self.update_interval
by timeout_interval
and then just use. elapsed < self.timeout_interval
, what do you think?
No description provided.