Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conductor: add heartbeat monitor for background workers #1023

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vrmiguel
Copy link
Member

No description provided.


#[derive(Clone)]
pub struct HeartbeatUpdater {
shared_heartbeat: Arc<AtomicU64>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with a lock-free approach rather than something like Arc<RwLock<Instant>>, as the issue with workers getting stuck might stem from lock contention, so adding more lock contention probably wouldn't help

self.shared_heartbeat
.store(current_timestamp(), Ordering::Relaxed);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep this as something manually updated rather than being updated by yet another background thread

@vrmiguel vrmiguel marked this pull request as ready for review October 21, 2024 18:22

if current_time >= last_update {
let elapsed = Duration::from_secs(current_time - last_update);
elapsed < self.update_interval * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the * 2 part for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check if there's been an update within twice the expected timeout duration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that, but why 2x? Maybe that should just be part of the update interval config?

Keep in mind there is healthcheck config on kubernetes side too. Like how many consecutive failed requests will restart the pod. I

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could replace self.update_interval by timeout_interval and then just use. elapsed < self.timeout_interval, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants