-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: Add scaling event reporting #1107
base: main
Are you sure you want to change the base?
Conversation
No changes to the coverage.
HTML Report |
693b601
to
a3cf0fa
Compare
b70150d
to
54bfb21
Compare
f608569
to
1c71a57
Compare
a3cf0fa
to
16c0917
Compare
a46466d
to
df54b37
Compare
16c0917
to
d2b4d45
Compare
This is part 2 of 2; see #1078 for the ground work. In short, this commit: * Adds a new package: 'pkg/agent/scalingevents' * Adds new callbacks to core.State to allow it to report on scaling events changes in desired CU.
d2b4d45
to
8c60b7f
Compare
Remaining items for me, on this:
In the meantime, it should be ok to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some questions and suggestions.
} | ||
|
||
goalCU := max(cpuGoalCU, memGoalCU, memTotalGoalCU, lfcGoalCU) | ||
goalCU := uint32(math.Ceil(max( | ||
math.Round(cpuGoalCU), // for historical compatibility, use round() instead of ceil() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the historical compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.
I'd be happy to change it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.
Makes sense!
I'd be happy to change it in a separate PR?
Up to you.
// This exists because Neon allows fractional compute units, while the autoscaler-agent acts on | ||
// integer multiples of a smaller compute unit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never paid attention to this difference before, I'd like to discuss it more broadly.
pkg/agent/scalingevents/reporter.go
Outdated
ClusterName string `json:"clusterName"` | ||
RegionName string `json:"regionName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do those mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified in d8397d4. As for "why", refer to neondatabase/cloud#15939.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither explains the difference between them. Are they equal at all times? Maybe we need only one?
// Returns a function to generate keys for the placement of scaling events data into blob storage. | ||
// | ||
// Example: prefix/2024/10/31/23/events_{uuid}.ndjson.gz (11pm on halloween, UTC) | ||
func newBlobStorageKeyGenerator(prefix string) func() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the same key generator as we have for billing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but we'd end up changing the format. I figured basically:
- It's a little annoying to make the same key generator available to both packages; and
- It's probably better to use the same key format as what proxy provides, rather than matching what we use for billing
Not 100% sure though. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the billing key generator was modeled after billing in proxy. So I guess we have sepearate formats for billing and reporting, but those formats are the same between autoscaling and proxy?
lastParts *scalingevents.GoalCUComponents | ||
} | ||
|
||
func (rl *desiredScalingReportLimiter) report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential alternative would be to have
type Skipper interface {
Skip(event ScalingEvent) bool
}
injected into Reporter
, and it would be called in Submit()
.
This would allow to implement limiter in a more generic way, for any type of event. Plus no need to pass so many arguments, when we can pass only ScalingEvent
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a fair point. I agree the current situation is overly complex; I'm not sure that adding a Skipper
interface makes it simpler.
I'll think about it and get back you here...
@@ -322,6 +333,102 @@ func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util | |||
} | |||
} | |||
|
|||
func (r *Runner) reportScalingEvent(timestamp time.Time, currentCU, targetCU uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: use more consistent names, right now there is Real
vs Hypothetical
and ScalingEvent
vs DesiredScaling
.
reportRealScaling
and reportHypotheticalScaling
would be fine, but it suggests as if hypothetical value cannot be real. Maybe Actual
vs Desired
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on more consistent naming. Let me explain where I'm coming from; I want your thoughts:
IIRC, the thing I was trying to distinguish here is that the "hypothetical"/"desired" scaling events can go far beyond the endpoint limits, and also contain the fractional CU values for each "part" (cpu/mem/lfc). We also change these events if the components change significantly, even if the overall scaling is still the same (so: it's not like each "hypothetical"/"desired" scaling event constitutes actual scaling)
I think I opted not to go fully for "desired" scaling also because IIRC "desired" is used elsewhere in the autoscaler-agent to mean "the scaling value we should be working towards", and is restricted by endpoint CU limits.
Thoughts? Happy to go with Actual
vs Desired
if you think that makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"desired" is used elsewhere in the autoscaler-agent
Good point, let's not use desired
than. I like Actual
and Hypothetical
.
One more option I just thought about is Actual
vs SuggestedScaling
/ScalingSuggestion
.
ScalingEvent ReportScalingEventCallback | ||
DesiredScaling ReportDesiredScalingCallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: maybe instead of having those as callbacks, define a new adapter interface, and pass it like this?
Passing interface feels more idiomatic.
@@ -727,8 +736,20 @@ func (s *state) desiredResourcesFromMetricsOrRequestedUpscaling(now time.Time) ( | |||
// 2. Cap the goal CU by min/max, etc | |||
// 3. that's it! | |||
|
|||
reportGoals := func(goalCU uint32, parts scalingevents.GoalCUComponents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Instead of having callback here, we could merge scalingevents.GoalCUComponents
into scalingGoal
, define a method GoalCU()
on it, which would return max, and we can put this object into DesiredScaling callback.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Oleg Vasilev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more questions + some previous are still open. Should be good to go soon.
@@ -26,6 +26,10 @@ const ( | |||
AnnotationAutoscalingBounds = "autoscaling.neon.tech/bounds" | |||
AnnotationAutoscalingConfig = "autoscaling.neon.tech/config" | |||
AnnotationBillingEndpointID = "autoscaling.neon.tech/billing-endpoint-id" | |||
|
|||
// ref cloud#15939; to be removed after rollout is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels being opt-in means the release process doesn't need to be blocked on testing scaling event reporting
But the global configuration is also opt-in, so we can always roll-out with scaling reporting disabled.
being able to enable only for a single VM should make it easier to test
Why? I think it is unlikely to disrupt normal operations, so we might as well enable it globally.
It's also in part because I'd like to get #1108 released without needing to coordinate testing of this PR 😅
It it because to make #1108 work, we have to enable current changes globally, but we don't have to set labels?
if c.ScalingEvents.Clients.S3 != nil { | ||
validateBaseReportingConfig(&c.ScalingEvents.Clients.S3.BaseClientConfig, "scalingEvents.clients.s3") | ||
validateS3ReportingConfig(&c.ScalingEvents.Clients.S3.S3ClientConfig, ".scalingEvents.clients.s3") | ||
erc.Whenf(ec, c.ScalingEvents.Clients.S3.PrefixInBucket == "", emptyTmpl, ".scalingEvents.clients.s3.prefixInBucket") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can PrefixInBucket be in validateS3ReportingConfig?
@@ -322,6 +333,102 @@ func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util | |||
} | |||
} | |||
|
|||
func (r *Runner) reportScalingEvent(timestamp time.Time, currentCU, targetCU uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"desired" is used elsewhere in the autoscaler-agent
Good point, let's not use desired
than. I like Actual
and Hypothetical
.
One more option I just thought about is Actual
vs SuggestedScaling
/ScalingSuggestion
.
// Returns a function to generate keys for the placement of scaling events data into blob storage. | ||
// | ||
// Example: prefix/2024/10/31/23/events_{uuid}.ndjson.gz (11pm on halloween, UTC) | ||
func newBlobStorageKeyGenerator(prefix string) func() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the billing key generator was modeled after billing in proxy. So I guess we have sepearate formats for billing and reporting, but those formats are the same between autoscaling and proxy?
func NewPromMetrics() PromMetrics { | ||
return PromMetrics{ | ||
reporting: reporting.NewEventSinkMetrics("autoscaling_agent_events"), | ||
totalCount: prometheus.NewGauge(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is totalCount incremented anywhere?
This is part 2 of 2; see #1078 for the ground work and neondatabase/cloud#15939 for the full context.
In short, this PR:
pkg/agent/scalingevents
core.State
to allow it to report on scaling events changes in desired CU.Notes for review:
I'd like to add minio-based S3 tests to this, but it seemed like it'd be non-trivial, particularly because scaling events actually require that there's scaling that happens — unlike the existing billing tests.
So I figured I'd open this for review in the meantime.
Also note: This PR builds on #1078 and must not be merged before it.