agent: Add scaling event reporting #1107

sharnoff · 2024-10-12T01:48:23Z

This is part 2 of 2; see #1078 for the ground work and neondatabase/cloud#15939 for the full context.

In short, this PR:

Adds a new package: pkg/agent/scalingevents
Adds new callbacks to core.State to allow it to report on scaling events changes in desired CU.

Notes for review:

I'd like to add minio-based S3 tests to this, but it seemed like it'd be non-trivial, particularly because scaling events actually require that there's scaling that happens — unlike the existing billing tests.

So I figured I'd open this for review in the meantime.

~~Also note: This PR builds on #1078 and must not be merged before it.~~

pkg/agent/scalingevents/reporter.go

github-actions · 2024-10-12T01:51:51Z

No changes to the coverage.

HTML Report

Click to open

This is part 2 of 2; see #1078 for the ground work. In short, this commit: * Adds a new package: 'pkg/agent/scalingevents' * Adds new callbacks to core.State to allow it to report on scaling events changes in desired CU.

sharnoff · 2024-11-19T19:43:53Z

Remaining items for me, on this:

Add more thorough e2e tests
Test on staging

In the meantime, it should be ok to review.

Omrigan

Looks good, some questions and suggestions.

pkg/agent/config.go

Omrigan · 2024-11-27T11:32:42Z

pkg/agent/core/goalcu.go

 	}

-	goalCU := max(cpuGoalCU, memGoalCU, memTotalGoalCU, lfcGoalCU)
+	goalCU := uint32(math.Ceil(max(
+		math.Round(cpuGoalCU), // for historical compatibility, use round() instead of ceil()


Do we really need the historical compatibility?

Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.

I'd be happy to change it in a separate PR?

Probably not, but I didn't want a scaling algorithm change to be a side-effect of this PR.

Makes sense!

I'd be happy to change it in a separate PR?

Up to you.

Omrigan · 2024-11-27T13:29:31Z

pkg/agent/scalingevents/reporter.go

+	// This exists because Neon allows fractional compute units, while the autoscaler-agent acts on
+	// integer multiples of a smaller compute unit.


I never paid attention to this difference before, I'd like to discuss it more broadly.

Omrigan · 2024-11-27T13:30:40Z

pkg/agent/scalingevents/reporter.go

+	ClusterName string `json:"clusterName"`
+	RegionName  string `json:"regionName"`


What do those mean?

Clarified in d8397d4. As for "why", refer to neondatabase/cloud#15939.

Neither explains the difference between them. Are they equal at all times? Maybe we need only one?

pkg/agent/runner.go

Omrigan · 2024-11-27T14:20:47Z

pkg/agent/scalingevents/clients.go

+// Returns a function to generate keys for the placement of scaling events data into blob storage.
+//
+// Example: prefix/2024/10/31/23/events_{uuid}.ndjson.gz (11pm on halloween, UTC)
+func newBlobStorageKeyGenerator(prefix string) func() string {


Can we reuse the same key generator as we have for billing?

We could, but we'd end up changing the format. I figured basically:

It's a little annoying to make the same key generator available to both packages; and

It's probably better to use the same key format as what proxy provides, rather than matching what we use for billing

Not 100% sure though. WDYT?

IIRC, the billing key generator was modeled after billing in proxy. So I guess we have sepearate formats for billing and reporting, but those formats are the same between autoscaling and proxy?

Omrigan · 2024-11-27T14:26:22Z

pkg/agent/runner.go

+	lastParts  *scalingevents.GoalCUComponents
+}
+
+func (rl *desiredScalingReportLimiter) report(


Potential alternative would be to have

type Skipper interface { Skip(event ScalingEvent) bool }

injected into Reporter, and it would be called in Submit().

This would allow to implement limiter in a more generic way, for any type of event. Plus no need to pass so many arguments, when we can pass only ScalingEvent.

Hm, that's a fair point. I agree the current situation is overly complex; I'm not sure that adding a Skipper interface makes it simpler.

I'll think about it and get back you here...

Omrigan · 2024-11-27T14:43:35Z

pkg/agent/runner.go

@@ -322,6 +333,102 @@ func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util
 	}
 }

+func (r *Runner) reportScalingEvent(timestamp time.Time, currentCU, targetCU uint32) {


Suggestion: use more consistent names, right now there is Real vs Hypothetical and ScalingEvent vs DesiredScaling.

reportRealScaling and reportHypotheticalScaling would be fine, but it suggests as if hypothetical value cannot be real. Maybe Actual vs Desired?

+1 on more consistent naming. Let me explain where I'm coming from; I want your thoughts:

IIRC, the thing I was trying to distinguish here is that the "hypothetical"/"desired" scaling events can go far beyond the endpoint limits, and also contain the fractional CU values for each "part" (cpu/mem/lfc). We also change these events if the components change significantly, even if the overall scaling is still the same (so: it's not like each "hypothetical"/"desired" scaling event constitutes actual scaling)

I think I opted not to go fully for "desired" scaling also because IIRC "desired" is used elsewhere in the autoscaler-agent to mean "the scaling value we should be working towards", and is restricted by endpoint CU limits.

Thoughts? Happy to go with Actual vs Desired if you think that makes more sense.

"desired" is used elsewhere in the autoscaler-agent

Good point, let's not use desired than. I like Actual and Hypothetical.

One more option I just thought about is Actual vs SuggestedScaling/ScalingSuggestion.

Omrigan · 2024-11-27T14:46:52Z

pkg/agent/core/state.go

+	ScalingEvent   ReportScalingEventCallback
+	DesiredScaling ReportDesiredScalingCallback


Suggestion: maybe instead of having those as callbacks, define a new adapter interface, and pass it like this?

Passing interface feels more idiomatic.

Omrigan · 2024-11-27T14:49:20Z

pkg/agent/core/state.go

@@ -727,8 +736,20 @@ func (s *state) desiredResourcesFromMetricsOrRequestedUpscaling(now time.Time) (
 	// 2. Cap the goal CU by min/max, etc
 	// 3. that's it!

+	reportGoals := func(goalCU uint32, parts scalingevents.GoalCUComponents) {


Suggestion: Instead of having callback here, we could merge scalingevents.GoalCUComponents into scalingGoal, define a method GoalCU() on it, which would return max, and we can put this object into DesiredScaling callback.

WDYT?

Hm, how will that interact with things like #1129 / #1140 ? Otherwise I like this idea, I think it's a lot simpler.

Hm, how will that interact with things like #1129 / #1140 ?

I don't think there are significant interaction: both PR make scalingGoal public, and that should be it.

Co-authored-by: Oleg Vasilev <[email protected]>

Omrigan

A few more questions + some previous are still open. Should be good to go soon.

Omrigan · 2025-01-16T13:20:33Z

pkg/api/vminfo.go

@@ -26,6 +26,10 @@ const (
 	AnnotationAutoscalingBounds   = "autoscaling.neon.tech/bounds"
 	AnnotationAutoscalingConfig   = "autoscaling.neon.tech/config"
 	AnnotationBillingEndpointID   = "autoscaling.neon.tech/billing-endpoint-id"
+
+	// ref cloud#15939; to be removed after rollout is complete.


labels being opt-in means the release process doesn't need to be blocked on testing scaling event reporting

But the global configuration is also opt-in, so we can always roll-out with scaling reporting disabled.

being able to enable only for a single VM should make it easier to test

Why? I think it is unlikely to disrupt normal operations, so we might as well enable it globally.

It's also in part because I'd like to get #1108 released without needing to coordinate testing of this PR 😅

It it because to make #1108 work, we have to enable current changes globally, but we don't have to set labels?

Omrigan · 2025-01-16T17:05:08Z

pkg/agent/config.go

+	if c.ScalingEvents.Clients.S3 != nil {
+		validateBaseReportingConfig(&c.ScalingEvents.Clients.S3.BaseClientConfig, "scalingEvents.clients.s3")
+		validateS3ReportingConfig(&c.ScalingEvents.Clients.S3.S3ClientConfig, ".scalingEvents.clients.s3")
+		erc.Whenf(ec, c.ScalingEvents.Clients.S3.PrefixInBucket == "", emptyTmpl, ".scalingEvents.clients.s3.prefixInBucket")


Can PrefixInBucket be in validateS3ReportingConfig?

Omrigan · 2025-01-17T09:07:35Z

pkg/agent/runner.go

@@ -322,6 +333,102 @@ func (r *Runner) Run(ctx context.Context, logger *zap.Logger, vmInfoUpdated util
 	}
 }

+func (r *Runner) reportScalingEvent(timestamp time.Time, currentCU, targetCU uint32) {


"desired" is used elsewhere in the autoscaler-agent

Good point, let's not use desired than. I like Actual and Hypothetical.

One more option I just thought about is Actual vs SuggestedScaling/ScalingSuggestion.

Omrigan · 2025-01-17T11:45:13Z

pkg/agent/scalingevents/clients.go

+// Returns a function to generate keys for the placement of scaling events data into blob storage.
+//
+// Example: prefix/2024/10/31/23/events_{uuid}.ndjson.gz (11pm on halloween, UTC)
+func newBlobStorageKeyGenerator(prefix string) func() string {


IIRC, the billing key generator was modeled after billing in proxy. So I guess we have sepearate formats for billing and reporting, but those formats are the same between autoscaling and proxy?

Omrigan · 2025-01-17T11:50:23Z

pkg/agent/scalingevents/prommetrics.go

+func NewPromMetrics() PromMetrics {
+	return PromMetrics{
+		reporting: reporting.NewEventSinkMetrics("autoscaling_agent_events"),
+		totalCount: prometheus.NewGauge(prometheus.GaugeOpts{


Is totalCount incremented anywhere?

sharnoff commented Oct 12, 2024

View reviewed changes

pkg/agent/scalingevents/reporter.go Show resolved Hide resolved

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 693b601 to a3cf0fa Compare October 12, 2024 21:39

sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from b70150d to 54bfb21 Compare October 12, 2024 21:53

sharnoff mentioned this pull request Oct 12, 2024

agent,billing: Move clients and queue into new pkg/reporting #1078

Merged

sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from f608569 to 1c71a57 Compare October 12, 2024 22:06

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from a3cf0fa to 16c0917 Compare October 12, 2024 22:16

sharnoff force-pushed the sharnoff/scaling-event-reporting-1 branch from a46466d to df54b37 Compare October 17, 2024 17:13

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from 16c0917 to d2b4d45 Compare October 17, 2024 17:13

sharnoff mentioned this pull request Oct 21, 2024

agent: Add per-VM metric for desired CU(s) #1108

Open

Base automatically changed from sharnoff/scaling-event-reporting-1 to main November 13, 2024 16:50

agent: Add scaling event reporting

8c60b7f

This is part 2 of 2; see #1078 for the ground work. In short, this commit: * Adds a new package: 'pkg/agent/scalingevents' * Adds new callbacks to core.State to allow it to report on scaling events changes in desired CU.

sharnoff force-pushed the sharnoff/scaling-event-reporting-2 branch from d2b4d45 to 8c60b7f Compare November 18, 2024 04:01

sharnoff requested review from a team and Omrigan and removed request for a team November 19, 2024 19:42

Omrigan reviewed Nov 27, 2024

View reviewed changes

sharnoff and others added 2 commits November 27, 2024 12:18

Merge branch 'main' into scaling-event-reporting-2

8d6df58

Update pkg/agent/runner.go

a6b5960

Co-authored-by: Oleg Vasilev <[email protected]>

Omrigan assigned sharnoff Dec 10, 2024

sharnoff added 5 commits December 27, 2024 10:37

Merge branch 'main' into scaling-event-reporting-2

05c5bff

fix goalcu_test compile error

a802341

unify s3 config validation

2ed3492

add comments on ClusterName/RegionName

d8397d4

fix 2ed3492

a91fb73

sharnoff assigned Omrigan and unassigned sharnoff Jan 6, 2025

Omrigan reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Add scaling event reporting #1107

agent: Add scaling event reporting #1107

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading

sharnoff commented Nov 19, 2024 •

edited

Loading

Omrigan left a comment

Omrigan Nov 27, 2024

sharnoff Nov 27, 2024 •

edited

Loading

Omrigan Dec 2, 2024

Omrigan Nov 27, 2024

Omrigan Nov 27, 2024

sharnoff Dec 27, 2024

Omrigan Jan 16, 2025

Omrigan Nov 27, 2024

sharnoff Dec 27, 2024

Omrigan Jan 17, 2025

Omrigan Nov 27, 2024

sharnoff Dec 27, 2024

Omrigan Nov 27, 2024

sharnoff Nov 27, 2024

Omrigan Jan 17, 2025

Omrigan Nov 27, 2024

Omrigan Nov 27, 2024

sharnoff Nov 27, 2024

Omrigan Dec 2, 2024

Omrigan left a comment

Omrigan Jan 16, 2025

Omrigan Jan 16, 2025

Omrigan Jan 17, 2025

Omrigan Jan 17, 2025

Omrigan Jan 17, 2025

		// This exists because Neon allows fractional compute units, while the autoscaler-agent acts on
		// integer multiples of a smaller compute unit.

		ClusterName string `json:"clusterName"`
		RegionName string `json:"regionName"`

		ScalingEvent ReportScalingEventCallback
		DesiredScaling ReportDesiredScalingCallback

agent: Add scaling event reporting #1107

Are you sure you want to change the base?

agent: Add scaling event reporting #1107

Conversation

sharnoff commented Oct 12, 2024 • edited Loading

github-actions bot commented Oct 12, 2024 • edited Loading

HTML Report

sharnoff commented Nov 19, 2024 • edited Loading

Omrigan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharnoff Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Omrigan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharnoff commented Oct 12, 2024 •

edited

Loading

github-actions bot commented Oct 12, 2024 •

edited

Loading

sharnoff commented Nov 19, 2024 •

edited

Loading

sharnoff Nov 27, 2024 •

edited

Loading