v0.1 API Review #154

kfswain · 2025-01-06T21:09:57Z

This PR is not intended to be merged, merely a point of reference for review.

Slides: https://docs.google.com/presentation/d/1gtOJS1YA0Ax8KvsGPrHiyZR2dBoWnlLd9aACyojmk68/edit#slide=id.p

k8s-ci-robot · 2025-01-06T21:10:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kfswain
Once this PR has been reviewed and has the lgtm label, please assign nikhita for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kfswain · 2025-01-06T21:10:32Z

/hold

api/inferencemodel_types.go

api/inferencepool_types.go

api/inferencemodel_types.go

api/inferencepool_types.go

candita · 2025-01-08T20:22:15Z

api/inferencepool_types.go

+	// The number must be in the range 1 to 65535.
+	//
+	// +kubebuilder:validation:Minimum=1
+	// +kubebuilder:validation:Maximum=65535


If applicable, I suggest reducing this port range to a smaller list of well-known ports that users can rely on for firewall configuration purposes. Also, don't allow overlap with other well-known ports like those used for dns, http/s, etc.

Holding off on changing this one, just to gather consensus on what range we should limit it to. But I do agree with the idea.

It's possible that we could start with a small range and relax as needed. As the other direction would be nigh impossible

@candita this is meant to be a reference to a port number on a Pod, I can't think of any reasonable way to limit that since I think Kubernetes has likely scaled far enough that there's probably at least one case of each individual port being in use across the many Kubernetes that exist.

robscott · 2025-01-09T00:50:58Z

Thanks to everyone for the help reviewing this! Any/all comments are very appreciated.

We also went over this at a high level in a meeting with SIG-Net TLs (and others), walking through some slides as a reference point. Since this is starting as an x-k8s.io API, formal API review is optional, but we're trying to get as much feedback as we can early in the process. Barring any blocking feedback, we're hoping to release v0.1 next week.

Copying SIG-Net TLs and chairs if any have time for additional review cycles.

/cc @aojea @danwinship @MikeZappa87 @shaneutt @thockin

api/inferencemodel_types.go

api/inferencepool_types.go

k8s-ci-robot · 2025-01-10T17:27:28Z

@kfswain: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-inference-extension-test-unit-main	`86b852d`	link	true	`/test pull-gateway-api-inference-extension-test-unit-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

shaneutt · 2025-01-10T18:40:56Z

api/inferencemodel_types.go

+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+
+// InferenceModel is the Schema for the InferenceModels API.


InferenceModel appears to be topical, and of particular importance, so it may be one of the first thing a newcomer reads when learning these APIs. It may be beneficial to expand the documentation here to explain more thoroughly the "what" and "why" of it.

is it a cop out to reference our site or the doc proposal that goes into a bit more detail?

I could see a brief blurb being valuable, but a more detailed explanation offloaded.

We also go into more detail in the spec maybe it could be as simple as 'a more detailed description is affixed to the InferenceModelSpec field below'?

A link may be sufficient. As far as the spec section goes, I was thinking maybe higher level but maybe that's OK as well.

I would like to see something more in the documentation here. I trust your judgement in what that is, just please consider what a newcomer might be looking for when they come here and try to accommodate for that. Otherwise please feel free to consider this comment resolved at your discretion.

Took a stab! LMKWYT

api/inferencemodel_types.go

shaneutt · 2025-01-10T18:53:11Z

api/inferencemodel_types.go

+	Items           []InferenceModel `json:"items"`
+}
+
+// InferenceModelSpec represents the desired state of a specific model use case. This resource is


The "model use case" language wasn't immediately clear to me here. Is it fair to just say:

Suggested change

// InferenceModelSpec represents the desired state of a specific model use case. This resource is

// InferenceModelSpec represents the desired state of an InferenceModel. This resource is

Or are we trying to make some additional distinction?

I was just trying not to use a circular definition, and clarify what an InferenceModel is intending to represent. Open to changing

Small thing, please consider resolved at your discretion.

Adding more high-level documentation to the InferenceModel obj allowed me to make the model use case - InferenceModel linkage above. Rewording.

shaneutt · 2025-01-10T18:55:52Z

api/inferencemodel_types.go

+// managed by the "Inference Workload Owner" persona.
+//
+// The Inference Workload Owner persona is someone that trains, verifies, and
+// leverages a large language model from a model frontend, drives the lifecycle


I think we can assume anyone that's made it here understands "inference", "training", "models", e.t.c. but might it be worth explaining more or enumerating some examples of a "model frontend" if we're going to mention that here?

Tried rewording here.

api/inferencemodel_types.go

shaneutt · 2025-01-10T19:15:57Z

api/inferencemodel_types.go

+	// Criticality defines how important it is to serve the model compared to other models referencing the same pool.
+	// The lack of defaulting is intentional, the behavior of not setting criticality future-proofs the API without complicating.
+	//
+	// +optional
+	Criticality *Criticality `json:"criticality,omitempty"`


I've been seeing and hearing a lot of discussion about Criticality in terms of linguistics, and it's place in this API. I see the coordination between the criticality of multiple models as you add more and more to have the potential to become a bit confusing, particularly if you're trying to deploy a new model and you have to kinda look at what's out there and make decisions about your new model. For instance, will it be common to get into weird shape where you have a new model, and it's criticality is such that it really needs to be higher than that of anything that came before it? Then as a part of deploying it, there's a sort of impetus to update (perhaps downgrade) the criticality of a bunch of other models as a part of deploying the model? Would this serve to complicate the job of the "Inference Workload Owner"?

This doesn't mean I'm strictly against it or anything mind you, I'm just trying to think through how this plays out in the real world. It might help me personally (since I'm very new to this project) to see some of the motivation and user stories that influenced criticality if someone can spare a link.

There is some more recent discussion here: https://kubernetes.slack.com/archives/C071WA7R9LY/p1736906518995639

But Criticality has been a hot topic, agreed that it might not be in its final state. Since criticality is used in a load balancing aspect, we are trying to limit options to have something that we can guarantee to support out of the box. We expect iteration in the future as we (hopefully) increase usage.

Would opening an issue about criticality to centralize conversation be acceptable?

Yes, a TODO issue just to make sure the conversation remains topical and continues is a reasonable deferral at this stage so that we can keep velocity up and test it out in its current state and see what that teaches us. 👍

shaneutt · 2025-01-10T19:17:12Z

api/inferencemodel_types.go

+
+	// TargetModels allow multiple versions of a model for traffic splitting.
+	// If not specified, the target model name is defaulted to the modelName parameter.
+	// modelName is often in reference to a LoRA adapter.


It might be worth expanding on this piece in particular in this documentation to help future code readers.

Done, added examples to more clearly explain the variability.

api/inferencemodel_types.go

shaneutt · 2025-01-10T20:01:31Z

api/inferencepool_types.go

+}
+
+// InferencePoolList contains a list of InferencePool.
+//


Similar to the above it might be nice to provide more documentation about what (and why) InferencePool is.

aojea · 2025-01-13T22:09:06Z

api/inferencemodel_types.go

+	// Weight is used to determine the proportion of traffic that should be
+	// sent to this model when multiple target models are specified.
+	//
+	// Weight defines the proportion of requests forwarded to the specified


one question, if have 1 client with http2 that sends 1000 requests (all requests are pipelined over the same connection) and two models with 50 and 50, is the result 500 requests for each model?

Is request a connection request, a token , an http request?

Gateway implementations handle the actual connection (ext-proc just uses gRPC communication with the GW)

But yes, assuming equal weighting for 2 underlying models, the mathematical probability should be a 50:50 split over a large enough sample pool

aojea · 2025-01-13T22:10:41Z

api/inferencemodel_types.go

+	//
+	// +optional
+	// +kubebuilder:validation:MaxItems=10
+	TargetModels []TargetModel `json:"targetModels,omitempty"`


If target model mutable? can it be updated to add or remove models? if that happen, are the weights recalculated ?

Yep, targetModels can be added/removed.

The weights do recalculate, they are all relative to one another. Link to how the weights are consumed here:

gateway-api-inference-extension/pkg/ext-proc/backend/datastore.go

Line 85 in 6a4b025

func RandomWeightedDraw(model *v1alpha1.InferenceModel, seed int64) string {

kfswain · 2025-01-16T23:39:27Z

Heya @shaneutt! I made: #204 to address comments (or I just commented directly). Thanks for the input!

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2025

k8s-ci-robot requested a review from mrbobbytables January 6, 2025 21:10

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 6, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2025

mrbobbytables removed their request for review January 6, 2025 21:13

danehans reviewed Jan 6, 2025

View reviewed changes

api/inferencemodel_types.go Outdated Show resolved Hide resolved

v0.1

3c6bd6e

kfswain force-pushed the api-review branch from 23e6d36 to 3c6bd6e Compare January 6, 2025 22:55

k8s-ci-robot requested a review from shaneutt January 8, 2025 17:18