-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.1 API Review #154
base: v0.0
Are you sure you want to change the base?
v0.1 API Review #154
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kfswain The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
// The number must be in the range 1 to 65535. | ||
// | ||
// +kubebuilder:validation:Minimum=1 | ||
// +kubebuilder:validation:Maximum=65535 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If applicable, I suggest reducing this port range to a smaller list of well-known ports that users can rely on for firewall configuration purposes. Also, don't allow overlap with other well-known ports like those used for dns, http/s, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding off on changing this one, just to gather consensus on what range we should limit it to. But I do agree with the idea.
It's possible that we could start with a small range and relax as needed. As the other direction would be nigh impossible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@candita this is meant to be a reference to a port number on a Pod, I can't think of any reasonable way to limit that since I think Kubernetes has likely scaled far enough that there's probably at least one case of each individual port being in use across the many Kubernetes that exist.
Thanks to everyone for the help reviewing this! Any/all comments are very appreciated. We also went over this at a high level in a meeting with SIG-Net TLs (and others), walking through some slides as a reference point. Since this is starting as an Copying SIG-Net TLs and chairs if any have time for additional review cycles. |
@kfswain: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
) | ||
|
||
// InferenceModel is the Schema for the InferenceModels API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InferenceModel
appears to be topical, and of particular importance, so it may be one of the first thing a newcomer reads when learning these APIs. It may be beneficial to expand the documentation here to explain more thoroughly the "what" and "why" of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a cop out to reference our site or the doc proposal that goes into a bit more detail?
I could see a brief blurb being valuable, but a more detailed explanation offloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also go into more detail in the spec
maybe it could be as simple as 'a more detailed description is affixed to the InferenceModelSpec field below'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link may be sufficient. As far as the spec
section goes, I was thinking maybe higher level but maybe that's OK as well.
I would like to see something more in the documentation here. I trust your judgement in what that is, just please consider what a newcomer might be looking for when they come here and try to accommodate for that. Otherwise please feel free to consider this comment resolved at your discretion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a stab! LMKWYT
Items []InferenceModel `json:"items"` | ||
} | ||
|
||
// InferenceModelSpec represents the desired state of a specific model use case. This resource is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "model use case" language wasn't immediately clear to me here. Is it fair to just say:
// InferenceModelSpec represents the desired state of a specific model use case. This resource is | |
// InferenceModelSpec represents the desired state of an InferenceModel. This resource is |
Or are we trying to make some additional distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying not to use a circular definition, and clarify what an InferenceModel is intending to represent. Open to changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, please consider resolved at your discretion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding more high-level documentation to the InferenceModel
obj allowed me to make the model use case
- InferenceModel
linkage above. Rewording.
// managed by the "Inference Workload Owner" persona. | ||
// | ||
// The Inference Workload Owner persona is someone that trains, verifies, and | ||
// leverages a large language model from a model frontend, drives the lifecycle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can assume anyone that's made it here understands "inference", "training", "models", e.t.c. but might it be worth explaining more or enumerating some examples of a "model frontend" if we're going to mention that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried rewording here.
// Criticality defines how important it is to serve the model compared to other models referencing the same pool. | ||
// The lack of defaulting is intentional, the behavior of not setting criticality future-proofs the API without complicating. | ||
// | ||
// +optional | ||
Criticality *Criticality `json:"criticality,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been seeing and hearing a lot of discussion about Criticality
in terms of linguistics, and it's place in this API. I see the coordination between the criticality of multiple models as you add more and more to have the potential to become a bit confusing, particularly if you're trying to deploy a new model and you have to kinda look at what's out there and make decisions about your new model. For instance, will it be common to get into weird shape where you have a new model, and it's criticality is such that it really needs to be higher than that of anything that came before it? Then as a part of deploying it, there's a sort of impetus to update (perhaps downgrade) the criticality of a bunch of other models as a part of deploying the model? Would this serve to complicate the job of the "Inference Workload Owner"?
This doesn't mean I'm strictly against it or anything mind you, I'm just trying to think through how this plays out in the real world. It might help me personally (since I'm very new to this project) to see some of the motivation and user stories that influenced criticality if someone can spare a link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some more recent discussion here: https://kubernetes.slack.com/archives/C071WA7R9LY/p1736906518995639
But Criticality has been a hot topic, agreed that it might not be in its final state. Since criticality is used in a load balancing aspect, we are trying to limit options to have something that we can guarantee to support out of the box. We expect iteration in the future as we (hopefully) increase usage.
Would opening an issue about criticality to centralize conversation be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a TODO issue just to make sure the conversation remains topical and continues is a reasonable deferral at this stage so that we can keep velocity up and test it out in its current state and see what that teaches us. 👍
|
||
// TargetModels allow multiple versions of a model for traffic splitting. | ||
// If not specified, the target model name is defaulted to the modelName parameter. | ||
// modelName is often in reference to a LoRA adapter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth expanding on this piece in particular in this documentation to help future code readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added examples to more clearly explain the variability.
} | ||
|
||
// InferencePoolList contains a list of InferencePool. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above it might be nice to provide more documentation about what (and why) InferencePool is.
// Weight is used to determine the proportion of traffic that should be | ||
// sent to this model when multiple target models are specified. | ||
// | ||
// Weight defines the proportion of requests forwarded to the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question, if have 1 client with http2 that sends 1000 requests (all requests are pipelined over the same connection) and two models with 50 and 50, is the result 500 requests for each model?
Is request a connection request, a token , an http request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gateway implementations handle the actual connection (ext-proc just uses gRPC communication with the GW)
But yes, assuming equal weighting for 2 underlying models, the mathematical probability should be a 50:50 split over a large enough sample pool
// | ||
// +optional | ||
// +kubebuilder:validation:MaxItems=10 | ||
TargetModels []TargetModel `json:"targetModels,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If target model mutable? can it be updated to add or remove models? if that happen, are the weights recalculated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, targetModels can be added/removed.
The weights do recalculate, they are all relative to one another. Link to how the weights are consumed here:
func RandomWeightedDraw(model *v1alpha1.InferenceModel, seed int64) string { |
This PR is not intended to be merged, merely a point of reference for review.
Slides: https://docs.google.com/presentation/d/1gtOJS1YA0Ax8KvsGPrHiyZR2dBoWnlLd9aACyojmk68/edit#slide=id.p