Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model server protocol proposal #164

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Jan 7, 2025

This is adapted from the initial doc

I didn't include the ORCA load reporting section as it's not currently required by the inference extension, though parallel efforts are happening. The intention is to keep the scope of this protocol small and expand in the future if needed.

Much of the protocol items are "SHOULD" instead of "MUST" given the following reasons:

  • Existing model servers already have chosen different implementation. E.g., different metric names. While we would like to drive unification, we don't want to dictate.
  • Emerging features like LoRA serving has yet to evolve to a more common understanding. Currently different model server have very different configurations for LoRA serving, for example. It's premature to come to a well defined contract.

I see this as a very initial effort to define the contract, and is an evolving process to monitor industry trends and drive more unification.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kfswain January 7, 2025 04:43
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from robscott January 7, 2025 04:43
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 7, 2025
@liu-cong liu-cong marked this pull request as draft January 7, 2025 19:01
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 7, 2025

@smarterclayton

@@ -0,0 +1,72 @@
# Model Server Protocol for Gateway API Inference Extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton @robscott I think we need to explicitly define two "contracts":

Contract 1 is the one between the gateway and the endpoint picker extension; this contract is defined by the InferenceModel/InferencePool APIs.

Contract 2 is the one between the reference implementation of the endpoint picker and the model server.

For each contract, we need separately establish versioning and define conformance tests and my view is that we should not make them dependent on each other.

The InferencePool API could have a way to communicate to the endpoint picker which protocol to use (e.g., as part of the extension config API), but for the most part it should be part of some raw parameters config (e.g., a map or a generic reference to a CRD) that the extension understands and knowns how to interpret.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InferencePool API could have a way to communicate to the endpoint picker which protocol to use (e.g., as part of the extension config API)

+1, see #162 (comment) for an example of exposing the protocol used between the InferencePool implementation and extension. See #162 (comment) support for UDS.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025
@liu-cong liu-cong marked this pull request as ready for review January 7, 2025 22:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 7, 2025 22:59
@kfswain
Copy link
Collaborator

kfswain commented Jan 9, 2025

This is awesome!
I'm gonna put a hold on here so we can discuss this in our Contributors meeting before it merges. Does next week sound good to have this on the agenda? Or is a little more time preferable?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants