You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi @lessw2020 - first i want to say thank you for your YouTube videos on FSDP!!!
For continuous/dynamic batching, we really want something that's in python :) where it's easy to tweak the server. As the main bottleneck is the GPU related generation (at least for LLMs), there is only a marginal benefit to using a Rust/Java based web server framework. Nevertheless, all the main frameworks (i.e. TGI and vLLM) are not in python. Thanks!
Hi @vgoklani - got it, thanks for your feedback.
This has generated a discussion about possibly making a reference architecture to showcase these type of features.
Let me leave this issue open and will update if it turns into a real effort.
Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.
Thanks!
The text was updated successfully, but these errors were encountered: