Hi everyone 😃
Today I would like to walk you through some experiments we made when it comes to load balancing requests to LLMs in Kubernetes. Let’s dive deep into it!
Well, Kubernets has now established itself as a leading technology when it comes to workloads orchestration and, with time, the support to accelerators such as GPUs increased, making it now easier to run workloads which can exploit the large parallelism those accelerators provide.
Furthermore, many projects developed to efficiently serve LLMs are provided also in the form of containers, ensuring so an effort-less experience when it comes to their deployments. Some examples include, but are not limited to:
But these are probably not the main reasons to bring Kubernetes in the loop when it comes to LLM serving. Because, actually, one of the main advantages that Kubernetes-like environment bring, is the mentality to deploy workloads in a GitOps way, which ensures our application is consistent with its definition, allowing to also re-deploy it on a different machine or in case a re-installation is needed.
In our case, we decided to try Kubeflow, a project aimed at providing users with separate Kubernetes namespaces where their workloads can run, with support to accelerators and a modular approach which allows the operator to decide which Kubeflow components to deploy, skipping for example some components which are not needed in case you are not training your own models.
We deployed it using the official manifest but with a little trick, namely updating KServe, the component we can actually recognize as a full-fledge model serving platform, to version 0.17, which was not the default one at the time in Kubeflow.
Why 0.17? It all comes down to… llm-d!
Okay, when it comes to applications and serving requests efficiently, one of the main concept adopted regardless the type of application and architecture is having a cache for all what is possible, trying to re-use it and maximize the number of cache hits. It turns out, this can play a role also when it comes to LLMs, but to understand how, we need to explore a bit how the decode phase in modern LLMs works!
Generating text in an LLM works roughly like this:

This is what an attention layer does. LLMs stack many of these layers, each followed by a small feed-forward network, refining the representation step by step. After the final layer, the result is projected onto the vocabulary to get a probability distribution over possible next tokens — one of which is sampled (or picked) as the output.
That new token gets appended to the sequence, and the whole process repeats — one token at a time — until a special end-of-sequence token is produced.
Given this, where could we fit an optimization under the form of a cache? Well, keys and values computed for the sequence until a certain step, can be re-used also in the computation of the next step, without the need to re-compute them!
This is exactly the concept behind the KV-Cache, which aims at re-using those already computed values.
Let’s imagine we would like to deploy two instances of a model running on two different GPUs, how can we do this? Until the latest versions, KServe offered the InferenceService, which allowed to easily serve an ML model, not only limited to an LLM.
For example, to serve Qwen 2.5 on our two GPUs, we could have done something similar to this:

Given that we have two instances, we could load balance users’ requests among the two instances using some standard techniques such as round robin or using metrics such as the load. However, this would not allow us to really exploit, at this stage, the concept of the KV cache we just saw. That’s why, in recent versions, KServe started offering also a more specialized LLMInferenceService, which among the other features, integrates llm-d, which provides also a KV Cache aware router (both approximated or precise, exploiting for example vLLM metrics pushed through ZeroMQ).
We could for example design the following configuration for the EndpointPicker, namely the llm-d component designed to choose among the available instances (endpoints), to use different metrics with different weights, including the precise prefix cache scorer, to maximize the hits, in this case also in an extreme way 😃

Given that we are using the precise prefix cache we need then to configure our inference runtime, such as vllm, to pass the information of the cache content back to the EndpointPicker, to make it able to take great routing decisions!

And in this way we are able to serve two instances of a model and have a service able to route requests based also on the content of the KV Cache, to maximize performances and the reduce the time to the first token!
Today we saw some of the benefits of using Kubernetes also for our ML workflows and some interesting tools to be able to efficiently and easily serve LLMs, with an eye on the KV-Cache aware routing!
See you soon 😉