Deploying an OpenAI Compatible Endpoint on Runpod with vLLM and K6 Load Testing
This post explores renting a cloud GPU from RunPod and using the vLLM inference engine to run a Large Language Model made available via an OpenAI compatible endpoint, and then load testing that endpoint with K6.
What is RunPod? RunPod is a paid cloud GPU provider. It offers:
Pods
We will utilise a pod in this example.
A pod is a container with one or more GPUs attached. We specify the docker image and the configuration.