This post explores renting a cloud GPU from RunPod and using the vLLM inference engine to run a Large Language Model made available via an OpenAI compatible endpoint, and then load testing that endpoint with K6. What is RunPod? RunPod is a paid cloud GPU provider. It offers: Pods We will utilise a pod in this example. A pod is a container with one or more GPUs attached. We specify the docker image and the configuration.