Posts for: #Linux

Deploying an OpenAI Compatible Endpoint on Runpod with vLLM and K6 Load Testing

This post explores renting a cloud GPU from RunPod and using the vLLM inference engine to run a Large Language Model made available via an OpenAI compatible endpoint, and then load testing that endpoint with K6. What is RunPod? RunPod is a paid cloud GPU provider. It offers: Pods We will utilise a pod in this example. A pod is a container with one or more GPUs attached. We specify the docker image and the configuration.
[Read more]

Converting a Pytorch Model to Safetensors Format and Quantising to Exl2

A set of notes on converting a transformers model from Pytorch format to Safetensors format and then quantising to ExLlamaV2 (Exl2) using a code based calibration dataset. This was inspired by posts which reported coding LLMs quantised to Exl2 format using the wikitext default calibration dataset resulted in relatively lower quality outcomes. ExLlamaV2 is the excellent work of Turboderp. It is an inference library for running LLMs on consumer GPUs. It is fast and supports multi-GPU hosts.
[Read more]

Automating Virtual Machine Creation on Proxmox with Terraform and bpg

A guide to using the Terraform bpg provider to create virtual machines on a Proxmox instance. The bpg provider is a wrapper for the Proxmox API. It enables the provisioning of infrastructure on Proxmox using Terraform. bpg is one of two terraform providers available for Proxmox at time of writing, the other option being telmate. Both are active based on their GitHub repos, at a quick glance bpg was a bit more active, and a few positive posts about bpg swayed the decision towards it.
[Read more]

Scaling GitHub Actions with Kubernetes: A Guide to ARC Deployment

Let us walk through setting up an Actions Runner Controller (ARC) for GitHub in a Kubernetes cluster. This will enable running continuous integration and continuous deployment (CI/CD) pipelines using GitHub Actions on our infrastructure, or on cloud based Kubernetes. First, we’ll introduce a bit of the terminology: Runner a container which runs code in response to a trigger. They may be used to test, build and deploy code, as well as far more creative use-cases.
[Read more]

Streamlining Secret Management with Vault in K3s Kubernetes

This post will explore deploying Hashicorp Vault to K3s (Kubernetes distribution) using Helm and then configuring it with Terraform. This will enable us to store our secret state data in Vault and make those secrets available to our K3s resources. Vault is an enterprise level secrets manager configurable for high availability which integrates with Kubernetes and many CI toolsets. In the previous two posts journaling the evolution of this site’s delivery, we have been managing a single secret, the Cloudflared tunnel token.
[Read more]

Migrating from Docker Compose to Kubernetes (K3s)

In this post, we will look at migrating Docker Compose run services to K3s, a lightweight version of Kubernetes. K3s provides an approachable way to experience Kubernetes. It is quick to spin up and takes care of a lot of boilerplate, which suits a test environment. We can work our way up to full Kubernetes (K8s) in the future. We will continue using this site as an example and build upon the previous post which got our GitHub repo to here.
[Read more]

Self-Hosted Website with Hugo, Docker, and Cloudflare Tunnels

This post will step through the process of building a Hugo-based website image using Docker in Ubuntu Linux, setting up a Cloudflare tunnel, and using a Docker Compose stack to bring up the website and Cloudflared containers. This will make a website available on the internet using an existing top-level domain. Some basic knowledge of Linux is required. At the time of writing, this is how this site is being hosted.
[Read more]