GPU Node — LLM Inference Server
The GPU Node runs the local LLM that Logster consults to reach a verdict on each window of endpoint activity. It is shipped as a Docker image tarball that you load and run on a VM with GPU access.
Provisioning the VM with working GPU access is your responsibility — this page guides you through the prerequisites and then walks you through loading and running the image.
[!NOTE] Bring the GPU Node up before the App Node. You will need this node's endpoint URL when you configure the App Node.
Hardware
| Resource | Recommended | Minimum |
|---|---|---|
| CPU | 8 vCPU | 8 vCPU |
| RAM | 64 GB | 64 GB |
| GPU | 2 × NVIDIA H100 80 GB | 2 × NVIDIA RTX A6000 |
Step 1 — Provision the VM with GPU access
The model runs inside Docker and needs direct access to the GPU. Provision a Linux VM (Ubuntu Server 22.04 is a good default) with the GPU(s) passed through, then install the prerequisites so Docker can use them:
-
GPU passthrough — configure PCIe passthrough for the GPU(s) in your hypervisor so the VM sees the physical cards. This is hypervisor-specific (Proxmox, ESXi, KVM, Hyper-V, etc.) — follow your platform's documentation.
-
NVIDIA driver — install the NVIDIA driver inside the VM and confirm the GPUs are visible:
You should see a table listing your GPU(s). If this command fails, stop here and fix the driver/passthrough before continuing.
-
Docker Engine + Docker Compose — install Docker following the official guide.
-
NVIDIA Container Toolkit — this is what lets Docker containers reach the GPU. Install it following the NVIDIA Container Toolkit guide, then verify a container can see the GPU:
If this prints the same GPU table from inside a container, the GPU Node host is ready.
Step 2 — Load the model image
You have been shipped the model as a Docker image tarball. Load it into the VM's local Docker image store:
When the load finishes, Docker prints the image name and tag. Confirm it is present:
Step 3 — Run the model server
Start the LLM server. It exposes an OpenAI-compatible API on port 8000:
sudo docker run -d --name vllm-gemma4 --restart unless-stopped \
--gpus '"device=0"' \
--ipc=host \
--shm-size 16g \
-p 8000:8000 \
-v /var/lib/vllm-cache:/root/.cache/huggingface \
vllm/vllm-openai:gemma4-0505-cu130 \
--model cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
--max-model-len 65536 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8 \
--max-num-seqs 16 \
--max-num-batched-tokens 8192 \
--enable-prefix-caching \
--async-scheduling \
--limit-mm-per-prompt '{"image": 0, "audio": 0}' \
--host 0.0.0.0 --port 8000
Notes:
--gpus '"device=0"'selects the first GPU. Adjust the device selection to match the GPU(s) you want the server to use.- The container is set to
--restart unless-stopped, so it comes back automatically after a reboot.
First start downloads the model weights and can take several minutes. Follow the logs until the server reports it is ready to serve:
Step 4 — Verify the endpoint
From the GPU Node itself, confirm the server is answering:
You should get a JSON response listing the loaded model.
Step 5 — Note the endpoint URL for the App Node
The App Node talks to this server through its Chat Completions URL. Using the GPU Node's network address (the address the App Node can route to):
You will set this as LOCAL_LLM_ENDPOINT on the App Node — see
App Node → Step 2.
[!IMPORTANT] Make sure the App Node can reach the GPU Node on port
8000. If a firewall sits between the two nodes, open that port.