Skip to content

GPU Node — LLM Inference Server

The GPU Node runs the local LLM that Logster consults to reach a verdict on each window of endpoint activity. It is shipped as a Docker image tarball that you load and run on a VM with GPU access.

Provisioning the VM with working GPU access is your responsibility — this page guides you through the prerequisites and then walks you through loading and running the image.

[!NOTE] Bring the GPU Node up before the App Node. You will need this node's endpoint URL when you configure the App Node.


Hardware

Resource Recommended Minimum
CPU 8 vCPU 8 vCPU
RAM 64 GB 64 GB
GPU 2 × NVIDIA H100 80 GB 2 × NVIDIA RTX A6000

Step 1 — Provision the VM with GPU access

The model runs inside Docker and needs direct access to the GPU. Provision a Linux VM (Ubuntu Server 22.04 is a good default) with the GPU(s) passed through, then install the prerequisites so Docker can use them:

  1. GPU passthrough — configure PCIe passthrough for the GPU(s) in your hypervisor so the VM sees the physical cards. This is hypervisor-specific (Proxmox, ESXi, KVM, Hyper-V, etc.) — follow your platform's documentation.

  2. NVIDIA driver — install the NVIDIA driver inside the VM and confirm the GPUs are visible:

    nvidia-smi
    

    You should see a table listing your GPU(s). If this command fails, stop here and fix the driver/passthrough before continuing.

  3. Docker Engine + Docker Compose — install Docker following the official guide.

  4. NVIDIA Container Toolkit — this is what lets Docker containers reach the GPU. Install it following the NVIDIA Container Toolkit guide, then verify a container can see the GPU:

    sudo docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
    

    If this prints the same GPU table from inside a container, the GPU Node host is ready.


Step 2 — Load the model image

You have been shipped the model as a Docker image tarball. Load it into the VM's local Docker image store:

sudo docker load -i /path/to/logster-model-image.tar

When the load finishes, Docker prints the image name and tag. Confirm it is present:

sudo docker images

Step 3 — Run the model server

Start the LLM server. It exposes an OpenAI-compatible API on port 8000:

sudo docker run -d --name vllm-gemma4 --restart unless-stopped \
  --gpus '"device=0"' \
  --ipc=host \
  --shm-size 16g \
  -p 8000:8000 \
  -v /var/lib/vllm-cache:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-0505-cu130 \
    --model cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.92 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 16 \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --async-scheduling \
    --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
    --host 0.0.0.0 --port 8000

Notes:

  • --gpus '"device=0"' selects the first GPU. Adjust the device selection to match the GPU(s) you want the server to use.
  • The container is set to --restart unless-stopped, so it comes back automatically after a reboot.

First start downloads the model weights and can take several minutes. Follow the logs until the server reports it is ready to serve:

sudo docker logs -f vllm-gemma4

Step 4 — Verify the endpoint

From the GPU Node itself, confirm the server is answering:

curl http://localhost:8000/v1/models

You should get a JSON response listing the loaded model.


Step 5 — Note the endpoint URL for the App Node

The App Node talks to this server through its Chat Completions URL. Using the GPU Node's network address (the address the App Node can route to):

http://<gpu-node-ip>:8000/v1/chat/completions

You will set this as LOCAL_LLM_ENDPOINT on the App Node — see App Node → Step 2.

[!IMPORTANT] Make sure the App Node can reach the GPU Node on port 8000. If a firewall sits between the two nodes, open that port.