Skip to content

Admin Guide: Daily Operations

This page is the day-to-day runbook for keeping a Logster deployment healthy. It is written for the platform engineer on call — the person who answers pages when a service falls over, when Elasticsearch goes yellow, or when the fleet doubles and the inference service starts falling behind.

For installation, see Installation. For the full failure-mode reference, see the Troubleshooting Guide.


What "healthy" looks like

All service state is managed by Docker Compose from the deploy/ directory. The authoritative list of containers is deploy/docker-compose.yml.

cd deploy/
docker compose --profile services ps

Every container below should report running (or healthy where a healthcheck is defined):

Service Expected state
kafka running, healthy
elasticsearch running, healthy
redis running, healthy
logstash running
kibana running
prometheus running
grafana running
tempo running
otel-collector running
normalizer running
inference running
alerts running
api running
dashboard running

Plus a completed kafka-init container (it only runs once on first start to create the topics, then exits 0).

Quick liveness checks

# Elasticsearch — status green or yellow
curl http://localhost:9200/_cluster/health

# API — status ok
curl http://localhost:8080/health

# Dashboard — any JSON response
curl http://localhost:5001/api/summary

# Kafka — eight topics listed
docker compose exec kafka \
    kafka-topics --bootstrap-server localhost:9092 --list

Observability

Metrics (Prometheus + Grafana)

Every Python service exposes Prometheus metrics. The scrape config lives in deploy/prometheus.yml; Grafana reads from Prometheus and has pre-provisioned dashboards.

Metrics to watch during an incident:

Service Metric Meaning
normalizer raw_events_total Raw events consumed. Should climb monotonically.
normalizer normalized_events_total Normalized events published. Gap vs raw_events_total means parse failures.
normalizer parse_errors Unparseable events. A rising count points at a schema change in the upstream shipper.
inference inferences_run GNN runs completed. Should climb steadily in proportion to endpoint count.
inference active_endpoints Endpoints with an active sliding window. Dropping to zero means the service is not seeing events.
inference inference_time_ms Per-run GNN latency. A sudden jump means larger graphs or CPU saturation.
alerts alerts_created New alerts opened.
alerts deduplicated Inference results merged into existing alerts.
alerts lateral_movement_detected Correlations fired across endpoints.

Traces (Tempo + OpenTelemetry)

Every Python service initializes OTLP tracing via libs/logster-common/logster_common/tracing.py. Traces ship to the in-stack OpenTelemetry Collector, which forwards to Tempo. View them in Grafana → Explore → Tempo.

Traces answer request-path questions that metrics cannot: "Why did this alert take 12 seconds from event to dashboard?"

Logs

Every service logs to stdout, which makes docker compose logs the primary debugging tool:

# Tail one service
docker compose logs -f inference

# Last 200 lines, all services
docker compose --profile services logs --tail=200

# Narrow to errors
docker compose --profile services logs --tail=500 | grep -i error

For long-term retention, ship stdout to your log aggregator of choice. Logster does not bundle a log shipper for its own service logs.


Scaling

Because every service is a Kafka consumer keyed by tenant_id:endpoint_id, the scaling pattern is the same for all of them: run more replicas, and let Kafka rebalance partitions across them.

Horizontal scaling ceilings

Service Parallelism ceiling Notes
Normalizer 6 Bounded by the partition count of each raw topic. Stateless — scale freely.
Inference 12 Bounded by normalized-endpoint-events partitions. Per-endpoint ordering preserved because all events for a given endpoint hash to the same partition.
Alerts 6 Bounded by logster-inference-results partitions. See caveat below.
API Unbounded Stateless. Scale behind a load balancer. See caveat below.
Dashboard Unbounded Stateless. Every replica queries ES directly.

[!CAUTION] Alerts service dedup state is per-replica. Running two replicas means an alert that would normally dedup may duplicate if its inferences land on different partitions. Start with one replica; only scale if you see inference-topic lag.

[!CAUTION] API alert store is in-memory by default. Scaling the API horizontally without first swapping in the Postgres store means replicas will not share alert state. See Important Considerations.

Vertical scaling

The only service that routinely benefits from vertical scaling is inference. Give it more CPU (or switch model.device to cuda and provision a GPU) before giving it more replicas.


Backup and retention

Data Where it lives What to do
Kafka topics kafka_data volume Retention is Kafka's default (7 days for log segments). Adjust with log.retention.hours.
Elasticsearch indices es_data volume No ILM policy configured by default — configure one before long-lived deployments.
Redis state redis_data volume Ephemeral. Losing it means the inference service re-populates sliding windows from live traffic; the first 3 minutes after recovery are cold.
Alerts (in-memory) In-memory in the API container Lost on API restart. Swap to the Postgres store for durability — see Important Considerations.
Prometheus / Grafana / Tempo Their volumes Standard backup rules for each tool apply.

Simple disaster-recovery baseline — snapshot volumes:

docker run --rm -v logster_es_data:/data -v $PWD:/backup \
    alpine tar czf /backup/es_data.tar.gz -C /data .

Restore is the inverse. Always stop the affected service before restoring a volume.


Upgrading

Logster does not ship a formal migration framework. Upgrades today look like this:

cd deploy/

# Stop services (infrastructure stays up)
docker compose --profile services down

# Pull the new code
git pull

# Rebuild service images
docker compose --profile services build

# Bring services back up
docker compose --profile services up -d

# Verify
curl http://localhost:8080/health

Things to watch during an upgrade:

  • Schema changes. Check libs/logster-common/logster_common/schemas/ for changes to NormalizedEvent, InferenceResult, or Alert.
  • Model changes. If a release bundles new model weights, update model.path / model.linux_model_path to match. See Model Deployment.
  • Config changes. Diff deploy/service-config.yaml against your working copy before rebuilding.
  • Kafka consumer group resets. Avoid changing kafka.group_id across an upgrade — doing so resets consumer offsets and reprocesses the entire retained log.

[!WARNING] docker compose down -v wipes every volume — Kafka data, ES indices, Redis, Prometheus, Grafana, Tempo. Never use it as part of an upgrade unless you intend a full reset.


Common failure modes (at a glance)

Most operational incidents match one of a handful of patterns. The full symptom → cause → fix reference is in the Troubleshooting Guide. The quick list:

Symptom Likely cause Fix page
Dashboard tiles empty No data in ES (time range or pipeline stall) Troubleshooting Guide
inferences_run metric flat Consumer lag on normalized-endpoint-events Troubleshooting Guide
ES status: red Disk full or low memory Troubleshooting Guide
Inference container crash Model file missing or wrong path Model Deployment
High error prediction rate Windows too small for graph construction Troubleshooting Guide

Next steps