Admin Guide: Daily Operations

This page is the day-to-day runbook for keeping a Logster deployment healthy. It is written for the platform engineer on call — the person who answers pages when a service falls over, when Elasticsearch goes yellow, or when the fleet doubles and the inference service starts falling behind.

For installation, see Installation. For the full failure-mode reference, see the Troubleshooting Guide.

What "healthy" looks like

All service state is managed by Docker Compose from the deploy/ directory. The authoritative list of containers is deploy/docker-compose.yml.

cd deploy/
docker compose --profile services ps

Every container below should report running (or healthy where a healthcheck is defined):

Service	Expected state
`kafka`	running, healthy
`elasticsearch`	running, healthy
`redis`	running, healthy
`logstash`	running
`kibana`	running
`prometheus`	running
`grafana`	running
`tempo`	running
`otel-collector`	running
`normalizer`	running
`inference`	running
`alerts`	running
`api`	running
`dashboard`	running

Plus a completed kafka-init container (it only runs once on first start to create the topics, then exits 0).

Quick liveness checks

# Elasticsearch — status green or yellow
curl http://localhost:9200/_cluster/health

# API — status ok
curl http://localhost:8080/health

# Dashboard — any JSON response
curl http://localhost:5001/api/summary

# Kafka — eight topics listed
docker compose exec kafka \
    kafka-topics --bootstrap-server localhost:9092 --list

Observability

Metrics (Prometheus + Grafana)

Every Python service exposes Prometheus metrics. The scrape config lives in deploy/prometheus.yml; Grafana reads from Prometheus and has pre-provisioned dashboards.

Prometheus UI: http://localhost:9090
Grafana UI: http://localhost:3000 (login admin / logster)

Metrics to watch during an incident:

Service	Metric	Meaning
normalizer	`raw_events_total`	Raw events consumed. Should climb monotonically.
normalizer	`normalized_events_total`	Normalized events published. Gap vs `raw_events_total` means parse failures.
normalizer	`parse_errors`	Unparseable events. A rising count points at a schema change in the upstream shipper.
inference	`inferences_run`	GNN runs completed. Should climb steadily in proportion to endpoint count.
inference	`active_endpoints`	Endpoints with an active sliding window. Dropping to zero means the service is not seeing events.
inference	`inference_time_ms`	Per-run GNN latency. A sudden jump means larger graphs or CPU saturation.
alerts	`alerts_created`	New alerts opened.
alerts	`deduplicated`	Inference results merged into existing alerts.
alerts	`lateral_movement_detected`	Correlations fired across endpoints.

Traces (Tempo + OpenTelemetry)

Every Python service initializes OTLP tracing via libs/logster-common/logster_common/tracing.py. Traces ship to the in-stack OpenTelemetry Collector, which forwards to Tempo. View them in Grafana → Explore → Tempo.

Traces answer request-path questions that metrics cannot: "Why did this alert take 12 seconds from event to dashboard?"

Logs

Every service logs to stdout, which makes docker compose logs the primary debugging tool:

# Tail one service
docker compose logs -f inference

# Last 200 lines, all services
docker compose --profile services logs --tail=200

# Narrow to errors
docker compose --profile services logs --tail=500 | grep -i error

For long-term retention, ship stdout to your log aggregator of choice. Logster does not bundle a log shipper for its own service logs.

Scaling

Because every service is a Kafka consumer keyed by tenant_id:endpoint_id, the scaling pattern is the same for all of them: run more replicas, and let Kafka rebalance partitions across them.

Horizontal scaling ceilings

Service	Parallelism ceiling	Notes
Normalizer	6	Bounded by the partition count of each raw topic. Stateless — scale freely.
Inference	12	Bounded by `normalized-endpoint-events` partitions. Per-endpoint ordering preserved because all events for a given endpoint hash to the same partition.
Alerts	6	Bounded by `logster-inference-results` partitions. See caveat below.
API	Unbounded	Stateless. Scale behind a load balancer. See caveat below.
Dashboard	Unbounded	Stateless. Every replica queries ES directly.

[!CAUTION] Alerts service dedup state is per-replica. Running two replicas means an alert that would normally dedup may duplicate if its inferences land on different partitions. Start with one replica; only scale if you see inference-topic lag.

[!CAUTION] API alert store is in-memory by default. Scaling the API horizontally without first swapping in the Postgres store means replicas will not share alert state. See Important Considerations.

Vertical scaling

The only service that routinely benefits from vertical scaling is inference. Give it more CPU (or switch model.device to cuda and provision a GPU) before giving it more replicas.

Backup and retention

Data	Where it lives	What to do
Kafka topics	`kafka_data` volume	Retention is Kafka's default (7 days for log segments). Adjust with `log.retention.hours`.
Elasticsearch indices	`es_data` volume	No ILM policy configured by default — configure one before long-lived deployments.
Redis state	`redis_data` volume	Ephemeral. Losing it means the inference service re-populates sliding windows from live traffic; the first 3 minutes after recovery are cold.
Alerts (in-memory)	In-memory in the API container	Lost on API restart. Swap to the Postgres store for durability — see Important Considerations.
Prometheus / Grafana / Tempo	Their volumes	Standard backup rules for each tool apply.

Simple disaster-recovery baseline — snapshot volumes:

docker run --rm -v logster_es_data:/data -v $PWD:/backup \
    alpine tar czf /backup/es_data.tar.gz -C /data .

Restore is the inverse. Always stop the affected service before restoring a volume.

Upgrading

Logster does not ship a formal migration framework. Upgrades today look like this:

cd deploy/

# Stop services (infrastructure stays up)
docker compose --profile services down

# Pull the new code
git pull

# Rebuild service images
docker compose --profile services build

# Bring services back up
docker compose --profile services up -d

# Verify
curl http://localhost:8080/health

Things to watch during an upgrade:

Schema changes. Check libs/logster-common/logster_common/schemas/ for changes to NormalizedEvent, InferenceResult, or Alert.
Model changes. If a release bundles new model weights, update model.path / model.linux_model_path to match. See Model Deployment.
Config changes. Diff deploy/service-config.yaml against your working copy before rebuilding.
Kafka consumer group resets. Avoid changing kafka.group_id across an upgrade — doing so resets consumer offsets and reprocesses the entire retained log.

[!WARNING] docker compose down -v wipes every volume — Kafka data, ES indices, Redis, Prometheus, Grafana, Tempo. Never use it as part of an upgrade unless you intend a full reset.

Common failure modes (at a glance)

Most operational incidents match one of a handful of patterns. The full symptom → cause → fix reference is in the Troubleshooting Guide. The quick list:

Symptom	Likely cause	Fix page
Dashboard tiles empty	No data in ES (time range or pipeline stall)	Troubleshooting Guide
`inferences_run` metric flat	Consumer lag on `normalized-endpoint-events`	Troubleshooting Guide
ES `status: red`	Disk full or low memory	Troubleshooting Guide
Inference container crash	Model file missing or wrong path	Model Deployment
High `error` prediction rate	Windows too small for graph construction	Troubleshooting Guide

Next steps

Important Considerations — production hardening checklist.
Troubleshooting Guide — the full failure-mode reference.
Installation Parameters — tune behavior without restart-panicking.