Admin Guide: Daily Operations
This page is the day-to-day runbook for keeping a Logster deployment healthy. It is written for the platform engineer on call — the person who answers pages when a service falls over, when Elasticsearch goes yellow, or when the fleet doubles and the inference service starts falling behind.
For installation, see Installation. For the full failure-mode reference, see the Troubleshooting Guide.
What "healthy" looks like
All service state is managed by Docker Compose from the deploy/
directory. The authoritative list of containers is
deploy/docker-compose.yml.
Every container below should report running (or healthy where a
healthcheck is defined):
| Service | Expected state |
|---|---|
kafka |
running, healthy |
elasticsearch |
running, healthy |
redis |
running, healthy |
logstash |
running |
kibana |
running |
prometheus |
running |
grafana |
running |
tempo |
running |
otel-collector |
running |
normalizer |
running |
inference |
running |
alerts |
running |
api |
running |
dashboard |
running |
Plus a completed kafka-init container (it only runs once on first
start to create the topics, then exits 0).
Quick liveness checks
# Elasticsearch — status green or yellow
curl http://localhost:9200/_cluster/health
# API — status ok
curl http://localhost:8080/health
# Dashboard — any JSON response
curl http://localhost:5001/api/summary
# Kafka — eight topics listed
docker compose exec kafka \
kafka-topics --bootstrap-server localhost:9092 --list
Observability
Metrics (Prometheus + Grafana)
Every Python service exposes Prometheus metrics. The scrape config lives in deploy/prometheus.yml; Grafana reads from Prometheus and has pre-provisioned dashboards.
- Prometheus UI: http://localhost:9090
- Grafana UI: http://localhost:3000 (login
admin/logster)
Metrics to watch during an incident:
| Service | Metric | Meaning |
|---|---|---|
| normalizer | raw_events_total |
Raw events consumed. Should climb monotonically. |
| normalizer | normalized_events_total |
Normalized events published. Gap vs raw_events_total means parse failures. |
| normalizer | parse_errors |
Unparseable events. A rising count points at a schema change in the upstream shipper. |
| inference | inferences_run |
GNN runs completed. Should climb steadily in proportion to endpoint count. |
| inference | active_endpoints |
Endpoints with an active sliding window. Dropping to zero means the service is not seeing events. |
| inference | inference_time_ms |
Per-run GNN latency. A sudden jump means larger graphs or CPU saturation. |
| alerts | alerts_created |
New alerts opened. |
| alerts | deduplicated |
Inference results merged into existing alerts. |
| alerts | lateral_movement_detected |
Correlations fired across endpoints. |
Traces (Tempo + OpenTelemetry)
Every Python service initializes OTLP tracing via libs/logster-common/logster_common/tracing.py. Traces ship to the in-stack OpenTelemetry Collector, which forwards to Tempo. View them in Grafana → Explore → Tempo.
Traces answer request-path questions that metrics cannot: "Why did this alert take 12 seconds from event to dashboard?"
Logs
Every service logs to stdout, which makes docker compose logs the
primary debugging tool:
# Tail one service
docker compose logs -f inference
# Last 200 lines, all services
docker compose --profile services logs --tail=200
# Narrow to errors
docker compose --profile services logs --tail=500 | grep -i error
For long-term retention, ship stdout to your log aggregator of choice. Logster does not bundle a log shipper for its own service logs.
Scaling
Because every service is a Kafka consumer keyed by
tenant_id:endpoint_id, the scaling pattern is the same for all of
them: run more replicas, and let Kafka rebalance partitions across
them.
Horizontal scaling ceilings
| Service | Parallelism ceiling | Notes |
|---|---|---|
| Normalizer | 6 | Bounded by the partition count of each raw topic. Stateless — scale freely. |
| Inference | 12 | Bounded by normalized-endpoint-events partitions. Per-endpoint ordering preserved because all events for a given endpoint hash to the same partition. |
| Alerts | 6 | Bounded by logster-inference-results partitions. See caveat below. |
| API | Unbounded | Stateless. Scale behind a load balancer. See caveat below. |
| Dashboard | Unbounded | Stateless. Every replica queries ES directly. |
[!CAUTION] Alerts service dedup state is per-replica. Running two replicas means an alert that would normally dedup may duplicate if its inferences land on different partitions. Start with one replica; only scale if you see inference-topic lag.
[!CAUTION] API alert store is in-memory by default. Scaling the API horizontally without first swapping in the Postgres store means replicas will not share alert state. See Important Considerations.
Vertical scaling
The only service that routinely benefits from vertical scaling is
inference. Give it more CPU (or switch model.device to cuda
and provision a GPU) before giving it more replicas.
Backup and retention
| Data | Where it lives | What to do |
|---|---|---|
| Kafka topics | kafka_data volume |
Retention is Kafka's default (7 days for log segments). Adjust with log.retention.hours. |
| Elasticsearch indices | es_data volume |
No ILM policy configured by default — configure one before long-lived deployments. |
| Redis state | redis_data volume |
Ephemeral. Losing it means the inference service re-populates sliding windows from live traffic; the first 3 minutes after recovery are cold. |
| Alerts (in-memory) | In-memory in the API container | Lost on API restart. Swap to the Postgres store for durability — see Important Considerations. |
| Prometheus / Grafana / Tempo | Their volumes | Standard backup rules for each tool apply. |
Simple disaster-recovery baseline — snapshot volumes:
docker run --rm -v logster_es_data:/data -v $PWD:/backup \
alpine tar czf /backup/es_data.tar.gz -C /data .
Restore is the inverse. Always stop the affected service before restoring a volume.
Upgrading
Logster does not ship a formal migration framework. Upgrades today look like this:
cd deploy/
# Stop services (infrastructure stays up)
docker compose --profile services down
# Pull the new code
git pull
# Rebuild service images
docker compose --profile services build
# Bring services back up
docker compose --profile services up -d
# Verify
curl http://localhost:8080/health
Things to watch during an upgrade:
- Schema changes. Check
libs/logster-common/logster_common/schemas/
for changes to
NormalizedEvent,InferenceResult, orAlert. - Model changes. If a release bundles new model weights, update
model.path/model.linux_model_pathto match. See Model Deployment. - Config changes. Diff deploy/service-config.yaml against your working copy before rebuilding.
- Kafka consumer group resets. Avoid changing
kafka.group_idacross an upgrade — doing so resets consumer offsets and reprocesses the entire retained log.
[!WARNING]
docker compose down -vwipes every volume — Kafka data, ES indices, Redis, Prometheus, Grafana, Tempo. Never use it as part of an upgrade unless you intend a full reset.
Common failure modes (at a glance)
Most operational incidents match one of a handful of patterns. The full symptom → cause → fix reference is in the Troubleshooting Guide. The quick list:
| Symptom | Likely cause | Fix page |
|---|---|---|
| Dashboard tiles empty | No data in ES (time range or pipeline stall) | Troubleshooting Guide |
inferences_run metric flat |
Consumer lag on normalized-endpoint-events |
Troubleshooting Guide |
ES status: red |
Disk full or low memory | Troubleshooting Guide |
| Inference container crash | Model file missing or wrong path | Model Deployment |
High error prediction rate |
Windows too small for graph construction | Troubleshooting Guide |
Next steps
- Important Considerations — production hardening checklist.
- Troubleshooting Guide — the full failure-mode reference.
- Installation Parameters — tune behavior without restart-panicking.