Skip to content

Admin Guide: Important Considerations

This page lists the things about Logster's default build that you must know before going into production. None of the items below break the development stack — they are deliberate simplifications that make the local Compose experience smooth. Leaving them in place on a production deployment is how teams get surprised.

Read this page end-to-end. Track each item as a real piece of ops work.


The production hardening checklist

Before Logster sees a single packet from a production endpoint, work through every item below. The pages linked off each item have the detailed procedures.

Security and access

  • [ ] Put every user-facing port behind an auth-enforcing reverse proxy. Dashboard, API, Grafana, Kibana, Prometheus, Tempo — all of them. See Authentication.
  • [ ] Bind published ports to 127.0.0.1 in deploy/docker-compose.yml so Logster itself is never directly reachable from outside the host.
  • [ ] Tighten api.cors_origins in deploy/service-config.yaml to the actual dashboard origin instead of "*".
  • [ ] Change the Grafana admin password from logster. Update GF_SECURITY_ADMIN_PASSWORD in the Compose file.
  • [ ] Enable Elasticsearch security (xpack.security.enabled: true) and configure TLS if ES is reachable beyond the Compose network.
  • [ ] Move Kafka off PLAINTEXT listeners if Kafka is reachable beyond the Compose network. Use SASL/SSL.

Durability and persistence

  • [ ] Swap the InMemoryAlertStore for the Postgres-backed store. The default build loses every alert on API restart. See Alert store section below.
  • [ ] Configure Elasticsearch index lifecycle management (ILM) for logster-events and logster-inferences. Without ILM, both indices grow forever.
  • [ ] Schedule volume backups for es_data, grafana_data, and (if swapped in) the Postgres volume.

Scale and reliability

  • [ ] Run Kafka as a real multi-broker cluster, not single-broker KRaft. The dev stack sets replication.factor=1 and ISR=1 everywhere — a single broker outage is data loss.
  • [ ] Provision GPU hosts if your fleet requires it. Set model.device: cuda and use the NVIDIA container runtime.
  • [ ] Wire alerts on the Prometheus metrics that matter to you — inferences_run flatlining, parse_errors rising, ES disk over 80%, active_endpoints dropping.

Observability

  • [ ] Ship service logs to a real aggregator. The default build only has docker compose logs, which is fine for debugging but not for audit or long-term analysis.
  • [ ] Retain traces for long enough to investigate incidents. Tempo's default retention is short — configure retention_period in deploy/tempo.yaml.

Known limitations of the default build

These are not bugs — they are known, documented shortcuts that the dev stack takes to stay simple.

Alert store is in-memory

create_app() in services/api/src/logster_api/main.py wires the API to an InMemoryAlertStore:

store = InMemoryAlertStore()
app = create_app(alert_store=store, cors_origins=cors_origins)

Implication: every alert is lost when the API container restarts. Analyst verdicts recorded via POST /feedback vanish with the container.

Fix: The Postgres-backed store exists in the alerts package. See LOGSTER_ARCHITECTURE.md §3.3 for the class structure. The swap is a one-line change in main.py plus a Postgres container in the Compose file.

REST API has no authentication

The API is a FastAPI app with CORS middleware and nothing else. There is no token validation, no identity check, no audit log of who transitioned which alert.

Fix: Use a reverse proxy as described in Authentication.

Dashboard has no authentication

DISABLE_AUTH=true is set in the default Compose file. The dashboard has a pluggable auth layer but it is not wired to any identity provider.

Fix: Reverse proxy with OIDC/SAML. Same pattern as the API.

Single-broker Kafka with replication factor 1

The Compose file brings up one Kafka broker in KRaft mode. Topic replication factor is 1 and minimum in-sync replicas is 1. A single disk failure or container crash means message loss.

Fix: Run Kafka as a proper multi-broker cluster before going into production. The normalizer / inference / alerts services already use confluent-kafka and will happily connect to any standards-compliant cluster.

No TTP analyzer deployed

ttp.service_url is null in the default deploy/service-config.yaml. Alerts get created without ttp_techniques or ttp_explanation populated.

Fix: Deploy the TTP analyzer service separately and point ttp.service_url at it. This does not affect threat detection — only alert enrichment.

Only Windows data flows in the demo dataset

The sample inference history shows 100% Windows events from one endpoint. The Linux pipeline is fully built and tested, but no Linux endpoints are connected in the default demo. You must deploy Linux collectors yourself — see Key Features: Data Ingestion.

Polling-based dashboard

The React dashboard polls the Express backend on an interval. It does not use WebSockets or Server-Sent Events. For real-time "push" updates, you need to build that layer yourself.


Things that will trip you up

The dashboard port is 5001, not 5000

The dashboard container listens on 5000 internally but is published to host port 5001. Update bookmarks and forget the 5000 port ever existed.

Kafka auto-creates topics

KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true" is set in the Compose file. If you misspell a topic name in service-config.yaml, Kafka will silently create it with the default partition count (KAFKA_NUM_PARTITIONS: 6). Your pipeline will appear to work but message ordering and throughput will be subtly wrong.

[!TIP] Always verify topic partition counts after any config change:

docker compose exec kafka \
    kafka-topics --bootstrap-server localhost:9092 \
    --describe --topic normalized-endpoint-events

Windows hostnames are lowercased

The normalizer lowercases endpoint_id. DESKTOP-1NNIMRR becomes desktop-1nnimrr. Queries and filters must use the lowercased form.

Inference errors are not pipeline failures

If the GNN cannot build a valid graph (typically because the window contains fewer than two nodes), the inference service emits an InferenceResult with prediction="error". These are expected and do not indicate a bug — they indicate under-sized windows. If you see a high error rate, widen inference.window or investigate whether the normalizer is dropping events.

docker compose down -v is destructive

-v deletes every volume, not just stopping containers. Kafka logs, ES indices, Redis state, Prometheus/Grafana data — all gone. Only use this flag when you want a complete factory reset.


Pre-flight checklist

A good final sanity check before calling a Logster deployment "ready for production":

  1. nmap the host from outside — only the reverse proxy's ports should be reachable. 8080, 5001, 3000, 5601, 9090 must all be closed.
  2. Attempt to PATCH /alerts/<id> on the API from outside the reverse proxy — it must return 401 or be completely unreachable.
  3. Restart the API container and check that alerts survive. (If they don't, you haven't swapped to Postgres yet.)
  4. Delete logster-events and verify that ILM recreates the index and Logstash keeps writing without intervention.
  5. Kill one Kafka broker (in a multi-broker deployment) and verify that the pipeline keeps running.
  6. Confirm that Grafana's inferences_run metric is climbing in proportion to the number of endpoints you have.

If any of these fail, go back to the appropriate Admin Guide page and address the root cause. Do not ship until every check passes.


Next steps