Admin Guide: Important Considerations
This page lists the things about Logster's default build that you must know before going into production. None of the items below break the development stack — they are deliberate simplifications that make the local Compose experience smooth. Leaving them in place on a production deployment is how teams get surprised.
Read this page end-to-end. Track each item as a real piece of ops work.
The production hardening checklist
Before Logster sees a single packet from a production endpoint, work through every item below. The pages linked off each item have the detailed procedures.
Security and access
- [ ] Put every user-facing port behind an auth-enforcing reverse proxy. Dashboard, API, Grafana, Kibana, Prometheus, Tempo — all of them. See Authentication.
- [ ] Bind published ports to
127.0.0.1in deploy/docker-compose.yml so Logster itself is never directly reachable from outside the host. - [ ] Tighten
api.cors_originsin deploy/service-config.yaml to the actual dashboard origin instead of"*". - [ ] Change the Grafana admin password from
logster. UpdateGF_SECURITY_ADMIN_PASSWORDin the Compose file. - [ ] Enable Elasticsearch security (
xpack.security.enabled: true) and configure TLS if ES is reachable beyond the Compose network. - [ ] Move Kafka off
PLAINTEXTlisteners if Kafka is reachable beyond the Compose network. Use SASL/SSL.
Durability and persistence
- [ ] Swap the
InMemoryAlertStorefor the Postgres-backed store. The default build loses every alert on API restart. See Alert store section below. - [ ] Configure Elasticsearch index lifecycle management (ILM) for
logster-eventsandlogster-inferences. Without ILM, both indices grow forever. - [ ] Schedule volume backups for
es_data,grafana_data, and (if swapped in) the Postgres volume.
Scale and reliability
- [ ] Run Kafka as a real multi-broker cluster, not single-broker
KRaft. The dev stack sets
replication.factor=1andISR=1everywhere — a single broker outage is data loss. - [ ] Provision GPU hosts if your fleet requires it. Set
model.device: cudaand use the NVIDIA container runtime. - [ ] Wire alerts on the Prometheus metrics that matter to you —
inferences_runflatlining,parse_errorsrising, ES disk over 80%,active_endpointsdropping.
Observability
- [ ] Ship service logs to a real aggregator. The default build
only has
docker compose logs, which is fine for debugging but not for audit or long-term analysis. - [ ] Retain traces for long enough to investigate incidents.
Tempo's default retention is short — configure
retention_periodin deploy/tempo.yaml.
Known limitations of the default build
These are not bugs — they are known, documented shortcuts that the dev stack takes to stay simple.
Alert store is in-memory
create_app() in
services/api/src/logster_api/main.py
wires the API to an InMemoryAlertStore:
Implication: every alert is lost when the API container restarts.
Analyst verdicts recorded via POST /feedback vanish with the
container.
Fix: The Postgres-backed store exists in the alerts package. See
LOGSTER_ARCHITECTURE.md §3.3 for the
class structure. The swap is a one-line change in main.py plus a
Postgres container in the Compose file.
REST API has no authentication
The API is a FastAPI app with CORS middleware and nothing else. There is no token validation, no identity check, no audit log of who transitioned which alert.
Fix: Use a reverse proxy as described in Authentication.
Dashboard has no authentication
DISABLE_AUTH=true is set in the default Compose file. The dashboard
has a pluggable auth layer but it is not wired to any identity
provider.
Fix: Reverse proxy with OIDC/SAML. Same pattern as the API.
Single-broker Kafka with replication factor 1
The Compose file brings up one Kafka broker in KRaft mode. Topic
replication factor is 1 and minimum in-sync replicas is 1. A
single disk failure or container crash means message loss.
Fix: Run Kafka as a proper multi-broker cluster before going into
production. The normalizer / inference / alerts services already use
confluent-kafka and will happily connect to any standards-compliant
cluster.
No TTP analyzer deployed
ttp.service_url is null in the default
deploy/service-config.yaml. Alerts
get created without ttp_techniques or ttp_explanation populated.
Fix: Deploy the TTP analyzer service separately and point
ttp.service_url at it. This does not affect threat detection — only
alert enrichment.
Only Windows data flows in the demo dataset
The sample inference history shows 100% Windows events from one endpoint. The Linux pipeline is fully built and tested, but no Linux endpoints are connected in the default demo. You must deploy Linux collectors yourself — see Key Features: Data Ingestion.
Polling-based dashboard
The React dashboard polls the Express backend on an interval. It does not use WebSockets or Server-Sent Events. For real-time "push" updates, you need to build that layer yourself.
Things that will trip you up
The dashboard port is 5001, not 5000
The dashboard container listens on 5000 internally but is published
to host port 5001. Update bookmarks and forget the 5000 port ever
existed.
Kafka auto-creates topics
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true" is set in the Compose file.
If you misspell a topic name in service-config.yaml, Kafka will
silently create it with the default partition count (KAFKA_NUM_PARTITIONS:
6). Your pipeline will appear to work but message ordering and
throughput will be subtly wrong.
[!TIP] Always verify topic partition counts after any config change:
Windows hostnames are lowercased
The normalizer lowercases endpoint_id. DESKTOP-1NNIMRR becomes
desktop-1nnimrr. Queries and filters must use the lowercased form.
Inference errors are not pipeline failures
If the GNN cannot build a valid graph (typically because the window
contains fewer than two nodes), the inference service emits an
InferenceResult with prediction="error". These are expected and
do not indicate a bug — they indicate under-sized windows. If you
see a high error rate, widen inference.window or investigate
whether the normalizer is dropping events.
docker compose down -v is destructive
-v deletes every volume, not just stopping containers. Kafka logs,
ES indices, Redis state, Prometheus/Grafana data — all gone. Only
use this flag when you want a complete factory reset.
Pre-flight checklist
A good final sanity check before calling a Logster deployment "ready for production":
nmapthe host from outside — only the reverse proxy's ports should be reachable.8080,5001,3000,5601,9090must all be closed.- Attempt to
PATCH /alerts/<id>on the API from outside the reverse proxy — it must return 401 or be completely unreachable. - Restart the API container and check that alerts survive. (If they don't, you haven't swapped to Postgres yet.)
- Delete
logster-eventsand verify that ILM recreates the index and Logstash keeps writing without intervention. - Kill one Kafka broker (in a multi-broker deployment) and verify that the pipeline keeps running.
- Confirm that Grafana's
inferences_runmetric is climbing in proportion to the number of endpoints you have.
If any of these fail, go back to the appropriate Admin Guide page and address the root cause. Do not ship until every check passes.
Next steps
- Troubleshooting Guide — the full failure-mode reference.
- Security Guide — detailed security validations and configuration.
- Daily Operations — the day-to-day runbook.