Troubleshooting Guide
This page is the failure-mode reference for Logster. It is organized by symptom rather than by component, because that's how operators actually hit problems. Find your symptom in the index, jump to its section, and work through the cause/fix recipe.
For day-to-day operations and health checks, see Admin Guide: Daily Operations.
Symptom index
| Symptom | Jump to |
|---|---|
| Services fail with "connection refused to kafka:9092" | Kafka not ready |
elasticsearch reports status: red |
Elasticsearch disk full |
| Inference service crashes at startup | Model load failure |
| Dashboard tiles are empty | Dashboard tiles are empty |
High error prediction rate |
High error prediction rate |
inferences_run metric is flat |
Kafka consumer lag |
| Dashboard can't reach Elasticsearch | Dashboard can't reach Elasticsearch |
| Topics have wrong partition counts | Auto-created topics with wrong partition counts |
| Port already in use | Port already in use |
| API returns 503 | API returns 503 |
| Alerts disappear on restart | Alerts disappear on restart |
| Same alert appears every 30 seconds | Same alert appears every 30 seconds |
Services fail with "connection refused to kafka:9092"
Symptom. During docker compose --profile services up -d, one or
more services (usually normalizer or inference) crash in a restart
loop with KafkaError: connection refused in their logs.
Cause. Kafka was not fully ready when the service tried to connect. The Kafka container has a healthcheck, but service startup proceeds once the healthcheck passes for the first time — which is sooner than "able to accept producer/consumer connections."
Fix.
cd deploy/
# Wait ~60 seconds after bringing up infrastructure
docker compose up -d
sleep 60
docker compose --profile services up -d
# Or, if you already hit the error, just restart the services
docker compose --profile services restart
After restart, the services should connect cleanly because Kafka is now fully initialized.
Elasticsearch disk full
Symptom. curl http://localhost:9200/_cluster/health returns
status: red or yellow, with "high disk watermark" or "flood
stage" warnings in the ES container logs.
Cause. Elasticsearch ran out of disk space. In the default
Compose stack, there is no index lifecycle management, so
logster-events and logster-inferences grow forever.
Immediate fix.
# Delete old indices (they will be recreated automatically)
curl -X DELETE 'http://localhost:9200/logster-events?ignore_unavailable=true'
curl -X DELETE 'http://localhost:9200/logster-inferences?ignore_unavailable=true'
Logstash will recreate the indices and resume writing on its next batch.
Structural fix. Configure an ILM policy that rolls over indices daily and deletes after N days. This is required for any long-lived deployment — see Admin Guide: Important Considerations.
Model load failure
Symptom. The inference container exits at startup with a
traceback referencing torch.load or FileNotFoundError.
Cause. The path in model.path or model.linux_model_path
(in deploy/service-config.yaml)
points at a file that doesn't exist inside the container. Remember
that the inference container mounts the host's
models/models/ directory at /app/models.
Fix.
# From the repo root, check what's actually in models/
ls -la models/models/
# Verify the expected model files exist
ls -la models/models/balanced_run_20260114_143653/best_model.pt
ls -la models/models/balanced_run_20260222_142924/best_model.pt
Every path referenced in service-config.yaml must resolve to a
real best_model.pt file. If you have retrained models or moved
them, update the config accordingly — see
Model Deployment.
Dashboard tiles are empty
Symptom. The dashboard at http://localhost:5001 loads but
every tile shows 0, —, or "no data."
Cause. Most of the time this means the logster-inferences
index has no documents in the selected time range. Less frequently,
it means the Express backend cannot reach Elasticsearch.
Diagnosis.
# Are there any inferences at all?
curl 'http://localhost:9200/logster-inferences/_count'
# In the last hour?
curl 'http://localhost:9200/logster-inferences/_count' \
-H 'Content-Type: application/json' \
-d '{"query":{"range":{"@timestamp":{"gte":"now-1h"}}}}'
# Can the dashboard backend reach ES?
docker compose logs dashboard | tail -20
Fix paths.
- Zero count, no time range restriction. No events are flowing.
Verify upstream:
curl http://localhost:9200/logster-events/_count— are raw events reaching ES?- Check that the normalizer is consuming from Kafka.
- Check that the inference service is publishing to
logster-inference-results.
- Nonzero total count, zero in time range. Widen the dashboard
time range. The default is
now-30m, which may be too narrow for a freshly-started stack. - Dashboard logs show connection errors. Restart the dashboard
(
docker compose restart dashboard). See Dashboard can't reach Elasticsearch.
High error prediction rate
Symptom. The dashboard's Distribution panel shows a large
error slice alongside attack and benign. Typically >10%
of all predictions.
Cause. The GNN is being asked to classify windows that are too small to build a meaningful graph — typically fewer than 2 nodes or 1 edge after parsing.
Fix paths.
- Widen the window. Larger windows contain more events and produce larger graphs. Edit deploy/service-config.yaml:
Then docker compose --profile services restart inference.
-
Check if the normalizer is dropping events. Look at the
parse_errorsPrometheus metric. A rising count means a schema change in one of the raw ingestion sources — checkdocker compose logs normalizer | grep -i parse. -
Confirm events are flowing at all.
curl http://localhost:9200/logster-events/_countshould be climbing over time. If it isn't, the pipeline is stalled upstream of the inference service.
Kafka consumer lag
Symptom. The inferences_run Prometheus metric is flat even
though events are flowing into Kafka. The dashboard stops updating.
Cause. The inference service's consumer group is behind on
normalized-endpoint-events. Lag grows, and new events aren't
being processed in time.
Diagnosis.
docker compose exec kafka \
kafka-consumer-groups --bootstrap-server localhost:9092 \
--describe --group logster-inference
The LAG column tells you how far behind each partition is. If
lag is rising rather than falling, you have a throughput problem.
Fix paths.
- Slow GNN runs. Check
inference_time_msin Prometheus. Wideninference.interval(to give each run more breathing room), or vertically scale the inference container. - Too many endpoints for one replica. Scale the inference service horizontally. Each replica owns a subset of the 12 partitions. See Admin Guide: Daily Operations.
- A poison event. Check
docker compose logs inference | grep -i error. A single malformed event can stall a partition. Temporarily dropinference.use_idle_allowlist, or add a targeted exception in the parser.
Dashboard can't reach Elasticsearch
Symptom. Dashboard tiles show "Failed to fetch …" errors.
Dashboard logs show connection errors against
http://elasticsearch:9200.
Cause. Usually one of:
- Elasticsearch container stopped or crashed.
ES_HOSTenvironment variable set to the wrong address.- Elasticsearch took longer than usual to become healthy, and the dashboard gave up.
Fix paths.
- Is ES running?
-
Is
ES_HOSTcorrect? Check the dashboard's environment block in deploy/docker-compose.yml. Default should behttp://elasticsearch:9200for intra-Compose DNS. -
Restart the dashboard.
Auto-created topics with wrong partition counts
Symptom. A Kafka topic exists but has 1 partition instead of
the expected 6 or 12. Throughput is bad.
Cause. Kafka auto-created the topic before kafka-init got to
it, using the default KAFKA_NUM_PARTITIONS: 6. This mostly
affects normalized-endpoint-events, which should have 12
partitions.
Verify.
docker compose exec kafka \
kafka-topics --bootstrap-server localhost:9092 \
--describe --topic normalized-endpoint-events
Fix.
docker compose exec kafka \
kafka-topics --bootstrap-server localhost:9092 \
--alter --topic normalized-endpoint-events --partitions 12
[!WARNING] Increasing partitions on a keyed topic disrupts per-endpoint ordering for the rehash duration. Prefer deleting and recreating the topic on a dev stack. On production, schedule a maintenance window.
Port already in use
Symptom. docker compose up -d fails with
port is already allocated or bind: address already in use.
Cause. Another process on the host is bound to one of the ports listed in deploy/docker-compose.yml.
Fix. Either stop the conflicting process, or change the
host-side port mapping. Only edit the left side of the
"HOST:CONTAINER" string — the container-side port is referenced
elsewhere and should not change.
Then update your bookmark from http://localhost:5001 to
http://localhost:5002.
API returns 503
Symptom. curl http://localhost:8080/health returns
503 Service Unavailable with {"detail":"Alert store not configured"}.
Cause. The FastAPI app was constructed without an alert store.
This almost always means the API container was launched with a
misconfigured main.py or a missing dependency.
Fix. Check the API logs:
The most common root cause is an import failure in main.py — for
example, the Postgres alert store package was swapped in without
installing its dependencies. Revert to the InMemoryAlertStore
temporarily, then fix the Postgres setup properly.
Alerts disappear on restart
Symptom. After restarting the API container, GET /alerts
returns an empty list even though you had alerts before.
Cause. The default API uses InMemoryAlertStore, which is
exactly what it says — in-memory. Every container restart starts
from an empty alert set.
Fix. This is not a bug, it's the default. For durability, swap in the Postgres-backed store. See Admin Guide: Important Considerations.
[!NOTE] Even without durable alerts, the dashboard still works, because the dashboard reads from Elasticsearch (
logster-eventsandlogster-inferences) directly, not from the API's alert store. Only the REST API is affected.
Same alert appears every 30 seconds
Symptom. On the dashboard's Recent Attacks panel, the same host keeps showing the same attack inference every 30 seconds.
Cause. This is not a bug — the dashboard shows every
inference, not the deduplicated alert. The alerts service has
already deduplicated it at the alert level (logster-alerts), but
the dashboard reads from logster-inferences directly.
Fix. For the deduplicated view, use the REST API:
Each open alert represents a deduplicated group of inferences. The
inference_ids field on the alert tells you how many underlying
inferences contributed.
Still stuck?
If you hit something that's not in this guide:
- Check the logs.
docker compose --profile services logs --tail=500 | grep -i error - Check Prometheus. Are any service metrics flat, missing, or spiking?
- Check the two ES indices directly. They are the source of truth for every dashboard tile.
- Check Kafka consumer group lag.
- Still nothing? Escalate to Logster Support customer support (Enterprise and Premium tiers) with the output of steps 1–4 attached.
Where to go next
- Admin Guide: Daily Operations — proactive health checks.
- Admin Guide: Important Considerations — known limitations of the default build.
- Security Guide: Overview — threat model and security validations.