Troubleshooting Guide

This page is the failure-mode reference for Logster. It is organized by symptom rather than by component, because that's how operators actually hit problems. Find your symptom in the index, jump to its section, and work through the cause/fix recipe.

For day-to-day operations and health checks, see Admin Guide: Daily Operations.

Symptom index

Symptom	Jump to
Services fail with "connection refused to kafka:9092"	Kafka not ready
`elasticsearch` reports `status: red`	Elasticsearch disk full
Inference service crashes at startup	Model load failure
Dashboard tiles are empty	Dashboard tiles are empty
High `error` prediction rate	High error prediction rate
`inferences_run` metric is flat	Kafka consumer lag
Dashboard can't reach Elasticsearch	Dashboard can't reach Elasticsearch
Topics have wrong partition counts	Auto-created topics with wrong partition counts
Port already in use	Port already in use
API returns 503	API returns 503
Alerts disappear on restart	Alerts disappear on restart
Same alert appears every 30 seconds	Same alert appears every 30 seconds

Services fail with "connection refused to kafka:9092"

Symptom. During docker compose --profile services up -d, one or more services (usually normalizer or inference) crash in a restart loop with KafkaError: connection refused in their logs.

Cause. Kafka was not fully ready when the service tried to connect. The Kafka container has a healthcheck, but service startup proceeds once the healthcheck passes for the first time — which is sooner than "able to accept producer/consumer connections."

Fix.

cd deploy/
# Wait ~60 seconds after bringing up infrastructure
docker compose up -d
sleep 60
docker compose --profile services up -d

# Or, if you already hit the error, just restart the services
docker compose --profile services restart

After restart, the services should connect cleanly because Kafka is now fully initialized.

Elasticsearch disk full

Symptom. curl http://localhost:9200/_cluster/health returns status: red or yellow, with "high disk watermark" or "flood stage" warnings in the ES container logs.

Cause. Elasticsearch ran out of disk space. In the default Compose stack, there is no index lifecycle management, so logster-events and logster-inferences grow forever.

Immediate fix.

# Delete old indices (they will be recreated automatically)
curl -X DELETE 'http://localhost:9200/logster-events?ignore_unavailable=true'
curl -X DELETE 'http://localhost:9200/logster-inferences?ignore_unavailable=true'

Logstash will recreate the indices and resume writing on its next batch.

Structural fix. Configure an ILM policy that rolls over indices daily and deletes after N days. This is required for any long-lived deployment — see Admin Guide: Important Considerations.

Model load failure

Symptom. The inference container exits at startup with a traceback referencing torch.load or FileNotFoundError.

Cause. The path in model.path or model.linux_model_path (in deploy/service-config.yaml) points at a file that doesn't exist inside the container. Remember that the inference container mounts the host's models/models/ directory at /app/models.

Fix.

# From the repo root, check what's actually in models/
ls -la models/models/

# Verify the expected model files exist
ls -la models/models/balanced_run_20260114_143653/best_model.pt
ls -la models/models/balanced_run_20260222_142924/best_model.pt

Every path referenced in service-config.yaml must resolve to a real best_model.pt file. If you have retrained models or moved them, update the config accordingly — see Model Deployment.

Dashboard tiles are empty

Symptom. The dashboard at http://localhost:5001 loads but every tile shows 0, —, or "no data."

Cause. Most of the time this means the logster-inferences index has no documents in the selected time range. Less frequently, it means the Express backend cannot reach Elasticsearch.

Diagnosis.

# Are there any inferences at all?
curl 'http://localhost:9200/logster-inferences/_count'

# In the last hour?
curl 'http://localhost:9200/logster-inferences/_count' \
    -H 'Content-Type: application/json' \
    -d '{"query":{"range":{"@timestamp":{"gte":"now-1h"}}}}'

# Can the dashboard backend reach ES?
docker compose logs dashboard | tail -20

Fix paths.

Zero count, no time range restriction. No events are flowing. Verify upstream:
- curl http://localhost:9200/logster-events/_count — are raw events reaching ES?
- Check that the normalizer is consuming from Kafka.
- Check that the inference service is publishing to logster-inference-results.
Nonzero total count, zero in time range. Widen the dashboard time range. The default is now-30m, which may be too narrow for a freshly-started stack.
Dashboard logs show connection errors. Restart the dashboard (docker compose restart dashboard). See Dashboard can't reach Elasticsearch.

High error prediction rate

Symptom. The dashboard's Distribution panel shows a large error slice alongside attack and benign. Typically >10% of all predictions.

Cause. The GNN is being asked to classify windows that are too small to build a meaningful graph — typically fewer than 2 nodes or 1 edge after parsing.

Fix paths.

Widen the window. Larger windows contain more events and produce larger graphs. Edit deploy/service-config.yaml:

inference:
  window: 5.0    # was 3.0 (minutes)

Then docker compose --profile services restart inference.

Check if the normalizer is dropping events. Look at the parse_errors Prometheus metric. A rising count means a schema change in one of the raw ingestion sources — check docker compose logs normalizer | grep -i parse.
Confirm events are flowing at all. curl http://localhost:9200/logster-events/_count should be climbing over time. If it isn't, the pipeline is stalled upstream of the inference service.

Kafka consumer lag

Symptom. The inferences_run Prometheus metric is flat even though events are flowing into Kafka. The dashboard stops updating.

Cause. The inference service's consumer group is behind on normalized-endpoint-events. Lag grows, and new events aren't being processed in time.

Diagnosis.

docker compose exec kafka \
    kafka-consumer-groups --bootstrap-server localhost:9092 \
    --describe --group logster-inference

The LAG column tells you how far behind each partition is. If lag is rising rather than falling, you have a throughput problem.

Fix paths.

Slow GNN runs. Check inference_time_ms in Prometheus. Widen inference.interval (to give each run more breathing room), or vertically scale the inference container.
Too many endpoints for one replica. Scale the inference service horizontally. Each replica owns a subset of the 12 partitions. See Admin Guide: Daily Operations.
A poison event. Check docker compose logs inference | grep -i error. A single malformed event can stall a partition. Temporarily drop inference.use_idle_allowlist, or add a targeted exception in the parser.

Dashboard can't reach Elasticsearch

Symptom. Dashboard tiles show "Failed to fetch …" errors. Dashboard logs show connection errors against http://elasticsearch:9200.

Cause. Usually one of:

Elasticsearch container stopped or crashed.
ES_HOST environment variable set to the wrong address.
Elasticsearch took longer than usual to become healthy, and the dashboard gave up.

Fix paths.

Is ES running?

docker compose ps elasticsearch
curl http://localhost:9200/_cluster/health

Is ES_HOST correct? Check the dashboard's environment block in deploy/docker-compose.yml. Default should be http://elasticsearch:9200 for intra-Compose DNS.
Restart the dashboard.

docker compose restart dashboard

Auto-created topics with wrong partition counts

Symptom. A Kafka topic exists but has 1 partition instead of the expected 6 or 12. Throughput is bad.

Cause. Kafka auto-created the topic before kafka-init got to it, using the default KAFKA_NUM_PARTITIONS: 6. This mostly affects normalized-endpoint-events, which should have 12 partitions.

Verify.

docker compose exec kafka \
    kafka-topics --bootstrap-server localhost:9092 \
    --describe --topic normalized-endpoint-events

Fix.

docker compose exec kafka \
    kafka-topics --bootstrap-server localhost:9092 \
    --alter --topic normalized-endpoint-events --partitions 12

[!WARNING] Increasing partitions on a keyed topic disrupts per-endpoint ordering for the rehash duration. Prefer deleting and recreating the topic on a dev stack. On production, schedule a maintenance window.

Port already in use

Symptom. docker compose up -d fails with port is already allocated or bind: address already in use.

Cause. Another process on the host is bound to one of the ports listed in deploy/docker-compose.yml.

Fix. Either stop the conflicting process, or change the host-side port mapping. Only edit the left side of the "HOST:CONTAINER" string — the container-side port is referenced elsewhere and should not change.

# Example: move dashboard from host port 5001 to 5002
dashboard:
    ports:
        - "5002:5000"

Then update your bookmark from http://localhost:5001 to http://localhost:5002.

API returns 503

Symptom. curl http://localhost:8080/health returns 503 Service Unavailable with {"detail":"Alert store not configured"}.

Cause. The FastAPI app was constructed without an alert store. This almost always means the API container was launched with a misconfigured main.py or a missing dependency.

Fix. Check the API logs:

docker compose logs api | tail -50

The most common root cause is an import failure in main.py — for example, the Postgres alert store package was swapped in without installing its dependencies. Revert to the InMemoryAlertStore temporarily, then fix the Postgres setup properly.

Alerts disappear on restart

Symptom. After restarting the API container, GET /alerts returns an empty list even though you had alerts before.

Cause. The default API uses InMemoryAlertStore, which is exactly what it says — in-memory. Every container restart starts from an empty alert set.

Fix. This is not a bug, it's the default. For durability, swap in the Postgres-backed store. See Admin Guide: Important Considerations.

[!NOTE] Even without durable alerts, the dashboard still works, because the dashboard reads from Elasticsearch (logster-events and logster-inferences) directly, not from the API's alert store. Only the REST API is affected.

Same alert appears every 30 seconds

Symptom. On the dashboard's Recent Attacks panel, the same host keeps showing the same attack inference every 30 seconds.

Cause. This is not a bug — the dashboard shows every inference, not the deduplicated alert. The alerts service has already deduplicated it at the alert level (logster-alerts), but the dashboard reads from logster-inferences directly.

Fix. For the deduplicated view, use the REST API:

curl 'http://localhost:8080/alerts?status=open'

Each open alert represents a deduplicated group of inferences. The inference_ids field on the alert tells you how many underlying inferences contributed.

Still stuck?

If you hit something that's not in this guide:

Check the logs. docker compose --profile services logs --tail=500 | grep -i error
Check Prometheus. Are any service metrics flat, missing, or spiking?

Check the two ES indices directly. They are the source of truth for every dashboard tile.

curl 'http://localhost:9200/logster-events/_search?pretty&size=5'
curl 'http://localhost:9200/logster-inferences/_search?pretty&size=5'

Check Kafka consumer group lag.

docker compose exec kafka \
    kafka-consumer-groups --bootstrap-server localhost:9092 --list

Still nothing? Escalate to Logster Support customer support (Enterprise and Premium tiers) with the output of steps 1–4 attached.

Where to go next

Admin Guide: Daily Operations — proactive health checks.
Admin Guide: Important Considerations — known limitations of the default build.
Security Guide: Overview — threat model and security validations.