Skip to content

Enterprise Architecture

This page explains the end-to-end architecture of a Logster deployment — every component, how they communicate, and where data lives at each stage. It is written for platform engineers, security architects, and anyone evaluating Logster for a production rollout.

For the exact schemas, GNN layer details, and Kafka partition counts, see the contributor-facing LOGSTER_ARCHITECTURE.md.


High-level view

Logster's pipeline is a chain of Kafka-coupled microservices. No service talks directly to another — every stage communicates through Kafka topics, which makes every component independently scalable and failure-tolerant.

flowchart TB
    subgraph Endpoints["MONITORED ENDPOINTS"]
        WIN["Windows<br/>Winlogbeat / Sysmon"]
        LNX["Linux<br/>auditd + eBPF"]
    end

    subgraph Kafka["KAFKA (KRaft mode)"]
        direction TB
        subgraph RawTopics["Raw topics"]
            T1[sysmon-logs]
            T2[linux-auditd-logs]
            T3[linux-ebpf-process-logs]
            T4[linux-ebpf-file-logs]
            T5[linux-ebpf-network-logs]
        end
        subgraph ProcTopics["Processed topics"]
            P1[normalized-endpoint-events]
            P2[logster-inference-results]
            P3[logster-alerts]
        end
    end

    NORM[NORMALIZER]
    INF[INFERENCE]
    ALR[ALERTS]
    REDIS[(Redis<br/>window state)]
    API[/"REST API<br/>:8080"/]

    subgraph Serving["LOGSTASH → ELASTICSEARCH → DASHBOARD"]
        direction TB
        LS[Logstash]
        ES[("Elasticsearch<br/>logster-events<br/>logster-inferences")]
        DASH["Dashboard<br/>React + Express :5001"]
        KIBANA["Kibana :5601"]
        GRAF["Grafana :3000"]
        PROM["Prometheus :9090"]
    end

    WIN --> T1
    LNX --> T2 & T3 & T4 & T5

    RawTopics --> NORM
    NORM --> P1
    P1 --> INF
    INF --> P2
    P2 --> ALR
    ALR --> P3

    INF <--> REDIS
    ALR --> API

    P1 --> LS
    P2 --> LS
    LS --> ES
    ES --> DASH
    ES --> KIBANA

    classDef topic fill:#2b3a55,stroke:#8fa8d8,color:#e6ecff;
    classDef svc fill:#2f5148,stroke:#7bc9a8,color:#e6fff3;
    classDef store fill:#55432b,stroke:#d8b98f,color:#fff3e6;
    class T1,T2,T3,T4,T5,P1,P2,P3 topic;
    class NORM,INF,ALR,LS,DASH,KIBANA,GRAF,PROM,API svc;
    class REDIS,ES store;

The Docker Compose file that wires this together for local development lives at deploy/docker-compose.yml.


Pipeline services

Each service is a thin Python package (or, for the dashboard, a Node/React project). They share two libraries under libs/ and communicate exclusively via Kafka and Elasticsearch.

Normalizer

Location: services/normalizer/

  • Consumes: the five raw topics (sysmon-logs, linux-auditd-logs, linux-ebpf-{process,file,network}-logs).
  • Produces: normalized-endpoint-events.
  • State: none — fully stateless and horizontally scalable.

The normalizer routes each raw event to a platform-specific parser (Sysmon, auditd, or eBPF), transforms it into the unified NormalizedEvent schema, and publishes it with a Kafka key of tenant_id:endpoint_id so all events for a given host land on the same partition. Downstream services never have to know whether an event originated from Sysmon, auditd, or eBPF — the normalizer is the one place in the system that cares.

Inference

Location: services/inference/

  • Consumes: normalized-endpoint-events.
  • Produces: logster-inference-results.
  • State: per-endpoint sliding windows, optionally persisted in Redis.

The inference service buffers normalized events into per-endpoint windows keyed by (tenant_id, endpoint_id, platform). Every inference.interval seconds (default 30s), for every endpoint with activity, it:

  1. Collects events from the last inference.window minutes (default 3 minutes).
  2. On Linux, runs a cheap idle-daemon pre-filter — if every process in the window is on an allow-list (systemd, cron, sshd, ...), the window is labeled benign without invoking the model.
  3. Otherwise, builds a heterogeneous temporal graph (NetworkX → PyTorch Geometric) with platform-specific node and edge types.
  4. Runs the platform's pre-trained GNN forward pass (Windows or Linux model).
  5. Publishes an InferenceResult with prediction, attack_prob, confidence, and graph statistics.

The model architecture is a 3-layer Heterogeneous Graph Attention Network (GAT) with 128 hidden dimensions, 4 attention heads, and a 2-class softmax output (benign / attack).

Alerts

Location: services/alerts/

  • Consumes: logster-inference-results.
  • Produces: logster-alerts, plus an in-memory or Postgres-backed alert store.
  • State: dedup windows, per-tenant correlation windows.

The alerts service turns raw attack predictions into actionable alerts. Four things happen on every inference result:

  1. Threshold filter. Results below alerts.min_threshold (default 0.7) are dropped immediately.
  2. Deduplication. If the same (tenant, endpoint, platform) has fired within alerts.dedup_window_seconds (default 300s), the new result is merged into the existing open alert instead of opening a new one.
  3. Lateral-movement correlation. If multiple endpoints in the same tenant alert within alerts.correlation_window_seconds (default 60s), they are cross-linked via the related_endpoints field.
  4. Severity and enrichment. Severity is bucketed from attack_prob (>=0.95 CRITICAL, >=0.85 HIGH, >=0.7 MEDIUM). If a TTP analyzer is configured, an async call enriches the alert with MITRE ATT&CK technique IDs.

API

Location: services/api/

  • Port: 8080
  • Framework: FastAPI

The API exposes /alerts, /alerts/{id}, /feedback, /endpoints, and /health. Analysts and integrations call it to list alerts, transition alert states, and record true/false-positive verdicts. See the API User Guide for full endpoint documentation.

[!NOTE] The default build wires the API to an in-memory alert store, which means alerts are lost on API restart. For production use, swap in the Postgres-backed store — see Admin Guide: Important Considerations.

Dashboard

Location: services/dashboard/

  • Port: 5001 (published from container port 5000)
  • Stack: React SPA + Express backend

The dashboard reads directly from Elasticsearch (not via the REST API) and serves aggregated views for SOC analysts: host summaries, attack timelines, inference detail, process trees, endpoint insights, and MITRE TTP distributions. See the Dashboard User Guide for the analyst walkthrough.


Message topology — Kafka

All inter-service communication flows through Kafka. The topic layout:

Topic Partitions Producer Consumers
sysmon-logs 6 Winlogbeat on Windows endpoints normalizer
linux-auditd-logs 6 auditd shipper normalizer
linux-ebpf-process-logs 6 eBPF collector normalizer
linux-ebpf-file-logs 6 eBPF collector normalizer
linux-ebpf-network-logs 6 eBPF collector normalizer
normalized-endpoint-events 12 normalizer inference, logstash
logster-inference-results 6 inference alerts, logstash
logster-alerts 3 alerts (reserved for downstream sinks)

Every event is keyed by tenant_id:endpoint_id, which guarantees per-endpoint message ordering through the entire pipeline and enables partition-aware scaling: add more inference replicas, and Kafka automatically rebalances partitions across them without breaking ordering.

[!IMPORTANT] Do not increase partition counts on a running topic once real traffic is flowing. Kafka allows it, but existing endpoint keys will rehash to different partitions and break per-endpoint ordering during the transition. Plan partition counts up front.


Data stores

Store Purpose Volatility
Kafka Event transport and durable buffer for every topic. Persistent (configurable retention).
Redis Per-endpoint inference state (sliding windows). Volatile — losing it means the first inference.window of events after recovery is cold.
Elasticsearch Long-term query store for logster-events and logster-inferences. Powers the dashboard. Persistent (configure ILM for retention).
Alert store Backing store for the REST API. In-memory by default; Postgres optional. In-memory loses alerts on restart; Postgres is durable.

Logstash is the bridge between Kafka and Elasticsearch. Two pipelines defined under deploy/logstash/ ship normalized-endpoint-events to the logster-events index and logster-inference-results to the logster-inferences index, using the event / inference id as the ES document id.


Infrastructure components

Component Image Default port Purpose
Kafka confluentinc/cp-kafka:7.7.1 9092 internal, 29092 external Event streaming (KRaft mode, no Zookeeper)
Elasticsearch elasticsearch:8.13.0 9200 Long-term event + inference storage
Redis redis:7-alpine 6379 Per-endpoint sliding-window state
Logstash logstash:8.13.0 Kafka → Elasticsearch ingestion
Kibana kibana:8.13.0 5601 Raw log exploration
Prometheus prom/prometheus:v2.51.0 9090 Metrics scraping
Grafana grafana/grafana:10.4.0 3000 Metrics visualization (admin / logster)
Tempo grafana/tempo:2.4.1 3200 / 4317 Trace backend
OpenTelemetry Collector otel/opentelemetry-collector-contrib:0.96.0 4317 / 4318 Trace ingestion for every service

End-to-end example

Walk through a single PowerShell attack on a Windows host to see every component in action:

  1. Endpoint. Winlogbeat on DESKTOP-1NNIMRR captures Sysmon Event ID 1 (Process Create) for powershell.exe -enc <base64> and publishes the raw JSON to the sysmon-logs Kafka topic.
  2. Normalizer. Consumes the event, extracts image, command_line, parent_image, user, and hashes into a NormalizedEvent, and publishes to normalized-endpoint-events with key default:desktop-1nnimrr.
  3. Logstash. Tails normalized-endpoint-events and indexes the document into logster-events (document id = event_id).
  4. Inference. Appends the event to the desktop-1nnimrr sliding window. At the next 30-second tick:
  5. Collects the last 3 minutes of events for that endpoint.
  6. Builds a heterogeneous graph (process → file, process → network, ...).
  7. Runs the Windows GNN → attack_prob=0.91, prediction=attack.
  8. Publishes an InferenceResult to logster-inference-results.
  9. Logstash (again). Indexes the inference result into logster-inferences.
  10. Alerts. Consumes the attack prediction:
  11. Creates a new alert with severity HIGH (since 0.91 >= 0.85).
  12. Checks whether any other endpoints in default alerted in the last 60 seconds.
  13. Calls the TTP analyzer (if configured) → T1059.001 (PowerShell).
  14. Writes to its store and publishes to logster-alerts.
  15. Dashboard. Sees the new inference record in logster-inferences and surfaces it on the host card, attack timeline, inference detail, process tree, and TTP distribution views.
  16. Analyst. Opens the alert, confirms it as a true positive via POST /feedback, and the alert transitions to resolved.

Design rationale

Three design decisions shape almost everything downstream:

Kafka at the center. Every service is a Kafka consumer/producer, never a direct dependency on another service. Inference can crash and restart, and events pile up harmlessly in its input topic. Logstash and the dashboard are fed from the same topics the services consume, so the analyst view and the ES indices never disagree about what the pipeline saw.

Per-endpoint windows, not per-event inference. The GNN needs behavioral context — a single powershell.exe invocation is not interesting on its own. Running the model on 3-minute sliding windows gives it enough events to build a graph with real structure. This also makes inference.window and inference.interval the main tuning knobs for the latency-vs-accuracy trade-off.

Dedup first, correlate second. Without dedup, a noisy endpoint would spam analysts with identical alerts every 30 seconds. Without correlation, cross-host lateral movement would look like unrelated incidents. The alerts service handles both before anything reaches the API or the dashboard.


Where to go next