Monitoring
klite is designed to be easy to operate. This guide covers logging, health checks, and monitoring options.
klite uses structured logging via Go’s slog package. Output is JSON by default when running in a container, or human-readable text on a terminal.
Log levels
Section titled “Log levels”Set the log level with --log-level or KLITE_LOG_LEVEL:
./klite --log-level debug| Level | Description |
|---|---|
debug | Verbose protocol-level details, request/response tracing |
info | Startup, shutdown, topic creation, group events (default) |
warn | Recoverable issues: client disconnects, invalid requests |
error | Unrecoverable errors: disk failures, S3 errors |
Key log messages
Section titled “Key log messages”INFO broker started listen=:9092 cluster_id=abc123 node_id=0INFO topic created topic=my-topic partitions=3INFO group rebalance group=my-group members=2 generation=3WARN client disconnected addr=192.168.1.5:43210 error="read: connection reset"ERROR WAL write failed error="disk full"Log filtering in production
Section titled “Log filtering in production”With JSON output, filter logs using jq:
# Show only errors./klite 2>&1 | jq 'select(.level == "ERROR")'
# Show group coordinator events./klite 2>&1 | jq 'select(.msg | contains("group"))'In Kubernetes:
kubectl logs klite-0 | jq 'select(.level == "ERROR")'Health checks
Section titled “Health checks”klite currently exposes health via TCP connectivity. If the broker is accepting connections on port 9092, it’s healthy.
Docker health check
Section titled “Docker health check”healthcheck: test: ["CMD", "nc", "-z", "localhost", "9092"] interval: 10s timeout: 3s retries: 3Kubernetes probes
Section titled “Kubernetes probes”livenessProbe: tcpSocket: port: 9092 initialDelaySeconds: 10 periodSeconds: 10readinessProbe: tcpSocket: port: 9092 initialDelaySeconds: 5 periodSeconds: 5Metrics
Section titled “Metrics”Prometheus metrics support is planned. Key metrics to be exposed:
| Metric | Type | Description |
|---|---|---|
klite_messages_produced_total | Counter | Total messages produced |
klite_messages_consumed_total | Counter | Total messages fetched |
klite_produce_latency_seconds | Histogram | Produce request latency |
klite_fetch_latency_seconds | Histogram | Fetch request latency |
klite_active_connections | Gauge | Current open connections |
klite_consumer_groups_active | Gauge | Active consumer groups |
klite_wal_size_bytes | Gauge | WAL data size on disk |
klite_s3_flush_lag_seconds | Gauge | Time since last S3 flush |
Prometheus + Grafana (planned)
Section titled “Prometheus + Grafana (planned)”# ServiceMonitor for Prometheus OperatorapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: klitespec: selector: matchLabels: app.kubernetes.io/name: klite endpoints: - port: metrics interval: 30sAlerting recommendations
Section titled “Alerting recommendations”Even without a dedicated metrics endpoint, you can alert on the basics:
| Alert | Condition | Severity |
|---|---|---|
| klite down | TCP probe fails for 30s | Critical |
| Disk usage high | Data dir > 80% capacity | Warning |
| S3 flush failing | Error logs with “s3” or “flush” | Critical |
| Consumer group stuck | Consumer lag increasing for > 5 min | Warning |
Example alert (Prometheus)
Section titled “Example alert (Prometheus)”groups: - name: klite rules: - alert: KliteDown expr: up{job="klite"} == 0 for: 30s labels: severity: critical annotations: summary: "klite broker is down"
- alert: KliteDiskUsageHigh expr: node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"} < 0.2 for: 5m labels: severity: warning annotations: summary: "klite data directory disk usage above 80%"Disk usage
Section titled “Disk usage”Monitor the data directory size:
# Check WAL sizedu -sh ./data/
# Watch it over timewatch -n 5 du -sh ./data/With S3 storage enabled, the WAL is periodically flushed and old segments are removed. Without S3, the WAL grows indefinitely (subject to retention policy).
Configure retention to control disk usage:
./klite --retention-ms 86400000 # 24 hoursSee Storage for details on WAL lifecycle and S3 flushing.
Next steps
Section titled “Next steps”- Configuration reference — all flags and environment variables
- Architecture — understand WAL, S3, and the data path
- Troubleshooting — common operational issues