Skip to content

Monitoring

klite is designed to be easy to operate. This guide covers logging, health checks, and monitoring options.

klite uses structured logging via Go’s slog package. Output is JSON by default when running in a container, or human-readable text on a terminal.

Set the log level with --log-level or KLITE_LOG_LEVEL:

Terminal window
./klite --log-level debug
LevelDescription
debugVerbose protocol-level details, request/response tracing
infoStartup, shutdown, topic creation, group events (default)
warnRecoverable issues: client disconnects, invalid requests
errorUnrecoverable errors: disk failures, S3 errors
INFO broker started listen=:9092 cluster_id=abc123 node_id=0
INFO topic created topic=my-topic partitions=3
INFO group rebalance group=my-group members=2 generation=3
WARN client disconnected addr=192.168.1.5:43210 error="read: connection reset"
ERROR WAL write failed error="disk full"

With JSON output, filter logs using jq:

Terminal window
# Show only errors
./klite 2>&1 | jq 'select(.level == "ERROR")'
# Show group coordinator events
./klite 2>&1 | jq 'select(.msg | contains("group"))'

In Kubernetes:

Terminal window
kubectl logs klite-0 | jq 'select(.level == "ERROR")'

klite currently exposes health via TCP connectivity. If the broker is accepting connections on port 9092, it’s healthy.

healthcheck:
test: ["CMD", "nc", "-z", "localhost", "9092"]
interval: 10s
timeout: 3s
retries: 3
livenessProbe:
tcpSocket:
port: 9092
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 9092
initialDelaySeconds: 5
periodSeconds: 5

Prometheus metrics support is planned. Key metrics to be exposed:

MetricTypeDescription
klite_messages_produced_totalCounterTotal messages produced
klite_messages_consumed_totalCounterTotal messages fetched
klite_produce_latency_secondsHistogramProduce request latency
klite_fetch_latency_secondsHistogramFetch request latency
klite_active_connectionsGaugeCurrent open connections
klite_consumer_groups_activeGaugeActive consumer groups
klite_wal_size_bytesGaugeWAL data size on disk
klite_s3_flush_lag_secondsGaugeTime since last S3 flush
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: klite
spec:
selector:
matchLabels:
app.kubernetes.io/name: klite
endpoints:
- port: metrics
interval: 30s

Even without a dedicated metrics endpoint, you can alert on the basics:

AlertConditionSeverity
klite downTCP probe fails for 30sCritical
Disk usage highData dir > 80% capacityWarning
S3 flush failingError logs with “s3” or “flush”Critical
Consumer group stuckConsumer lag increasing for > 5 minWarning
groups:
- name: klite
rules:
- alert: KliteDown
expr: up{job="klite"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "klite broker is down"
- alert: KliteDiskUsageHigh
expr: node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"} < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "klite data directory disk usage above 80%"

Monitor the data directory size:

Terminal window
# Check WAL size
du -sh ./data/
# Watch it over time
watch -n 5 du -sh ./data/

With S3 storage enabled, the WAL is periodically flushed and old segments are removed. Without S3, the WAL grows indefinitely (subject to retention policy).

Configure retention to control disk usage:

Terminal window
./klite --retention-ms 86400000 # 24 hours

See Storage for details on WAL lifecycle and S3 flushing.