Observability
Metrics, logs, traces, and alerts for production Spice.ai Enterprise deployments.
Spice.ai Enterprise exposes a comprehensive set of metrics, structured logs, and OpenTelemetry traces. This page documents how to wire each into a production observability stack and which signals to alert on.
Metrics
The runtime exposes Prometheus metrics on port 9090 at /metrics. The Helm chart and the Kubernetes operator both ship with first-class scrape integration:
Helm chart: set
monitoring.podMonitor.enabled: trueto deploy aPodMonitor.Operator: set
servicemonitor.enabled: trueto deploy aServiceMonitorfor the operator itself.
# values.yaml (Spice Helm chart)
monitoring:
podMonitor:
enabled: true
additionalLabels:
release: prometheusKey metrics
spiced_query_duration_seconds
Histogram
End-to-end query latency.
spiced_acceleration_refresh_duration_seconds
Histogram
Per-dataset acceleration refresh latency.
spiced_acceleration_refresh_total{result}
Counter
Refresh successes and failures per dataset.
spiced_acceleration_rows
Gauge
Row count per accelerated dataset.
spiced_http_request_duration_seconds
Histogram
HTTP API latency by route.
spiced_flight_request_duration_seconds
Histogram
Arrow Flight RPC latency.
spiced_cluster_executor_count
Gauge
(SpicepodCluster) Number of executors registered with the scheduler.
spiced_cluster_certificate_expiry_seconds
Gauge
(SpicepodCluster) Time-to-expiry for the per-node leaf certificate.
spiceai_operator_reconcile_duration_seconds
Histogram
Operator reconcile loop latency.
spiceai_operator_pod_dead_total
Counter
Dead pod observations. Triggers crashloop pause when above the configured threshold.
For the full list, query the running runtime: curl localhost:9090/metrics | grep -E '^# HELP'.
Grafana dashboard
Spice.ai publishes a maintained Grafana dashboard with the panels operations teams need most often (query rate / latency / errors, acceleration freshness and row counts, executor registration, certificate expiry).
Import via dashboard ID or copy the JSON from the Spice.ai Grafana dashboard. The dashboard is compatible with Prometheus, Amazon Managed Prometheus, Azure Managed Prometheus, and Google Cloud Managed Service for Prometheus.
Logs
Spice emits structured JSON logs on stdout. The log level is controlled by SPICED_LOG:
ERROR
Production default.
WARN
Production with elevated visibility.
INFO
Default during cutover and incident investigation.
DEBUG
Development and targeted debugging only \u2014 not for production.
Log routing
CloudWatch Logs
Fluent Bit DaemonSet with the cloudwatch_logs output plugin.
Azure Monitor
Google Cloud Logging
The GKE-managed logging agent (default on Autopilot / Standard).
Datadog
Datadog Agent with the Kubernetes integration enabled.
Elastic / OpenSearch
Filebeat or Fluent Bit with the elasticsearch / opensearch output.
Loki
Grafana Alloy or Promtail.
Always retain query-error and acceleration-refresh-failure log lines for at least 30 days for incident review.
Distributed tracing
When the runtime is started with --otel-endpoint or the SPICED_OTEL_ENDPOINT environment variable, Spice exports OpenTelemetry traces over OTLP/gRPC. Traces cover query parsing, optimization, and execution; for SpicepodCluster, scheduler-to-executor RPCs are linked into a single trace.
Pair with an OpenTelemetry Collector configured for the organization's tracing backend (Tempo, Jaeger, Datadog, Honeycomb, X-Ray).
Alerts
The following alerts are the minimum recommended set for production. Tune thresholds against the deployment's observed baseline.
Health endpoints
GET /health
Liveness. Returns 200 OK when the process is responsive. Use for container liveness probes.
GET /v1/ready
Readiness. Returns 200 OK only once datasets and accelerations have completed initial load. Use for container readiness probes and load balancer health checks.
GET /metrics
Prometheus metrics on port 9090. Should never be exposed externally.
The Spice Helm chart and the operator configure liveness and readiness probes against these endpoints by default. Tune the probe parameters on heavy initial-load workloads via probes:
Last updated
Was this helpful?