githubEdit

chart-lineObservability

Metrics, logs, traces, and alerts for production Spice.ai Enterprise deployments.

Spice.ai Enterprise exposes a comprehensive set of metrics, structured logs, and OpenTelemetry traces. This page documents how to wire each into a production observability stack and which signals to alert on.

Metrics

The runtime exposes Prometheus metrics on port 9090 at /metrics. The Helm chart and the Kubernetes operator both ship with first-class scrape integration:

  • Helm chart: set monitoring.podMonitor.enabled: true to deploy a PodMonitor.

  • Operator: set servicemonitor.enabled: true to deploy a ServiceMonitor for the operator itself.

# values.yaml (Spice Helm chart)
monitoring:
  podMonitor:
    enabled: true
    additionalLabels:
      release: prometheus

Key metrics

Metric
Type
Meaning

spiced_query_duration_seconds

Histogram

End-to-end query latency.

spiced_query_total{result="error"}

Counter

Failed queries. Drives the error rate alert.

spiced_acceleration_refresh_duration_seconds

Histogram

Per-dataset acceleration refresh latency.

spiced_acceleration_refresh_total{result}

Counter

Refresh successes and failures per dataset.

spiced_acceleration_rows

Gauge

Row count per accelerated dataset.

spiced_http_request_duration_seconds

Histogram

HTTP API latency by route.

spiced_flight_request_duration_seconds

Histogram

Arrow Flight RPC latency.

spiced_cluster_executor_count

Gauge

(SpicepodCluster) Number of executors registered with the scheduler.

spiced_cluster_certificate_expiry_seconds

Gauge

(SpicepodCluster) Time-to-expiry for the per-node leaf certificate.

spiceai_operator_reconcile_duration_seconds

Histogram

Operator reconcile loop latency.

spiceai_operator_pod_dead_total

Counter

Dead pod observations. Triggers crashloop pause when above the configured threshold.

For the full list, query the running runtime: curl localhost:9090/metrics | grep -E '^# HELP'.

Grafana dashboard

Spice.ai publishes a maintained Grafana dashboard with the panels operations teams need most often (query rate / latency / errors, acceleration freshness and row counts, executor registration, certificate expiry).

Import via dashboard ID or copy the JSON from the Spice.ai Grafana dashboardarrow-up-right. The dashboard is compatible with Prometheus, Amazon Managed Prometheus, Azure Managed Prometheus, and Google Cloud Managed Service for Prometheus.

Logs

Spice emits structured JSON logs on stdout. The log level is controlled by SPICED_LOG:

Level
Use

ERROR

Production default.

WARN

Production with elevated visibility.

INFO

Default during cutover and incident investigation.

DEBUG

Development and targeted debugging only \u2014 not for production.

Log routing

Destination
Recommended forwarder

CloudWatch Logs

Fluent Bit DaemonSet with the cloudwatch_logs output plugin.

Google Cloud Logging

The GKE-managed logging agent (default on Autopilot / Standard).

Datadog

Datadog Agent with the Kubernetes integration enabled.

Elastic / OpenSearch

Filebeat or Fluent Bit with the elasticsearch / opensearch output.

Always retain query-error and acceleration-refresh-failure log lines for at least 30 days for incident review.

Distributed tracing

When the runtime is started with --otel-endpoint or the SPICED_OTEL_ENDPOINT environment variable, Spice exports OpenTelemetry traces over OTLP/gRPC. Traces cover query parsing, optimization, and execution; for SpicepodCluster, scheduler-to-executor RPCs are linked into a single trace.

Pair with an OpenTelemetry Collector configured for the organization's tracing backend (Tempo, Jaeger, Datadog, Honeycomb, X-Ray).

Alerts

The following alerts are the minimum recommended set for production. Tune thresholds against the deployment's observed baseline.

Health endpoints

Endpoint
Purpose

GET /health

Liveness. Returns 200 OK when the process is responsive. Use for container liveness probes.

GET /v1/ready

Readiness. Returns 200 OK only once datasets and accelerations have completed initial load. Use for container readiness probes and load balancer health checks.

GET /metrics

Prometheus metrics on port 9090. Should never be exposed externally.

The Spice Helm chart and the operator configure liveness and readiness probes against these endpoints by default. Tune the probe parameters on heavy initial-load workloads via probes:

Last updated

Was this helpful?