> For the complete documentation index, see [llms.txt](https://docs.spice.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.spice.ai/docs/enterprise/production/observability.md).

# Observability

Spice.ai Enterprise exposes a comprehensive set of metrics, structured logs, and OpenTelemetry traces. This page documents how to wire each into a production observability stack and which signals to alert on.

## Metrics

The runtime exposes Prometheus metrics on port `9090` at `/metrics`. The Helm chart and the Kubernetes operator both ship with first-class scrape integration:

* **Helm chart**: set `monitoring.podMonitor.enabled: true` to deploy a `PodMonitor`.
* **Operator**: set `servicemonitor.enabled: true` to deploy a `ServiceMonitor` for the operator itself.

```yaml
# values.yaml (Spice Helm chart)
monitoring:
  podMonitor:
    enabled: true
    additionalLabels:
      release: prometheus
```

### Key metrics

| Metric                                         | Type      | Meaning                                                                              |
| ---------------------------------------------- | --------- | ------------------------------------------------------------------------------------ |
| `spiced_query_duration_seconds`                | Histogram | End-to-end query latency.                                                            |
| `spiced_query_total{result="error"}`           | Counter   | Failed queries. Drives the [error rate alert](#alerts).                              |
| `spiced_acceleration_refresh_duration_seconds` | Histogram | Per-dataset acceleration refresh latency.                                            |
| `spiced_acceleration_refresh_total{result}`    | Counter   | Refresh successes and failures per dataset.                                          |
| `spiced_acceleration_rows`                     | Gauge     | Row count per accelerated dataset.                                                   |
| `spiced_http_request_duration_seconds`         | Histogram | HTTP API latency by route.                                                           |
| `spiced_flight_request_duration_seconds`       | Histogram | Arrow Flight RPC latency.                                                            |
| `spiced_cluster_executor_count`                | Gauge     | (`SpicepodCluster`) Number of executors registered with the scheduler.               |
| `spiced_cluster_certificate_expiry_seconds`    | Gauge     | (`SpicepodCluster`) Time-to-expiry for the per-node leaf certificate.                |
| `spiceai_operator_reconcile_duration_seconds`  | Histogram | Operator reconcile loop latency.                                                     |
| `spiceai_operator_pod_dead_total`              | Counter   | Dead pod observations. Triggers crashloop pause when above the configured threshold. |

For the full list, query the running runtime: `curl localhost:9090/metrics | grep -E '^# HELP'`.

## Grafana dashboard

Spice.ai publishes a maintained Grafana dashboard with the panels operations teams need most often (query rate / latency / errors, acceleration freshness and row counts, executor registration, certificate expiry).

Import via dashboard ID or copy the JSON from the [Spice.ai Grafana dashboard](https://spiceai.org/docs/monitoring/grafana). The dashboard is compatible with Prometheus, Amazon Managed Prometheus, Azure Managed Prometheus, and Google Cloud Managed Service for Prometheus.

## Logs

Spice emits structured JSON logs on stdout. The log level is controlled by `SPICED_LOG`:

| Level   | Use                                                                |
| ------- | ------------------------------------------------------------------ |
| `ERROR` | Production default.                                                |
| `WARN`  | Production with elevated visibility.                               |
| `INFO`  | Default during cutover and incident investigation.                 |
| `DEBUG` | Development and targeted debugging only \u2014 not for production. |

```yaml
spec:
  env:
    - name: SPICED_LOG
      value: INFO
```

### Log routing

| Destination              | Recommended forwarder                                                                                         |
| ------------------------ | ------------------------------------------------------------------------------------------------------------- |
| **CloudWatch Logs**      | Fluent Bit DaemonSet with the `cloudwatch_logs` output plugin.                                                |
| **Azure Monitor**        | [Container insights](https://learn.microsoft.com/azure/azure-monitor/containers/container-insights-overview). |
| **Google Cloud Logging** | The GKE-managed logging agent (default on Autopilot / Standard).                                              |
| **Datadog**              | Datadog Agent with the Kubernetes integration enabled.                                                        |
| **Elastic / OpenSearch** | Filebeat or Fluent Bit with the `elasticsearch` / `opensearch` output.                                        |
| **Loki**                 | [Grafana Alloy](https://grafana.com/docs/alloy/) or Promtail.                                                 |

Always retain query-error and acceleration-refresh-failure log lines for at least 30 days for incident review.

## Distributed tracing

When the runtime is started with `--otel-endpoint` or the `SPICED_OTEL_ENDPOINT` environment variable, Spice exports OpenTelemetry traces over OTLP/gRPC. Traces cover query parsing, optimization, and execution; for `SpicepodCluster`, scheduler-to-executor RPCs are linked into a single trace.

```yaml
spec:
  env:
    - name: SPICED_OTEL_ENDPOINT
      value: http://otel-collector.observability:4317
    - name: SPICED_OTEL_RESOURCE_ATTRIBUTES
      value: service.name=spiceai,deployment.environment=prod
```

Pair with an OpenTelemetry Collector configured for the organization's tracing backend (Tempo, Jaeger, Datadog, Honeycomb, X-Ray).

## Alerts

The following alerts are the minimum recommended set for production. Tune thresholds against the deployment's observed baseline.

```yaml
groups:
  - name: spiceai
    interval: 30s
    rules:
      - alert: SpiceAIQueryErrorRateHigh
        expr: |
          sum(rate(spiced_query_total{result="error"}[5m]))
            /
          sum(rate(spiced_query_total[5m])) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Query error rate above 5% for 10 minutes."

      - alert: SpiceAIAccelerationRefreshFailing
        expr: |
          increase(spiced_acceleration_refresh_total{result="error"}[15m]) > 3
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Acceleration refresh failing for {{ $labels.dataset }}."

      - alert: SpiceAIQueryLatencyP95
        expr: |
          histogram_quantile(0.95, sum by (le) (
            rate(spiced_query_duration_seconds_bucket[5m])
          )) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p95 query latency above 2s."

      - alert: SpiceAIExecutorMissing
        expr: |
          spiced_cluster_executor_count < 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.cluster }} has fewer than 2 executors registered."

      - alert: SpiceAICertificateExpiringSoon
        expr: |
          spiced_cluster_certificate_expiry_seconds < 7 * 24 * 3600
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cluster certificate expires in less than 7 days."

      - alert: SpiceAICrashLoop
        expr: |
          increase(spiceai_operator_pod_dead_total[15m]) > 5
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.spicepodset }} crashlooping \u2014 operator may pause it."
```

## Health endpoints

| Endpoint        | Purpose                                                                                                                                                           |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `GET /health`   | Liveness. Returns `200 OK` when the process is responsive. Use for container liveness probes.                                                                     |
| `GET /v1/ready` | Readiness. Returns `200 OK` only once datasets and accelerations have completed initial load. Use for container readiness probes and load balancer health checks. |
| `GET /metrics`  | Prometheus metrics on port `9090`. Should never be exposed externally.                                                                                            |

The Spice Helm chart and the operator configure liveness and readiness probes against these endpoints by default. Tune the probe parameters on heavy initial-load workloads via [`probes`](/docs/enterprise/kubernetes-operator/spicepodset.md):

```yaml
spec:
  probes:
    readiness:
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 6
    liveness:
      initialDelaySeconds: 60
      periodSeconds: 30
      failureThreshold: 5
```

### Gating readiness on executor availability (scheduler role)

In distributed (`SpicepodCluster`) deployments a scheduler can finish loading its own datasets and accelerations before enough executors have connected to actually serve queries. To keep a scheduler out of rotation until the cluster has capacity, `GET /v1/ready` accepts two optional executor gates. They apply only to the **scheduler** role; supplying a non-zero value on a non-scheduler node returns `400`.

| Query parameter               | Description                                                                                                                                                                                                         |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `min_ready_executors`         | Minimum number of currently-ready executors required for the probe to succeed. "Ready" means the scheduler holds a live FlightSQL client for the executor (i.e. it can route queries to it). `0` disables the gate. |
| `min_ready_executors_percent` | Minimum percentage (`0`–`100`) of ready executors relative to the executors currently registered (control stream open). `0` disables the gate. Values above `100` return `400`.                                     |
| `verbose`                     | When `true`, the response body becomes a multi-line diagnostic listing the result of each gate. The HTTP status code is unchanged — useful for `kubectl describe` and `curl` debugging.                             |

When both gates are supplied, **both** must pass for the probe to return `200 OK`; otherwise the endpoint returns `503` (gate not yet satisfied) or `400` (invalid parameter, e.g. a percentage outside `0`–`100`, or a non-zero gate requested outside scheduler role). Datasets and accelerations must still have completed their initial load regardless of executor gating.

Pass the gates as query parameters on the readiness probe path:

```yaml
readinessProbe:
  httpGet:
    path: /v1/ready?min_ready_executors=3&min_ready_executors_percent=80
    port: 8090
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.spice.ai/docs/enterprise/production/observability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
