1 of 100

Spice.ai Cloud Documentation

Getting Started

Welcome to Spice.ai Cloud

Welcome to the Spice.ai Cloud Platform documentation!

The Spice.ai Cloud Platform is an AI application and agent cloud; an AI-backend-as-a-service comprising of composable, ready-to-use AI and agent building blocks including high-speed SQL query, LLM inference, Vector Search, and RAG built-on cloud-scale, managed Spice.ai OSS.

This documentation pertains to the Spice.ai Cloud Platform.

With the Spice.ai Cloud Platform, powered by Spice.ai OSS, you can:

Query and accelerate data: Run high-performance SQL queries across multiple data sources with results optimized for AI applications and agents.
Use AI Models: Perform large language model (LLM) inference with major providers including OpenAI, Anthropic, and Grok for chat, completion, and generative AI workflows.
Collaborate on Spicepods: Share, fork, and manage datasets, models, embeddings, evals, and tools in a collaborative, community-driven hub indexed by spicerack.org.

Use-Cases

Fast, virtualized data views: Build specialized “small data” warehouses to serve fast, virtualized views across large datasets for applications, APIs, dashboards, and analytics.
Performance and reliability: Manage replicas of hot data, cache SQL queries and AI results, and load-balance AI services to improve resiliency and scalability.
Production-grade AI workflows: Use Spice.ai Cloud as a data and AI proxy for secure, monitored, and compliant production environments, complete with advanced observability and performance management.

Take it for a spin by starting with the getting started guide.

Feel free to ask any questions or queries to the team in Discord.

Getting Started

Get started with the Spice.ai Cloud Platform in 5 mins.

Create a Spice app

Add a dataset and query it

Add an AI Model and chat with it

Sign in with GitHub

A GitHub account is required to access the Spice.ai Cloud Platform. If you don't have one, you can create an accout here.

From the Spice.ai website

Go to spice.ai and click on Login or Try for Free in the top right corner.
1. You can also navigate directly by URL to spice.ai/login

Click Continue with GitHub to login with your GitHub account.

Click Authorize Spice.ai Cloud Platform.

You will be redirected to the new application page.

Continue to Step 2 to configure your first Spice application.

Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.

Create a Spice app

Create your first Spice app

Once signed in with GitHub, you will be redirected to the new application page. Set a name, add a model provider, and optionally select one of ready to use datasets.

Enter a name for the application.
Choose a model provider and provide an API key.
Optionally select one or more of the available datasets. Datasets can also be added later.
Click Create application.
- It will take up to 30 seconds to create and provision a dedicated Spice.ai OSS instance for the application.

Once the application instance is loaded, you will be redirected to the Playground.

Executing the show tables query will show the default datasets available for the app.

🎉 Congrats, you've created your first Spice app!

Continue to Step. 3 to add a dataset and query it.

Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.

Add a Dataset and query data

Add a dataset and query it using SQL Query in the Playground

To add a dataset to the Spice app, navigate to the Code tab.

Use the Components sidebar on the right to select from available Data Connectors, Model Providers, and ready-to-use Datasets.

Adding a ready-to-use Dataset

Navigate to Code tab.
In Components sidebar, click the Datasets tab.

Select and add the NYC Taxi Trips dataset
1. Note the configuration has been added to the editor

Click Save in the code toolbar and then Deploy on popup card that appears in the bottom right.

Navigate to the Playground tab, open the dataset reference, and click on the spice.samples.taxi_trips dataset to insert a sample query into the SQL editor. Then, click Run Selection.

[Optional] Execute a SQL query using cURL

Go app Settings and copy one of the app API Keys.

Replace [API-KEY] in the sample below with your API Key and execute from a terminal.

curl --request POST \
  --url 'https://data.spiceai.io/v1/sql' \
  --header 'Content-Type: text/plain' \
  --header 'X-API-KEY: [API-KEY]' \
  --data 'select * from spice.samples.taxi_trips limit 3'

🎉 Congratulations, you've now added a dataset and queried it.

Continue to Step 4 to add an AI Model and chat with the dataset.

Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.

Add AI Model and chat with your data

Add an OpenAI model and chat with the NYC Taxi Trips dataset

An OpenAI API Platform account and API key is required.

Adding a Model Provider

Navigate to Code tab.
In Components sidebar, click Model Providers tab, and select OpenAI.
Enter the Model name.
Enter the Model ID, (e.g. gpt-4o).
Set the OpenAI API Key secret
- API keys and other secrets are securely stored and encrypted.

Insert tools: auto in the params section of the gpt-4o Model to automatically connect datasets to the model. The final Spicepod configuration in the editor should be as follows:

name: my-first-app
kind: Spicepod
version: v1beta1

datasets:   
  - from: s3://spiceai-demo-datasets/taxi_trips/2024/
    name: samples.taxi_trips
    description: Taxi trips dataset from Spice.ai demo datasets.
    params:
      file_format: parquet
    
models:   
  - from: openai:gpt-4o
    name: gpt-4o
    params:
      endpoint: https://api.openai.com/v1
      openai_api_key: ${secrets:OPENAI_API_KEY}
      tools: auto

Click Save in the code toolbar and then Deploy in the popup card that appears in the bottom right to deploy the changes.
Navigate to Playground and select AI Chat in the sidebar.
Ask a question about the NYC Taxi Trips dataset in the chat. For example:
- "What datasets are available?"
- "What is the average fare amount of a taxi trip?"

[Optional] Call chat completions API using cURL

Replace [API-KEY] in the sample below with the app API Key and execute in a terminal.

curl --request POST \
      --url 'https://data.spiceai.io/v1/chat/completions' \
      --header 'Content-Type: application/json' \
      --header 'X-API-KEY: 31393037|8f2f6125e7b8487f80964041c123d3c3' \
      --data '{ "messages": [{ "role": "user", "content": "Hello!" }], "model": "gpt-4o" }'

🎉 Congratulations, you've now added an OpenAI model and can use it to ask questions of the NYC Taxi Trips dataset.

Continue to Next Steps to explore use-cases to do more with the Spice.ai Cloud Platform.

Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.

Next Steps

Learn more about building AI applications and agents with the Spice.ai Cloud Platform.

FAQ

Frequently asked questions

What's the difference between the Spice.ai Cloud Platform and Spice.ai OSS?

Spice.ai OSS is an open-source project created by the Spice AI team that provides a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

The Spice.ai Cloud Platform is a data and AI application platform that provides a set of building-blocks to create AI and agentic applications. Building blocks include a cloud-data-warehouse, ML model training and inference, and a cloud-scale, managed Spice.ai OSS cloud-hosted service.

How much does Spice.ai Cloud cost?

It's free to get an API key to use the Community Edition.

Customers who need resource limits, service-level guarantees, or priority support we offer high-value paid tiers based on usage.

What level of support do you offer?

We offer enterprise-grade support with an SLA for Enterprise Plans.

For standard plans we offer best-effort community support in Discord.

What's your approach to security and compliance?

See Security. The Spice.ai Cloud Platform is SOC 2 Type II compliant.

What SQL query engine/dialect do you support?

Spice.ai OSS is built on Apache DataFusion and uses the PostgreSQL dialect.

Features

Federated SQL Query

Federated SQL Query documentation

Spice supports federated queries, enabling you to join and combine data from multiple sources, including databases (PostgreSQL, MySQL), data warehouses (Databricks, Snowflake, BigQuery), and data lakes (S3, MinIO). For a full list of supported sources, see Data Connectors.

SQL Query

Playground SQL Explorer

The Playground SQL Explorer is the fastest way to get started with federated queries, debugging queries, and iterating quickly. The SQL Query Editor be accessed by clicking on the SQL Explorer tab after selecting Playground in the app navigation bar.

See SQL Query for further documentation on using the SQL Query Editor.

Apache Arrow Flight API

For production applications, leveraging the high-performance Apache Arrow Flight endpoint is recommended. The Spice SDKs always query using Arrow Flight.

See Apache Arrow Flight API for further documentation on using Apache Arrow Flight APIs.

HTTP API

SQL Query is also accessible via a standard HTTP API.

See HTTP API for further documentation on using the HTTP SQL API.

Data Acceleration

Configure local acceleration for datasets in Spice for faster queries (test)

Datasets can be locally accelerated by the Spice runtime, pulling data from any Data Connector and storing it locally in a Data Accelerator for faster access. The data can be kept up-to-date in real-time or on a refresh schedule, ensuring users always have the latest data locally for querying.

Supported Data Accelerators

Dataset acceleration is enabled by setting the acceleration configuration. Spice currently supports In-Memory Arrow, DuckDB, SQLite, PostgreSQL as accelerators. For engine specific configuration, see Data Accelerator Documentation

Example - Locally Accelerating taxi_trips with Arrow Accelerator

datasets:
  - from: spice.ai/spiceai/quickstart/datasets/taxi_trips
    name: taxi_trips
    acceleration:
      enabled: true
      refresh_mode: full
      refresh_check_interval: 10s

Refresh Modes

Spice supports three modes to refresh/update locally accelerated data from a connected data source. full is the default mode. Refer to Data Refresh documentation for detailed refresh usage and configuration.

Mode

Description

Example

full

Replace/overwrite the entire dataset on each refresh

A table of users

append

Append/add data to the dataset on each refresh

Append-only, immutable datasets, such as time-series or log data

changes

Apply incremental changes

Customer order lifecycle table

Example - Accelerate with arrow accelerator under full refresh mode

datasets:
  - from: databricks:my_dataset
    name: accelerated_dataset
    acceleration:
      refresh_mode: full
      refresh_check_interval: 10m

Indexes

Database indexes are essential for optimizing query performance. Configure indexes for accelerators via indexes field. For detailed configuration, refer to the index documentation.

Example - Configure indexes with SQLite Accelerator

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    acceleration:
      enabled: true
      engine: sqlite
      indexes:
        number: enabled # Index the `number` column
        '(hash, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns

Constraints

Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.

Constraints are specified using column references in the Spicepod via the primary_key field in the acceleration configuration. Additional unique constraints are specified via the indexes field with the value unique. Data that violates these constraints will result in a conflict. For constraints configuration details, visit Constraints Documentation.

Example - Configure primary key constraints with SQLite Accelerator

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    acceleration:
      enabled: true
      engine: sqlite
      primary_key: hash # Define a primary key on the `hash` column
      indexes:
        '(number, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columns

In-Memory Arrow Data Accelerator

The In-Memory Arrow Data Accelerator is the default data accelerator in Spice. It uses Apache Arrow to store data in-memory for fast access and query performance.

Configuration

To use the In-Memory Arrow Data Accelerator, no additional configuration is required beyond enabling acceleration.

Example:

datasets:
  - from: spice.ai:path.to.my_dataset
    name: my_dataset
    acceleration:
      enabled: true

However Arrow can be specified explicitly using arrow as the engine for acceleration.

datasets:
  - from: spice.ai:path.to.my_dataset
    name: my_dataset
    acceleration:
      enabled: true
      engine: arrow

Limitations

The In-Memory Arrow Data Accelerator does not support persistent storage. Data is stored in-memory and will be lost when the Spice runtime is stopped.
The In-Memory Arrow Data Accelerator does not support Decimal256 (76 digits), as it exceeds Arrow's maximum Decimal width of 38 digits.
The In-Memory Arrow Data Accelerator does not support indexes.
The In-Memory Arrow Data Accelerator only supports primary-key constraints, not unique constraints.
With Arrow acceleration, mathematical operations like value1 / value2 are treated as integer division if the values are integers. For example, 1 / 2 will result in 0 instead of the expected 0.5. Use casting to FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT) (or CAST(1 AS FLOAT) / 2).

DuckDB Data Accelerator

To use DuckDB as Data Accelerator, specify duckdb as the engine for acceleration.

datasets:
  - from: spice.ai:path.to.my_dataset
    name: my_dataset
    acceleration:
      engine: duckdb

Configuration

Spice.ai currently only supports mode: memory for DuckDB accelerator.

Configuration params are provided in the acceleration section for a data store. Other common acceleration fields can be configured for DuckDB, see see datasets.

LIMITATIONS

The DuckDB accelerator does not support nested lists, or structs with nested structs/lists field types. For example:
- Supported:
  - SELECT {'x': 1, 'y': 2, 'z': 3}
- Unsupported:
  - SELECT [['duck', 'goose', 'heron'], ['frog', 'toad']]
  - SELECT {'x': [1, 2, 3]}
The DuckDB accelerator does not support enum, dictionary, or map field types. For example:
- Unsupported:
  - SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])
The DuckDB accelerator does not support Decimal256 (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
Updating a dataset with DuckDB acceleration while the Spice Runtime is running (hot-reload) will cause the DuckDB accelerator query federation to disable until the Runtime is restarted.

MEMORY CONSIDERATIONS

When accelerating a dataset using mode: memory (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

PostgreSQL Data Accelerator

To use PostgreSQL as Data Accelerator, specify postgres as the engine for acceleration.

Configuration

The connection to PostgreSQL can be configured by providing the following params:

pg_host: The hostname of the PostgreSQL server.
pg_port: The port of the PostgreSQL server.
pg_db: The name of the database to connect to.
pg_user: The username to connect with.
pg_sslmode: Optional. Specifies the SSL/TLS behavior for the connection, supported values:
- verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
- verify-ca: This mode requires a TLS connection and a valid root certificate.
- require: This mode requires a TLS connection.
- prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
- disable: This mode will not attempt to use a TLS connection, even if the server supports it.
pg_sslrootcert: Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
connection_pool_size: Optional. The maximum number of connections to keep open in the connection pool. Default is 10.

Configuration params are provided either in the acceleration section of a dataset.

Arrow to PostgreSQL Type Mapping

LIMITATIONS

The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col) into CAST (SUM(int_col) as BIGINT.

SQLite Data Accelerator

To use SQLite as Data Accelerator, specify sqlite as the engine for acceleration.

datasets:
  - from: spice.ai:path.to.my_dataset
    name: my_dataset
    acceleration:
      engine: sqlite

Configuration

The connection to SQLite can be configured by providing the following params:

busy_timeout: Optional. Specifies the duration for the SQLite busy timeout when connecting to the database file. Default: 5000 ms.

Configuration params are provided in the acceleration section of a dataset. Other common acceleration fields can be configured for sqlite, see see datasets.

LIMITATIONS

The SQLite accelerator doesn't support advanced grouping features such as ROLLUP and GROUPING.
In SQLite, CAST(value AS DECIMAL) doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL) will be treated as integer division, resulting in 0 instead of the expected 0.5. Use FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT).
Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.
The SQLite accelerator doesn't support arrow Interval types, as SQLite doesn't have a native interval type.
The SQLite accelerator only supports arrow List types of primitive data types; lists with structs are not supported.

MEMORY CONSIDERATIONS

Search & Retrieval

Use the advanced search and retrieval capabilities of Spice

Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and Vector-Similarity Search functionality.

SQL-Based Search

Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's LIKE clause:

SELECT id, text_column
FROM my_table
WHERE
    LOWER(text_column) LIKE '%search_term%'
  AND
    date_published > '2021-01-01'

Vector Search

Spice also provides advanced Vector-Similarity Search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:

Local embedding models, e.g. sentence-transformers/all-MiniLM-L6-v2.
Remote embedding providers, e.g. OpenAI.

See Model Providers to view all supported providers

Embedding models are defined in the spicepod.yaml file as top-level components.

embeddings:
  - from: openai
    name: remote_service
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

  - name: local_embedding_model
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.

datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    acceleration:
      enabled: true
    columns:
      - name: body
        embeddings:
          - from: local_embedding_model # Embedding model used for this column

By defining embeddings on the body column, Spice is now configured to execute similarity searches on the dataset.

curl -XPOST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["spiceai.issues"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 2
  }'

For more details, see the API reference for /v1/search.

Spice also supports vector search on datasets with preexisting embeddings. See below for compatibility details.

Chunking Support

Spice supports chunking of content before embedding, which is useful for large text columns such as those found in Document Tables. Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.

datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    acceleration:
      enabled: true
    embeddings:
      - column: body
        from: local_embedding_model
        chunking:
          enabled: true
          target_chunk_size: 512

The body column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).

Document Retrieval

When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns list.

For example:

curl -XPOST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["spiceai.issues"],
    "text": "cutting edge AI",
    "where": "array_has(assignees, \"jeadie\")",
    "additional_columns": ["title", "state", "body"],
    "limit": 2
  }'

Response:

{
  "matches": [
    {
      "value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
      "dataset": "spiceai.issues",
      "metadata": {
        "title": "Improve scalar UDF array_distance",
        "state": "Closed",
        "body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
      }
    },
    {
      "value": "est external tools being returned for toolusing models",
      "dataset": "spiceai.issues",
      "metadata": {
        "title": "Automatic NSQL retries in /v1/nsql ",
        "state": "Open",
        "body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n  "sql": "SELECT ..."\n}\n```"
      }
    }
  ],
  "duration_ms": 45
}

Pre-Existing Embeddings

Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:

Underlying Column Presence:
- The underlying column must exist in the table, and be of string Arrow data type .
Embeddings Column Naming Convention:
- For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding. For example, a customer_reviews table with a review column must have a review_embedding column.
Embeddings Column Data Type:
- The embeddings column must have the following Arrow data type when loaded into Spice:
  1. FixedSizeList[Float32 or Float64, N], where N is the dimension (size) of the embedding vector. FixedSizeList is used for efficient storage and processing of fixed-size vectors.
  2. If the column is chunked, use List[FixedSizeList[Float32 or Float64, N]].
Offset Column for Chunked Data:
- If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets with the following Arrow data type:
  1. List[FixedSizeList[Int32, 2]], where each element is a pair of integers [start, end] representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
  2. For instance, [[0, 100], [101, 200]] indicates two chunks covering indices 0–100 and 101–200, respectively.

By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.

Example

A table sales with an address column and corresponding embedding column(s).

sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name       | data_type                               | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number      | Int64                                   | YES         |
| quantity_ordered  | Int64                                   | YES         |
| price_each        | Float64                                 | YES         |
| order_line_number | Int64                                   | YES         |
| address           | Utf8                                    | YES         |
| address_embedding | FixedSizeList(                          | NO          |
|                   |   Field {                               |             |
|                   |     name: "item",                       |             |
|                   |     data_type: Float32,                 |             |
|                   |     nullable: false,                    |             |
|                   |     dict_id: 0,                         |             |
|                   |     dict_is_ordered: false,             |             |
|                   |     metadata: {}                        |             |
|                   |   },                                    |             |
|                   |   384                                   |             |
+-------------------+-----------------------------------------+-------------+

The same table if it was chunked:

sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name       | data_type                               | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number      | Int64                                   | YES         |
| quantity_ordered  | Int64                                   | YES         |
| price_each        | Float64                                 | YES         |
| order_line_number | Int64                                   | YES         |
| address           | Utf8                                    | YES         |
| address_embedding | List(Field {                            | NO          |
|                   |   name: "item",                         |             |
|                   |   data_type: FixedSizeList(             |             |
|                   |     Field {                             |             |
|                   |       name: "item",                     |             |
|                   |       data_type: Float32,               |             |
|                   |     },                                  |             |
|                   |     384                                 |             |
|                   |   ),                                    |             |
|                   | })                                      |             |
+-------------------+-----------------------------------------+-------------+
| address_offset    | List(Field {                            | NO          |
|                   |   name: "item",                         |             |
|                   |   data_type: FixedSizeList(             |             |
|                   |     Field {                             |             |
|                   |       name: "item",                     |             |
|                   |       data_type: Int32,                 |             |
|                   |     },                                  |             |
|                   |     2                                   |             |
|                   |   ),                                    |             |
|                   | })                                      |             |
+-------------------+-----------------------------------------+-------------+

AI Gateway

AI Gateway documentation

Spice provides a high-performance, OpenAI API-compatible AI Gateway optimized for managing and scaling large language models (LLMs). Additionally, Spice offers tools for Enterprise Retrieval-Augmented Generation (RAG), such as SQL query across federated datasets and an advanced search feature (see Search).

Spice supports full OpenTelemetry observability, enabling detailed tracking of data flows and requests for full transparency and easier debugging.

Supported Models

Spice supports a variety of LLMs, including OpenAI, Azure OpenAI, Anthropic, Groq, Hugging Face, and more (see Model Providers for all supported models).

Core Features

Custom Tools: Equip models with tools to interact with the Spice runtime.
System Prompts: Customize system prompts and override defaults for v1/chat/completion.

For detailed configuration and API usage, refer to the API Documentation.

Example: Configuring an OpenAI Compatible Model

To use a language model hosted on OpenAI (or compatible), specify the openai path and model ID in from.

Example spicepod.yml:

models:
  - from: openai:gpt-4o-mini
    name: openai
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

  - from: openai:llama3-groq-70b-8192-tool-use-preview
    name: groq-llama
    params:
      endpoint: https://api.groq.com/openai/v1
      openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }

For details, see OpenAI (or Compatible) Language Models.

Semantic Models

Define semantic data models in Spice to improve dataset understanding for AI

A semantic model is a structured representation of data that captures the meaning and relationships between elements in a dataset.

In Spice, semantic models transform raw data into meaningful business concepts by defining metadata, descriptions, and relationships at both the dataset and column level. This makes the data more interpretable for both AI language models and human analysis.

Use-Cases

Large Language Models (LLMs)

The semantic model is automatically used by Spice Models as context to produce more accurate and context-aware AI responses.

Defining a Semantic Model

Semantic data models are defined within the spicepod.yaml file, specifically under the datasets section. Each dataset supports description, metadata, and a columns field where individual columns are described with metadata and features for utility and clarity.

Example Configuration

Example spicepod.yaml:

datasets:
  - name: taxi_trips
    description: NYC taxi trip rides
    metadata:
      instructions: Always provide citations with reference URLs.
      reference_url_template: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_<YYYY-MM>.parquet
    columns:
      - name: tpep_pickup_time
        description: 'The time the passenger was picked up by the taxi'
      - name: notes
        description: 'Optional notes about the trip'
        embeddings:
          - from: hf_minilm # A defined Spice Model
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 128
              trim_whitespace: true

Dataset Metadata

Datasets can be defined with the following metadata:

instructions: Optional. Instructions to provide to a language model when using this dataset.
reference_url_template: Optional. A URL template for citation links.

For detailed metadata configuration, see the Spice OSS Dataset Reference

Column Definitions

Each column in the dataset can be defined with the following attributes:

description: Optional. A description of the column's contents and purpose.
embeddings: Optional. Vector embeddings configuration for this column.

For detailed columns configuration, see the Spice OSS Dataset Reference

ML Models

Spice Machine Learning (ML) Models

Spice Models are in beta for Design Partners. Get in touch for more info.

Spice Models enable the training and use of ML models natively on the Spice platform.

The platform currently supports time-series forecasting models, with other categories of models planned.

Defining a Model

Training a Model

model.yaml files committed to the connected repository will be automatically detected and imported as Spice Models.

Navigating to a specific Model will show detailed information as defined in the model.yaml.

A training run can be started using the Train button.

Training runs in progress will be shown and updated, along with historical training runs.

The Training Status will be updated to Complete for successfully completed training runs. Details and the Training Report, are available on the Training Run page.

Running Model Predictions

Spice Models (beta) currently supports time-series forecasting.

Additional categories of data science and machine learning are on our roadmap.

A successfully trained model can be used to make predictions.

The lookback data (inferencing data) is automatically provided by the platform and wired up to the inference, enabling a prediction to be made using a simple API call.

AI Predictions in the Playground

Navigate to AI Predictions in the Playground.

Successfully trained models will be available for selection from the model selector drop down on the right.

Clicking Predict will demonstrate calling the predictions API using lookback data within the Spice platform. A graph of the predicted value(s) along with the lookback data will be displayed.

Predictions by API

The Training Runs page provides training details including a copyable curl command to make a prediction from the command line.

Observability

First-class, built-in observability to understand the operations Spice performs.

Observability in Spice enables task tracking and performance monitoring through a built-in distributed tracing system that can export to Zipkin or be viewed via the runtime.task_history SQL table.

Spice records detailed information about runtime operations through trace IDs, timings, and labels - from SQL queries to AI completions. This task history system helps operators monitor performance, debug issues, and understand system behavior across individual requests and overall patterns.

Use-Cases

Debugging and Troubleshooting

Trace AI chat completion steps and tool interactions to identify why a request isn't responding as expected
Investigate failed queries and other task errors

Performance Analysis

Track SQL query/tool use execution times
Identify slow-running tasks

Usage Analytics

Track usage patterns by protocol and dataset
Understand how AI models are using tools to retrieve data from the datasets available to them

Portal Interface

The Spice platform provides a built-in UI for visualizing the observability traces that Spice OSS generates.

Zipkin

Export observability traces from Spice into Zipkin

In addition to the built-in runtime.task_history SQL table, Spice can export the observability traces it collects into Zipkin.

Enabling Zipkin Export

Zipkin export is defined in the spicepod.yaml under the runtime.tracingsection:

runtime:
  tracing:
    zipkin_enabled: true
    zipkin_endpoint: http://localhost:9411/api/v2/spans

zipkin_enabled: Optional. Default false. Enables or disables the Zipkin trace export.
zipkin_endpoint: Required if zipkin_enabledis true. The path to the /api/v2/spansendpoint on the Zipkin instance to export to.

Building Blocks

Data Connectors

Learn how to use Data Connector to query external data.

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Supported Data Connectors include:

Object Store File Formats

For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.

If a file is provided, the file format will be inferred, and params.file_format is unnecessary.

File formats currently supported are:

Note Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.

ClickHouse

ClickHouse Data Connector Documentation

ClickHouse is a fast, open-source columnar database management system designed for online analytical processing (OLAP) and real-time analytics. This connector enables federated SQL queries from a ClickHouse server.

Configuration

`from`

The from field for the ClickHouse connector takes the form of from:db.dataset where db.dataset is the path to the Dataset within ClickHouse. In the example above it would be my.dataset.

If db is not specified in either the from field or the clickhouse_db parameter, it will default to the default database.

`name`

The dataset name. This will be used as the table name within Spice.

`params`

The ClickHouse data connector can be configured by providing the following params:

Examples

Specifying a connection timeout

Using a connection string

ODBC

API

Portal

Playground

Use-Cases

SDKs

Integrations

REFERENCE

Task History

The Spice runtime stores information about completed tasks in the spice.runtime.task_history table. A task is a single unit of execution within the runtime, such as a SQL query or an AI chat completion (see Task Types below). Tasks can be nested, and the runtime will record the parent-child relationship between tasks.

Each task executed has a row in this table, and by default the data is retained for 8 hours. Use a SELECT query to return information about each task as shown in this example:

SELECT
  *
FROM
  spice.runtime.task_history
LIMIT
  100;

Output:

+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
| trace_id                         | span_id          | parent_span_id | task                | input                                        | captured_output | start_time                 | end_time                   | execution_duration_ms | error_message                                                | labels                                                                                                                              |
+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
| f94dba6b89de98c6e54b074f2353a897 | 4eb243d9b5347762 |                | accelerated_refresh | runtime.metrics                              |                 | 2024-09-23T23:17:39.907789 | 2024-09-23T23:17:39.917777 | 9.988                 |                                                              | {sql: SELECT * FROM runtime.metrics}                                                                                                |
| 1f1f8305520e15ea7ad9b0a43e5d2c7e | 6aadf7c91caea3c4 |                | accelerated_refresh | runtime.task_history                         |                 | 2024-09-23T23:17:39.907873 | 2024-09-23T23:17:39.917797 | 9.924000000000001     |                                                              | {sql: SELECT * FROM runtime.task_history}                                                                                           |
| 1432e30c5ed7764f4ef35f6508dfd56c | fbb31c60d41d8232 |                | accelerated_refresh | logs_file                                    |                 | 2024-09-23T23:17:40.143699 | 2024-09-23T23:17:40.271678 | 127.97900000000001    |                                                              | {sql: SELECT * FROM logs_file}                                                                                                      |
| fd0b909b789938384d99f0e4e6f4b68b | 624ea4751bb6727a |                | accelerated_refresh | logs                                         |                 | 2024-09-23T23:17:40.676838 | 2024-09-23T23:17:42.345932 | 1669.0939999999998    |                                                              | {sql: SELECT * FROM "logs"}                                                                                                         |
| 3db5488039408825ac0829a3feb49b05 | e3e5ac928b497eef |                | accelerated_refresh | decimal                                      |                 | 2024-09-23T23:17:41.592359 | 2024-09-23T23:17:43.781699 | 2189.34               |                                                              | {sql: SELECT * FROM "decimal"}                                                                                                      |
| 5c5ddd481d1e19df823da74fe33f261f | 6afcfd1e65385a16 |                | sql_query           | select * from runtime.task_history limit 100 |                 | 2024-09-23T23:17:48.305649 | 2024-09-23T23:17:48.307369 | 1.72                  |                                                              | {runtime_query: true, query_execution_duration_ms: 1.429375, protocol: FlightSQL, datasets: runtime.task_history, rows_produced: 5} |
| 4c3dd314b874aa63fcd15023e67fc645 | cab3cdc2d31c1b6a |                | sql_query           | select block_number from logs_file limit 5   |                 | 2024-09-23T23:18:00.267218 | 2024-09-23T23:18:00.269278 | 2.06                  |                                                              | {datasets: logs_file, rows_produced: 5, query_execution_duration_ms: 1.940291, accelerated: true, protocol: FlightSQL}              |
| f135c00df3aecd68dfa4d2360eff78f5 | db3474855449715c |                | sql_query           | select * from foobar                         |                 | 2024-09-23T23:18:12.865122 | 2024-09-23T23:18:12.865196 | 0.074                 | Error during planning: table 'spice.public.foobar' not found | {protocol: FlightSQL, error_code: QueryPlanningError, rows_produced: 0, query_execution_duration_ms: 0.126959, datasets: }          |
+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+

Task Types

The following top-level task types are currently recorded:

Task Type

Description

CLI Command

sql_query

SQL Query

spice sql

nsql_query

Natural Language to SQL Query

ai_chat

AI Chat Completion

spice chat

vector_search

Vector Search

spice search

accelerated_refresh

Accelerated Table Refresh

text_embed

Text Embedding

Configuration

Set the following parameters in the runtime.task_history section of the spicepod.yaml file to configure task history:

enabled: Enable or disable task history. Default: true.
retention_period: The duration for which task history data is retained. Default: 8h.
retention_check_interval: The interval at which the task history retention is checked. Default: 1m.
captured_output: The level of output captured for tasks. none or truncated. Default: none. truncated captures the first 3 rows of the result set for sql_query and nsql_query task types. Other task types currently capture the entire output even in truncated mode.

Examples

Adjust the retention period for task history:

runtime:
  task_history:
    retention_period: 1h # Keep tasks for 1 hour
    retention_check_interval: 1m # Check for expired tasks every minute

Disable task history:

runtime:
  task_history:
    enabled: false

Disable capturing output from tasks:

runtime:
  task_history:
    captured_output: none # none or truncated

Table Schema

Column Name

Data Type

Is Nullable

Description

trace_id

Utf8

Unique identifier for the entire trace this task happened in

span_id

Utf8

Unique identifier for this specific task within the trace

parent_span_id

Utf8

YES

Identifier of the parent task, if any

task

Utf8

Name or description of the task being performed (e.g. sql_query)

input

Utf8

Input data or parameters for the task

captured_output

Utf8

YES

Output or result of the task, if available

start_time

Timestamp(Nanosecond, None)

Time when the task started

end_time

Timestamp(Nanosecond, None)

Time when the task ended

execution_duration_ms

Float64

Duration of the task execution in milliseconds

error_message

Utf8

YES

Error message if the task failed, otherwise null

labels

Map(Utf8, Utf8)

Key-value pairs for additional metadata or attributes associated with the task

Example Queries

Retrieve all tasks within a specific timeframe

SELECT 
    trace_id,
    span_id,
    task,
    start_time,
    end_time,
    execution_duration_ms,
    error_message
FROM spice.runtime.task_history
WHERE start_time >= NOW() - INTERVAL '10 MINUTES'
  AND end_time <= NOW();

Example output:

+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
| trace_id                         | span_id          | task                | start_time                 | end_time                   | execution_duration_ms | error_message                                                                               |
+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
| 687e0970f8c49d19c5a08764ea2d4dc1 | f4f52ed29db8b151 | text_embed          | 2024-11-25T05:39:37.444749 | 2024-11-25T05:39:53.577195 | 16132.446000000002    |                                                                                             |
| 687e0970f8c49d19c5a08764ea2d4dc1 | e47b17bd9fd9fe37 | accelerated_refresh | 2024-11-25T05:39:31.112504 | 2024-11-25T05:39:53.579933 | 22467.429             |                                                                                             |
| 1e881188e5fd252b26adb8a8d838efb8 | 532b0019ad778094 | sql_query           | 2024-11-25T05:40:38.864982 | 2024-11-25T05:40:38.871090 | 6.108                 |                                                                                             |
| 2ee1c700b450034bb6c2da3de2e2386c | 235dafed1e7d8c02 | sql_query           | 2024-11-25T05:39:38.249113 | 2024-11-25T05:39:39.387258 | 1138.145              |                                                                                             |
| 20e75df9ea77ba1c8cb99a2632cdd091 | d07551cd172ffa80 | sql_query           | 2024-11-25T05:39:39.458135 | 2024-11-25T05:39:39.482181 | 24.046000000000003    |                                                                                             |
| ca1d470b12191726b61d825df6f2ce2a | 65597a0bc0a4fde3 | sql_query           | 2024-11-25T05:39:39.675726 | 2024-11-25T05:39:39.822479 | 146.753               |                                                                                             |
| ac5abd8bfec7e5aa7c19fc84772c55f1 | 316622ac359e3c00 | sql_query           | 2024-11-25T05:39:39.872946 | 2024-11-25T05:39:39.872994 | 0.048                 | This feature is not implemented: The context currently only supports a single SQL statement |
| 1c640298e248ba297a12b1e3b59fffc7 | 031c3a25dc56d8e9 | sql_query           | 2024-11-25T05:39:40.467032 | 2024-11-25T05:39:40.486156 | 19.124                |                                                                                             |
| 2c4d9abee740ced8ae423e0eb4fcff6b | a324b699b8bcf338 | sql_query           | 2024-11-25T05:39:40.525506 | 2024-11-25T05:39:40.525526 | 0.02                  | This feature is not implemented: The context currently only supports a single SQL statement |
| e5ed7f7a98e62f493ef8af2e0cd7734e | e84c30862a546bb5 | sql_query           | 2024-11-25T05:39:40.560891 | 2024-11-25T05:39:40.560911 | 0.02                  | This feature is not implemented: The context currently only supports a single SQL statement |
| d471f83092a95bde8663438cda74627f | 3dd9c4d4ebff4cb9 | sql_query           | 2024-11-25T05:39:40.600892 | 2024-11-25T05:39:40.647092 | 46.199999999999996    |                                                                                             |
| 701874d7282dd47791e7519b343a9694 | 5dacf75c4537ee0e | accelerated_refresh | 2024-11-25T05:39:30.452534 | 2024-11-25T05:39:30.452900 | 0.366                 |                                                                                             |
| 2e6b672a49a8cd5f0862a760661dc846 | f813941e0699e783 | accelerated_refresh | 2024-11-25T05:39:30.848425 | 2024-11-25T05:39:30.857242 | 8.817                 |                                                                                             |
| 18d76b6389898cc5253a49294607477d | cc0d06a4e69cbcd5 | health              | 2024-11-25T05:39:30.451626 | 2024-11-25T05:39:31.563876 | 1112.25               |                                                                                             |
| c75af81360e8962639faa64e6804b830 | 1ea2c95b243a5717 | accelerated_refresh | 2024-11-25T05:39:31.036470 | 2024-11-25T05:39:31.607845 | 571.375               |                                                                                             |
| 817d88778e91322640414263779ce7f1 | 513a58d83f0416a7 | accelerated_refresh | 2024-11-25T05:39:30.998455 | 2024-11-25T05:39:32.076359 | 1077.904              |                                                                                             |
| 3c507ee30211e6fab7d8a2eaf686e451 | d9be117925fb6d42 | accelerated_refresh | 2024-11-25T05:39:31.061851 | 2024-11-25T05:39:32.078412 | 1016.561              |                                                                                             |
| aa6010405a12a14b6afaf76e9fabedb8 | 1f50a2b177003c54 | accelerated_refresh | 2024-11-25T05:39:30.933543 | 2024-11-25T05:39:32.476197 | 1542.654              |                                                                                             |
| 3c75d16b6b4b8da98c551d115e1c049c | 9a16dc065a95236a | sql_query           | 2024-11-25T05:42:27.386754 | 2024-11-25T05:42:27.386859 | 0.10500000000000001   | SQL error: ParserError("Expected: an SQL statement, found: ELECT")                          |
+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+

Retrieve the most recent error messages

SELECT 
    trace_id, 
    task, 
    error_message, 
    SUBSTRING(input, 1, 100) AS input_preview, 
    start_time
FROM spice.runtime.task_history
WHERE error_message IS NOT NULL
ORDER BY start_time DESC
LIMIT 5;

+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
| trace_id                         | task      | error_message                                                                               | input_preview                                                                                        | start_time                 |
+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
| 352539e75fdb1a3d5fc3b48bfd4b4bae | sql_query | Error during planning: Invalid function 'date'.                                             | SELECT DATE(start_time) AS task_date, COUNT(*) AS task_count                                         | 2024-11-25T06:17:40.573970 |
|                                  |           | Did you mean 'tanh'?                                                                        | FROM spice.runtime.task_history                                                                      |                            |
|                                  |           |                                                                                             | GROUP B                                                                                              |                            |
| f6672d562ad97dde0bb4db428723461f | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name       ,c_first_name       ,s_store_name       ,ca_state       ,s_ | 2024-11-25T06:06:39.800900 |
| b16fa36e5a2f7f119fc3834875f6bdee | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with frequent_ss_items as  (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date soldda | 2024-11-25T06:06:39.760532 |
| 8d126ea506a374c0c6239c11ee5cbe5a | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with  cross_items as  (select i_item_sk ss_item_sk  from item,  (select iss.i_brand_id brand_id      | 2024-11-25T06:06:39.118422 |
| 198a9dcc496f0435cff69de61cc07874 | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name       ,c_first_name       ,s_store_name       ,ca_state       ,s_ | 2024-11-25T06:04:11.317419 |
+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+

Summarize number of tasks by type

SELECT task, COUNT(*) AS task_count, AVG(execution_duration_ms) AS avg_duration_ms
FROM spice.runtime.task_history
GROUP BY task
ORDER BY task_count DESC;

Example output:

+-------------------------------+------------+---------------------+
| task                          | task_count | avg_duration_ms     |
+-------------------------------+------------+---------------------+
| sql_query                     | 65         | 55.10198461538462   |
| accelerated_refresh           | 27         | 749.1187407407407   |
| ai_completion                 | 9          | 5026.337888888888   |
| tool_use::list_datasets       | 4          | 0.16899999999999998 |
| text_embed                    | 4          | 3341.08975          |
| ai_chat                       | 4          | 7151.03675          |
| vector_search                 | 3          | 384.376             |
| tool_use::document_similarity | 3          | 385.0406666666667   |
| tool_use::get_readiness       | 1          | 0.12999999999999998 |
| tool_use::sample_data         | 1          | 2.275               |
| health                        | 1          | 661.0169999999999   |
+-------------------------------+------------+---------------------+

Identify the longest-running tasks

SELECT 
    task, 
    trace_id, 
    parent_span_id, 
    execution_duration_ms,
    labels 
FROM spice.runtime.task_history
ORDER BY execution_duration_ms DESC
LIMIT 10;

Example output:

+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
| task                | trace_id                         | parent_span_id   | execution_duration_ms | labels                                                                                         |
+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
| accelerated_refresh | d9c38c7e58a02ec939240385a4a25a04 |                  | 1093711.474           | {sql: SELECT * FROM react.issues}                                                              |
| ai_chat             | 7a6427313880942316bf3018cd23a198 |                  | 17202.836000000003    | {model: gpt-4o}                                                                                |
| ai_completion       | 7a6427313880942316bf3018cd23a198 | 59b1fd88c8397e3f | 17202.475             | {model: gpt-4o, total_tokens: 2673, prompt_tokens: 1807, completion_tokens: 866, stream: true} |
| accelerated_refresh | 96758c1132164204a68e1a7234a06cda |                  | 15660.023000000001    | {sql: SELECT * FROM react.docs}                                                                |
| text_embed          | 96758c1132164204a68e1a7234a06cda | 109c489b24602356 | 12406.787             | {outputs_produced: 2086}                                                                       |
| ai_chat             | b2a69503a1b83215603ead321eea6f61 |                  | 6445.162              | {model: gpt-4o}                                                                                |
| ai_completion       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 6444.1990000000005    | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30}  |
| ai_completion       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 5608.6359999999995    | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true}  |
| text_embed          | 65880ecfc884a41555ac4d21ceef9aef |                  | 5143.494000000001     | {outputs_produced: 1}                                                                          |
| text_embed          | f51e5e9d4e26de31a2f7d5e9286dd8f4 |                  | 4769.832              | {outputs_produced: 1}                                                                          |
+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+

Retrieve details of all tasks associated with a specific trace

SELECT 
    task, 
    trace_id, 
    parent_span_id, 
    span_id, 
    execution_duration_ms,
    start_time, 
    end_time,
    SUBSTRING(input, 1, 100) AS input_preview,
    SUBSTRING(captured_output, 1, 100) AS output_preview,
    error_message,
    labels
FROM spice.runtime.task_history
WHERE trace_id = 'b2a69503a1b83215603ead321eea6f61'
ORDER BY start_time;

Example output:

+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| task                          | trace_id                         | parent_span_id   | span_id          | execution_duration_ms | start_time                 | end_time                   | input_preview                                                                                        | output_preview                                                                                       | error_message                                                                                                                                      | labels                                                                                                                                        |
+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ai_chat                       | b2a69503a1b83215603ead321eea6f61 |                  | 95411c59fc9c8cb8 | 6445.162              | 2024-11-25T06:03:31.197980 | 2024-11-25T06:03:37.643142 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | It seems that the dataset containing the recent React issues is currently being refreshed and is not |                                                                                                                                                    | {model: gpt-4o}                                                                                                                               |
| tool_use::list_datasets       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | aa5ba649da3f0581 | 0.367                 | 2024-11-25T06:03:31.198139 | 2024-11-25T06:03:31.198506 |                                                                                                      | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac |                                                                                                                                                    | {tool: list_datasets}                                                                                                                         |
| ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd67a43da1b4312 | 6444.1990000000005    | 2024-11-25T06:03:31.198906 | 2024-11-25T06:03:37.643105 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30}                                                 |
| tool_use::document_similarity | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 49b25ff51580d4fa | 140.337               | 2024-11-25T06:03:31.893913 | 2024-11-25T06:03:32.034250 | {"text":"how to install react","datasets":["spice.react.issues"],"limit":3}                          |                                                                                                      | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {tool: document_similarity}                                                                                                                   |
| vector_search                 | b2a69503a1b83215603ead321eea6f61 | 49b25ff51580d4fa | 9feb8c9a54079cbd | 140.24200000000002    | 2024-11-25T06:03:31.894    | 2024-11-25T06:03:32.034242 | how to install react                                                                                 |                                                                                                      | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {limit: 3, tables: spice.react.issues}                                                                                                        |
| text_embed                    | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 7f18fd8d96a1baeb | 122.771               | 2024-11-25T06:03:31.894072 | 2024-11-25T06:03:32.016843 | "how to install react"                                                                               |                                                                                                      |                                                                                                                                                    | {outputs_produced: 1}                                                                                                                         |
| sql_query                     | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 025f6b1e5cd502a6 | 16.892                | 2024-11-25T06:03:32.017320 | 2024-11-25T06:03:32.034212 | WITH ranked_docs as (                                                                                |                                                                                                      | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues                                             | {error_code: QueryExecutionError, protocol: Internal, query_execution_duration_ms: 8.20325, datasets: spice.react.issues, rows_produced: 0}   |
|                               |                                  |                  |                  |                       |                            |                            |                 SELECT id, dist, offset FROM (                                                       |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
|                               |                                  |                  |                  |                       |                            |                            |                     SELECT                                                                           |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
|                               |                                  |                  |                  |                       |                            |                            |                                                                                                      |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
| ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 1b2a3273a1dc4cdf | 5608.6359999999995    | 2024-11-25T06:03:32.034451 | 2024-11-25T06:03:37.643087 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true}                                                 |
| tool_use::sample_data         | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd10431fb18df87 | 2.275                 | 2024-11-25T06:03:33.039583 | 2024-11-25T06:03:33.041858 | TopNSample({"dataset":"spice.react.issues","limit":3,"order_by":"created_at DESC"})                  |                                                                                                      |                                                                                                                                                    | {sample_method: top_n_sample, tool: top_n_sample}                                                                                             |
| sql_query                     | b2a69503a1b83215603ead321eea6f61 | 8bd10431fb18df87 | f1b2e06225aa2ba3 | 2.193                 | 2024-11-25T06:03:33.039625 | 2024-11-25T06:03:33.041818 | SELECT * FROM spice.react.issues ORDER BY created_at DESC LIMIT 3                                    |                                                                                                      | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues                                             | {query_execution_duration_ms: 1.5519999, datasets: spice.react.issues, protocol: Internal, error_code: QueryExecutionError, rows_produced: 0} |
| ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | b649e4bbe8aac7a3 | 4601.02               | 2024-11-25T06:03:33.042037 | 2024-11-25T06:03:37.643057 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1599, completion_tokens: 11, model: gpt-4o, stream: true, total_tokens: 1610}                                                 |
| tool_use::get_readiness       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | f40f0b4cd608de8b | 0.12999999999999998   | 2024-11-25T06:03:33.499867 | 2024-11-25T06:03:33.499997 |                                                                                                      | {"dataset:call_center":"Ready","dataset:catalog_page":"Ready","dataset:catalog_returns":"Ready","dat |                                                                                                                                                    | {tool: get_readiness}                                                                                                                         |
| ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | e018a56742b5064f | 4142.789              | 2024-11-25T06:03:33.500219 | 2024-11-25T06:03:37.643008 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {stream: true, completion_tokens: 254, prompt_tokens: 1920, model: gpt-4o, total_tokens: 2174}                                                |
+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

Retrieve details of most recent chat query

SELECT 
    task, 
    trace_id, 
    execution_duration_ms,
    start_time, 
    SUBSTRING(input, 1, 100) AS input_preview,
    SUBSTRING(captured_output, 1, 100) AS output_preview,
    error_message,
    labels
FROM spice.runtime.task_history
WHERE trace_id = (
    SELECT trace_id
    FROM spice.runtime.task_history
    WHERE task = 'ai_chat'
    ORDER BY start_time DESC
    LIMIT 1
)
ORDER BY start_time;

-----------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
| task                          | trace_id                         | execution_duration_ms | start_time                 | input_preview                                                                                        | output_preview                                                                                       | error_message | labels                                                                                                       |
+-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
| ai_chat                       | bb2de94b6f575b6c001f39cfded8bff4 | 5665.4169999999995    | 2024-11-25T07:02:43.240005 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | Here are the three most recent issues related to React, along with their summaries and links:        |               | {model: gpt-4o}                                                                                              |
|                               |                                  |                       |                            |                                                                                                      |                                                                                                      |               |                                                                                                              |
|                               |                                  |                       |                            |                                                                                                      | 1. **                                                                                                |               |                                                                                                              |
| tool_use::list_datasets       | bb2de94b6f575b6c001f39cfded8bff4 | 0.157                 | 2024-11-25T07:02:43.240048 |                                                                                                      | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac |               | {tool: list_datasets}                                                                                        |
| ai_completion                 | bb2de94b6f575b6c001f39cfded8bff4 | 5664.964              | 2024-11-25T07:02:43.240416 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |               | {prompt_tokens: 2688, total_tokens: 2716, completion_tokens: 28, model: gpt-4o, stream: true}                |
| tool_use::document_similarity | bb2de94b6f575b6c001f39cfded8bff4 | 710.9609999999999     | 2024-11-25T07:02:44.344935 | {"text":"recent issues","datasets":["spice.react.issues"],"limit":3}                                 |                                                                                                      |               | {tool: document_similarity}                                                                                  |
| vector_search                 | bb2de94b6f575b6c001f39cfded8bff4 | 710.842               | 2024-11-25T07:02:44.345018 | recent issues                                                                                        | {Full { catalog: "spice", schema: "react", table: "issues" }: VectorSearchTableResult { data: [Recor |               | {limit: 3, tables: spice.react.issues}                                                                       |
| text_embed                    | bb2de94b6f575b6c001f39cfded8bff4 | 562.453               | 2024-11-25T07:02:44.345072 | "recent issues"                                                                                      |                                                                                                      |               | {outputs_produced: 1}                                                                                        |
| sql_query                     | bb2de94b6f575b6c001f39cfded8bff4 | 147.672               | 2024-11-25T07:02:44.908148 | WITH ranked_docs as (                                                                                | [{"title_chunk":"app:lintVitalReleaseBug issu","id":"I_kwDOAJy2Ks47fqsI","title":"app:lintVitalRelea |               | {datasets: spice.react.issues, protocol: Internal, query_execution_duration_ms: 139.69496, rows_produced: 3} |
|                               |                                  |                       |                            |                 SELECT id, dist, offset FROM (                                                       |                                                                                                      |               |                                                                                                              |
|                               |                                  |                       |                            |                     SELECT                                                                           |                                                                                                      |               |                                                                                                              |
|                               |                                  |                       |                            |                                                                                                      |                                                                                                      |               |                                                                                                              |
| ai_completion                 | bb2de94b6f575b6c001f39cfded8bff4 | 3849.3140000000003    | 2024-11-25T07:02:45.055985 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |               | {model: gpt-4o, total_tokens: 3240, stream: true, completion_tokens: 271, prompt_tokens: 2969}               |
+-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+

Retrieve Recent Queries for Specific Dataset

SELECT 
    task,
    start_time,
    execution_duration_ms,
    SUBSTRING(input, 1, 100) AS input_preview, 
    error_message, 
    labels
FROM spice.runtime.task_history
WHERE 'catalog_sales' = ANY(string_to_array(labels['datasets'], ','))
ORDER BY start_time DESC
LIMIT 5;

Example output:

+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| task      | start_time                 | execution_duration_ms | input_preview                                                                                        | error_message | labels                                                                                                                                                                            |
+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| sql_query | 2024-11-25T07:15:31.278018 | 19.424                | with ss as (  select           i_manufact_id,sum(ss_ext_sales_price) total_sales  from  	store_sales |               | {accelerated: true, protocol: FlightSQL, query_execution_duration_ms: 18.765831, rows_produced: 15, datasets: store_sales,customer_address,item,date_dim,web_sales,catalog_sales} |
| sql_query | 2024-11-25T07:15:31.233147 | 1.661                 | select  sum(cs_ext_discount_amt)  as "excess discount amount" from    catalog_sales    ,item    ,dat |               | {query_execution_duration_ms: 1.380667, accelerated: true, datasets: date_dim,catalog_sales,item, rows_produced: 1, protocol: FlightSQL}                                          |
| sql_query | 2024-11-25T07:15:30.889905 | 30.101                | select      i_item_id     ,i_item_desc     ,s_store_id     ,s_store_name     ,stddev_samp(ss_quantit |               | {query_execution_duration_ms: 29.511086, datasets: store_sales,store,item,date_dim,store_returns,catalog_sales, rows_produced: 0, protocol: FlightSQL, accelerated: true}         |
| sql_query | 2024-11-25T07:15:30.658339 | 24.942                | select  i_item_id,         avg(cs_quantity) agg1,         avg(cs_list_price) agg2,         avg(cs_co |               | {query_execution_duration_ms: 24.620039, protocol: FlightSQL, datasets: item,date_dim,catalog_sales,promotion,customer_demographics, accelerated: true, rows_produced: 73}        |
| sql_query | 2024-11-25T07:15:30.574257 | 40.908                | select  i_item_id  ,i_item_desc  ,s_store_id  ,s_store_name  ,min(ss_net_profit) as store_sales_prof |               | {protocol: FlightSQL, rows_produced: 0, accelerated: true, datasets: store_returns,date_dim,store,store_sales,item,catalog_sales, query_execution_duration_ms: 40.29496}          |
+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

ABFS

Azure BlobFS Data Connector Documentation

The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (abfss://) and Azure Data Lake (adl://) endpoints.

When a folder path is provided, all the contained files will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

datasets:
  - from: abfs://foocontainer/taxi_sample.csv
    name: azure_test
    params:
      abfs_account: spiceadls
      abfs_access_key: ${ secrets:access_key }
      file_format: csv

Configuration

`from`

Defines the ABFS-compatible URI to a folder or object:

from: abfs://<container>/<path> with the account name configured using abfs_account parameter, or
from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>

`name`

Defines the dataset name, which is used as the table name within Spice.

Example:

datasets:
  - from: abfs://foocontainer/taxi_sample.csv
    name: cool_dataset
    params: ...

SELECT COUNT(*) FROM cool_dataset;

+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

`params`

Basic parameters

Parameter name

Description

file_format

abfs_account

Azure storage account name

abfs_sas_string

SAS (Shared Access Signature) Token to use for authorization

abfs_endpoint

Storage endpoint, default: https://{account}.blob.core.windows.net

abfs_use_emulator

Use true or false to connect to a local emulator

abfs_authority_host

Alternative authority host, default: https://login.microsoftonline.com

abfs_proxy_url

Proxy URL

abfs_proxy_ca_certificate

CA certificate for the proxy

abfs_proxy_exludes

A list of hosts to exclude from proxy connections

abfs_disable_tagging

Disable tagging objects. Use this if your backing store doesn't support tags

allow_http

Allow insecure HTTP connections

hive_partitioning_enabled

Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

Authentication parameters

The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:

abfs_access_key
abfs_bearer_token
abfs_client_secret
abfs_skip_signature

If none of these are set the connector will default to using a managed identity

Parameter name

Description

abfs_access_key

Secret access key

abfs_bearer_token

abfs_client_id

Client ID for client authentication flow

abfs_client_secret

Client Secret to use for client authentication flow

abfs_tenant_id

Tenant ID to use for client authentication flow

abfs_skip_signature

Skip credentials and request signing for public containers

abfs_msi_endpoint

Endpoint for managed identity tokens

abfs_federated_token_file

File path for federated identity token in Kubernetes

abfs_use_cli

Set to true to use the Azure CLI to acquire access tokens

Retry parameters

Parameter name

Description

abfs_max_retries

Maximum retries

abfs_retry_timeout

Total timeout for retries (e.g., 5s, 1m)

abfs_backoff_initial_duration

Initial retry delay (e.g., 5s)

abfs_backoff_max_duration

Maximum retry delay (e.g., 1m)

abfs_backoff_base

Exponential backoff base (e.g., 0.1)

Authentication

ABFS connector supports three types of authentication, as detailed in the authentication parameters

Service principal authentication

Configure service principal authentication by setting the abfs_client_secret parameter.

Create a new Azure AD application in the Azure portal and generate a client secret under Certificates & secrets.
Grant the Azure AD application read access to the storage account under Access Control (IAM), this can typically be done using the Storage Blob Data Reader built-in role.

Access key authentication

Configure service principal authentication by setting the abfs_access_key parameter to Azure Storage Account Access Key

Supported file formats

Specify the file format using file_format parameter. More details in Object Store File Formats.

Examples

Reading a CSV file with an Access Key

datasets:
  - from: abfs://foocontainer/taxi_sample.csv
    name: azure_test
    params:
      abfs_account: spiceadls
      abfs_access_key: ${ secrets:ACCESS_KEY }
      file_format: csv

Using Public Containers

datasets:
  - from: abfs://pubcontainer/taxi_sample.csv
    name: pub_data
    params:
      abfs_account: spiceadls
      abfs_skip_signature: true
      file_format: csv

Connecting to the Storage Emulator

datasets:
  - from: abfs://test_container/test_csv.csv
    name: test_data
    params:
      abfs_use_emulator: true
      file_format: csv

Using secrets for Account name

datasets:
  - from: abfs://my_container/my_csv.csv
    name: prod_data
    params:
      abfs_account: ${ secrets:PROD_ACCOUNT }
      file_format: csv

Authenticating using Client Authentication

datasets:
  - from: abfs://my_data/input.parquet
    name: my_data
    params:
      abfs_tenant_id: ${ secrets:MY_TENANT_ID }
      abfs_client_id: ${ secrets:MY_CLIENT_ID }
      abfs_client_secret: ${ secrets:MY_CLIENT_SECRET }

MySQL

MySQL Data Connector Documentation

MySQL is an open-source relational database management system that uses structured query language (SQL) for managing and manipulating databases.

The MySQL Data Connector enables federated/accelerated SQL queries on data stored in MySQL databases.

datasets:
  - from: mysql:mytable
    name: my_dataset
    params:
      mysql_host: my_db_host
      mysql_tcp_port: 3306
      mysql_db: my_database
      mysql_user: my_user
      mysql_pass: ${secrets:mysql_pass}

Configuration

`from`

The from field takes the form mysql:database_name.table_name where database_name is the fully-qualified table name in the SQL server.

If the database_name is omitted in the from field, the connector will use the database specified in the mysql_db parameter. If the mysql_db parameter is not provided, it will default to the user's default database.

These two examples are identical:

datasets:
  - from: mysql:mytable
    name: my_dataset
    params:
      mysql_db: my_database
      ...

datasets:
  - from: mysql:my_database.mytable
    name: my_dataset
    params: ...

`name`

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: mysql:path.to.my_dataset
    name: cool_dataset
    params: ...

SELECT COUNT(*) FROM cool_dataset;

+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

`params`

The MySQL data connector can be configured by providing the following params. Use the secret replacement syntax to load the secret from a secret store, e.g. ${secrets:my_mysql_conn_string}.

Parameter Name

Description

mysql_connection_string

The connection string to use to connect to the MySQL server. This can be used instead of providing individual connection parameters.

mysql_host

The hostname of the MySQL server.

mysql_tcp_port

The port of the MySQL server.

mysql_db

The name of the database to connect to.

mysql_user

The MySQL username.

mysql_pass

The password to connect with.

mysql_sslmode

Optional. Specifies the SSL/TLS behavior for the connection, supported values:

required: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
preferred: This mode will try to establish a secure SSL connection if possible, but will connect insecurely if the server does not support SSL.
disabled: This mode will not attempt to use an SSL connection, even if the server supports it.

mysql_sslrootcert

Optional parameter specifying the path to a custom PEM certificate that the connector will trust.

Types

The table below shows the MySQL data types supported, along with the type mapping to Apache Arrow types in Spice.

MySQL Type

Arrow Type

TINYINT

Int8

SMALLINT

Int16

INT

Int32

MEDIUMINT

Int32

BIGINT

Int64

DECIMAL

Decimal128 / Decimal256

FLOAT

Float32

DOUBLE

Float64

DATETIME

Timestamp(Microsecond, None)

TIMESTAMP

Timestamp(Microsecond, None)

YEAR

Int16

TIME

Time64(Nanosecond)

DATE

Date32

CHAR

Utf8

BINARY

Binary

VARCHAR

Utf8

VARBINARY

Binary

TINYBLOB

Binary

TINYTEXT

Utf8

BLOB

Binary

TEXT

Utf8

MEDIUMBLOB

Binary

MEDIUMTEXT

Utf8

LONGBLOB

LargeBinary

LONGTEXT

LargeUtf8

SET

Utf8

ENUM

Dictionary(UInt16, Utf8)

BIT

UInt64

The MySQL TIMESTAMP value is retrieved as a UTC time value.

Examples

Connecting using username and password

datasets:
  - from: mysql:path.to.my_dataset
    name: my_dataset
    params:
      mysql_host: my_db_host
      mysql_tcp_port: 3306
      mysql_db: my_database
      mysql_user: my_user
      mysql_pass: ${secrets:mysql_pass}

Connecting using SSL

datasets:
  - from: mysql:path.to.my_dataset
    name: my_dataset
    params:
      mysql_host: my_db_host
      mysql_tcp_port: 3306
      mysql_db: my_database
      mysql_user: my_user
      mysql_pass: ${secrets:mysql_pass}
      mysql_sslmode: preferred
      mysql_sslrootcert: ./custom_cert.pem

Connecting using a Connection String

datasets:
  - from: mysql:path.to.my_dataset
    name: my_dataset
    params:
      mysql_connection_string: mysql://${secrets:my_user}:${secrets:my_password}@my_db_host:3306/my_db

Connecting to the default database

datasets:
  - from: mysql:mytable
    name: my_dataset
    params:
      mysql_host: my_db_host
      mysql_tcp_port: 3306
      mysql_user: my_user
      mysql_pass: ${secrets:mysql_pass}

GraphQL

GraphQL Data Connector Documentation

The GraphQL Data Connector enables federated SQL queries on any GraphQL endpoint by specifying graphql as the selector in the from value for the dataset.

datasets:
  - from: graphql:your-graphql-endpoint
    name: my_dataset
    params:
      json_pointer: /data/some/nodes
      graphql_query: |
        {
          some {
            nodes {
              field1
              field2
            }
          }
        }

Limitations

The GraphQL data connector does not support variables in the query.
Filter pushdown, with the exclusion of LIMIT, is not currently supported. Using a LIMIT will reduce the amount of data requested from the GraphQL server.

Configuration

`from`

The from field takes the form of graphql:your-graphql-endpoint.

`name`

The dataset name. This will be used as the table name within Spice.

`params`

The GraphQL data connector can be configured by providing the following params. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_graphql_auth_token}.

Parameter Name

Description

unnest_depth

Depth level to automatically unnest objects to. By default, disabled if unspecified or 0.

graphql_auth_token

The authentication token to use to connect to the GraphQL server. Uses bearer authentication.

graphql_auth_user

The username to use for basic auth. E.g. graphql_auth_user: my_user

graphql_auth_pass

The password to use for basic auth. E.g. graphql_auth_pass: ${secrets:my_graphql_auth_pass}

graphql_query

json_pointer

GraphQL Query Example

query: |
  {
    some {
      nodes {
        field1
        field2
      }
    }
  }

Examples

Example using the GitHub GraphQL API and Bearer Auth. The following will use json_pointer to retrieve all of the nodes in starredRepositories:

from: graphql:https://api.github.com/graphql
name: stars
params:
  graphql_auth_token: ${env:GITHUB_TOKEN}
  graphql_auth_user: ${env:GRAPHQL_USER}                                                                                            ...
  graphql_auth_pass: ${env:GRAPHQL_PASS}
  json_pointer: /data/viewer/starredRepositories/nodes
  graphql_query: |
    {
      viewer {
        starredRepositories {
          nodes {
            name
            stargazerCount
            languages (first: 10) {
              nodes {
                name
              }
            }
          }
        }
      }
    }

Pagination

The GraphQL Data Connector supports automatic pagination of the response for queries using cursor pagination.

The graphql_query must include the pageInfo field as per spec. The connector will parse the graphql_query, and when pageInfo is present, will retrieve data until pagination completes.

The query must have the correct pagination arguments in the associated paginated field.

Example

Forward Pagination:

{
  something_paginated(first: 100) {
    nodes {
      foo
      bar
    }
    pageInfo {
      endCursor
      hasNextPage
    }
  }
}

Backward Pagination:

{
  something_paginated(last: 100) {
    nodes {
      foo
      bar
    }
    pageInfo {
      startCursor
      hasPreviousPage
    }
  }
}

Working with JSON Data

Tips for working with JSON data. For more information see Datafusion Docs.

Accessing objects fields

You can access the fields of the object using the square bracket notation. Arrays are indexed from 1.

Example for the stargazers query from pagination section:

sql> select node['login'] as login, node['name'] as name from stargazers limit 5;
+--------------+----------------------+
| login        | name                 |
+--------------+----------------------+
| simsieg      | Simon Siegert        |
| davidmathers | David Mathers        |
| ahmedtadde   | Ahmed Tadde          |
| lordhamlet   | Shih-Fen Cheng       |
| thinmy       | Thinmy Patrick Alves |
+--------------+----------------------+

Piping array into rows

You can use Datafusion unnest function to pipe values from array into rows. We'll be using countries GraphQL api as an example.

from: graphql:https://countries.trevorblades.com
name: countries
params:
  json_pointer: /data/continents
  graphql_query: |
    {
      continents {
        name
        countries {
          name
          capital
        }
      }
    }

description: countries
acceleration:
  enabled: true
  refresh_mode: full
  refresh_check_interval: 30m

Example query:

sql> select continent, country['name'] as country, country['capital'] as capital
from (select name as continent, unnest(countries) as country from countries)
where continent = 'North America' limit 5;
+---------------+---------------------+--------------+
| continent     | country             | capital      |
+---------------+---------------------+--------------+
| North America | Antigua and Barbuda | Saint John's |
| North America | Anguilla            | The Valley   |
| North America | Aruba               | Oranjestad   |
| North America | Barbados            | Bridgetown   |
| North America | Saint Barthélemy    | Gustavia     |
+---------------+---------------------+--------------+

Unnesting object properties

You can also use the unnest_depth parameter to control automatic unnesting of objects from GraphQL responses.

This examples uses the GitHub stargazers endpoint:

from: graphql:https://api.github.com/graphql
name: stargazers
params:
  graphql_auth_token: ${env:GITHUB_TOKEN}
  unnest_depth: 2
  json_pointer: /data/repository/stargazers/edges
  graphql_query: |
    {
      repository(name: "spiceai", owner: "spiceai") {
        id
        name
        stargazers(first: 100) {
          edges {
            node {
              id
              name
              login
            }
          }
          pageInfo {
            hasNextPage
            endCursor
          }
        }

      }
    }

If unnest_depth is set to 0, or unspecified, object unnesting is disabled. When enabled, unnesting automatically moves nested fields to the parent level.

Without unnesting, stargazers data looks like this in a query:

sql> select node from stargazers limit 1;
+------------------------------------------------------------+
| node                                                       |
+------------------------------------------------------------+
| {id: MDQ6VXNlcjcwNzIw, login: ashtom, name: Thomas Dohmke} |
+------------------------------------------------------------+

With unnesting, these properties are automatically placed into their own columns:

sql> select node from stargazers limit 1;
+------------------+--------+---------------+
| id               | login  | name          |
+------------------+--------+---------------+
| MDQ6VXNlcjcwNzIw | ashtom | Thomas Dohmke |
+------------------+--------+---------------+

Unnesting Duplicate Columns

By default, the Spice Runtime will error when a duplicate column is detected during unnesting.

For example, this example spicepod.yml query would fail due to name fields:

from: graphql:https://my-graphql-api.com
name: stargazers
params:
  unnest_depth: 2
  json_pointer: /data/users
  graphql_query: |
    query {
      users {
        name
        emergency_contact {
          name
        }
      }
    }

This example would fail with a runtime error:

WARN runtime: GraphQL Data Connector Error: Invalid object access. Column 'name' already exists in the object.

Avoid this error by using aliases in the query where possible. In the example above, a duplicate error was introduced from emergency_contact { name }.

The example below uses a GraphQL alias to rename emergency_contact.name as emergencyContactName.

from: graphql:https://my-graphql-api.com
name: stargazers
params:
  unnest_depth: 2
  json_pointer: /data/people
  graphql_query: |
    query {
      users {
        name
        emergency_contact {
          emergencyContactName: name
        }
      }
    }

Postgres

PostgreSQL Data Connector Documentation

PostgreSQL is an advanced open-source relational database management system known for its robustness, extensibility, and support for SQL compliance.

The PostgreSQL Server Data Connector enables federated/accelerated SQL queries on data stored in PostgreSQL databases.

datasets:
  - from: postgres:my_table
    name: my_dataset
    params: ...

Configuration

`from`

The from field takes the form postgres:my_table where my_table is the table identifer in the PostgreSQL server to read from.

The fully-qualified table name (database.schema.table) can also be used in the from field.

datasets:
  - from: postgres:my_database.my_schema.my_table
    name: my_dataset
    params: ...

`name`

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: postgres:my_database.my_schema.my_table
    name: cool_dataset
    params: ...

SELECT COUNT(*) FROM cool_dataset;

+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

`params`

The connection to PostgreSQL can be configured by providing the following params:

Parameter Name

Description

pg_host

The hostname of the PostgreSQL server.

pg_port

The port of the PostgreSQL server.

pg_db

The name of the database to connect to.

pg_user

The username to connect with.

pg_pass

pg_sslmode

Optional. Specifies the SSL/TLS behavior for the connection, supported values:

verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
verify-ca: This mode requires a TLS connection and a valid root certificate.
require: This mode requires a TLS connection.
prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
disable: This mode will not attempt to use a TLS connection, even if the server supports it.

pg_sslrootcert

Optional parameter specifying the path to a custom PEM certificate that the connector will trust.

connection_pool_size

Optional. The maximum number of connections to keep open in the connection pool. Default is 10.

Types

The table below shows the PostgreSQL data types supported, along with the type mapping to Apache Arrow types in Spice.

PostgreSQL Type

Arrow Type

int2

Int16

int4

Int32

int8

Int64

money

Int64

float4

Float32

float8

Float64

numeric

Decimal128

text

Utf8

varchar

Utf8

bpchar

Utf8

uuid

Utf8

bytea

Binary

bool

Boolean

json

LargeUtf8

timestamp

Timestamp(Nanosecond, None)

timestampz

Timestamp(Nanosecond, TimeZone

date

Date32

time

Time64(Nanosecond)

interval

Interval(MonthDayNano)

point

FixedSizeList(Float64[2])

int2[]

List(Int16)

int4[]

List(Int32)

int8[]

List(Int64)

float4[]

List(Float32)

float8[]

List(Float64)

text[]

List(Utf8)

bool[]

List(Boolean)

bytea[]

List(Binary)

geometry

Binary

geography

Binary

enum

Dictionary(Int8, Utf8)

Composite Types

Struct

The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col) into CAST (SUM(int_col) as BIGINT.

Examples

Connecting using Username/Password

datasets:
  - from: postgres:my_database.my_schema.my_table
    name: my_dataset
    params:
      pg_host: my_db_host
      pg_port: 5432
      pg_db: my_database
      pg_user: my_user
      pg_pass: ${secrets:my_pg_pass}

Connect using SSL

datasets:
  - from: postgres:my_database.my_schema.my_table
    name: my_dataset
    params:
      pg_host: my_db_host
      pg_port: 5432
      pg_db: my_database
      pg_user: my_user
      pg_pass: ${secrets:my_pg_pass}
      pg_sslmode: verify-ca
      pg_sslrootcert: ./custom_cert.pem

Separate dataset/accelerator secrets

Specify different secrets for a PostgreSQL source and acceleration:

datasets:
  - from: postgres:my_schema.my_table
    name: my_dataset
    params:
      pg_host: my_db_host
      pg_port: 5432
      pg_db: my_database
      pg_user: my_user
      pg_pass: ${secrets:pg1_pass}
    acceleration:
      engine: postgres
      params:
        pg_host: my_db_host
        pg_port: 5433
        pg_db: acceleration
        pg_user: two_user_two_furious
        pg_pass: ${secrets:pg2_pass}

GitHub

GitHub Data Connector Documentation

The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github as the selector in the from value for the dataset.

Common Configuration

Configuration

`from`

The from field takes the form of github:github.com/<owner>/<repo>/<content> where content could be files, issues, pulls, commits, stargazers. See examples for more configuration detail.

`name`

The dataset name. This will be used as the table name within Spice.

`params`

Personal Access Token

Parameter Name

Description

github_token

GitHub App Installation

GitHub Apps provide a secure and scalable way to integrate with GitHub's API. Learn more.

Parameter Name

Description

github_client_id

Required. Specifies the client ID for GitHub App Installation auth mode.

github_private_key

Required. Specifies the private key for GitHub App Installation auth mode.

github_installation_id

Required. Specifies the installation ID for GitHub App Installation auth mode.

Limitations

With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.

Common Parameters

Parameter Name

Description

github_query_mode

owner

Required. Specifies the owner of the GitHub repository.

repo

Required. Specifies the name of the GitHub repository.

Filter Push Down

GitHub queries support a github_query_mode parameter, which can be set to either auto or search for the following types:

Issues: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.
Pull Requests: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.

Commits only supports auto mode. Query with filter push down is only enabled for the committed_date column. commited_date supports exact matches, or greater/less than matches for dates provided in ISO8601 format, like WHERE committed_date > '2024-09-24'.

When set to search, Issues and Pull Requests will use the GitHub Search API for improved filter performance when querying against the columns:

author and state; supports exact matches, or NOT matches. For example, WHERE author = 'peasee' or WHERE author <> 'peasee'.
body and title; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'.
updated_at, created_at, merged_at and closed_at; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'.

All other filters are supported when github_query_mode is set to search, but cannot be pushed down to the GitHub API for improved performance.

Limitations

GitHub has a limitation in the Search API where it may return more stale data than the standard API used in the default query mode.
GitHub has a limitation in the Search API where it only returns a maximum of 1000 results for a query. Use append mode acceleration to retrieve more results over time. See the append example for pull requests.

Examples

Querying GitHub Files

Limitations

content column is fetched only when acceleration is enabled.
Querying GitHub files does not support filter push down, which may result in long query times when acceleration is disabled.
Setting github_query_mode to search is not supported.

ref - Required. Specifies the GitHub branch or tag to fetch files from.
include - Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.

datasets:
  - from: github:github.com/<owner>/<repo>/files/<ref>
    name: spiceai.files
    params:
      github_token: ${secrets:GITHUB_TOKEN}
      include: '**/*.json; **/*.yaml'
    acceleration:
      enabled: true

Schema

Column Name

Data Type

Is Nullable

name

Utf8

YES

path

Utf8

YES

size

Int64

YES

sha

Utf8

YES

mode

Utf8

YES

url

Utf8

YES

download_url

Utf8

YES

content

Utf8

YES

Example

datasets:
  - from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
    name: spiceai.files
    params:
      github_token: ${secrets:GITHUB_TOKEN}
      include: '**/*.txt' # include txt files only
    acceleration:
      enabled: true

sql> select * from spiceai.files
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| name        | path        | size | sha                                      | mode   | url                                                                                             | download_url                                                               | content     |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| version.txt | version.txt | 12   | ee80f747038c30e776eecb2c2ae155dec9a68187 | 100644 | https://api.github.com/repos/spiceai/spiceai/git/blobs/ee80f747038c30e776eecb2c2ae155dec9a68187 | https://raw.githubusercontent.com/spiceai/spiceai/v0.17.2-beta/version.txt | 0.17.2-beta |
|             |             |      |                                          |        |                                                                                                 |                                                                            |             |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+

Time: 0.005067 seconds. 1 rows.

Querying GitHub Issues

Limitations

Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'.

datasets:
  - from: github:github.com/<owner>/<repo>/issues
    name: spiceai.issues
    params:
      github_token: ${secrets:GITHUB_TOKEN}
    acceleration:
      enabled: true

Schema

Column Name

Data Type

Is Nullable

assignees

List(Utf8)

YES

author

Utf8

YES

body

Utf8

YES

closed_at

Timestamp

YES

comments

List(Struct)

YES

created_at

Timestamp

YES

Utf8

YES

labels

List(Utf8)

YES

milestone_id

Utf8

YES

milestone_title

Utf8

YES

comments_count

Int64

YES

number

Int64

YES

state

Utf8

YES

title

Utf8

YES

updated_at

Timestamp

YES

url

Utf8

YES

Example

datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    params:
      github_token: ${secrets:GITHUB_TOKEN}

sql> select title, state, labels from spiceai.issues where title like '%duckdb%'
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| title                                                                                                     | state  | labels               |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| Limitation documentation duckdb accelerator about nested struct and decimal256                            | CLOSED | [kind/documentation] |
| Inconsistent duckdb connector params: `params.open` and `params.duckdb_file`                              | CLOSED | [kind/bug]           |
| federation across multiple duckdb acceleration tables.                                                    | CLOSED | []                   |
| Integration tests to cover "On Conflict" behaviors for duckdb accelerator                                 | CLOSED | [kind/task]          |
| Permission denied issue while using duckdb data connector with spice using HELM for Kubernetes deployment | CLOSED | [kind/bug]           |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+

Time: 0.011877542 seconds. 5 rows.

Querying GitHub Pull Requests

Limitations

Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'.

datasets:
  - from: github:github.com/<owner>/<repo>/pulls
    name: spiceai.pulls
    params:
      github_token: ${secrets:GITHUB_TOKEN}

Schema

Column Name

Data Type

Is Nullable

additions

Int64

YES

assignees

List(Utf8)

YES

author

Utf8

YES

body

Utf8

YES

changed_files

Int64

YES

closed_at

Timestamp

YES

comments_count

Int64

YES

commits_count

Int64

YES

created_at

Timestamp

YES

deletions

Int64

YES

hashes

List(Utf8)

YES

Utf8

YES

labels

List(Utf8)

YES

merged_at

Timestamp

YES

number

Int64

YES

reviews_count

Int64

YES

state

Utf8

YES

title

Utf8

YES

url

Utf8

YES

Example

datasets:
  - from: github:github.com/spiceai/spiceai/pulls
    name: spiceai.pulls
    params:
      github_token: ${secrets:GITHUB_TOKEN}
    acceleration:
      enabled: true

sql> select title, url, state from spiceai.pulls where title like '%GitHub connector%'
+---------------------------------------------------------------------+----------------------------------------------+--------+
| title                                                               | url                                          | state  |
+---------------------------------------------------------------------+----------------------------------------------+--------+
| GitHub connector: convert `labels` and `hashes` to primitive arrays | https://github.com/spiceai/spiceai/pull/2452 | MERGED |
+---------------------------------------------------------------------+----------------------------------------------+--------+

Time: 0.034996667 seconds. 1 rows.

Append Example

datasets:
  - from: github:github.com/spiceai/spiceai/pulls
    name: spiceai.pulls
    params:
      github_token: ${secrets:GITHUB_TOKEN}
      github_query_mode: search
    time_column: created_at
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_check_interval: 6h # check for new results every 6 hours
      refresh_data_window: 90d # at initial load, load the last 90 days of pulls

Querying GitHub Commits

Limitations

Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE committed_date > '2024-09-24'.
Setting github_query_mode to search is not supported.

datasets:
  - from: github:github.com/<owner>/<repo>/commits
    name: spiceai.commits
    params:
      github_token: ${secrets:GITHUB_TOKEN}

Schema

Column Name

Data Type

Is Nullable

additions

Int64

YES

author_email

Utf8

YES

author_name

Utf8

YES

committed_date

Timestamp

YES

deletions

Int64

YES

Utf8

YES

message

Utf8

YES

message_body

Utf8

YES

message_head_line

Utf8

YES

sha

Utf8

YES

Example

datasets:
  - from: github:github.com/spiceai/spiceai/commits
    name: spiceai.commits
    params:
      github_token: ${secrets:GITHUB_TOKEN}
    acceleration:
      enabled: true

sql> select sha, message_head_line from spiceai.commits limit 10
+------------------------------------------+------------------------------------------------------------------------+
| sha                                      | message_head_line                                                      |
+------------------------------------------+------------------------------------------------------------------------+
| 2a9fab7905737e1af182e17f40aecc5c4b5dd236 |  wait 2 seconds for the status to turn ready in refreshing status tes… |
| b9c210a818abeaf14d2493fde5227781f47faed8 | Update README.md - Remove bigquery from tablet of connectors (#1434)   |
| d61e1af61ebf826f83703b8dd939f19e8b2ba426 | Add databricks_use_ssl parameter (#1406)                               |
| f1ec55c5986e3e5d57eff94197182ffebbae1045 | wording and logs change reflected on readme (#1435)                    |
| bfc74185584d1e048ef66c72ce3572a0b652bfd9 | Update acknowledgements (#1433)                                        |
| 0d870f1791d456e7924b4ecbbda5f3b762db1e32 | Update helm version and use v0.13.0-alpha (#1436)                      |
| 12f930cbad69833077bd97ea43599a75cff985fc | Enable push-down federation by default (#1429)                         |
| 6e4521090aaf39664bd61d245581d34398ce77db | Add functional tests for federation push-down (#1428)                  |
| fa3279b7d9fcaa5e8baaa2425f69b556bb30e309 | Add LRU cache support for http-based sql queries (#1410)               |
| a3f93dde9d1312bfbf14f7ae3b75bdc468289212 | Add guides and examples about error handling (#1427)                   |
+------------------------------------------+------------------------------------------------------------------------+

Time: 0.0065395 seconds. 10 rows.

Querying GitHub stars (Stargazers)

Limitations

Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE starred_at > '2024-09-24'.
Setting github_query_mode to search is not supported.

datasets:
  - from: github:github.com/<owner>/<repo>/stargazers
    name: spiceai.stargazers
    params:
      github_token: ${secrets:GITHUB_TOKEN}

Schema

Column Name

Data Type

Is Nullable

starred_at

Timestamp

YES

Utf8

YES

Utf8

YES

name

Utf8

YES

company

Utf8

YES

x_username

Utf8

YES

location

Utf8

YES

avatar_url

Utf8

YES

bio

Utf8

YES

Example

datasets:
  - from: github:github.com/spiceai/spiceai/stargazers
    name: spiceai.stargazers
    params:
      github_token: ${secrets:GITHUB_TOKEN}
    acceleration:
      enabled: true

sql> select starred_at, login from spiceai.stargazers order by starred_at DESC limit 10
+----------------------+----------------------+
| starred_at           | login                |
+----------------------+----------------------+
| 2024-09-15T13:22:09Z | cisen                |
| 2024-09-14T18:04:22Z | tyan-boot            |
| 2024-09-13T10:38:01Z | yofriadi             |
| 2024-09-13T10:01:33Z | FourSpaces           |
| 2024-09-13T04:02:11Z | d4x1                 |
| 2024-09-11T18:10:28Z | stephenakearns-insta |
| 2024-09-09T22:17:42Z | Lrs121               |
| 2024-09-09T19:56:26Z | jonathanfinley       |
| 2024-09-09T07:02:10Z | leookun              |
| 2024-09-09T03:04:27Z | royswale             |
+----------------------+----------------------+

Time: 0.0088075 seconds. 10 rows.