Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 119 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Spice.ai Cloud Documentation

Getting Started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Features

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Building Blocks

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

API

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Portal

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Use-Cases

Loading...

Loading...

Loading...

Loading...

Loading...

SDKs

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Integrations

Loading...

Loading...

Getting Started

Get started with the Spice.ai Cloud Platform in 5 mins.

Sign in to the Portal with GitHub

Sign in with GitHub

Create a Spice app

Create a Spice app

Add a dataset and query it

Add an AI Model and chat with it

Welcome to Spice.ai Cloud

Welcome to the Spice.ai Cloud Platform documentation!

The Spice.ai Cloud Platform is an AI application and agent cloud; an AI-backend-as-a-service comprising of composable, ready-to-use AI and agent building blocks including high-speed SQL query, LLM inference, Vector Search, and RAG built-on cloud-scale, managed Spice.ai OSS.

This documentation pertains to the Spice.ai Cloud Platform.

For documentation on the self-hostable Spice.ai OSS Project, please visit docs.spiceai.org.

Spice.ai Cloud, powered by Spice.ai OSS delivers a secure compute environment for AI applications and agents

With the Spice.ai Cloud Platform, powered by Spice.ai OSS, you can:

  1. Query and accelerate data: Run high-performance SQL queries across multiple data sources with results optimized for AI applications and agents.

  2. Use AI Models: Perform large language model (LLM) inference with major providers including OpenAI, Anthropic, and Grok for chat, completion, and generative AI workflows.

  3. Collaborate on : Share, fork, and manage datasets, models, embeddings, evals, and tools in a collaborative, community-driven hub indexed by .

Use-Cases

  • Fast, virtualized data views: Build specialized “small data” warehouses to serve fast, virtualized views across large datasets for applications, APIs, dashboards, and analytics.

  • Performance and reliability: Manage replicas of hot data, cache SQL queries and AI results, and load-balance AI services to improve resiliency and scalability.

  • Production-grade AI workflows: Use Spice.ai Cloud as a data and AI proxy for secure, monitored, and compliant production environments, complete with advanced observability and performance management.

Take it for a spin by starting with the .

Feel free to ask any questions or queries to the team in .

Create a Spice app

Create your first Spice app

Once , you will be redirected to the new application page. Set a name, add a model provider, and optionally select one of ready to use datasets.

  1. Enter a name for the application.

  2. Optionally choose a model provider, such as OpenAI, and provide an API key.

  3. Optionally select one or more of the available datasets. Datasets can also be added later.

Next Steps

Learn more about building AI applications and agents with the Spice.ai Cloud Platform.

Sign in with GitHub

Sign in to the Portal with GitHub

A GitHub account is required to access the Spice.ai Cloud Platform. If you don't have one, you can .

From the Spice.ai website

Add a Dataset and query data
Add AI Model and chat with your data
Spicepods
spicerack.org
getting started guide
Slack
  • Click Create application.

    1. It will take up to 30 seconds to create and provision a dedicated Spice.ai instance for the application.

  • Setting an application name, selecting a optional model and optional datasets.
    A new Spice.ai instance will be created.
    1. Once the application instance is deployed and ready, you will be redirected to the Playground.

    SQL Query in the Playground.
    1. Executing the show tables SQL query will show the default datasets available for the app.

    🎉 Congrats, you've created your first Spice app!

    Continue to Step. 3 to add a dataset and query it.

    Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Slack.

    signed in with GitHub
    Create your first Spice.ai Cloud application.

    Go to spice.ai and click on Login or Try for Free in the top right corner.

    1. You can also navigate directly by URL to spice.ai/login

    The Spice.ai website
    1. Click Continue with GitHub to login with your GitHub account.

    Login with GitHub
    1. Click Authorize Spice.ai Cloud Platform.

    Authorize Spice.ai Cloud Platform with GitHub
    1. You will be redirected to the new application page.

    Continue to Step 2 to configure your first Spice application.

    Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Slack.

    create an accout here

    Playground

    Start experimenting in the

    Features

    Explore the of the platform

    Use-Cases

    Explore for Spice.ai.

    FAQ

    Frequently asked questions

    What's the difference between the Spice.ai Cloud Platform and Spice.ai OSS?

    Spice.ai OSS is an open-source project created by the Spice AI team that provides a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

    The Spice.ai OSS runtime

    The Spice.ai Cloud Platform is a data and AI application platform that provides a set of building-blocks to create AI and agentic applications. Building blocks include a cloud-data-warehouse, ML model training and inference, and a cloud-scale, managed Spice.ai OSS cloud-hosted service.

    How much does Spice.ai Cloud cost?

    It's free to to use the .

    Customers who need resource limits, service-level guarantees, or priority support we offer based on usage.

    What level of support do you offer?

    We offer enterprise-grade support with an SLA .

    For standard plans we offer in Discord.

    What's your approach to security and compliance?

    See . The Spice.ai Cloud Platform is SOC 2 Type II compliant.

    What SQL query engine/dialect do you support?

    Spice.ai OSS is built on and uses the PostgreSQL dialect.

    Add a Dataset and query data

    Add a dataset and query it using SQL Query in the Playground

    To add a dataset to the Spice app, navigate to the tab.

    Use the Components sidebar on the right to select from available Data Connectors, Model Providers, and ready-to-use Datasets.

    Adding a ready-to-use Dataset

    Playground
    features
    use-cases
    get an API key
    Community Edition
    high-value paid tiers
    for Enterprise Plans
    best-effort community support
    Security
    Apache DataFusion
    The Spice.ai Cloud Platform

    ODBC

    Enterprise Search

    Use Spice for Enterprise Search and Retrieval

    Vector similarity search across disparate and legacy data systems

    Enterprises face the challenge of accessing data from various disparate and legacy systems to provide AI with comprehensive context. Speed is crucial for this process to be effective.

    Spice offers a fast knowledge index into both structured and unstructured data, enabling efficient vector similarity search across multiple data sources. This ensures that AI applications have access to the necessary data for accurate and timely responses.

    Navigate to Code tab.
  • In Components sidebar, click the Datasets tab.

  • Spice.ai app Code configuration
    1. Select and add the NYC Taxi Trips dataset

      1. Note the configuration has been added to the editor

    Add the NYC Taxi Trips dataset
    1. Click Save in the code toolbar and then Deploy on popup card that appears in the bottom right.

    1. Navigate to the Playground tab, open the dataset reference, and click on the spice.samples.taxi_trips dataset to insert a sample query into the SQL editor. Then, click Run Selection.

    Exexuting the sample query for the NYC Taxi Trips dataset.

    [Optional] Execute a SQL query using cURL

    1. Go app Settings and copy one of the app API Keys.

    Getting an API Key from the app Settings.
    1. Replace [API-KEY] in the sample below with your API Key and execute from a terminal.

    Showing results from executing a sample NYC Taxi Trips dataaset query using cURL.

    🎉 Congratulations, you've now added a dataset and queried it.

    Continue to Step 4 to add an AI Model and chat with the dataset.

    Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Slack.

    Code

    Add AI Model and chat with your data

    Add an OpenAI model and chat with the NYC Taxi Trips dataset

    An OpenAI API Platform account and API key is required.

    Adding a Model Provider

    1. Navigate to Code tab.

    2. In Components sidebar, click Model Providers tab, and select OpenAI.

    3. Enter the Model name.

    4. Enter the Model ID, (e.g. gpt-4o).

    5. Set the OpenAI API Key secret

      1. API keys and other secrets are securely stored and encrypted.

    1. Insert tools: auto in the params section of the gpt-4o Model to automatically connect datasets to the model. The final Spicepod configuration in the editor should be as follows:

    1. Click Save in the code toolbar and then Deploy in the popup card that appears in the bottom right to deploy the changes.

    2. Navigate to Playground and select AI Chat in the sidebar.

    3. Ask a question about the NYC Taxi Trips dataset in the chat. For example:

    [Optional] Call chat completions API using cURL

    1. Replace [API-KEY] in the sample below with the app API Key and execute in a terminal.

    🎉 Congratulations, you've now added an OpenAI model and can use it to ask questions of the NYC Taxi Trips dataset.

    Continue to to explore use-cases to do more with the Spice.ai Cloud Platform.

    Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on .

    In-Memory Arrow Data Accelerator

    The In-Memory Arrow Data Accelerator is the default data accelerator in Spice. It uses Apache Arrow to store data in-memory for fast access and query performance.

    Configuration

    To use the In-Memory Arrow Data Accelerator, no additional configuration is required beyond enabling acceleration.

    Example:

    However Arrow can be specified explicitly using arrow as the engine for acceleration.

    Memory

    Memory Data Connector Documentation

    The Memory Data Connector enables configuring an in-memory dataset for tables used, or produced by the Spice runtime. Only certain tables, with predefined schemas, can be defined by the connector. These are:

    • store: Defines a table that LLMs, with , can store data in. Requires mode: read_write.

    SpiceAI

    Instructions for using models hosted on the Spice Cloud Platform with Spice.

    To use a model hosted on the , specify the spice.ai path in the from field.

    Example:

    Specific model versions can be referenced using a version label or Training Run ID.

    from Format

    The from key must conform to the following regex format:

    Health API

    Health HTTP API documentation

    The health API can be called to confirm the overall availability of the API service.

    Get API Health

    GET https://data.spiceai.io/health

    This endpoint gets the health of the API.

    LLM API

    Chat Completions

    Spice provides an OpenAI compatible chat completion AI at . Authorize with the endpoint using an .

    The App requires a configured and deployed model to respond to chat completion requests.

    For more information about using chat completions, refer to the .

    Search API

    Data Lakehouse

    Create a Data Lakehouse using Spice

    Accessing data across multiple, disparate data sources

    Perform across databases, data warehouses, and data lakes using .

    Database CDN

    Use Spice as a CDN for Databases

    Enhancing data application performance

    Colocate a local working set of hot data with data applications and frontends to serve more concurrent requests and users with faster page loads and data updates.

    Deployments

    Monitor and manage app Spicepod instances and deployments.

    Create New Deployment

    Navigate to the Spicepod tab and click on Create Deployment.

    Profile

    A light or dark mode portal theme can be set:

    1. Click the profile picture on the top right corner of the portal interface.

    2. Select Light, Dark, or System mode using the theme toggle.

    Spice Runtime Versions

    Each new app deployment automatically retrieves the most recent stable Spice OSS release. Visit the or to check for the latest runtime updates.

    Migrate Exiting Apps

    All Spice Apps are now powered by the latest, next-generation Spice.ai Open Source data and AI engine. Existing apps have been migrated but require a manual setup step to connect datasets and/or model providers.

    Learn More: Read the blog post for details on this upgrade, and visit the for over 50 quickstarts and examples.

    curl --request POST \
      --url 'https://data.spiceai.io/v1/sql' \
      --header 'Content-Type: text/plain' \
      --header 'X-API-KEY: [API-KEY]' \
      --data 'select * from spice.samples.taxi_trips limit 3'
    ok

    Migrating from legacy data systems

    Spice provides a drop-in solution that offers a single, unified endpoint to multiple data systems without requiring changes to the application.

    federated SQL queries
    Data Connectors
    Increasing application resilience

    Maintain local replicas of data with the application to significantly enhance application resilience and availability.

    Improving dashboard, analytics, and BI performance

    Create a materialization layer for visualization tools like Power BI, Tableau, or Superset to achieve faster, more responsive dashboards without incurring massive compute costs.

    Watch the Apache Superset demo

    Try the CQRS Cookbook
    Releases page
    Spice OSS blog
    Announcing 1.0-stable
    Spice.ai Cookbook
    Light Theme

    Playground

    Enterprise RAG

    Use Spice for Retrieval-Augmented-Generation (RAG)

    Use Spice to access data across various data sources for Retrieval-Augmented-Generation (RAG).

    Spice enables developers to combine structured data via SQL queries and unstructured data through built-in vector similarity search. This combined data can then be fed to large language models (LLMs) through a native AI gateway, enhancing the models' ability to generate accurate and contextually relevant responses.

    "What datasets are available?"
  • "What is the average fare amount of a taxi trip?"

  • Next Steps
    Slack
    Asking questions of the NYC Taxi Trips dataset in the AI Chat Playground.
    Calling the chat completions API to chat wit the NYC Taxi Trips dataset.
    Examples
    memory tooling
    datasets:
      - from: memory:store
        name: llm_memory
        mode: read_write
        columns:
          - name: value
            embeddings: # Easily make your LLM learnings searchable.
              - from: all-MiniLM-L6-v2
    
    embeddings:
      - name: all-MiniLM-L6-v2
        from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

    Examples:

    • spice.ai/lukekim/smart/models/drive_stats:latest: Refers to the latest version of the drive_stats model in the smart application by the user or organization lukekim.

    • spice.ai/lukekim/smart/drive_stats:60cb80a2-d59b-45c4-9b68-0946303bdcaf: Specifies a model with a unique training run ID.

    Specification

    1. Prefix (Optional): The value must start with spice.ai/.

    2. Organization/User: The name of the organization or user (org) hosting the model.

    3. Application Name: The name of the application (app) which the model belongs to.

    4. Model Name: The name of the model (model).

    5. Version (Optional): A colon (:) followed by the version identifier (version), which could be a semantic version, latest for the most recent version, or a specific training run ID.

    Spice Cloud Platform
    name: my-first-app
    kind: Spicepod
    version: v1beta1
    
    datasets:
      - from: s3://spiceai-demo-datasets/taxi_trips/2024/
        name: samples.taxi_trips
        description: Taxi trips dataset from Spice.ai demo datasets.
        params:
          file_format: parquet
    
    models:
      - from: openai:gpt-4o
        name: gpt-4o
        params:
          endpoint: https://api.openai.com/v1
          openai_api_key: ${secrets:OPENAI_API_KEY}
          tools: auto
    curl --request POST \
          --url 'https://data.spiceai.io/v1/chat/completions' \
          --header 'Content-Type: application/json' \
          --header 'X-API-KEY: 31393037|8f2f6125e7b8487f80964041c123d3c3' \
          --data '{ "messages": [{ "role": "user", "content": "Hello!" }], "model": "gpt-4o" }'
    models:
      - from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats
        name: drive_stats
        datasets:
          - drive_stats_inferencing
    models:
      - from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats:latest # Label
        name: drive_stats_a
        datasets:
          - drive_stats_inferencing
    
      - from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats:60cb80a2-d59b-45c4-9b68-0946303bdcaf # Training Run ID
        name: drive_stats_b
        datasets:
          - drive_stats_inferencing
    \A(?:spice\.ai\/)?(?<org>[\w\-]+)\/(?<app>[\w\-]+)(?:\/models)?\/(?<model>[\w\-]+):(?<version>[\w\d\-\.]+)\z

    Limitations

    • The In-Memory Arrow Data Accelerator does not support persistent storage. Data is stored in-memory and will be lost when the Spice runtime is stopped.

    • The In-Memory Arrow Data Accelerator does not support Decimal256 (76 digits), as it exceeds Arrow's maximum Decimal width of 38 digits.

    • The In-Memory Arrow Data Accelerator does not support .

    • The In-Memory Arrow Data Accelerator only supports primary-key , not unique constraints.

    • With Arrow acceleration, mathematical operations like value1 / value2 are treated as integer division if the values are integers. For example, 1 / 2 will result in 0 instead of the expected 0.5. Use casting to FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT) (or CAST(1 AS FLOAT) / 2).

    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          enabled: true
    Spicepod Instance Logs

    Navigate to the Spicepod tab and click on the Logs for the selected instance.

    Spicepod Logs
    Spicepod Instance
    Spicepod Deployment
    https://data.spiceai.io/v1/chat/completions
    App API key
    OpenAI documentation

    FlightSQL

    Flight SQL Data Connector Documentation

    import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

    Connect to any Flight SQL compatible server (e.g. Influx 3.0, CnosDB, other Spice runtimes!) as a connector for federated SQL queries.

    - from: flightsql:my_catalog.good_schemas.cool_dataset
      name: cool_dataset
      params:
        flightsql_endpoint: http://127.0.0.1:50051
        flightsql_username: spicy
        flightsql_password: ${secrets:my_flightsql_pass}

    Configuration

    from

    The from field takes the form flightsql:dataset where dataset is the fully qualified name of the dataset to read from.

    name

    The dataset name. This will be used as the table name within Spice.

    params

    Parameter name
    Description

    Federated SQL Query

    Federated SQL Query documentation

    Spice supports federated queries, enabling you to join and combine data from multiple sources, including databases (PostgreSQL, MySQL), data warehouses (Databricks, Snowflake, BigQuery), and data lakes (S3, MinIO). For a full list of supported sources, see Data Connectors.

    SQL Query

    Playground SQL Explorer

    The Playground SQL Explorer is the fastest way to get started with federated queries, debugging queries, and iterating quickly. The SQL Query Editor be accessed by clicking on the SQL Explorer tab after selecting Playground in the app navigation bar.

    See for further documentation on using the SQL Query Editor.

    Apache Arrow Flight API

    For production applications, leveraging the high-performance endpoint is recommended. The Spice SDKs always query using Arrow Flight.

    See for further documentation on using Apache Arrow Flight APIs.

    HTTP API

    SQL Query is also accessible via a standard HTTP API.

    See for further documentation on using the HTTP SQL API.

    Azure

    Instructions for using Azure OpenAI models

    To use a language model hosted on Azure OpenAI, specify the azure path in the from field and the following parameters from the Azure OpenAI Model Deployment page:

    Param
    Description
    Default

    azure_api_key

    The Azure OpenAI API key from the models deployment page.

    -

    azure_api_version

    The API version used for the Azure OpenAI service.

    Only one of azure_api_key or azure_entra_token can be provided for model configuration.

    Example:

    Refer to the for more details on available models and configurations.

    Follow the to try Azure OpenAI models for vector-based search and chat functionalities with structured (taxi trips) and unstructured (GitHub files) data.

    AI Gateway

    AI Gateway documentation

    Spice provides a high-performance, OpenAI API-compatible AI Gateway optimized for managing and scaling large language models (LLMs). Additionally, Spice offers tools for Enterprise Retrieval-Augmented Generation (RAG), such as SQL query across federated datasets and an advanced search feature (see Search).

    Spice supports full OpenTelemetry observability, enabling detailed tracking of data flows and requests for full transparency and easier debugging.

    Supported Models​

    Spice supports a variety of LLMs, including OpenAI, Azure OpenAI, Anthropic, Groq, Hugging Face, and more (see Model Providers for all supported models).

    Core Features

    • Custom Tools: Equip models with tools to interact with the Spice runtime.

    • System Prompts: Customize system prompts and override defaults for .

    For detailed configuration and API usage, refer to the .

    Example: Configuring an OpenAI Compatible Model

    To use a language model hosted on OpenAI (or compatible), specify the openai path and model ID in from.

    Example spicepod.yml:

    For details, see .

    Semantic Models

    Define semantic data models in Spice to improve dataset understanding for AI

    A semantic model is a structured representation of data that captures the meaning and relationships between elements in a dataset.

    In Spice, semantic models transform raw data into meaningful business concepts by defining metadata, descriptions, and relationships at both the dataset and column level. This makes the data more interpretable for both AI language models and human analysis.

    Use-Cases

    Large Language Models (LLMs)

    The semantic model is automatically used by as context to produce more accurate and context-aware AI responses.

    Defining a Semantic Model

    Semantic data models are defined within the spicepod.yaml file, specifically under the datasets section. Each dataset supports description, metadata, and a columns field where individual columns are described with metadata and features for utility and clarity.

    Example Configuration

    Example spicepod.yaml:

    Dataset Metadata

    Datasets can be defined with the following metadata:

    • instructions: Optional. Instructions to provide to a language model when using this dataset.

    • reference_url_template: Optional. A URL template for citation links.

    For detailed metadata configuration, see the Spice OSS

    Column Definitions

    Each column in the dataset can be defined with the following attributes:

    • description: Optional. A description of the column's contents and purpose.

    • embeddings: Optional. Vector embeddings configuration for this column.

    For detailed columns configuration, see the Spice OSS

    XAI

    Instructions for using xAI models

    To use a language model hosted on xAI, specify xai path in the from field and the associated xai_api_key parameter:

    Param
    Description
    Default

    xai_api_key

    The xAI API key.

    -

    Example:

    models:
    - from: xai:grok2-latest
      name: xai
      params:
        xai_api_key: ${secrets:SPICE_GROK_API_KEY}

    Refer to the for more details on available models and configurations.

    Although the xAI show that xAI models can returned structured outputs, this is not true.

    Spark

    Apache Spark Connector Documentation

    import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

    Apache Spark as a connector for federated SQL query against a Spark Cluster using Spark Connect

    datasets:
      - from: spark:spiceai.datasets.my_awesome_table
        name: my_table
        params:
          spark_remote: sc://my-spark-endpoint

    Configuration

    • spark_remote: A spark remote connection URI. Refer to spark connect client connection string for parameters in URI.

    Limitations

    • Correlated scalar subqueries are only supported in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands.

    • The Spark connector does not yet support streaming query results from Spark.

    Anthropic

    Instructions for using language models hosted on Anthropic with Spice.

    To use a language model hosted on Anthropic, specify anthropic in the from field.

    To use a specific model, include its model ID in the from field (see example below). If not specified, the default model is claude-3-5-sonnet-latest.

    The following parameters are specific to Anthropic models:

    Parameter
    Description
    Default

    Example spicepod.yml configuration:

    See for a list of supported model names.

    Zipkin

    Export observability traces from Spice into Zipkin

    In addition to the built-in runtime.task_history SQL table, Spice can export the observability traces it collects into Zipkin.

    Zipkin UI showing traces exported by Spice

    Enabling Zipkin Export

    Zipkin export is defined in the spicepod.yaml under the runtime.tracingsection:

    runtime:
      tracing:
        zipkin_enabled: true
        zipkin_endpoint: http://localhost:9411/api/v2/spans
    • zipkin_enabled: Optional. Default false. Enables or disables the Zipkin trace export.

    • zipkin_endpoint: Required if zipkin_enabledis true. The path to the /api/v2/spansendpoint on the Zipkin instance to export to.

    LocalPod

    Localpod Data Connector Documentation

    The Localpod Data Connector enables setting up a parent/child relationship between datasets in the current Spicepod. This can be used for configuring multiple/tiered accelerations for a single dataset, and ensuring that the data is only downloaded once from the remote source. For example, you can use the localpod connector to create a child dataset that is accelerated in-memory, while the parent dataset is accelerated to a file.

    The dataset created by the localpod connector will logically have the same data as the parent dataset.

    Synchronized Refreshes

    The localpod connector supports synchronized refreshes, which ensures that the child dataset is refreshed from the same data as the parent dataset. Synchronized refreshes require that both the parent and child datasets are accelerated with refresh_mode: full (which is the default).

    When synchronization is enabled, the following logs will be emitted:

    Examples

    Public Apps

    Publish app to https://spicerack.org

    Public Spice Apps can be forked, added as a dependency or connected to using Spice OSS Spice.ai connector.

    Making the app public

    The app must be connected to a public GitHub repository to be made public. Check out how to connect app to the repository - connect GitHub.

    To make your app public, go to your app settings and click Make public.

    After that, the app will be visible to all users at https://spice.ai/<org-name>/<app-name> and searchable at .

    HTTP API

    Query with SQL via the HTTP API

    Data may be queried by posting SQL to the /v1/sql API and /v1/firesql API for Firecached data. For documentation on the Spice Firecache see .

    See Tables for a list of tables to query or browse the example queries listed in the menu.

    Requirements and limitations

    • An API key is required for all SQL queries.

    • Results are limited to 500 rows. Use the to fetch up to 1M rows in a single query or the to fetch results with paging.

    • Requests are limited to 90 seconds.

    Streaming

    The @spiceai/spice SDK supports streaming partial results as they become available.

    This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.

    The Client.query function takes an optional onData callback that will be passed partial results as they become available.

    public async query(
        queryText: string,
        onData: ((data: Table) => void) | undefined = undefined
      ): Promise<Table>

    In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:

    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new SpiceClient(process.env.API_KEY);
    const query = `
    SELECT s_suppkey, s_name
    FROM tpch.supplier
    `
    const allSuppliers = await spiceClient.query(query)
    allSuppliers.toArray().forEach((row) => {
        processSupplier(row.toJSON())
    });

    This call will wait for the promise returned by query() to complete, returning all 10,000 suppliers.

    Alternatively, data can be processed as it is streamed to the SDK. Provide a callback function to the onData parameter, which will be called with every partial set of data streamed to the SDK:

    Runtime

    Configuring Spice.ai runtime for your Spice application

    Navigate to Settings -> Runtime to configure the runtime settings for your Spice application.

    Runtime Version

    The runtime version determines the Spice.ai Open Source version for your Spice application. Each new deployment automatically adopts the latest stable version of the Spice runtime to ensure access to the most recent features and optimizations.

    Runtime Region

    The runtime region specifies the geographic location of the data center hosting your Spice application. Region selection optimizes latency, compliance, and performance based on your business needs.

    • Availability: Region selection is exclusive to Enterprise plan customers.

    • Supported Regions:

      • North America:

    Compute

    Compute settings define the resource allocation for your Spice application, balancing performance and cost.

    Standard Compute Instances:

    • Developer: 2 CPU / 4 GB

    • Pro for Teams: 4 CPU / 8 GB

    • Enterprise: Dedicated Instances with multiple high-availability replicas

    Storage

    Provides a persistent storage for the runtime to save data acceleration files. Data remains intact across restarts and redeployments.

    • Availability: Storage is exclusive to Enterprise plan customers.

    • Mount Path: /data.

    • Size: Configured per request. Contact your account executive to set or update capacity.

    HTTPS

    HTTP(s) Data Connector Documentation

    The HTTP(s) Data Connector enables federated SQL query across stored at an HTTP(s) endpoint.

    Configuration

    from

    Observability

    First-class, built-in observability to understand the operations Spice performs.

    Observability in Spice enables task tracking and performance monitoring through a built-in distributed tracing system that can or be viewed via the SQL table.

    Spice records detailed information about runtime operations through trace IDs, timings, and labels - from SQL queries to AI completions. This task history system helps operators monitor performance, debug issues, and understand system behavior across individual requests and overall patterns.

    Use-Cases

    Apps

    Apps are self-contained instances of Spice OSS Runtime, running in Spice.ai Cloud Platform.

    Each app has a unique API Key and owned by individual accounts or .

    Learn how to:

    SQLite Data Accelerator

    To use SQLite as Data Accelerator, specify sqlite as the engine for acceleration.

    Configuration

    The connection to SQLite can be configured by providing the following params:

    SQL Query API

    SQL Query (Cloud Data Warehouse) API

    The SQL Query API provides powerful capabilities for querying data managed by the Spice.ai Cloud Data Warehouse and connected external data sources using federated SQL queries. Results can be fetched through either the high-performance Apache Arrow Flight API or a standard HTTP API.

    Production Query Endpoints

    Perplexity

    Instructions for using language models hosted on Perplexity with Spice.

    To use a language model hosted on Perplexity, specify perplexity in the from field.

    To use a specific model, include its model ID in the from field (see example below). If not specified, the default model is sonar.

    The following parameters are specific to Perplexity models:

    Parameter
    Description
    Default

    API keys

    Each Spice app have two pre-generated API keys, which can be used with , the or the .

    • Select Spice app and navigate to Settings -> General.

    • Click the API Key 1 or API Key 2 field to copy the key value.

    Spicepod Configuration

    Edit app spicepod not connected to GitHub

    To update Spice App spicepod.yaml, navigate to Code tab. Use Components sidebar to add data connectors, model providers and preconfigured datasets, or manually edit spicepod.yaml in code editor.

    After saving the spicepod changes, a new deployment must be triggered. Learn more about .

    Transfer

    You can transfer an App's ownership to another .

    App transfer is currently limited to organizations you have access to.

    Learn more about .

    Streaming

    The spicepy SDK supports streaming partial results as they become available.

    This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.

    spicepy enables streaming through the use of the .

    The object returned from spicepy.Client.query() is a .

    Calling to_pandas() on the FlightStreamReader will wait for the stream to return all of the data before returning a pandas DataFrame.

    SQL Query

    Use the Playground's SQL editor to easily explore data

    Open the SQL editor by navigating to an App Playground and clicking SQL Query in the sidebar.

    SQL table, column, and keyword suggestions

    The Spice.ai Query Editor will suggest table and column names along with keywords as you type. You can manually prompt for a suggestion by pressing ctrl+space.

    API Reference

    SpiceClient(params)

    The top-level object that connects to Spice.ai

    • params.api_key (string, optional): API key to authenticate with the endpoint

    SpiceAI

    Spice.ai Data Connector Documentation

    The Data Connector enables federated SQL query across datasets in the . Access to these datasets requires a free .

    Configuration

    Parameters

    Monitoring

    Continuously identify and fix issues by tracking process actions with Spice monitoring and request logs.

    Under the Monitoring tab, track your app requests, their status codes, and duration, in addition to the existing usage monitoring metrics dashboard.

    Overview

    Monitor the success, failures, and durations of SQL queries, AI completions, Vector Searches, Embedding calculations, and accelerated dataset refreshes.

    Secrets

    App Secrets are key-value pairs that are passed to the Spice Runtime instance as environment secrets. Secrets are securely encrypted and accessible only through the app in which they were created.

    Once a secret is saved, its value cannot be retrieved through Spice Cloud. If you need to update the secret value, you must delete the existing secret and create a new one.

    Create a new secret

    1. Select your app.

    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          enabled: true
          engine: arrow
    Apache Arrow Flight API

    For production use, leverage the high-performance Apache Arrow Flight API, which is optimized for large-scale data workloads. The Spice SDKs default to querying via Arrow Flight.

    • Endpoint: grpc+tls://flight.spiceai.io

    • For additional details, refer to the Apache Arrow Flight API documentation.

    HTTP API

    The SQL Query API is also accessible via HTTP, offering standard integration for web applications.

    • Core Endpoint: https://data.spiceai.io/v1/sql

    • For more details, consult the HTTP SQL API documentation.

    flightsql_endpoint

    The Apache Flight endpoint used to connect to the Flight SQL server.

    flightsql_username

    Optional. The username to use in the underlying Apache flight Handshake Request to authenticate to the server (see reference).

    flightsql_password

    Optional. The password to use in the underlying Apache flight Handshake Request to authenticate to the server. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_flightsql_pass}.

    indexes
    constraints
    xAI models documentation
    documentation
    Spark Docs

    Connect an app with your existing GitHub repository

  • Create and toggle custom Dataset and Views

  • organizations
    Transfer an app to another organization
    https://spicerack.org
    Publishing Spice App
    Public app page
    Spicerack search results
    You can regenerate each key if you need to invalidate it.
    Spice SDKs
    HTTP API
    Apache Arrow Flight API
    SQL Query
    Apache Arrow Flight
    Apache Arrow Flight API
    HTTP API
    The Playground SQL Query Editor.
    US East (Pittsburgh)
    -
    us-east-2
    (Teraswitch, Current)
  • US West (Oregon) - us-west-2 (AWS)

  • US East (N. Virginia) - us-east (Azure)

  • app runtime storage settings

    -

    azure_deployment_name

    The name of the model deployment.

    Model name

    endpoint

    The Azure OpenAI resource endpoint, e.g., https://resource-name.openai.azure.com.

    -

    azure_entra_token

    The Azure Entra token for authentication.

    -

    Azure OpenAI Service models
    Azure OpenAI Models Cookbook
    ​
    v1/chat/completion
    API Documentation
    ​
    OpenAI (or Compatible) Language Models
    Spice Models
    Dataset Reference
    Dataset Reference

    anthropic_api_key

    The Anthropic API key.

    -

    endpoint

    The Anthropic API base endpoint.

    https://api.anthropic.com/v1

    models:
      - from: anthropic:claude-3-5-sonnet-latest
        name: claude_3_5_sonnet
        params:
          anthropic_api_key: ${ secrets:SPICE_ANTHROPIC_API_KEY }
    Anthropic Model Names
    2024-10-28T15:45:24.220665Z  INFO runtime::datafusion: Localpod dataset test_local synchronizing refreshes with parent table test
    datasets:
      - from: postgres:cleaned_sales_data
        name: test
        params: ...
        acceleration:
          enabled: true # This dataset will be accelerated into a DuckDB file
          engine: duckdb
          mode: file
          refresh_check_interval: 10s
      - from: localpod:test
        name: test_local
        acceleration:
          enabled: true # This dataset accelerates the parent `test` dataset into in-memory Arrow records and is synchronized with the parent
    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new SpiceClient(process.env.API_KEY);
    const query = `
    SELECT s_suppkey, s_name
    FROM tpch.supplier
    `
    await spiceClient.query(query, (partialData) => {
        partialData.toArray().forEach((row) => {
            processSupplier(row.toJSON())
        });
    })
    To operate on partial results while the data is streaming, we will take advantage of the read_chunk() method on FlightStreamReader. This returns a FlightStreamChunk, which has a data attribute that is a RecordBatch. Once we have the RecordBatch, we can call to_pandas() on it to return the partial data as a pandas DataFrame. When the stream has ended, calling read_chunk() will raise a StopIteration exception that we can catch.

    In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:

    This call will return a pandas DataFrame with all 10,000 suppliers, and is a synchronous call that waits for all data to arrive before returning.

    Alternatively, to process chunks of data as they arrive instead of waiting for all data to arrive, FlightStreamReader supports reading chunks of data as they become available with read_chunk(). Using the same query example above, but processing data chunk by chunk:

    pyarrow Flight API
    pyarrow.flight.FlightStreamReader
    params.http_url (string, optional):
  • params.flight_url (string, optional): URL of the endpoint to use (default: localhost:50051, using local Spice Runtime)

  • Default connection to local Spice Runtime:

    Connect to Spice.AI Cloud Platform:

    Or using shorthand:

    SpiceClient Methods

    query(queryText: string, onData: (partialData: Table) => void) => Table

    • queryText: (string, required): The SQL query to execute

    • onData: (callback, optional): The callback function that is used for handling streaming data.

    query returns an Apache Arrow Table.

    To get the data in JSON format, iterate over each row by calling toArray() on the table and call toJSON() on each row.

    Get all of the elements for a column by calling getChild(name: string) and then calling toJSON() on the result.

    from

    The Spice.ai Cloud Platform dataset URI. To query a dataset in a public Spice.ai App, use the format spice.ai/<org>/<app>/datasets/<dataset_name>.

    Example

    Full Configuration Example

    Spice.ai
    Spice.ai Cloud Platform
    Spice.ai account
    models:
      - from: azure:gpt-4o-mini
        name: gpt-4o-mini
        params:
          endpoint: ${ secrets:SPICE_AZURE_AI_ENDPOINT }
          azure_api_version: 2024-08-01-preview
          azure_deployment_name: gpt-4o-mini
          azure_api_key: ${ secrets:SPICE_AZURE_API_KEY }
    models:
      - from: openai:gpt-4o-mini
        name: openai
        params:
          openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
    
      - from: openai:llama3-groq-70b-8192-tool-use-preview
        name: groq-llama
        params:
          endpoint: https://api.groq.com/openai/v1
          openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }
    datasets:
      - name: taxi_trips
        description: NYC taxi trip rides
        metadata:
          instructions: Always provide citations with reference URLs.
          reference_url_template: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_<YYYY-MM>.parquet
        columns:
          - name: tpep_pickup_time
            description: 'The time the passenger was picked up by the taxi'
          - name: notes
            description: 'Optional notes about the trip'
            embeddings:
              - from: hf_minilm # A defined Spice Model
                chunking:
                  enabled: true
                  target_chunk_size: 512
                  overlap_size: 128
                  trim_whitespace: true
    >>> from spicepy import Client
    >>> import os
    >>> client = Client(os.environ["API_KEY"])
    >>> rdr = client.query("SELECT * FROM eth.recent_blocks")
    <pyarrow._flight.FlightStreamReader object at 0x1059c9980>
    from spicepy import Client
    
    client = Client(os.environ["API_KEY"])
    query = """
        SELECT s_suppkey, s_name
        FROM tpch.supplier
    """
    
    reader = client.query(query)
    suppliers = reader.read_pandas()
    reader = client.query(query)
    
    has_more = True
    while has_more:
        try:
            flight_batch = reader.read_chunk()
            record_batch = flight_batch.data
            processChunk(record_batch.to_pandas())
        except StopIteration:
            has_more = False
    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new Spiceclient();
    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new Spiceclient({
        api_key: 'API_KEY',
        http_url: 'https://data.spiceai.io',
        flight_url: 'flight.spiceai.io:443'
    });
    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new SpiceClient('API_KEY');
    const table = await spiceClient.query("SELECT * from tpch.lineitem LIMIT 10")
    table.toArray().forEach((row) => {
      console.log(row.toJSON());
    });
    const table = await client.query(
      'SELECT sum(l_extendedprice) as sum_extendedprice FROM tpch.lineitem'
    );
    
    let sumExtendedPrice = tableResult.getChild("sum_extendedprice");
    console.log(sumExtendedPrice?.toJSON())
    - from: spice.ai/spiceai/quickstart/datasets/taxi_trips
      name: taxi_trips
    - from: spice.ai/spiceai/tpch/datasets/customer
      name: tpch.customer
    - from: spice.ai/spiceai/tpch/datasets/customer
      name: tpch.customer
      params:
        spiceai_api_key: ${secrets:spiceai_api_key}
      acceleration:
        enabled: true

    The from field must contain a valid URI to the location of a supported file. For example, http://static_username@my-http-api/report.csv.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The connector supports Basic HTTP authentication via param values.

    Parameter Name
    Description

    http_port

    Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.

    http_username

    Optional. Username to provide connection for HTTP basic authentication. Default: None.

    http_password

    Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the to load the password from a secret store, e.g. ${secrets:my_http_pass}.

    client_timeout

    Optional. Specifies timeout for HTTP operations. Default value is 30s E.g. client_timeout: 60s

    Examples

    Basic example

    Using Basic Authentication

    supported file formats
    Debugging and Troubleshooting
    • Trace AI chat completion steps and tool interactions to identify why a request isn't responding as expected

    • Investigate failed queries and other task errors

    Performance Analysis

    • Track SQL query/tool use execution times

    • Identify slow-running tasks

    Usage Analytics

    • Track usage patterns by protocol and dataset

    • Understand how AI models are using tools to retrieve data from the datasets available to them

    Portal Interface

    The Spice platform provides a built-in UI for visualizing the observability traces that Spice OSS generates.

    An observability trace for an AI chat completion in the Spice portal.

    export to Zipkin
    runtime.task_history
    busy_timeout: Optional. Specifies the duration for the SQLite busy timeout when connecting to the database file. Default: 5000 ms.

    Configuration params are provided in the acceleration section of a dataset. Other common acceleration fields can be configured for sqlite, see see datasets.

    LIMITATIONS

    • The SQLite accelerator doesn't support advanced grouping features such as ROLLUP and GROUPING.

    • In SQLite, CAST(value AS DECIMAL) doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL) will be treated as integer division, resulting in 0 instead of the expected 0.5. Use FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT).

    • Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.

    • The SQLite accelerator doesn't support arrow Interval types, as doesn't have a native interval type.

    • The SQLite accelerator only supports arrow List types of primitive data types; lists with structs are not supported.

    MEMORY CONSIDERATIONS

    When accelerating a dataset using mode: memory (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

    The Perplexity API authentication token.

    -

    perplexity_*

    Additional, perplexity specific parameters to use on all requests. See

    -

    Note: Like other models in Spice, Perplexity can set default overrides for OpenAI parameters. See Parameter Overrides.

    Example Configuration

    models:
      - name: webs
        from: perplexity:sonar
        params:
          perplexity_auth_token: ${ secrets:SPICE_PERPLEXITY_AUTH_TOKEN }

    perplexity_auth_token

    Edit app spicepod connected to GitHub

    If the Spice App is connected to a GitHub repository (learn more about how to connect), the only way to update the Spicepod configuration is to edit the spicepod.yaml file in the root of your repository and push it to the default branch.

    To apply the updated spicepod, a new deployment must be triggered. Learn more about Spicepod Deployments.

    App Code tab allows to explore and preview files in connected repository.

    Spice App Code

    Spicepod Deployments
    Spice App Code
    Transfer an App
    1. To transfer an app, click Settings in the app navigation.

    App Settings in the portal sidebar.
    1. In the Danger Zone section of App Settings, click the Transfer app button.

    1. On the Transfer application page, select the New owner organization from the menu.

    1. Type the full app name into the text box to confirm and click Transfer Application to complete process.

    The App will now be accessible by the receiving organization and its members.

    organization
    organization management

    Start typing a SQL command, such as SELECT * FROM

  • As you type, the Query Editor will suggest possible completions based on the query context. You can use the arrow keys or mouse to select a completion, and then press Enter or Tab to insert it into the editor.

  • Examples of using the SQL suggestions:

    • Select the spice.runtime.metrics table:

      • Type SELECT * FROM and press Tab. The editor will suggest spice.runtime.metrics as a possible table. Press Enter to insert it into the query.

    A list of datasets available in Spice
    • Show the fields in the spice.runtime.metrics table:

      • Type SELECT * FROM spice.runtime.metrics WHERE ". The editor will list the fields in the table.

    The available fields for spice.runtime.metrics along with their type.

    Datasets Reference

    The datasets reference displays all available datasets from the current app and allows you to search through them. Clicking on the dataset will insert a sample query into the SQL editor, which will be automatically selected for execution.

    Search for dataset in reference

    Spice.ai Query Editor
    Metrics Dashboards

    Track request volume, data usage, and query time at 1 hour, 24 hours, 7 days and 28 days granularity. Start by going to your app and navigating to the Monitoring tab.

    API Request Logs

    Under the Monitoring tab, select Logs. You can then toggle between Metrics and Logs views.

    Within Logs, you have the option to retrieve API requests from the past hour, 8 hours, 24 hours, or up to the previous three days.

    Monitoring Overview

    Navigate to Settings tab and select Secrets section.

  • Fill Secret Name and Secret Value fields and click Add.

    1. Saved secrets can be referenced in the Spicepod configuration as ${secrets::<SECRET_NAME>}, for example:

    1. To apply secrets, you must initiate a new spicepod deployment. Learn more about deployments.

    models:   
      - from: openai:gpt-4o
        name: gpt-4o
        params:
          openai_api_key: ${secrets:OPENAI_API_KEY}
    Apache Arrow Flight API
    Async HTTP API

    Organizations

    Organizations enable you to share apps, datasets, users, billing, and settings with your team. Organization administrators can set who has access to their organization's resources and data.

    When you create an account on Spice.ai, a single member organization of the same name as your username is created for you and you are automatically added as a member and the owner of the organization.

    Creating an organization

    Spice.ai organizations are created by connecting an existing GitHub organization to Spice.ai.

    1. Click on the organization dropdown icon in the application selector. Next, select the Create Org option from the menu.

    1. Check to accept the terms and conditions for the new organization, then proceed by clicking the Connect GitHub organization button.

    1. A window will pop up from GitHub where you can select the organization to install the Spice.ai app into.

    1. On the confirmation page proceed by clicking the Install button.

    1. Upon successful connection, you will be automatically redirected to the newly created Spice.ai organization.

    View Organizations

    To view your organizations, click the dropdown icon from the application selector.

    All organizations you have access to are listed.

    Organization Management

    Click on the first tab to access the details of your current organization or select another organization from the menu to view its information. On this page, you will see all the applications that have been created within the selected organization.

    Click the Settings tab to view information about the organization, including members and billing information.

    Adding organization members

    To add an existing Spice.ai user to an organization:

    1. Navigate to the organization's settings.

    2. Click the Add Member button.

    3. Enter the Spice.ai username of the user you wish to add to the organization.

    4. Click the Add Member button to confirm.

    The user will be added to the organization and they will receive an email notifying them of the new membership.

    To invite GitHub user to a Spice.ai organization:

    Enter the GitHub username of the user you wish to invite to the organization and select the user from the search results. Only users with public email address can be invited.

    The invited user will receive an invitation link. Once they accept the invitation, they will be granted access to the organization.

    To invite anyone by email

    Enter the email address of the user you want to invite to the organization, then click Send invite.

    Removing organization members

    To remove a member from an organization:

    1. Navigate to the organization's settings.

    2. Locate the user you wish to remove from the list of members.

    3. Click the ellipsis on the right of the user's card.

    4. Confirm the removal by clicking the Remove member from organization

    Agentic AI Apps

    Using Spice.ai for Agentic AI Applications

    Build intelligent autonomous agents that act contextually by grounding AI models in secure, full-knowledge datasets with fast, iterative feedback loops.

    Spice.ai helps in building intelligent autonomous agents by leveraging several key features:

    Federated SQL Query

    Spice.ai enables federated querying across databases, data warehouses, and lakes. With advanced query push-down optimizations, it ensures efficient retrieval and processing of data across disparate sources, reducing latency and operational complexity. Learn more about Federated SQL Query. For practical implementation, refer to the Federated SQL Query recipe.

    Data Acceleration & Materialization with Change Data Capture (CDC)

    Spice.ai materializes application-specific datasets close to the point of use, reducing query and thus retrieval times, and infrastructure costs. It supports Change Data Capture (CDC), keeping materialized data sets up-to-date with minimal overhead and enabling real-time, reliable data access. . See the for an example.

    AI Gateway

    Integrate AI into your applications with Spice.ai’s AI Gateway. It supports hosted models like OpenAI and Anthropic and local models such as OSS Llama and NVIDIA NIM. Fine-tuning and model distillation are simplified, helping faster cycles of development and deployment. . Refer to the for details.

    Search with Vector Similarity Search (VSS)

    Spice.ai provides advanced search capabilities, including vector similarity search (VSS), enabling efficient retrieval of unstructured data, embeddings, and AI model outputs. This is critical for applications like RAG and intelligent search systems. . For implementation, see the .

    Semantic Model for AI

    Built-in semantic models allow Spice.ai to align AI operations with enterprise data, ensuring that applications are grounded in contextual, full-knowledge datasets. This enhances the accuracy and reliability of AI outputs while reducing risks of irrelevant or untrustworthy results. .

    Monitoring and Observability

    Spice.ai includes robust monitoring and observability tools tailored for AI applications. These tools provide end-to-end visibility into data flows and AI workflows, LLM-specific observability to monitor model performance, track usage, and manage drift, and security and compliance auditing for data and model interactions. .

    Connect GitHub

    Connect your Spice.ai app to a GitHub repository

    Prepare the repository

    Before connecting:

    • Admin access: Ensure you have administrative access to the GitHub repository. This level of access is required to install the Spice.ai GitHub app.

    • Matching repository name: The Spice.ai app and the GitHub repository names must match. For example:

      • Spice.ai app:

      • GitHub repository:

    To quickly set up a new repository, use the as a starting point:

    Make sure to copy app spicepod.yaml contents from the Code tab and place it in the root of the repository before linking.

    Connect

    1. Ensure the repository is set up as per instructions above.

    2. In the context of the Spice app to connect, navigate to Settings, then click the Connect repository button.

    1. Follow GitHub App installation instructions.

    Ensure that you select all repositories or specifically the repository you intend to connect.

    1. Finally, link the repository to your Spice.ai app.

    Snowflake

    Snowflake Data Connector Documentation

    import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

    The Snowflake Data Connector enables federated SQL queries across datasets in the Snowflake Cloud Data Warehouse.

    Hint Unquoted table identifiers should be UPPERCASED in the from field. See Identifier resolution.

    Configuration

    from

    A Snowflake fully qualified table name (database.schema.table). For instance snowflake:SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM or snowflake:TAXI_DATA."2024".TAXI_TRIPS

    name

    The dataset name. This will be used as the table name within Spice.

    params

    Parameter Name
    Description

    Auth

    The connector supports password-based and authentication. Login requires the account identifier ('orgname-accountname' format) - use instructions.

    Example

    Limitations

    1. Account identifier does not support the . Use .

    2. The connector supports password-based and authentication.

    DuckDB Data Accelerator

    To use DuckDB as Data Accelerator, specify duckdb as the engine for acceleration.

    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          engine: duckdb

    Configuration

    Spice.ai currently only supports mode: memory for DuckDB accelerator.

    Configuration params are provided in the acceleration section for a data store. Other common acceleration fields can be configured for DuckDB, see see .

    LIMITATIONS

    • The DuckDB accelerator does not support nested lists, or structs with nested structs/lists . For example:

      • Supported:

    MEMORY CONSIDERATIONS

    When accelerating a dataset using mode: memory (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

    Dotnet SDK

    Dotnet SDK for Spice.ai

    The Dotnet SDK spiceai is the easiest way to query Spice.ai from Dotnet.

    It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.

    Requirements

    • .Net 6.0+ or .Net Standard 2.0+

    Installation

    Add Spice SDK

    Usage

    1. Create a SpiceClient by providing your API key to SpiceClientBuilder. Get your free API key at .

    1. Execute a query and get back an Apache Arrow .

    1. Iterate through the reader to access the records.

    Usage with local Spice runtime

    Follow the to install and run spice locally.

    Contributing

    Contribute to or file an issue with the spice-dotnet library at:

    Grafana

    Create custom dashboards and visualizations using the FlightSQL or Infinity Grafana Plugins:

    • FlightSQL Grafana Plugin

    • Infinity Grafana Plugin

    Setup with FlightSQL Grafana Plugin

    Installing FlightSQL Grafana Plugin

    Use the for an installation reference.

    In the Grafana installation, navigate to Administration -> Plugins and data -> Plugins. Select "State: All" and search for "FlightSQL" in the Search box. Select "FlightSQL" in the list of plugins.

    Click the "Install" button to install the plugin. Once installed, a new data source can be added.

    Adding Spice as a data source

    Click the "Add new data source" button from the FlightSQL plugin menu.

    In the data source menu, provide the following options:

    • Host:Port - set to flight.spiceai.io:443 to connect to the Spice.ai Cloud Platform

    • Auth Type - set to token

    • Token - set to your .

    Once these options are set, click "Save & Test". Grafana should report "OK" if the configuration is correct.

    Creating visualizations

    Create a visualization using the FlightSQL Spice.AI data source. Select the configured datasource from the list of datasources in the visualization creator.

    Create your query using the SQL builder, or switch to a plain SQL editor by clicking the "Edit SQL" button. In this example, the query retrieves the latest query execution times from the configured App instance and displays them as a line graph.

    Grafana may not automatically update the visualization when changes are made to the query. To force the visualization to update with new query changes, click the "Query Inspector" button, and click "Refresh".

    Java SDK

    The Java SDK is the easiest way to query the Spice Cloud Platform from Java.

    It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.

    Supported Java Versions

    This library supports the following Java implementations:

    • OpenJDK 11

    • OpenJDK 17

    • OpenJDK 21

    • OracleJDK 11

    • OracleJDK 17

    • OracleJDK 21

    • OracleJDK 22

    Installation

    Usage

    1. Import the package.

    2. Create a SpiceClient by providing your API key. Get your free API key at .

    3. Execute a query and get back a .

    5. Iterate through the FlightStream to access the records.

    Check to learn more.

    Usage with local Spice.ai OSS runtime

    Follow the to install and run spice locally.

    Or using custom flight address:

    Check or to learn more

    Connection retry

    The SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured with withMaxRetries:

    Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).

    Contributing

    Contribute to or file an issue with the spice-rs library at:

    App Spicepod

    Every Spice app is powered by a managed instance of the Spice OSS Runtime deployed to the platform.

    A Spicepod is a package that encapsulates application-centric datasets and machine learning (ML) models.

    Spicepods are analogous to code packaging systems, like NPM, however differ by expanding the concepts to data and ML models.

    Structure​

    A Spicepod is described by a YAML manifest file, typically named spicepod.yaml, which includes the following key sections:

    • Metadata: Basic information about the Spicepod, such as its name and version.

    • Datasets: Definitions of datasets that are used or produced within the Spicepod.

    • Catalogs: Definitions of catalogs that are used within the Spicepod.

    • Models: Definitions of ML models that the Spicepod manages, including their sources and associated datasets.

    Example Manifest

    Key Components

    Datasets

    Datasets in a Spicepod can be sourced from various locations, including local files or remote databases. They can be materialized and accelerated using different engines such as DuckDB, SQLite, or PostgreSQL to optimize performance ().

    Catalogs

    Catalogs in a Spicepod can contain multiple schemas. Each schema, in turn, contains multiple tables where the actual data is stored.

    Models

    ML models are integrated into the Spicepod similarly to datasets. The models can be specified using paths to local files or remote locations. ML inference can be performed using the models and datasets defined within the Spicepod.

    To learn more, please refer to the full .

    ClickHouse

    ClickHouse Data Connector Documentation

    ClickHouse is a fast, open-source columnar database management system designed for online analytical processing (OLAP) and real-time analytics. This connector enables federated SQL queries from a ClickHouse server.

    Configuration

    from

    FTP

    FTP/SFTP Data Connector Documentation

    FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol) are network protocols used for transferring files between a client and server, with FTP being less secure and SFTP providing encrypted file transfer over SSH.

    The FTP/SFTP Data Connector enables federated/accelerated SQL query across stored in FTP/SFTP servers.

    Configuration

    Dremio

    Dremio Data Connector Documentation

    is a data lake engine that enables high-performance SQL queries directly on data lake storage. It provides a unified interface for querying and analyzing data from various sources without the need for complex data movement or transformation.

    This connector enables using Dremio as a data source for federated SQL queries.

    Configuration

    Data Acceleration

    Configure local acceleration for datasets in Spice for faster queries (test)

    Datasets can be locally accelerated by the Spice runtime, pulling data from any and storing it locally in a for faster access. The data can be kept up-to-date in real-time or on a refresh schedule, ensuring users always have the latest data locally for querying.

    Supported Data Accelerators

    Dataset acceleration is enabled by setting the acceleration configuration. Spice currently supports In-Memory Arrow, DuckDB, SQLite, PostgreSQL as accelerators. For engine specific configuration, see

    Python SDK

    The spicepy is the easiest way to use and query in Python.

    The Python SDK uses to efficiently stream data to the client and Records as data frames which are then easily converted to Pandas data frames.

    Requirements

    datasets:
      - from: http://my-http-api.com/report.csv
        name: local_report
        params:
          http_password: ${env:MY_HTTP_PASS}
    datasets:
      - from: http://[email protected]/report.csv
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
        name: laion_freesound
    datasets:
      - from: http://[email protected]/report.csv
        name: local_report
        params:
          http_password: ${env:MY_HTTP_PASS}
    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          engine: sqlite
    models:
      - name: webs
        from: perplexity:sonar
        params:
          perplexity_auth_token: ${ secrets:SPICE_PERPLEXITY_AUTH_TOKEN }
          perplexity_search_domain_filter:
           - docs.spiceai.org
           - huggingface.co
          openai_temperature: 0.3141595
    datasets:
      - from: snowflake:DATABASE.SCHEMA.TABLE
        name: table
        params:
          snowflake_warehouse: COMPUTE_WH
          snowflake_role: accountadmin
    SQLite
    Learn more about Data Acceleration
    DuckDB Data Accelerator recipe
    Learn more about AI Gateway
    Running Llama3 Locally recipe
    Learn more about Vector Similarity Search
    Searching GitHub Files recipe
    Learn more about Semantic Model for AI
    Learn more about Monitoring and Observability
  • SELECT {'x': 1, 'y': 2, 'z': 3}

  • Unsupported:

    • SELECT [['duck', 'goose', 'heron'], ['frog', 'toad']]

    • SELECT {'x': [1, 2, 3]}

  • The DuckDB accelerator does not support enum, dictionary, or map field types. For example:

    • Unsupported:

      • SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])

  • The DuckDB accelerator does not support Decimal256 (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.

  • Updating a dataset with DuckDB acceleration while the Spice Runtime is running (hot-reload) will cause the DuckDB accelerator query federation to disable until the Runtime is restarted.

  • datasets
    field types
    secret replacement syntax
    Perplexity API Reference
    button in the confirmation popup.
    Selecting Create Org from organizations dropdown.
    Accepting Terms of Service and Privacy Policy for the new organization.
    Selecting organization to connect to Spice.ai
    Confirming installation of GitHub application to organization.
    Spice.ai selecting organization.
    Spice.ai organization settings
    Searching for a GitHub user to invite to the organization.
    Inviting the user via email
    Removing user from organization
    spice.ai/spiceai/demo
    github.com/spiceai/demo
    spiceai/spicepod-template
    Creating a new repository using the template
    Copy App Spicepod
    Save Spicepod to GitHub Repository
    Connecting a Spice.ai app to a GitHub repository.
    Installing the Spice.ai GitHub app.
    Linking Spice.ai to GitHub repository
    FlightSQL Grafana Plugin installation instructions
    Spice.ai API key
    Navigating to plugins and data in Grafana, and searching for FlightSQL
    The FlightSQL plugin menu, once the plugin is installed.
    Setting up a FlightSQL datasource to connect to Spice.ai Cloud Platform
    Select the FlightSQL datasource for the visualization
    A Grafana line graph displaying the most recent query execution times
    Manually refreshing the visualization query in FlightSQL

    snowflake_private_key_passphrase

    Optional, specifies the Snowflake private key passphrase

    snowflake_warehouse

    Optional, specifies the Snowflake Warehouse to use

    snowflake_role

    Optional, specifies the role to use for accessing Snowflake data

    snowflake_account

    Required, specifies the Snowflake account-identifier

    snowflake_username

    Required, specifies the Snowflake username to use for accessing Snowflake data

    snowflake_password

    Optional, specifies the Snowflake password to use for accessing Snowflake data

    snowflake_private_key_path

    Optional, specifies the path to Snowflake private key

    key-pair
    Finding the organization and account name for an account
    Legacy account locator in a region format
    Snowflake preferred name in organization format
    key-pair
    Spice.ai
    Flight Client Record Batch Stream Reader
    quickstart guide
    https://github.com/spiceai/spice-dotnet
    spice.ai
    FlightStream
    full example
    quickstart guide
    Spice OSS documentation
    Java SDK Sample
    https://github.com/spiceai/spice-java
    <dependency>
        <groupId>ai.spice</groupId>
        <artifactId>spiceai</artifactId>
        <version>0.3.0</version>
        <scope>compile</scope>
    </dependency>
    implementation 'ai.spice:spiceai:0.3.0'
    ​
    ​
    ​
    learn more
    ​
    ​
    Spicepod specification
    Example - Locally Accelerating taxi_trips with Arrow Accelerator

    Refresh Modes

    Spice supports three modes to refresh/update locally accelerated data from a connected data source. full is the default mode. Refer to Data Refresh documentation for detailed refresh usage and configuration.

    Mode
    Description
    Example

    full

    Replace/overwrite the entire dataset on each refresh

    A table of users

    append

    Append/add data to the dataset on each refresh

    Append-only, immutable datasets, such as time-series or log data

    changes

    Apply incremental changes

    Customer order lifecycle table

    Example - Accelerate with arrow accelerator under full refresh mode

    Indexes

    Database indexes are essential for optimizing query performance. Configure indexes for accelerators via indexes field. For detailed configuration, refer to the index documentation.

    Example - Configure indexes with SQLite Accelerator

    Constraints

    Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.

    Constraints are specified using column references in the Spicepod via the primary_key field in the acceleration configuration. Additional unique constraints are specified via the indexes field with the value unique. Data that violates these constraints will result in a conflict. For constraints configuration details, visit Constraints Documentation.

    Example - Configure primary key constraints with SQLite Accelerator

    Data Connector
    Data Accelerator
    Data Accelerator Documentation
    datasets:
      - from: snowflake:SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM
        name: lineitem
        params:
          snowflake_warehouse: COMPUTE_WH
          snowflake_role: accountadmin
    dotnet add package spiceai
    using Spice;
    
    var client = new SpiceClientBuilder()
        .WithApiKey("API_KEY")
        .WithSpiceCloud("http://my_remote_spice_instance:50051")
        .Build();
    var reader = await client.Query("SELECT * FROM tpch.lineitem LIMIT 10;");
    var enumerator = result.GetAsyncEnumerator();
    while (await enumerator.MoveNextAsync())
    {
        var batch = enumerator.Current;
        // Process batch
    }
    using Spice;
    
    var client = new SpiceClientBuilder()
        .Build();
        
    var data = await client.Query("SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;");
    import ai.spice.SpiceClient;
    SpiceClient spice = SpiceClient.builder()
        .withApiKey(ApiKey)
        .withSpiceCloud()
        .build()
    FlightStream stream = spice.query("SELECT * FROM tpch.lineitem LIMIT 10");
    while (stream.next()) {
        try (VectorSchemaRoot batches = stream.getRoot()) {
            System.out.println(batches.contentToTSVString());
        }
    }
    SpiceClient spice = SpiceClient.builder()
        .build();
    
    SpiceClient spice = SpiceClient.builder()
        .withFlightAddress(new URI("grpc://my_remote_spice_instance:50051"))
        .build();
    SpiceClient client = SpiceClient.builder()
        .withMaxRetries(5) // Setting to 0 will disable retries
        .build();
    version: v1beta1
    kind: Spicepod
    name: my_spicepod
    
    datasets:
      - from: spice.ai/spiceai/quickstart
        name: qs
        acceleration:
          enabled: true
          refresh_mode: append
    
    models:
      - from: openai/gpt-4o
        name: gpt-4o
    datasets:
      - from: spice.ai/spiceai/quickstart/datasets/taxi_trips
        name: taxi_trips
        acceleration:
          enabled: true
          refresh_mode: full
          refresh_check_interval: 10s
    datasets:
      - from: databricks:my_dataset
        name: accelerated_dataset
        acceleration:
          refresh_mode: full
          refresh_check_interval: 10m
    datasets:
      - from: spice.ai/eth.recent_blocks
        name: eth.recent_blocks
        acceleration:
          enabled: true
          engine: sqlite
          indexes:
            number: enabled # Index the `number` column
            '(hash, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
    datasets:
      - from: spice.ai/eth.recent_blocks
        name: eth.recent_blocks
        acceleration:
          enabled: true
          engine: sqlite
          primary_key: hash # Define a primary key on the `hash` column
          indexes:
            '(number, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columns

    The from field for the ClickHouse connector takes the form of from:db.dataset where db.dataset is the path to the Dataset within ClickHouse. In the example above it would be my.dataset.

    If db is not specified in either the from field or the clickhouse_db parameter, it will default to the default database.

    name

    The dataset name. This will be used as the table name within Spice.

    params

    The ClickHouse data connector can be configured by providing the following params:

    Parameter Name
    Definition

    clickhouse_connection_string

    The connection string to use to connect to the ClickHouse server. This can be used instead of providing individual connection parameters.

    clickhouse_host

    The hostname of the ClickHouse server.

    clickhouse_tcp_port

    The port of the ClickHouse server.

    clickhouse_db

    The name of the database to connect to.

    clickhouse_user

    The username to connect with.

    clickhouse_pass

    The password to connect with.

    Examples

    Specifying a connection timeout

    Using a connection string

    from

    The from field takes one of two forms: ftp://<host>/<path> or sftp://<host>/<path> where <host> is the host to connect to and <path> is the path to the file or directory to read from.

    If a folder is provided, all child files will be loaded.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    FTP

    Parameter Name
    Description

    file_format

    Specifies the data file format. Required if the format cannot be inferred by from the from path. See .

    ftp_port

    Optional, specifies the port of the FTP server. Default is 21. E.g. ftp_port: 21

    ftp_user

    The username for the FTP server. E.g. ftp_user: my-ftp-user

    ftp_pass

    The password for the FTP server. Use the to load the password from a secret store, e.g. ${secrets:my_ftp_pass}.

    client_timeout

    Optional. Specifies timeout for FTP connection. E.g. client_timeout: 30s. When not set, no timeout will be configured for FTP client.

    hive_partitioning_enabled

    Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

    SFTP

    Parameter Name
    Description

    file_format

    Specifies the data file format. Required if the format cannot be inferred by from the from path. See .

    sftp_port

    Optional, specifies the port of the SFTP server. Default is 22. E.g. sftp_port: 22

    sftp_user

    The username for the SFTP server. E.g. sftp_user: my-sftp-user

    sftp_pass

    The password for the SFTP server. Use the to load the password from a secret store, e.g. ${secrets:my_sftp_pass}.

    client_timeout

    Optional. Specifies timeout for SFTP connection. E.g. client_timeout: 30s. When not set, no timeout will be configured for SFTP client.

    hive_partitioning_enabled

    Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

    Examples

    Connecting to FTP

    Connecting to SFTP

    supported file formats
    from

    The from field takes the form dremio:dataset where dataset is the fully qualified name of the dataset to read from.

    Limitations

    Currently, only up to three levels of nesting are supported for dataset names (e.g., a.b.c). Additional levels are not supported at this time.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    Parameter Name
    Description

    dremio_endpoint

    The endpoint used to connect to the Dremio server.

    dremio_username

    The username used to connect to the Dremio endpoint.

    dremio_password

    The password used to connect to the Dremio endpoint. Use the to load the password from a secret store, e.g. ${secrets:my_dremio_pass}.

    Examples

    Connecting to a GRPC endpoint

    Types

    The table below shows the Dremio data types supported, along with the type mapping to Apache Arrow types in Spice.

    Dremio Type
    Arrow Type

    INT

    Int32

    BIGINT

    Int64

    FLOAT

    Float32

    DOUBLE

    Float64

    DECIMAL

    Decimal128

    VARCHAR

    Utf8

    Limitations

    Limitations

    • Dremio connector does not support queries with the EXCEPT and INTERSECT keywords in Spice REPL. Use DISTINCT and IN/NOT IN instead. See the example below.

    ````

    Dremio

    Python 3.11+

    The following packages are required and will be automatically installed by pip:

    • pyarrow

    • pandas

    • certify

    • requests

    Apple M1 Mac Requirements - How do I know if I have an M1?

    Apple M1 Macs require an arm64 compatible version of pyarrow which can be installed using miniforge. We recommend the following procedure:

    • Install Homebrew

    • Install miniforge with:

    • Initialize conda in your terminal with:

    • Install pyarrow and pandas with:

    While can be used to install pyarrow, the installed version is old (4.0.0) so we recommend using the distribution.

    Installation

    Install the spicepy package directly from the Spice Github Repository at https://github.com/spiceai/spicepy:

    Usage

    Import spicepy and create a Client by providing your API Key.

    You can then submit queries using the query function.

    Querying data is done through a Client object that initializes the connection with the Spice.ai endpoint. Client has the following arguments:

    • api_key (string, optional): Spice.ai API key to authenticate with the endpoint.

    • url (string, optional): URL of the endpoint to use (default: grpc+tls://flight.spiceai.io)

    • tls_root_cert (Path or string, optional): Path to the tls certificate to use for the secure connection (ommit for automatic detection)

    Once a Client is obtained queries can be made using the query() function. The query() function has the following arguments:

    • query (string, required): The SQL query.

    • timeout (int, optional): The timeout in seconds.

    A custom timeout can be set by passing the timeout parameter in the query function call. If no timeout is specified, it will default to a 10 min timeout then cancel the query, and a TimeoutError exception will be raised.

    Usage with local Spice runtime

    Follow the quickstart guide to install and run spice locally.

    Contributing

    Contribute to or file an issue with the spicepy library at: https://github.com/spiceai/spicepy

    Python SDK
    Spice.ai
    Apache Apache Flight
    Apache Arrow

    Hugging Face

    Instructions for using machine learning models hosted on HuggingFace with Spice.

    To use a model hosted on HuggingFace, specify the huggingface.co path in the from field and, when needed, the files to include.

    Configuration

    from

    The from key takes the form of huggingface:model_path. Below shows 2 common example of from key configuration.

    • huggingface:username/modelname: Implies the latest version of modelname hosted by username.

    • huggingface:huggingface.co/username/modelname:revision: Specifies a particular revision of modelname by username, including the optional domain.

    The from key follows the following regex format.

    The from key consists of five components:

    1. Prefix: The value must start with huggingface:.

    2. Domain (Optional): Optionally includes huggingface.co/ immediately after the prefix. Currently no other Huggingface compatible services are supported.

    3. Organization/User: The HuggingFace organization (org).

    name

    The model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models). This can be set to the same value as the model ID in the from field.

    params

    Param
    Description
    Default

    files

    The specific file path for Huggingface model. For example, GGUF model formats require a specific file path, other varieties (e.g. .safetensors) are inferred.

    Example

    Access Tokens

    Access tokens can be provided for Huggingface models in two ways:

    1. In the Huggingface token cache (i.e. ~/.cache/huggingface/token). Default.

    2. Via .

    Examples

    Load a ML model to predict taxi trips outcomes

    Load a LLM model to generate text

    Load a private model

    For more details on authentication, see .

    Limitations

    • The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports and for accelerated inference.

    • ML models currently only support ONNX file format.

    OpenAI

    Instructions for using language models hosted on OpenAI or compatible services with Spice.

    To use a language model hosted on OpenAI (or compatible), specify the openai path in the from field.

    For a specific model, include it as the model ID in the from field (see example below). The default model is gpt-4o-mini.

    Configuration

    from

    The from field takes the form openai:model_id where model_id is the model ID of the OpenAI model, valid model IDs are found in the {endpoint}/v1/models API response.

    Example:

    name

    The model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models). This can be set to the same value as the model ID in the from field.

    params

    Param
    Description
    Default

    See for additional configuration options.

    Supported OpenAI Compatible Providers

    Spice supports several OpenAI compatible providers. Specify the appropriate endpoint in the params section.

    Azure OpenAI

    Follow instructions.

    Groq

    Groq provides OpenAI compatible endpoints. Use the following configuration:

    NVidia NIM

    NVidia NIM models are OpenAI compatible endpoints. Use the following configuration:

    View the Spice cookbook for an example of setting up NVidia NIM with Spice .

    Parasail

    Parasail also offers OpenAI compatible endpoints. Use the following configuration:

    Refer to the respective provider documentation for more details on available models and configurations.

    ML Models

    Spice Machine Learning (ML) Models

    Spice Models are in beta for Design Partners. Get in touch for more info.

    Spice Models enable the training and use of ML models natively on the Spice platform.

    The platform currently supports time-series forecasting models, with other categories of models planned.

    Hosted models have first-class access to co-located data for training and inferencing including: Spice managed datasets, user managed datasets, and custom datasets and views. Additionally, Spice Firecache can be leveraged to train and infer up to 10x faster.

    Defining a Model

    Models are defined using a YAML file. Model details such as data requirements, architecture, training parameters, and other important hyperparameters are defined in the model.yaml.

    Add a model.yaml file to the repository path /models/[model_name]/model.yaml of a , replacing [model_name] with the desired model name. For example, the uses the path /models/gas-fees/model.yaml.

    Refer to the for all available configuration options.

    For example model manifests, see the .

    Training a Model

    In the , navigate to the Models tab of the Spice app.

    model.yaml files committed to the connected repository will be automatically detected and imported as Spice Models.

    Navigating to a specific Model will show detailed information as defined in the model.yaml.

    A training run can be started using the Train button.

    Training runs in progress will be shown and updated, along with historical training runs.

    The Training Status will be updated to Complete for successfully completed training runs. Details and the Training Report, are available on the Training Run page.

    Running Model Predictions

    Spice Models (beta) currently supports time-series forecasting.

    Additional categories of data science and machine learning are on our roadmap.

    A successfully trained model can be used to make predictions.

    The lookback data (inferencing data) is automatically provided by the platform and wired up to the inference, enabling a prediction to be made using a simple API call.

    AI Predictions in the Playground

    Navigate to AI Predictions in the Playground.

    Successfully trained models will be available for selection from the model selector drop down on the right.

    Clicking Predict will demonstrate calling the predictions API using lookback data within the Spice platform. A graph of the predicted value(s) along with the lookback data will be displayed.

    Predictions by API

    The Training Runs page provides training details including a copyable curl command to make a prediction from the command line.

    For details on the API, see .

    MSSQL

    Microsoft SQL Server Data Connector

    Microsoft SQL Server is a relational database management system developed by Microsoft.

    The Microsoft SQL Server Data Connector enables federated/accelerated SQL queries on data stored in MSSQL databases.

    Limitations

    1. The connector supports SQL Server authentication (SQL Login and Password) only.

    2. Spatial types (geography) are not supported, and columns with these types will be ignored.

    Configuration

    from

    The from field takes the form mssql:database.schema.table where database.schema.table is the fully-qualified table name in the SQL server.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The data connector supports the following params. Use the to load the secret from a secret store, e.g. ${secrets:my_mssql_conn_string}.

    Parameter Name
    Description

    Example

    Apache Arrow Flight API

    SQL Query Apache Arrow Flight API

    SQL query results can be served via a high-performance Apache Arrow Flight endpoint. Arrow Flight uses the gRPC protocol for efficient data transfer.

    This setup enables high-speed access to your data in Python, Go, C++, C#, Rust, Java, and C# and makes it easy to use libraries like Pandas and NumPy.

    SDKs

    It's recommended to use the Spice.ai SDKs to connect and query the Arrow Flight endpoint. SDKs are available for , , , , , and .

    In Python, query results can be easily converted to Pandas or NumPy formats.

    You may also use Apache's pyarrow library directly.

    Note on Apple Silicon Macs -

    The spicepy/pyarrow installation requires .

    See the for installation steps.

    Connecting to the Endpoint

    • Endpoint URL: grpc+tls://flight.spiceai.io

    • Basic Authentication:

      • Username can be set to an empty string

    Requirements

    • names must be fully-qualified. For example spiceai.quickstart

    Samples

    Find code samples in Python in .

    Troubleshooting

    Mac/Windows Certificate issue

    If you get this error:

    Could not get default pem root certs

    Install the .

    Instructions for macOS

    First download the roots.pem file from the Let's Encrypt server:

    Before running your code/jupyter notebook the environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH must be set to the pem file path. If you are using command from a terminal this can be done from the folder containing isrgrootx1.pem with:

    The export command will set this variable for this specific terminal and thus will need to be run every time you open a new terminal. Additionally you can add to your terminal profile.

    Instructions for Windows

    Go SDK

    Golang SDK for Spice.ai

    The Go SDK gospice is the easiest way to query Spice.ai from Go.

    It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.

    GoDocs are available at: pkg.go.dev/github.com/spiceai/gospice.

    Requirements

    • (or later)

    Installation

    Get the gospice package.

    Usage

    1. Import the package.

    2. Create a SpiceClient by providing your API key. Get your free API key at .

    3. Initialize the SpiceClient.

    4. Execute a query and get back an .

    5. Iterate through the reader to access the records.

    Usage with local Spice runtime

    Follow the to install and run spice locally.

    Or using custom flight address:

    Check to learn more.

    Example

    Run go run . to execute a sample query and print the results to the console.

    Connection retry

    The SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via SetMaxRetries:

    Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).

    Contributing

    Contribute to or file an issue with the gospice library at:

    SharePoint

    SharePoint Data Connector Documentation

    The SharePoint Data Connector enables federated SQL queries on documents stored in SharePoint.

    Example

    Returns

    Limitations

    DuckDB

    DuckDB Data Connector Documentation

    DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system designed for analytical query workloads. It is optimized for fast execution and can be embedded directly into applications, providing efficient data processing without the need for a separate database server.

    This connector supports DuckDB as a data source for federated SQL queries.

    Configuration

    GitHub Copilot

    How to use the Spice.ai for GitHub Copilot Extension

    The makes it easy to access and chat with external data in GitHub Copilot, enhancing AI-assisted research, Q&A, code, and documentation suggestions for greater accuracy.

    Access structured and unstructured data from any Spice data source like GitHub, PostgreSQL, MySQL, Snowflake, Databricks, GraphQL, data lakes (S3, Delta Lake, OneLake), HTTP(s), SharePoint, and even FTP.

    Some example prompts:

    • @spiceai What documentation is relevant to this file?

    Rust SDK

    Rust SDK for Spice.ai

    The spice-rs is the easiest way to query from Rust.

    It uses to efficiently stream data to the client and Records as data frames.

    Requirements

    datasets:
      - from: clickhouse:my.dataset
        name: my_dataset
    datasets:
      - from: clickhouse:my.dataset
        name: cool_dataset
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: clickhouse:my.dataset
        name: my_dataset
        params:
          clickhouse_connection_string: tcp://my_user:${secrets:my_clickhouse_pass}@host/my_database
          connection_timeout: 10000
          clickhouse_secure: true
    datasets:
      - from: clickhouse:my.dataset
        name: my_dataset
        params:
          clickhouse_connection_string: tcp://my_user:${secrets:my_clickhouse_pass}@host/my_database?connection_timeout=10000&secure=true
    datasets:
      - from: sftp://remote-sftp-server.com/path/to/folder/
        name: my_dataset
        params:
          file_format: csv
          sftp_port: 22
          sftp_user: my-sftp-user
          sftp_pass: ${secrets:my_sftp_password}
    datasets:
      - from: sftp://remote-sftp-server.com/path/to/folder/
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    - from: ftp://remote-ftp-server.com/path/to/folder/
      name: my_dataset
      params:
        file_format: csv
        ftp_user: my-ftp-user
        ftp_pass: ${secrets:my_ftp_password}
        hive_partitioning_enabled: false
    - from: sftp://remote-sftp-server.com/path/to/folder/
      name: my_dataset
      params:
        file_format: csv
        sftp_port: 22
        sftp_user: my-sftp-user
        sftp_pass: ${secrets:my_sftp_password}
        hive_partitioning_enabled: false
    # fail
    SELECT ws_item_sk FROM web_sales
    INTERSECT
    SELECT ss_item_sk FROM store_sales;
    
    # success
    SELECT DISTINCT ws_item_sk FROM web_sales
    WHERE ws_item_sk IN (
        SELECT DISTINCT ss_item_sk FROM store_sales
    );
    
    # fail
    SELECT ws_item_sk FROM web_sales
    EXCEPT
    SELECT ss_item_sk FROM store_sales;
    
    # success
    SELECT DISTINCT ws_item_sk FROM web_sales
    WHERE ws_item_sk NOT IN (
        SELECT DISTINCT ss_item_sk FROM store_sales
    );
    - from: dremio:datasets.dremio_dataset
      name: dremio_dataset
      params:
        dremio_endpoint: grpc://127.0.0.1:32010
        dremio_username: demo
        dremio_password: ${secrets:my_dremio_pass}
    datasets:
      - from: dremio:datasets.dremio_dataset
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    - from: dremio:datasets.dremio_dataset
      name: dremio_dataset
      params:
        dremio_endpoint: grpc://127.0.0.1:32010
        dremio_username: demo
        dremio_password: ${secrets:my_dremio_pass}
    pip install git+https://github.com/spiceai/[email protected]
    from spicepy import Client
    
    client = Client('API_KEY')
    data = client.query('SELECT * FROM eth.recent_blocks LIMIT 10;', timeout=5*60)
    pd = data.read_pandas()
    from spicepy import Client
    
    client = Client()
    data = client.query('SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;', timeout=5*60)
    pd = data.read_pandas()
    models:
      - from: openai:gpt-4o-mini
        name: openai_model
        params:
          openai_api_key: ${ secrets:OPENAI_API_KEY } # Required for official OpenAI models
          tools: auto # Optional. Connect the model to datasets via SQL query/vector search tools
          system_prompt: "You are a helpful assistant." # Optional.
    
          # Optional parameters
          endpoint: https://api.openai.com/v1 # Override to use a compatible provider (i.e. NVidia NIM)
          openai_org_id: ${ secrets:OPENAI_ORG_ID }
          openai_project_id: ${ secrets:OPENAI_PROJECT_ID }
    
          # Override default chat completion request parameters
          openai_temperature: 0.1
          openai_response_format: { "type": "json_object" }

    clickhouse_secure

    Optional. Specifies the SSL/TLS behavior for the connection, supported values:

    • true: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.

    • false: This mode will not attempt to use an SSL connection, even if the server supports it.

    connection_timeout

    Optional. Specifies the connection timeout in milliseconds.

    brew install --cask miniforge
    conda init "$(basename "${SHELL}")"
    conda install pyarrow pandas
    Anaconda
    miniforge
    Object Store File Formats
    secret replacement syntax
    Object Store File Formats
    secret replacement syntax

    VARBINARY

    Binary

    BOOL

    Boolean

    DATE

    Date64

    TIME

    Time32

    TIMESTAMP

    Timestamp(Millisecond, None)

    INTERVAL

    Interval

    LIST

    List

    STRUCT

    Struct

    MAP

    Map

    secret replacement syntax
    model manifest
    GitHub connected Spice app
    Gas Fees Predictions demo model
    Models YAML specification
    models samples repo
    spice.ai Portal
    Prediction Documentation
    Spice Models defined in model.yaml files automatically detected and imported in the Portal.
    Details for a specific Model.
    A model training with status "Running".
    A successfully completed Model training run with status "Complete".
    The AI Predictions playground.
    Detailed training run page with the predictions API

    Model Name: After a /, the model name (model).

  • Revision (Optional): A colon (:) followed by the git-like revision identifier (revision).

  • hf_token

    The Huggingface access token.

    -

    model_type

    The architecture to load the model as. Supported values: mistral, gemma, mixtral, llama, phi2, phi3, qwen2, gemma2, starcoder2, phi3.5moe, deepseekv2, deepseekv3

    -

    tools

    Which [tools] should be made available to the model. Set to auto to use all available tools.

    -

    system_prompt

    An additional system prompt used for all chat completions to this model.

    -

    model params
    access tokens
    Apple metal
    CUDA

    mssql_encrypt

    (Optional) Specifies whether encryption is required for the connection.

    • true: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.

    • false: This mode will not attempt to use an SSL connection, even if the server supports it. Only the login procedure is encrypted.

    mssql_trust_server_certificate

    (Optional) Specifies whether the server certificate should be trusted without validation when encryption is enabled.

    • true: The server certificate will not be validated and it is accepted as-is.

    • false: (default) Server certificate will be validated against system's certificate storage.

    mssql_connection_string

    The ADO connection string to use to connect to the server. This can be used instead of providing individual connection parameters.

    mssql_host

    The hostname or IP address of the Microsoft SQL Server instance.

    mssql_port

    (Optional) The port of the Microsoft SQL Server instance. Default value is 1433.

    mssql_database

    (Optional) The name of the database to connect to. The default database (master) will be used if not specified.

    mssql_username

    The username for the SQL Server authentication.

    mssql_password

    The password for the SQL Server authentication.

    secret replacement syntax
    Password should be set to the API key of your app
    Note that $PWD is a bash-specific variable that will be replaced by the current directory path. You can download the certificate file isrgrootx1.pem in a specific location and inform this path instead of $PWD.
    Python
    Node.js
    Go
    Rust
    Java
    Dotnet
    How do I know if I have Apple Silicon?
    miniforge
    Python SDK page
    Table
    Arrow Flight Samples
    Let's Encrypt root certificates
    Go 1.22
    spice.ai
    Apache Arrow Record Reader
    quickstart guide
    Spice OSS documentation
    https://github.com/spiceai/gospice

    Rust 1.77.0

    Installation

    Add Spice SDK

    Usage

    1. Create a SpiceClient by providing your API key to ClientBuilder. Get your free API key at spice.ai.

    2. Execute a query and get back an Apache Arrow Flight Record Batch Stream.

    3. Iterate through the reader to access the records.

    Usage with local Spice runtime

    Follow the quickstart guide to install and run spice locally.

    Contributing

    Contribute to or file an issue with the spice-rs library at: https://github.com/spiceai/spice-rs

    Rust SDK
    Spice.ai
    Apache Arrow Flight
    Apache Arrow
    \A(huggingface:)(huggingface\.co\/)?(?<org>[\w\-]+)\/(?<model>[\w\-]+)(:(?<revision>[\w\d\-\.]+))?\z
    models:
      - from: huggingface:huggingface.co/lmstudio-community/Qwen2.5-Coder-3B-Instruct-GGUF
        name: sloth-gguf
        files:
          - path: Qwen2.5-Coder-3B-Instruct-Q3_K_L.gguf
    models:
      - name: llama_3.2_1B
        from: huggingface:huggingface.co/meta-llama/Llama-3.2-1B
        params:
          hf_token: ${ secrets:HF_TOKEN }
    models:
      - from: huggingface:huggingface.co/spiceai/darts:latest
        name: hf_model
        files:
          - path: model.onnx
        datasets:
          - taxi_trips
    models:
      - from: huggingface:huggingface.co/microsoft/Phi-3.5-mini-instruct
        name: phi
    models:
      - name: llama_3.2_1B
        from: huggingface:huggingface.co/meta-llama/Llama-3.2-1B
        params:
          hf_token: ${ secrets:HF_TOKEN }
    datasets:
      - from: mssql:path.to.my_dataset
        name: my_dataset
        params:
          mssql_connection_string: ${secrets:mssql_connection_string}
    datasets:
      - from: mssql:path.to.my_dataset
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: mssql:SalesLT.Customer
        name: customer
        params:
          mssql_host: mssql-host.database.windows.net
          mssql_database: my_catalog
          mssql_username: my_user
          mssql_password: ${secrets:mssql_pass}
          mssql_encrypt: true
          mssql_trust_server_certificate: true
    curl -Lo isrgrootx1.pem https://letsencrypt.org/certs/isrgrootx1.pem 
    export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="$PWD/isrgrootx1.pem"
    @powershell -NoProfile -ExecutionPolicy unrestricted -Command ^
        (new-object System.Net.WebClient).Downloadfile( ^
            'https://letsencrypt.org/certs/isrgrootx1.pem', 'isrgrootx1.pem')
    set GRPC_DEFAULT_SSL_ROOTS_FILE_PATH=%cd%\isrgrootx1.pem
    go get github.com/spiceai/gospice/v7
    import "github.com/spiceai/gospice/v7"
    spice := gospice.NewSpiceClient()
    defer spice.Close()
    if err := spice.Init(
        spice.WithApiKey(ApiKey),
        spice.WithSpiceCloudAddress()
    ); err != nil {
        panic(fmt.Errorf("error initializing SpiceClient: %w", err))
    }
    reader, err := spice.Query(context.Background(), "SELECT * FROM tpch.lineitem LIMIT 10")
    if err != nil {
        panic(fmt.Errorf("error querying: %w", err))
    }
    defer reader.Release()
    for reader.Next() {
        record := reader.Record()
        defer record.Release()
        fmt.Println(record)
    }
    spice := gospice.NewSpiceClient()
    defer spice.Close()
    
    if err := spice.Init(); err != nil {
        panic(fmt.Errorf("error initializing SpiceClient: %w", err))
    }
    spice := gospice.NewSpiceClient()
    defer spice.Close()
    
    if err := spice.Init(
        spice.WithFlightAddress("grpc://localhost:50052")
    ); err != nil {
        panic(fmt.Errorf("error initializing SpiceClient: %w", err))
    }
    spice := NewSpiceClient()
    spice.SetMaxRetries(5) // Setting to 0 will disable retries
    cargo add spiceai
    use spiceai::ClientBuilder;
    
    #[tokio::main]
    async fn main() {
      let client = ClientBuilder::new()
        .api_key("API_KEY")
        .use_spiceai_cloud()
        .build()
        .await
        .unwrap();
    }
    let flight_data_stream = client.query("SELECT * FROM tpch.lineitem LIMIT 10;").await.expect("Error executing query");
    while let Some(batch) = flight_data_stream.next().await {
        match batch {
            Ok(batch) => {
                /* process batch */
                println!("{:?}", batch)
            },
            Err(e) => {
                /* handle error */
            },
        };
    }
    use spiceai::ClientBuilder;
    
    #[tokio::main]
    async fn main() {
      let client = ClientBuilder::new()
        .build()
        .await
        .unwrap();
    
      let data = client.query("SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;").await;
    }

    openai_org_id

    The OpenAI organization ID.

    -

    openai_project_id

    The OpenAI project ID.

    -

    openai_temperature

    Set the default temperature to use on chat completions.

    -

    openai_response_format

    An object specifying the format that the model must output, see .

    -

    endpoint

    The OpenAI API base endpoint. Can be overridden to use a compatible provider (i.e. Nvidia NIM).

    https://api.openai.com/v1

    tools

    Which tools should be made available to the model. Set to auto to use all available tools.

    -

    system_prompt

    An additional system prompt used for all chat completions to this model.

    -

    openai_api_key

    The OpenAI API key.

    -

    Large Language Models
    Tools
    Memory
    Evals
    Parameter overrides
    Azure AI Models
    here
    The sharepoint connector does not yet support creating a dataset from a single file (e.g. an Excel spreadsheet). Datasets must be created from a folder of documents (see
    ).

    Configuration

    Parameters

    Name
    Required?
    Description

    sharepoint_client_id

    Yes

    The client ID of the Azure AD (Entra) application

    sharepoint_tenant_id

    Yes

    The tenant ID of the Azure AD (Entra) application.

    sharepoint_client_secret

    Optional

    For service principal authentication. The client secret of the Azure AD (Entra) application.

    Only one of sharepoint_client_secret or sharepoint_bearer_token is allowed.

    from formats

    The from field in a SharePoint dataset takes the following format:

    Drives

    drive_type in a SharePoint Connector from field supports the following types:

    Drive Type
    Description
    Example

    drive

    The SharePoint drive's name

    from: sharepoint:drive:Documents/...

    driveId

    The SharePoint drive's ID

    from: sharepoint:driveId:b!Mh8opUGD80ec7zGXgX9r/...

    site

    A SharePoint site's name

    from: sharepoint:site:MySite/...

    siteId

    A SharePoint site's ID

    For the me drive type the user is identified based on sharepoint_client_code and cannot be used with sharepoint_client_secret

    For a name-based drive_id, the connector will attempt to resolve the name to an ID at startup.

    Subpaths

    Within a drive, the SharePoint connector can load documents from:

    Description
    Example

    The root of the drive

    from: sharepoint:me/root

    A specific path within the drive

    from: sharepoint:drive:Documents/path:/top_secrets

    A specific folder ID

    from: sharepoint:group:MyGroup/id:01QM2NJSNHBISUGQ52P5AJQ3CBNOXDMVNT

    Authentication

    Creating an Enterprise Application

    To use the SharePoint connector with service principal authentication, you will need to create an Azure AD application and grant it the necessary permissions. This will also support OAuth2 authentication for users within the tenant (i.e. sharepoint_bearer_token).

    1. Create a new Azure AD application in the Azure portal.

    2. Under the application's API permissions, add the following permissions: Sites.Read.All, Files.Read.All, User.Read, GroupMember.Read.All

      • For service principal authentication, Application permissions are required.

      • For user authentication, only delegated permissions are required.

    3. Add sharepoint_client_id (from the Application (Client) ID field) and sharepoint_tenant_id to the connector configuration.

    4. Under the application's Certificates & secrets, create a new client secret. Use this for the sharepoint_client_secret parameter.

    Document Support
    from

    The from field supports one of two forms:

    from

    Description

    duckdb:database.schema.table

    Read data from a table named database.schema.table in the DuckDB file

    duckdb:*

    Read data using any DuckDB function that produces a table. For example one of the functions such as read_json, read_parquet or read_csv.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The DuckDB data connector can be configured by providing the following params:

    Parameter Name
    Description

    duckdb_open

    The name of the DuckDB database to open.

    Configuration params are provided either in the top level dataset for a dataset source, or in the acceleration section for a data store.

    Examples

    Reading from a relative path

    A generic example of DuckDB data connector configuration.

    Reading from an absolute path

    DuckDB Functions

    Common data import DuckDB functions can also define datasets. Instead of a fixed table reference (e.g. database.schema.table), a DuckDB function is provided in the from: key. For example

    Datasets created from DuckDB functions are similar to a standard SELECT query. For example:

    is equivalent to:

    Many DuckDB data imports can be rewritten as DuckDB functions, making them usable as Spice datasets. For example:

    Limitations

    • The DuckDB connector does not support enum, dictionary, or map field types. For example:

      • Unsupported:

        • SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])

    • The DuckDB connector does not support Decimal256 (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.

    persistent databases

    @spiceai Write documentation about the user authentication issue

  • @spiceai Who are the top 5 committers to this repository?

  • @spiceai What are the latest error logs from my web app?

  • Installing the Spice.ai for GitHub Copilot Extension

    To install the extension, visit the GitHub Marketplace and search for Spice.ai.

    The Spice.ai Extension in the GitHub Marketplace.

    Scroll down, and click Install it for free.

    Install the Community Edition of the Spice.ai Extension for free.

    Configuring the extension

    1. Once installed, open Copilot Chat and type @spiceai. Press enter.

    Starting a conversaton with @spiceai
    1. A prompt will appear to connect to the Spice.ai Cloud Platform.

    Connection prompt
    1. You will need to authorize the extension. Click Authorize spiceai.

    Permissions screen for the Spice AI Extension
    1. To create an account on the Spice.ai Cloud Platform, click Authorize Spice AI Platform.

    Authorizing the Spice.ai Cloud Platform
    1. Once your account is created, you can configure the extension. Select from a set of ready-to-use datasets to get started. You can configure other datasets after setup.

    GitHub Copilot Extension Setup page
    1. The extension will take up to 30 seconds to deploy and load the initial dataset.

    GitHub Copilot Extension Deployment page
    1. When complete, proceed back to GitHub Copilot Chat.

    Chatting with Copilot Chat

    Start a new chat

    To chat with the Spice.ai for GitHub Copilot extension, prefix the message with @spiceai

    If @spiceai does not appear in the popup (2), ensure that all the installation steps have been followed.

    Querying which datasets are available

    To list the datasets available to Copilot, try @spiceai What datasets do I have access to?

    Querying data using SQL

    1

    Navigate to Spice.ai and click Portal

    2

    Click the copilot app

    This will open the Spice.ai playground

    3

    List the tables available

    Run show tables to list the tables that are available to the Copilot extension

    4

    Querying repository content

    To query the GitHub content, query one of the tables above like so:

    SELECT name, path, url, content_embedding FROM react.docs LIMIT 10;

    Reset the Copilot instance

    1

    Navigate to spice.ai and click Portal

    2

    Open the account menu in the top-right corner

    3

    Click Account Settings

    4

    Click GitHub Copilot

    5

    Click Reset GitHub Copilot config

    Spice AI GitHub Copilot Extension
    The @spiceai extension in GitHub Copilot.

    PostgreSQL Data Accelerator

    To use PostgreSQL as Data Accelerator, specify postgres as the engine for acceleration.

    Configuration

    The connection to PostgreSQL can be configured by providing the following params:

    • pg_host: The hostname of the PostgreSQL server.

    • pg_port: The port of the PostgreSQL server.

    • pg_db: The name of the database to connect to.

    • pg_user: The username to connect with.

    • pg_pass: The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}.

    • pg_sslmode: Optional. Specifies the SSL/TLS behavior for the connection, supported values:

      • verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.

      • verify-ca: This mode requires a TLS connection and a valid root certificate.

    • pg_sslrootcert: Optional parameter specifying the path to a custom PEM certificate that the connector will trust.

    • connection_pool_size: Optional. The maximum number of connections to keep open in the connection pool. Default is 10.

    Configuration params are provided either in the acceleration section of a dataset.

    Arrow to PostgreSQL Type Mapping

    The table below lists the supported and their mappings to when stored

    Arrow Type
    sea_query ColumnType
    PostgreSQL Type

    LIMITATIONS

    • The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col) into CAST (SUM(int_col) as BIGINT.

    Delta Lake

    Delta Lake Data Connector Documentation

    Delta Lake data connector connector enables SQL queries from Delta Lake tables.

    Configuration

    from

    The from field for the Delta Lake connector takes the form of delta_lake:path where path is any supported path, either local or to a cloud storage location. See the section below.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    Use the to reference a secret, e.g. ${secrets:aws_access_key_id}.

    Parameter Name
    Description

    Delta Lake object store parameters

    AWS S3

    Parameter Name
    Description

    Azure Blob

    Note One of the following auth values must be provided for Azure Blob:

    • delta_lake_azure_storage_account_key,

    • delta_lake_azure_storage_client_id and azure_storage_client_secret

    Parameter Name
    Description

    Google Storage (GCS)

    Parameter Name
    Description

    Examples

    Delta Lake + Local

    Delta Lake + S3

    Delta Lake + Azure Blob

    Delta Lake + Google Storage

    Types

    The table below shows the Delta Lake data types supported, along with the type mapping to Apache Arrow types in Spice.

    Delta Lake Type
    Arrow Type

    Limitations

    • Delta Lake connector does not support reading Delta tables with the V2Checkpoint feature enabled. To use the Delta Lake connector with such tables, drop the V2Checkpoint feature by executing the following command:

      For more details on dropping Delta table features, refer to the official documentation:

    Data Connectors

    Learn how to use Data Connector to query external data.

    Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

    Supported Data Connectors include:

    Name
    Description
    Protocol/Format

    databricks (mode: delta_lake)

    Databricks

    S3/Delta Lake

    delta_lake

    Delta Lake

    Delta Lake

    Object Store File Formats

    For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.

    If a file is provided, the file format will be inferred, and params.file_format is unnecessary.

    File formats currently supported are:

    Name
    Parameter
    Supported
    Is Document Format

    File formats support additional parameters in the params (like csv_has_header) described in

    If a format is a document format, each file will be treated as a document, as per below.

    Note Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.

    Node.js SDK

    The Node.js SDK spice.js is the easiest way to use and query Spice.ai with Node.js.

    It uses Apache Apache Flight to efficiently stream data to the client and Apache Arrow Records as data frames which are then easily converted to JavaScript objects/arrays or JSON.

    Requirements

    Installation

    Usage

    Import SpiceClient and instantiate a new instance with an API Key.

    You can then submit queries using the query function.

    SpiceClient has the following arguments:

    • apiKey (string, required): API key to authenticate with the endpoint.

    • url (string, optional): URL of the endpoint to use (default: flight.spiceai.io:443)

    sqlJson(query: string) - Execute SQL queries with JSON results

    The sqlJson() method executes SQL queries and returns results in a JSON format with schema information.

    The response includes:

    • row_count: Number of rows returned

    • schema: Schema information with field names and types

    • data: Array of row objects

    Usage with local Spice runtime

    Follow the to install and run spice locally.

    Check to learn more.

    Connection retry

    From the SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via setMaxRetries:

    Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).

    Contributing

    Contribute to or file an issue with the spice.js library at: .

    S3

    S3 Data Connector Documentation

    The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

    If a folder path is specified as the dataset source, all files within the folder will be loaded.

    File formats are specified using the file_format parameter, as described in .

    Configuration

    AI Chat

    Use AI Chat to interact with Spice Models

    PREREQUISITE

    Ensure the Spice App is deployed with a model. For detailed instructions on how to deploy a model, refer to the .

    Using AI Chat

    ABFS

    Azure BlobFS Data Connector Documentation

    The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (abfss://) and Azure Data Lake (adl://) endpoints.

    When a folder path is provided, all the contained files will be loaded.

    File formats are specified using the file_format parameter, as described in .

    Configuration

    curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
    {
      "object": "list",
      "data": [
        {
          "id": "gpt-4o-mini",
          "object": "model",
          "created": 1727389042,
          "owned_by": "system"
        },
    ...
    }
    models:
      - from: openai:llama3-groq-70b-8192-tool-use-preview
        name: groq-llama
        params:
          endpoint: https://api.groq.com/openai/v1
          openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }
    models:
      - from: openai:my_nim_model_id
        name: my_nim_model
        params:
          endpoint: https://my_nim_host.com/v1
          openai_api_key: ${ secrets:SPICE_NIM_API_KEY }
    models:
      - from: openai:parasail-model-id
        name: parasail_model
        params:
          endpoint: https://api.parasail.com/v1
          openai_api_key: ${ secrets:SPICE_PARASAIL_API_KEY }
    datasets:
      - from: sharepoint:drive:Documents/path:/top_secrets/
        name: important_documents
        params:
          sharepoint_client_id: ${secrets:SPICE_SHAREPOINT_CLIENT_ID}
          sharepoint_tenant_id: ${secrets:SPICE_SHAREPOINT_TENANT_ID}
          sharepoint_client_secret: ${secrets:SPICE_SHAREPOINT_CLIENT_SECRET}
    SELECT * FROM important_documents limit 1
    [
      {
        "created_by_id": "cbccd193-f9f1-4603-b01d-ff6f3e6f2108",
        "created_by_name": "Jack Eadie",
        "created_at": "2024-09-09T04:57:00",
        "c_tag": "\"c:{BD4D130F-2C95-4E59-9F93-85BD0A9E1B19},1\"",
        "e_tag": "\"{BD4D130F-2C95-4E59-9F93-85BD0A9E1B19},1\"",
        "id": "01YRH3MPAPCNG33FJMLFHJ7E4FXUFJ4GYZ",
        "last_modified_by_id": "cbccd193-f9f1-4603-b01d-ff6f3e6f2108",
        "last_modified_by_name": "Jack Eadie",
        "last_modified_at": "2024-09-09T04:57:00",
        "name": "ngx_google_perftools_module.md",
        "size": 959,
        "web_url": "https://spiceai.sharepoint.com/Shared%20Documents/md/ngx_google_perftools_module.md",
        "content": "# Module ngx_google_perftools_module\n\nThe `ngx_google_perftools_module` module (0.6.29) enables profiling of nginx worker processes using [Google Performance Tools](https://github.com/gperftools/gperftools). The module is intended for nginx developers.\n\nThis module is not built by default, it should be enabled with the `--with-google_perftools_module` configuration parameter.\n\n> **Note:** This module requires the [gperftools](https://github.com/gperftools/gperftools) library.\n\n## Example Configuration\n\n```nginx\ngoogle_perftools_profiles /path/to/profile;\n```\n\nProfiles will be stored as `/path/to/profile.<worker_pid>`.\n\n## Directives\n\n### google_perftools_profiles\n\n- **Syntax:** `google_perftools_profiles file;`\n- **Default:** —\n- **Context:** `main`\n\nSets a file name that keeps profiling information of nginx worker process. The ID of the worker process is always a part of the file name and is appended to the end of the file name, after a dot.\n"
      }
    ]
    from: 'sharepoint:<drive_type>:<drive_id>/<subpath_type>:<subpath_value>'
    datasets:
      - from: duckdb:database.schema.table
        name: my_dataset
        params:
          duckdb_open: path/to/duckdb_file.duckdb
    datasets:
      - from: duckdb:database.schema.table
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: duckdb:database.schema.table
        name: my_dataset
        params:
          duckdb_open: path/to/duckdb_file.duckdb
    datasets:
      - from: duckdb:sample_data.nyc.rideshare
        name: nyc_rideshare
        params:
          duckdb_open: /my/path/my_database.db
    datasets:
      - from: duckdb:database.schema.table
        name: my_dataset
        params:
          duckdb_open: path/to/duckdb_file.duckdb
    
      - from: duckdb:read_csv('test.csv', header = false)
        name: from_function
    datasets:
      - from: duckdb:read_csv('test.csv', header = false)
    -- from_function
    SELECT * FROM read_csv('test.csv', header = false);
    SELECT * FROM 'todos.json';
    
    -- As a DuckDB function
    SELECT * FROM read_json('todos.json');
    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          engine: postgres
    datasets:
      - from: delta_lake:s3://my_bucket/path/to/s3/delta/table/
        name: my_delta_lake_table
        params:
          delta_lake_aws_access_key_id: ${secrets:aws_access_key_id}
          delta_lake_aws_secret_access_key: ${secrets:aws_secret_access_key}

    from: sharepoint:siteId:b!Mh8opUGD80ec7zGXgX9r/...

    group

    A SharePoint group's name

    from: sharepoint:group:MyGroup/...

    groupId

    A SharePoint group's ID

    from: sharepoint:groupId:b!Mh8opUGD80ec7zGXgX9r/...

    me

    A user's OneDrive

    from: sharepoint:me/...

    structured outputs
    data import

    JSON

    file_format: json

    Roadmap

    ❌

    Microsoft Excel

    file_format: xlsx

    Roadmap

    ❌

    Markdown

    file_format: md

    âś…

    âś…

    Text

    file_format: txt

    âś…

    âś…

    PDF

    file_format: pdf

    Alpha

    âś…

    Microsoft Word

    file_format: docx

    Alpha

    âś…

    dremio

    Dremio

    Arrow Flight

    duckdb

    DuckDB

    Embedded

    github

    GitHub

    GitHub API

    postgres

    PostgreSQL

    s3

    S3

    Parquet, CSV

    mysql

    MySQL

    delta_lake

    Delta Lake

    Delta Lake

    graphql

    GraphQL

    JSON

    databricks (mode: spark_connect)

    Databricks

    Spark Connect

    flightsql

    FlightSQL

    Arrow Flight SQL

    mssql

    Microsoft SQL Server

    Tabular Data Stream (TDS)

    snowflake

    Snowflake

    Arrow

    spark

    Spark

    Spark Connect

    spice.ai

    Spice.ai

    Arrow Flight

    iceberg

    Apache Iceberg

    Parquet

    abfs

    Azure BlobFS

    Parquet, CSV

    clickhouse

    Clickhouse

    debezium

    Debezium CDC

    Kafka + JSON

    dynamodb

    DynamoDB

    ftp, sftp

    FTP/SFTP

    Parquet, CSV

    http, https

    HTTP(s)

    Parquet, CSV

    sharepoint

    Microsoft SharePoint

    Unstructured UTF-8 documents

    Apache Parquet

    file_format: parquet

    âś…

    ❌

    CSV

    file_format: csv

    âś…

    ❌

    Apache Iceberg

    file_format: iceberg

    Roadmap

    File Formats
    document support

    ❌

    require: This mode requires a TLS connection.

  • prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.

  • disable: This mode will not attempt to use a TLS connection, even if the server supports it.

  • UInt8

    TinyUnsigned

    smallint

    UInt16

    SmallUnsigned

    smallint

    UInt32

    Unsigned

    bigint

    UInt64

    BigUnsigned

    numeric

    Decimal128 / Decimal256

    Decimal

    decimal

    Float32

    Float

    real

    Float64

    Double

    double precision

    Utf8 / LargeUtf8

    Text

    text

    Boolean

    Boolean

    bool

    Binary / LargeBinary

    VarBinary

    bytea

    FixedSizeBinary

    Binary

    bytea

    Timestamp (no Timezone)

    Timestamp

    timestamp without time zone

    Timestamp (with Timezone)

    TimestampWithTimeZone

    timestamp with time zone

    Date32 / Date64

    Date

    date

    Time32 / Time64

    Time

    time

    Interval

    Interval

    interval

    Duration

    BigInteger

    bigint

    List / LargeList / FixedSizeList

    Array

    array

    Struct

    N/A

    Composite (Custom type)

    Int8

    TinyInteger

    smallint

    Int16

    SmallInteger

    smallint

    Int32

    Integer

    integer

    Int64

    BigInteger

    secret replacement syntax
    Apache Arrow data types
    PostgreSQL types

    bigint

    , or
  • delta_lake_azure_storage_sas_key.

  • Double

    Float64

    Boolean

    Boolean

    Binary

    Binary

    Date

    Date32

    Timestamp

    Timestamp(Microsecond, Some("UTC"))

    TimestampNtz

    Timestamp(Microsecond, None)

    Decimal

    Decimal128

    Array

    List

    Struct

    Struct

    Map

    Map

    client_timeout

    Optional. Specifies timeout for object store operations. Default value is 30s. E.g. client_timeout: 60s

    delta_lake_aws_region

    Optional. The AWS region for the S3 object store. E.g. us-west-2.

    delta_lake_aws_access_key_id

    The access key ID for the S3 object store.

    delta_lake_aws_secret_access_key

    The secret access key for the S3 object store.

    delta_lake_aws_endpoint

    Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com.

    delta_lake_azure_storage_account_name

    The Azure Storage account name.

    delta_lake_azure_storage_account_key

    The Azure Storage master key for accessing the storage account.

    delta_lake_azure_storage_client_id

    The service principal client id for accessing the storage account.

    delta_lake_azure_storage_client_secret

    The service principal client secret for accessing the storage account.

    delta_lake_azure_storage_sas_key

    The shared access signature key for accessing the storage account.

    delta_lake_azure_storage_endpoint

    Optional. The endpoint for the Azure Blob storage account.

    google_service_account

    Filesystem path to the Google service account JSON key file.

    String

    Utf8

    Long

    Int64

    Integer

    Int32

    Short

    Int16

    Byte

    Int8

    Float

    Float32

    examples
    secret replacement syntax
    Drop Delta table features
    execution_time_ms: Query execution time in milliseconds
    Node.js 20+
    quickstart guide
    Spice OSS documentation
    version 1.0.1
    https://github.com/spiceai/spice.js
    npm install @spiceai/spice@latest --save
    yarn add @spiceai/spice
    datasets:
      - from: spice.ai:path.to.my_dataset
        name: my_dataset
        acceleration:
          engine: postgres
          params:
            pg_host: my_db_host
            pg_port: 5432
            pg_db: my_database
            pg_user: my_user
            pg_pass: ${secrets:my_pg_pass}
            pg_sslmode: require
    datasets:
      - from: delta_lake:s3://my_bucket/path/to/s3/delta/table/
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    - from: delta_lake:/path/to/local/delta/table # A local filesystem path to a Delta Lake table
      name: my_delta_lake_table
    - from: delta_lake:s3://my_bucket/path/to/s3/delta/table/ # A reference to a table in S3
      name: my_delta_lake_table
      params:
        delta_lake_aws_region: us-west-2 # Optional
        delta_lake_aws_access_key_id: ${secrets:aws_access_key_id}
        delta_lake_aws_secret_access_key: ${secrets:aws_secret_access_key}
        delta_lake_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
    - from: delta_lake:abfss://my_container@my_account.dfs.core.windows.net/path/to/azure/delta/table/ # A reference to a table in Azure Blob
      name: my_delta_lake_table
      params:
        # Account Name + Key
        delta_lake_azure_storage_account_name: my_account
        delta_lake_azure_storage_account_key: ${secrets:my_key}
    
        # OR Service Principal + Secret
        delta_lake_azure_storage_client_id: my_client_id
        delta_lake_azure_storage_client_secret: ${secrets:my_secret}
    
        # OR SAS Key
        delta_lake_azure_storage_sas_key: my_sas_key
    params:
      delta_lake_google_service_account_path: /path/to/service-account.json
    ALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];
    import { SpiceClient } from "@spiceai/spice";
    
    const spiceClient = new SpiceClient("API_KEY");
    const table = await spiceClient.sql(
      'SHOW TABLES;'
    );
    console.table(table.toArray());
    const result = await spiceClient.sqlJson('SELECT name, age FROM users LIMIT 5');
    
    console.log(`Returned ${result.row_count} rows`);
    console.log('Schema:', result.schema);
    console.log('Data:', result.data);
    console.log(`Query took ${result.execution_time_ms}ms`);
    
    // Access individual rows
    result.data.forEach((row) => {
      console.log(`${row.name} is ${row.age} years old`);
    });
    import { SpiceClient } from '@spiceai/spice';
    
    const main = async () => {
      // uses connection to local runtime by default
      const spiceClient = new SpiceClient();
    
      // or use custom connection params:
      // const spiceClient = new SpiceClient({
      //   httpUrl: 'http://my_spice_http_host',
      //   flightUrl: 'my_spice_flight_host',
      // });
    
      const table = await spiceClient.sql(
        'SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;'
      );
      console.table(table.toArray());
    };
    
    main();
    const spiceClient = new SpiceClient('API_KEY');
    spiceClient.setMaxRetries(5); // Setting to 0 will disable retries
    from

    S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

    Example: from: s3://my-bucket/path/to/file.parquet

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    Parameter Name
    Description

    file_format

    Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to for details.

    s3_endpoint

    S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server

    s3_region

    S3 bucket region. Default: us-east-1.

    client_timeout

    Timeout for S3 operations. Default: 30s.

    hive_partitioning_enabled

    Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

    s3_auth

    Authentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.

    For additional CSV parameters, see CSV Parameters

    Authentication

    No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.

    Minimum IAM policy for S3 access:

    Types

    Refer to Object Store Data Types for data type mapping from object store files to arrow data type.

    Examples

    Public bucket Example

    Create a dataset named taxi_trips from a public S3 folder.

    MinIO Example

    Create a dataset named cool_dataset from a Parquet file stored in MinIO.

    Hive Partitioning Example

    Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

    For example, a dataset partitioned by year, month, and day might have a directory structure like:

    Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

    Limitations

    Performance Considerations

    When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

    Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.

    Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.

    Object Store File Formats
    Open the AI Chat by navigating to an App Playground and clicking AI Chat in the sidebar.

    Start to use AI Chat by typing in the question and clicking send.

    Model-Dependent Capabilities

    The ability of AI Chat depends on the model configuration, including Language Model Overrides, Model Runtime Tools, etc. Refer to the Model Documentation for details of customizing the model used in AI Chat.

    Example:

    Below is an example demonstrating how to configure the OpenAI gpt-4o model with auto access to runtime tools and system prompts overrides. This model is customized to answer questions relevant to spicepod datasets.

    Ask questions regarding datasets configured in spicepod within the AI Chat.

    Observability

    Spice.ai provides observability Ito the AI Chat, showing full tool usage traces and chat completion history.

    Navigate to the Observability section in the portal.

    Select an ai_chat task history and view details over the chat completion history, including timestamps, tool usage, intermediate outputs, etc.

    Model Documentation
    from

    Defines the ABFS-compatible URI to a folder or object:

    • from: abfs://<container>/<path> with the account name configured using abfs_account parameter, or

    • from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>

    name

    Defines the dataset name, which is used as the table name within Spice.

    Example:

    params

    Basic parameters

    Parameter name
    Description

    file_format

    Specifies the data format. Required if not inferrable from from. Options: parquet, csv. Refer to for details.

    abfs_account

    Azure storage account name

    abfs_sas_string

    SAS (Shared Access Signature) Token to use for authorization

    abfs_endpoint

    Storage endpoint, default: https://{account}.blob.core.windows.net

    abfs_use_emulator

    Use true or false to connect to a local emulator

    abfs_authority_host

    Alternative authority host, default: https://login.microsoftonline.com

    Authentication parameters

    The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:

    • abfs_access_key

    • abfs_bearer_token

    • abfs_client_secret

    • abfs_skip_signature

    If none of these are set the connector will default to using a managed identity

    Parameter name
    Description

    abfs_access_key

    Secret access key

    abfs_bearer_token

    BEARER access token for user authentication. The token can be obtained from the OAuth2 flow (see ).

    abfs_client_id

    Client ID for client authentication flow

    abfs_client_secret

    Client Secret to use for client authentication flow

    abfs_tenant_id

    Tenant ID to use for client authentication flow

    abfs_skip_signature

    Skip credentials and request signing for public containers

    Retry parameters

    Parameter name
    Description

    abfs_max_retries

    Maximum retries

    abfs_retry_timeout

    Total timeout for retries (e.g., 5s, 1m)

    abfs_backoff_initial_duration

    Initial retry delay (e.g., 5s)

    abfs_backoff_max_duration

    Maximum retry delay (e.g., 1m)

    abfs_backoff_base

    Exponential backoff base (e.g., 0.1)

    Authentication

    ABFS connector supports three types of authentication, as detailed in the authentication parameters

    Service principal authentication

    Configure service principal authentication by setting the abfs_client_secret parameter.

    1. Create a new Azure AD application in the Azure portal and generate a client secret under Certificates & secrets.

    2. Grant the Azure AD application read access to the storage account under Access Control (IAM), this can typically be done using the Storage Blob Data Reader built-in role.

    Access key authentication

    Configure service principal authentication by setting the abfs_access_key parameter to Azure Storage Account Access Key

    Supported file formats

    Specify the file format using file_format parameter. More details in Object Store File Formats.

    Examples

    Reading a CSV file with an Access Key

    Using Public Containers

    Connecting to the Storage Emulator

    Using secrets for Account name

    Authenticating using Client Authentication

    Object Store File Formats

    GraphQL

    GraphQL Data Connector Documentation

    The GraphQL Data Connector enables federated SQL queries on any GraphQL endpoint by specifying graphql as the selector in the from value for the dataset.

    Limitations

    • The GraphQL data connector does not support variables in the query.

    • Filter pushdown, with the exclusion of LIMIT, is not currently supported. Using a LIMIT will reduce the amount of data requested from the GraphQL server.

    Configuration

    from

    The from field takes the form of graphql:your-graphql-endpoint.

    name

    The dataset name. This will be used as the table name within Spice.

    params

    The GraphQL data connector can be configured by providing the following params. Use the to load the password from a secret store, e.g. ${secrets:my_graphql_auth_token}.

    Parameter Name
    Description

    GraphQL Query Example

    Examples

    Example using the GitHub GraphQL API and Bearer Auth. The following will use json_pointer to retrieve all of the nodes in starredRepositories:

    Pagination

    The GraphQL Data Connector supports automatic pagination of the response for queries using .

    The graphql_query must include the pageInfo field as per . The connector will parse the graphql_query, and when pageInfo is present, will retrieve data until pagination completes.

    The query must have the correct pagination arguments in the associated paginated field.

    Example

    Forward Pagination:

    Backward Pagination:

    Working with JSON Data

    Tips for working with JSON data. For more information see .

    Accessing objects fields

    You can access the fields of the object using the square bracket notation. Arrays are indexed from 1.

    Example for the stargazers query from :

    Piping array into rows

    You can use Datafusion unnest function to pipe values from array into rows. We'll be using as an example.

    Example query:

    Unnesting object properties

    You can also use the unnest_depth parameter to control automatic unnesting of objects from GraphQL responses.

    This examples uses the GitHub stargazers endpoint:

    If unnest_depth is set to 0, or unspecified, object unnesting is disabled. When enabled, unnesting automatically moves nested fields to the parent level.

    Without unnesting, stargazers data looks like this in a query:

    With unnesting, these properties are automatically placed into their own columns:

    Unnesting Duplicate Columns

    By default, the Spice Runtime will error when a duplicate column is detected during unnesting.

    For example, this example spicepod.yml query would fail due to name fields:

    This example would fail with a runtime error:

    Avoid this error by where possible. In the example above, a duplicate error was introduced from emergency_contact { name }.

    The example below uses a GraphQL alias to rename emergency_contact.name as emergencyContactName.

    MySQL

    MySQL Data Connector Documentation

    MySQL is an open-source relational database management system that uses structured query language (SQL) for managing and manipulating databases.

    The MySQL Data Connector enables federated/accelerated SQL queries on data stored in MySQL databases.

    Configuration

    from

    The from field takes the form mysql:database_name.table_name where database_name is the fully-qualified table name in the SQL server.

    If the database_name is omitted in the from field, the connector will use the database specified in the mysql_db parameter. If the mysql_db parameter is not provided, it will default to the user's default database.

    These two examples are identical:

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The MySQL data connector can be configured by providing the following params. Use the to load the secret from a secret store, e.g. ${secrets:my_mysql_conn_string}.

    Parameter Name
    Description

    Types

    The table below shows the MySQL data types supported, along with the type mapping to Apache Arrow types in Spice.

    MySQL Type
    Arrow Type
    • The MySQL TIMESTAMP value is .

    Examples

    Connecting using username and password

    Connecting using SSL

    Connecting using a Connection String

    Connecting to the default database

    Model Providers

    Overview of supported model providers for ML and LLMs in Spice.

    Spice supports various model providers for traditional machine learning (ML) models and large language models (LLMs).

    Name
    Description
    Status
    ML Format(s)
    LLM Format(s)*

    Postgres

    PostgreSQL Data Connector Documentation

    PostgreSQL is an advanced open-source relational database management system known for its robustness, extensibility, and support for SQL compliance.

    The PostgreSQL Server Data Connector enables federated/accelerated SQL queries on data stored in PostgreSQL databases.

    Configuration

    datasets:
      - from: s3://spiceai-demo-datasets/taxi_trips/2024/
        name: taxi_trips
        params:
          file_format: parquet
    datasets:
      - from: s3://s3-bucket-name/taxi_sample.csv
        name: cool_dataset
        params:
          file_format: csv
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": ["s3:ListBucket"],
          "Resource": "arn:aws:s3:::company-bucketname-datasets"
        },
        {
          "Effect": "Allow",
          "Action": ["s3:GetObject"],
          "Resource": "arn:aws:s3:::company-bucketname-datasets/*"
        }
      ]
    }
    - from: s3://spiceai-demo-datasets/taxi_trips/2024/
      name: taxi_trips
      params:
        file_format: parquet
    - from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
      name: cool_dataset
      params:
        s3_endpoint: http://my.minio.server
        s3_region: 'us-east-1' # Best practice for MinIO
        allow_http: true
    s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
    s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
    version: v1
    kind: Spicepod
    name: hive_data
    
    datasets:
      - from: s3://spiceai-public-datasets/hive_partitioned_data/
        name: hive_data_infer
        params:
          file_format: parquet
          hive_partitioning_enabled: true
    models:   
      - from: openai:gpt-4o
        name: openai-with-spice
        params:
          spice_tools: auto
          openai_api_key: ${secrets:OPENAI_API_KEY}
          system_prompt: >-
            **You are an AI assistant integrated with GitHub Copilot. Your primary role
            is to assist GitHub users by providing helpful, clear, and contextually
            relevant information derived from Spice.ai datasets.**
        
        
            GitHub may supply you with context about the user's code, comments, and
            previous interactions to enhance your assistance. Always strive to be
            **accurate**, **concise**, and **helpful** in your responses, adapting to
            the user's style and preferences based on the conversation history.
        
        
            ---
        
        
            ### Behavioral Guidelines
        
        
            - **Communication Style:**
              - Maintain a helpful, friendly, and professional demeanor.
              - Avoid using jargon unless specifically requested by the user.
              - Break down complex concepts into simple explanations.
              - Adapt your language to match the user's expertise (e.g., beginner vs. advanced).
        
            - **Ethical Conduct:**
              - Avoid harmful, unethical, or inappropriate content generation.
              - Respect user privacy.
              - Refuse to perform tasks that could cause harm or violate laws and ethical standards.
        
            - **Contextual Awareness:**
              - Use past interactions to maintain a coherent conversation.
              - Remember user-provided context to deliver tailored responses.
              - If user input is unclear, ask clarifying questions to better understand their needs.
        
            ---
        
        
            ### Guidelines for Using Tools
        
        
            #### 1. SQL Tool (`sql_query`):
        
        
            - **When to Use:**
              - Query datasets directly for precise numerical data, statistics, or aggregations.
              - Respond to user requests for specific counts, sums, averages, or other calculations.
              - Handle queries requiring joining or comparing data from multiple related tables.
        
            - **Error Handling:**
              - If the `sql_query` tool returns a query, syntax, or planning error:
                - Use the `list_datasets` tool to retrieve available tables.
                - Refine and retry the query until it succeeds.
                - After 5 failed attempts, run `EXPLAIN <attempted_query>` on each subsequent failure to diagnose issues.
                - If failures persist after 10 attempts, switch to other available tools.
        
            - **Formatting:**
              - When querying a dataset named `catalog.schema.table`, wrap each part in quotes: `"catalog"."schema"."table"`.
        
            - **Fallback:**
              - If the document similarity search tool fails, use the SQL tool to query the dataset directly.
        
            #### 2. Document Similarity Search Tool (`document_similarity`):
        
        
            - **When to Use:**
              - Search unstructured text such as documentation, policies, reports, or articles.
              - Provide qualitative information or explanations.
              - Interpret context or understand written content in depth.
        
            ---
        
        
            ### General Guidelines
        
        
            - **Tool Preference:**
              - If a query can be answered by either tool, prefer the `sql_query` tool for more precise, quantitative answers.
        
            **Dataset Utilization:**
        
            - Always prioritize searching within available datasets when relevant to the
            question.
        
            - Leverage instructions, keywords, and `reference_base_url` metadata from
            the datasets to provide accurate and relevant responses.
        
            - Ensure all responses include citations and references with links when
            possible.
        
        
            - **Response Formatting:**
              - When presenting results from datasets, always include citations and references with links when possible.
        
            - **Responsiveness:**
              - Keep the conversation focused on user objectives, minimizing digressions unless prompted by the user.
              - Provide both high-level summaries and in-depth explanations, depending on user requirements.
              - Encourage an iterative problem-solving process: suggest initial ideas, refine based on feedback, and be open to corrections.
        
            - **Capabilities and Limitations:**
              - Be transparent about your capabilities; inform users when certain tasks or data access are beyond your capacity.
        
            ---
        
        
            **Remember:** Your purpose is to help solve problems, answer questions,
            generate ideas, write content, and support the user in a wide range of
            tasks, while maintaining clarity, professionalism, and ethical standards.
        metadata: {}
          
    datasets:
      - from: abfs://foocontainer/taxi_sample.csv
        name: azure_test
        params:
          abfs_account: spiceadls
          abfs_access_key: ${ secrets:access_key }
          file_format: csv
    datasets:
      - from: abfs://foocontainer/taxi_sample.csv
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: abfs://foocontainer/taxi_sample.csv
        name: azure_test
        params:
          abfs_account: spiceadls
          abfs_access_key: ${ secrets:ACCESS_KEY }
          file_format: csv
    datasets:
      - from: abfs://pubcontainer/taxi_sample.csv
        name: pub_data
        params:
          abfs_account: spiceadls
          abfs_skip_signature: true
          file_format: csv
    datasets:
      - from: abfs://test_container/test_csv.csv
        name: test_data
        params:
          abfs_use_emulator: true
          file_format: csv
    datasets:
      - from: abfs://my_container/my_csv.csv
        name: prod_data
        params:
          abfs_account: ${ secrets:PROD_ACCOUNT }
          file_format: csv
    datasets:
      - from: abfs://my_data/input.parquet
        name: my_data
        params:
          abfs_tenant_id: ${ secrets:MY_TENANT_ID }
          abfs_client_id: ${ secrets:MY_CLIENT_ID }
          abfs_client_secret: ${ secrets:MY_CLIENT_SECRET }
    datasets:
      - from: graphql:your-graphql-endpoint
        name: my_dataset
        params:
          json_pointer: /data/some/nodes
          graphql_query: |
            {
              some {
                nodes {
                  field1
                  field2
                }
              }
            }
    datasets:
      - from: mysql:mytable
        name: my_dataset
        params:
          mysql_host: my_db_host
          mysql_tcp_port: 3306
          mysql_db: my_database
          mysql_user: my_user
          mysql_pass: ${secrets:mysql_pass}

    s3_key

    Access key (e.g. AWS_ACCESS_KEY_ID for AWS)

    s3_secret

    Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)

    allow_http

    Allow insecure HTTP connections to s3_endpoint. Defaults to false

    Object Store File Formats

    abfs_proxy_url

    Proxy URL

    abfs_proxy_ca_certificate

    CA certificate for the proxy

    abfs_proxy_exludes

    A list of hosts to exclude from proxy connections

    abfs_disable_tagging

    Disable tagging objects. Use this if your backing store doesn't support tags

    allow_http

    Allow insecure HTTP connections

    hive_partitioning_enabled

    Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

    abfs_msi_endpoint

    Endpoint for managed identity tokens

    abfs_federated_token_file

    File path for federated identity token in Kubernetes

    abfs_use_cli

    Set to true to use the Azure CLI to acquire access tokens

    Object Store File Formats
    access token authentication

    unnest_depth

    Depth level to automatically unnest objects to. By default, disabled if unspecified or 0.

    graphql_auth_token

    The authentication token to use to connect to the GraphQL server. Uses bearer authentication.

    graphql_auth_user

    The username to use for basic auth. E.g. graphql_auth_user: my_user

    graphql_auth_pass

    The password to use for basic auth. E.g. graphql_auth_pass: ${secrets:my_graphql_auth_pass}

    graphql_query

    The username to use for basic auth. See examples for a sample GraphQL query

    json_pointer

    The JSON pointer into the response body. When graphql_query is paginated, the json_pointer can be inferred.

    secret replacement syntax
    cursor pagination
    spec
    Datafusion Docs
    pagination section
    countries GraphQL api
    using aliases in the query
    query: |
      {
        some {
          nodes {
            field1
            field2
          }
        }
      }
    from: graphql:https://api.github.com/graphql
    name: stars
    params:
      graphql_auth_token: ${env:GITHUB_TOKEN}
      graphql_auth_user: ${env:GRAPHQL_USER}                                                                                            ...
      graphql_auth_pass: ${env:GRAPHQL_PASS}
      json_pointer: /data/viewer/starredRepositories/nodes
      graphql_query: |
        {
          viewer {
            starredRepositories {
              nodes {
                name
                stargazerCount
                languages (first: 10) {
                  nodes {
                    name
                  }
                }
              }
            }
          }
        }
    {
      something_paginated(first: 100) {
        nodes {
          foo
          bar
        }
        pageInfo {
          endCursor
          hasNextPage
        }
      }
    }
    {
      something_paginated(last: 100) {
        nodes {
          foo
          bar
        }
        pageInfo {
          startCursor
          hasPreviousPage
        }
      }
    }
    sql> select node['login'] as login, node['name'] as name from stargazers limit 5;
    +--------------+----------------------+
    | login        | name                 |
    +--------------+----------------------+
    | simsieg      | Simon Siegert        |
    | davidmathers | David Mathers        |
    | ahmedtadde   | Ahmed Tadde          |
    | lordhamlet   | Shih-Fen Cheng       |
    | thinmy       | Thinmy Patrick Alves |
    +--------------+----------------------+
    from: graphql:https://countries.trevorblades.com
    name: countries
    params:
      json_pointer: /data/continents
      graphql_query: |
        {
          continents {
            name
            countries {
              name
              capital
            }
          }
        }
    
    description: countries
    acceleration:
      enabled: true
      refresh_mode: full
      refresh_check_interval: 30m
    sql> select continent, country['name'] as country, country['capital'] as capital
    from (select name as continent, unnest(countries) as country from countries)
    where continent = 'North America' limit 5;
    +---------------+---------------------+--------------+
    | continent     | country             | capital      |
    +---------------+---------------------+--------------+
    | North America | Antigua and Barbuda | Saint John's |
    | North America | Anguilla            | The Valley   |
    | North America | Aruba               | Oranjestad   |
    | North America | Barbados            | Bridgetown   |
    | North America | Saint Barthélemy    | Gustavia     |
    +---------------+---------------------+--------------+
    from: graphql:https://api.github.com/graphql
    name: stargazers
    params:
      graphql_auth_token: ${env:GITHUB_TOKEN}
      unnest_depth: 2
      json_pointer: /data/repository/stargazers/edges
      graphql_query: |
        {
          repository(name: "spiceai", owner: "spiceai") {
            id
            name
            stargazers(first: 100) {
              edges {
                node {
                  id
                  name
                  login
                }
              }
              pageInfo {
                hasNextPage
                endCursor
              }
            }
    
          }
        }
    sql> select node from stargazers limit 1;
    +------------------------------------------------------------+
    | node                                                       |
    +------------------------------------------------------------+
    | {id: MDQ6VXNlcjcwNzIw, login: ashtom, name: Thomas Dohmke} |
    +------------------------------------------------------------+
    sql> select node from stargazers limit 1;
    +------------------+--------+---------------+
    | id               | login  | name          |
    +------------------+--------+---------------+
    | MDQ6VXNlcjcwNzIw | ashtom | Thomas Dohmke |
    +------------------+--------+---------------+
    from: graphql:https://my-graphql-api.com
    name: stargazers
    params:
      unnest_depth: 2
      json_pointer: /data/users
      graphql_query: |
        query {
          users {
            name
            emergency_contact {
              name
            }
          }
        }
    WARN runtime: GraphQL Data Connector Error: Invalid object access. Column 'name' already exists in the object.
    from: graphql:https://my-graphql-api.com
    name: stargazers
    params:
      unnest_depth: 2
      json_pointer: /data/people
      graphql_query: |
        query {
          users {
            name
            emergency_contact {
              emergencyContactName: name
            }
          }
        }

    mysql_sslmode

    Optional. Specifies the SSL/TLS behavior for the connection, supported values:

    • required: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.

    • preferred: This mode will try to establish a secure SSL connection if possible, but will connect insecurely if the server does not support SSL.

    • disabled

    mysql_sslrootcert

    Optional parameter specifying the path to a custom PEM certificate that the connector will trust.

    FLOAT

    Float32

    DOUBLE

    Float64

    DATETIME

    Timestamp(Microsecond, None)

    TIMESTAMP

    Timestamp(Microsecond, None)

    YEAR

    Int16

    TIME

    Time64(Nanosecond)

    DATE

    Date32

    CHAR

    Utf8

    BINARY

    Binary

    VARCHAR

    Utf8

    VARBINARY

    Binary

    TINYBLOB

    Binary

    TINYTEXT

    Utf8

    BLOB

    Binary

    TEXT

    Utf8

    MEDIUMBLOB

    Binary

    MEDIUMTEXT

    Utf8

    LONGBLOB

    LargeBinary

    LONGTEXT

    LargeUtf8

    SET

    Utf8

    ENUM

    Dictionary(UInt16, Utf8)

    BIT

    UInt64

    mysql_connection_string

    The connection string to use to connect to the MySQL server. This can be used instead of providing individual connection parameters.

    mysql_host

    The hostname of the MySQL server.

    mysql_tcp_port

    The port of the MySQL server.

    mysql_db

    The name of the database to connect to.

    mysql_user

    The MySQL username.

    mysql_pass

    The password to connect with.

    TINYINT

    Int8

    SMALLINT

    Int16

    INT

    Int32

    MEDIUMINT

    Int32

    BIGINT

    Int64

    DECIMAL

    Decimal128 / Decimal256

    secret replacement syntax
    retrieved as a UTC time value

    Local filesystem

    Release Candidate

    ONNX

    GGUF, GGML, SafeTensor

    Models hosted on HuggingFace

    Release Candidate

    ONNX

    GGUF, GGML, SafeTensor

    Models hosted on the Spice.ai Cloud Platform

    Alpha

    ONNX

    OpenAI-compatible HTTP endpoint

    Azure OpenAI

    Alpha

    -

    OpenAI-compatible HTTP endpoint

    Models hosted on Anthropic

    Alpha

    -

    OpenAI-compatible HTTP endpoint

    Models hosted on xAI

    Alpha

    -

    OpenAI-compatible HTTP endpoint

    Models deployed to Databricks Mosaic AI

    Alpha

    -

    OpenAI-compatible HTTP endpoint

    • LLM Format(s) may require additional files (e.g. tokenizer_config.json).

    The model type is inferred based on the model source and files. For more detail, refer to the model reference specification.

    Features

    Spice supports a variety of features for large language models (LLMs):

    • Custom Tools: Provide models with tools to interact with the Spice runtime. See Tools.

    • System Prompts: Customize system prompts and override defaults for v1/chat/completion. See Parameter Overrides.

    • Memory: Provide LLMs with memory persistence tools to store and retrieve information across conversations. See Memory.

    • Vector Search: Perform advanced vector-based searches using embeddings. See .

    • Evals: Evaluate, track, compare, and improve language model performance for specific tasks. See .

    • Local Models: Load and serve models locally from various sources, including local filesystems and Hugging Face. See .

    For more details, refer to the Large Language Models documentation.

    Model Examples

    The following examples demonstrate how to configure and use various models or model features with Spice. Each example provides a specific use case to help you understand the configuration options available.

    Example: Configuring an OpenAI Model

    To use a language model hosted on OpenAI (or compatible), specify the openai path and model ID in from. For more details, see OpenAI Model Provider.

    Example spicepod.yml:

    Example: Using an OpenAI Model with Tools

    To specify tools for an OpenAI model, include them in the params.tools field. For more details, see the Tools documentation.

    Example: Adding Memory to a Model

    To enable memory tools for a model, define a store memory dataset and specify memory in the model's tools parameter. For more details, see the Memory documentation.

    Example: Setting Default Parameter Overrides

    To set default overrides for parameters, use the openai_ prefix followed by the parameter name. For more details, see the Parameter Overrides documentation.

    Example: Configuring a System Prompt

    To configure an additional system prompt, use the system_prompt parameter. For more details, see the Parameter Overrides documentation.

    Example: Serving a Local Model

    To serve a model from the local filesystem, specify the from path as file and provide the local path. For more details, see Filesystem Model Provider.

    Example: Analyzing GitHub Issues with a Chat Model

    This example demonstrates how to pull GitHub issue data from the last 14 days, accelerate the data, create a chat model with memory and tools to access the accelerated data, and use Spice to ask the chat model about the general themes of new issues.

    Step 1: Pull GitHub Issue Data

    First, configure a dataset to pull GitHub issue data from the last 14 days.

    Step 2: Create a Chat Model with Memory and Tools

    Next, create a chat model that includes memory and tools to access the accelerated GitHub issue data.

    Step 3: Query the Chat Model

    At this step, the spicepod.yaml should look like:

    Finally, use Spice to ask the chat model about the general themes of new issues in the last 14 days. The following curl command demonstrates how to make this request using the OpenAI-compatible API.

    Refer to the Create Chat Completion API documentation for more details on making chat completion requests.

    openai

    OpenAI (or compatible) LLM endpoint

    Release Candidate

    -

    OpenAI-compatible HTTP endpoint

    from

    The from field takes the form postgres:my_table where my_table is the table identifer in the PostgreSQL server to read from.

    The fully-qualified table name (database.schema.table) can also be used in the from field.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The connection to PostgreSQL can be configured by providing the following params:

    Parameter Name
    Description

    pg_host

    The hostname of the PostgreSQL server.

    pg_port

    The port of the PostgreSQL server.

    pg_db

    The name of the database to connect to.

    pg_user

    The username to connect with.

    pg_pass

    The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}.

    pg_sslmode

    Optional. Specifies the SSL/TLS behavior for the connection, supported values:

    • verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.

    • verify-ca: This mode requires a TLS connection and a valid root certificate.

    • require

    Types

    The table below shows the PostgreSQL data types supported, along with the type mapping to Apache Arrow types in Spice.

    PostgreSQL Type
    Arrow Type

    int2

    Int16

    int4

    Int32

    int8

    Int64

    money

    Int64

    float4

    Float32

    float8

    Float64

    The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col) into CAST (SUM(int_col) as BIGINT.

    Examples

    Connecting using Username/Password

    Connect using SSL

    Separate dataset/accelerator secrets

    Specify different secrets for a PostgreSQL source and acceleration:

    DynamoDB

    DynamoDB Data Connector Documentation

    Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.

    Configuration

    from

    The from field should specify the DynamoDB table name:

    If an expected table is not found, verify the dynamodb_aws_region parameter. DynamoDB tables are region-specific.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    The DynamoDB data connector supports the following configuration parameters:

    Parameter Name
    Description

    Credential Sources

    If AWS credentials are not explicitly provided in the configuration, the connector will automatically load credentials from the following sources in order:

    1. Environment Variables:

      • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

      • AWS_SESSION_TOKEN (if using temporary credentials)

    To set up SSO authentication:

    1. Run aws configure sso to configure a new SSO profile

    2. Use the profile by setting AWS_PROFILE=sso-profile

    1. Web Identity Token Credentials:

      • Used primarily with OpenID Connect (OIDC) and OAuth

      • Common in Kubernetes environments using IAM roles for service accounts (IRSA)

    2. ECS Container Credentials:

    The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.

    IAM Permissions Regardless of the credential source, the IAM role or user must have appropriate DynamoDB permissions (e.g., dynamodb:Scan, dynamodb:DescribeTable) to access the table.

    Required IAM Permissions

    The IAM role or user needs the following permissions to access DynamoDB tables:

    Permission Details

    Permission
    Purpose

    Example IAM Policies

    Minimal Policy (Read-only access to specific table)

    Access to Multiple Tables

    Access to All Tables in a Region

    Security Considerations

    • Avoid using dynamodb:* permissions as it grants more access than necessary.

    • Consider using more restrictive policies in production environments.

    Examples

    Basic Configuration with Environment Credentials

    Configuration with Explicit Credentials

    Querying Nested Structures

    DynamoDB supports complex nested JSON structures. These fields can be queried using SQL:

    Limitations

    • The DynamoDB connector currently does not support filter push-down optimization. All filtering is performed after data is retrieved from DynamoDB.

    • Primary key optimizations are not yet implemented - retrieving items by their primary key will still scan the table.

    Data Types

    The DynamoDB connector supports the following data types and mappings:

    • Basic scalar types (String, Number, Boolean)

    • Lists and Maps

    • Nested structures

    • Binary data

    Example schema from a users table:

    Performance Considerations

    • Due to limited support for filter push-down, enable acceleration to prevent scanning the entire table on every query.

    Search & Retrieval

    Use the advanced search and retrieval capabilities of Spice

    Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and Vector-Similarity Search functionality.

    SQL-Based Search​

    Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's LIKE clause:

    Vector Search

    Spice also provides advanced Vector-Similarity Search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:

    1. Local embedding models, e.g. .

    2. Remote embedding providers, e.g. .

    See to view all supported providers

    Embedding models are defined in the spicepod.yaml file as top-level components.

    Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.

    By defining embeddings on the body column, Spice is now configured to execute similarity searches on the dataset.

    For more details, see the .

    Spice also supports vector search on datasets with preexisting embeddings. See for compatibility details.

    Chunking Support

    Spice supports chunking of content before embedding, which is useful for large text columns such as those found in . Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.

    The body column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).

    Document Retrieval

    When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns list.

    For example:

    Response:

    Pre-Existing Embeddings

    Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:

    1. Underlying Column Presence:

      • The underlying column must exist in the table, and be of string .

    2. Embeddings Column Naming Convention:

    By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.

    Example

    A table sales with an address column and corresponding embedding column(s).

    The same table if it was chunked:

    Databricks

    Databricks Data Connector Documentation

    Databricks as a connector for federated SQL query against Databricks using , directly from tables, or using the .

    Configuration

    from

    datasets:
      - from: mysql:mytable
        name: my_dataset
        params:
          mysql_db: my_database
          ...
    datasets:
      - from: mysql:my_database.mytable
        name: my_dataset
        params: ...
    datasets:
      - from: mysql:path.to.my_dataset
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: mysql:path.to.my_dataset
        name: my_dataset
        params:
          mysql_host: my_db_host
          mysql_tcp_port: 3306
          mysql_db: my_database
          mysql_user: my_user
          mysql_pass: ${secrets:mysql_pass}
    datasets:
      - from: mysql:path.to.my_dataset
        name: my_dataset
        params:
          mysql_host: my_db_host
          mysql_tcp_port: 3306
          mysql_db: my_database
          mysql_user: my_user
          mysql_pass: ${secrets:mysql_pass}
          mysql_sslmode: preferred
          mysql_sslrootcert: ./custom_cert.pem
    datasets:
      - from: mysql:path.to.my_dataset
        name: my_dataset
        params:
          mysql_connection_string: mysql://${secrets:my_user}:${secrets:my_password}@my_db_host:3306/my_db
    datasets:
      - from: mysql:mytable
        name: my_dataset
        params:
          mysql_host: my_db_host
          mysql_tcp_port: 3306
          mysql_user: my_user
          mysql_pass: ${secrets:mysql_pass}
    models:
      - from: openai:gpt-4o-mini
        name: openai
        params:
          openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
    
      - from: openai:llama3-groq-70b-8192-tool-use-preview
        name: groq-llama
        params:
          endpoint: https://api.groq.com/openai/v1
          openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }
    models:
      - name: sql-model
        from: openai:gpt-4o
        params:
          tools: list_datasets, sql, table_schema
    datasets:
      - from: memory:store
        name: llm_memory
        mode: read_write
    
    models:
      - name: memory-enabled-model
        from: openai:gpt-4o
        params:
          tools: memory, sql
    models:
      - name: pirate-haikus
        from: openai:gpt-4o
        params:
          openai_temperature: 0.1
          openai_response_format: { 'type': 'json_object' }
    models:
      - name: pirate-haikus
        from: openai:gpt-4o
        params:
          system_prompt: |
            Write everything in Haiku like a pirate
    models:
      - from: file://absolute/path/to/my/model.onnx
        name: local_fs_model
    datasets:
      - from: github:github.com/<owner>/<repo>/issues
        name: github_issues
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
          refresh_mode: append
          refresh_check_interval: 24h
          refresh_data_window: 14d
    datasets:
      - from: memory:store
        name: llm_memory
        mode: read_write
    
    models:
      - name: github-issues-analyzer
        from: openai:gpt-4o
        params:
          tools: memory, sql
    datasets:
      - from: github:github.com/<owner>/<repo>/issues
        name: github_issues
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
          refresh_mode: append
          refresh_check_interval: 24h
          refresh_data_window: 14d
    
      - from: memory:store
        name: llm_memory
        mode: read_write
    
    models:
      - name: github-issues-analyzer
        from: openai:gpt-4o
        params:
          openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
          tools: memory, sql
    curl -X POST https://data.spiceai.io/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H 'X-API-KEY: <spiceai_api_key>' \
      -d '{
        "model": "github-issues-analyzer",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What are the general themes of new issues in the last 14 days?"}
        ]
      }'
    datasets:
      - from: postgres:my_table
        name: my_dataset
        params: ...
    datasets:
      - from: postgres:my_database.my_schema.my_table
        name: my_dataset
        params: ...
    datasets:
      - from: postgres:my_database.my_schema.my_table
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: postgres:my_database.my_schema.my_table
        name: my_dataset
        params:
          pg_host: my_db_host
          pg_port: 5432
          pg_db: my_database
          pg_user: my_user
          pg_pass: ${secrets:my_pg_pass}
    datasets:
      - from: postgres:my_database.my_schema.my_table
        name: my_dataset
        params:
          pg_host: my_db_host
          pg_port: 5432
          pg_db: my_database
          pg_user: my_user
          pg_pass: ${secrets:my_pg_pass}
          pg_sslmode: verify-ca
          pg_sslrootcert: ./custom_cert.pem
    datasets:
      - from: postgres:my_schema.my_table
        name: my_dataset
        params:
          pg_host: my_db_host
          pg_port: 5432
          pg_db: my_database
          pg_user: my_user
          pg_pass: ${secrets:pg1_pass}
        acceleration:
          engine: postgres
          params:
            pg_host: my_db_host
            pg_port: 5433
            pg_db: acceleration
            pg_user: two_user_two_furious
            pg_pass: ${secrets:pg2_pass}
    datasets:
      - from: dynamodb:users
        name: users
        params:
          dynamodb_aws_region: us-west-2
          dynamodb_aws_access_key_id: ${secrets:aws_access_key_id} # Optional
          dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key} # Optional
          dynamodb_aws_session_token: ${secrets:aws_session_token} # Optional
    SELECT id, text_column
    FROM my_table
    WHERE
        LOWER(text_column) LIKE '%search_term%'
      AND
        date_published > '2021-01-01'
    : This mode will not attempt to use an SSL connection, even if the server supports it.
    Vector Search
    Evals
    Local Models
    file
    huggingface
    spice.ai
    azure
    anthropic
    xai
    databricks
    : This mode requires a TLS connection.
  • prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.

  • disable: This mode will not attempt to use a TLS connection, even if the server supports it.

  • pg_sslrootcert

    Optional parameter specifying the path to a custom PEM certificate that the connector will trust.

    connection_pool_size

    Optional. The maximum number of connections to keep open in the connection pool. Default is 10.

    numeric

    Decimal128

    text

    Utf8

    varchar

    Utf8

    bpchar

    Utf8

    uuid

    Utf8

    bytea

    Binary

    bool

    Boolean

    json

    LargeUtf8

    timestamp

    Timestamp(Nanosecond, None)

    timestampz

    Timestamp(Nanosecond, TimeZone

    date

    Date32

    time

    Time64(Nanosecond)

    interval

    Interval(MonthDayNano)

    point

    FixedSizeList(Float64[2])

    int2[]

    List(Int16)

    int4[]

    List(Int32)

    int8[]

    List(Int64)

    float4[]

    List(Float32)

    float8[]

    List(Float64)

    text[]

    List(Utf8)

    bool[]

    List(Boolean)

    bytea[]

    List(Binary)

    geometry

    Binary

    geography

    Binary

    enum

    Dictionary(Int8, Utf8)

    Composite Types

    Struct

    secret replacement syntax

    Shared AWS Config/Credentials Files:

    • Config file: ~/.aws/config (Linux/Mac) or %UserProfile%\.aws\config (Windows)

    • Credentials file: ~/.aws/credentials (Linux/Mac) or %UserProfile%\.aws\credentials (Windows)

    • The AWS_PROFILE environment variable can be used to specify a named profile.

    • Supports both static credentials and SSO sessions

    • Example credentials file:

    Run aws sso login to start a new SSO session

  • Used when running in Amazon ECS containers

  • Automatically uses the task's IAM role

  • Retrieved from the ECS credential provider endpoint

  • EC2 Instance Metadata Service (IMDSv2):

    • Used when running on EC2 instances

    • Automatically uses the instance's IAM role

    • Retrieved securely using IMDSv2

  • When using IAM roles with EKS, ensure the service account is properly configured with IRSA.

    The DynamoDB connector will scan the first 10 items to determine the schema of the table. This may miss columns that are not present in the first 10 items.

    from

    Description

    dynamodb:table

    Read data from a DynamoDB table named table

    dynamodb_aws_region

    Required. The AWS region containing the DynamoDB table

    dynamodb_aws_access_key_id

    Optional. AWS access key ID for authentication. If not provided, credentials will be loaded from environment variables or IAM roles

    dynamodb_aws_secret_access_key

    Optional. AWS secret access key for authentication. If not provided, credentials will be loaded from environment variables or IAM roles

    dynamodb_aws_session_token

    Optional. AWS session token for authentication

    dynamodb:Scan

    Required. Allows reading all items from the table

    dynamodb:DescribeTable

    Required. Allows fetching table metadata and schema information

    For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding. For example, a customer_reviews table with a review column must have a review_embedding column.
  • Embeddings Column Data Type:

    • The embeddings column must have the following Arrow data type when loaded into Spice:

      1. FixedSizeList[Float32 or Float64, N], where N is the dimension (size) of the embedding vector. FixedSizeList is used for efficient storage and processing of fixed-size vectors.

      2. If the column is , use List[FixedSizeList[Float32 or Float64, N]].

  • Offset Column for Chunked Data:

    • If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets with the following Arrow data type:

      1. List[FixedSizeList[Int32, 2]], where each element is a pair of integers [start, end] representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.

      2. For instance, [[0, 100], [101, 200]] indicates two chunks covering indices 0–100 and 101–200, respectively.

  • ​
    sentence-transformers/all-MiniLM-L6-v2
    OpenAI
    Model Providers
    API reference for /v1/search
    below
    ​
    Document Tables
    ​
    ​
    Arrow data type
    ​

    The from field for the Databricks connector takes the form databricks:catalog.schema.table where catalog.schema.table is the fully-qualified path to the table to read from.

    name

    The dataset name. This will be used as the table name within Spice.

    Example:

    params

    Use the secret replacement syntax to reference a secret, e.g. ${secrets:my_token}.

    Parameter Name
    Description

    mode

    The execution mode for querying against Databricks. The default is spark_connect. Possible values:

    • spark_connect: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.

    • delta_lake: Query directly from Delta Tables. Requires the object store credentials to be provided.

    databricks_endpoint

    The endpoint of the Databricks instance. Required for both modes.

    databricks_sql_warehouse_id

    The ID of the SQL Warehouse in Databricks to use for the query. Only valid when mode is sql_warehouse.

    databricks_cluster_id

    The ID of the compute cluster in Databricks to use for the query. Only valid when mode is spark_connect.

    databricks_use_ssl

    If true, use a TLS connection to connect to the Databricks endpoint. Default is true.

    client_timeout

    Optional. Applicable only in delta_lake mode. Specifies timeout for object store operations. Default value is 30s E.g. client_timeout: 60s

    Authentication

    Personal access token

    To learn more about how to set up personal access tokens, see Databricks PAT docs.

    Databricks service principal

    Spice supports the M2M (Machine to Machine) OAuth flow with service principal credentials by utilizing the databricks_client_id and databricks_client_secret parameters. The runtime will automatically refresh the token.

    Ensure that you grant your service principal the "Data Reader" privilege preset for the catalog and "Can Attach" cluster permissions when using Spark Connect mode.

    To Learn more about how to set up the service principal, see Databricks M2M OAuth docs.

    Delta Lake object store parameters

    Configure the connection to the object store when using mode: delta_lake. Use the secret replacement syntax to reference a secret, e.g. ${secrets:aws_access_key_id}.

    AWS S3

    Parameter Name
    Description

    databricks_aws_region

    Optional. The AWS region for the S3 object store. E.g. us-west-2.

    databricks_aws_access_key_id

    The access key ID for the S3 object store.

    databricks_aws_secret_access_key

    The secret access key for the S3 object store.

    databricks_aws_endpoint

    Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com.

    databricks_aws_allow_http

    Optional. Enables insecure HTTP connections to databricks_aws_endpoint. Defaults to false.

    Azure Blob

    Note

    One of the following auth values must be provided for Azure Blob:

    • databricks_azure_storage_account_key,

    • databricks_azure_storage_client_id and azure_storage_client_secret, or

    • databricks_azure_storage_sas_key.

    Parameter Name
    Description

    databricks_azure_storage_account_name

    The Azure Storage account name.

    databricks_azure_storage_account_key

    The Azure Storage key for accessing the storage account.

    databricks_azure_storage_client_id

    The Service Principal client ID for accessing the storage account.

    databricks_azure_storage_client_secret

    The Service Principal client secret for accessing the storage account.

    databricks_azure_storage_sas_key

    The shared access signature key for accessing the storage account.

    databricks_azure_storage_endpoint

    Optional. The endpoint for the Azure Blob storage account.

    Google Storage (GCS)

    Parameter Name
    Description

    google_service_account

    Filesystem path to the Google service account JSON key file.

    Examples

    Spark Connect

    SQL Warehouse

    Delta Lake (S3)

    Delta Lake (Azure Blobs)

    Delta Lake (GCP)

    Types

    mode: delta_lake

    The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.

    Databricks SQL Type
    Arrow Type

    STRING

    Utf8

    BIGINT

    Int64

    INT

    Int32

    SMALLINT

    Int16

    TINYINT

    Int8

    FLOAT

    Float32

    Limitations

    • Databricks connector (mode: delta_lake) does not support reading Delta tables with the V2Checkpoint feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop the V2Checkpoint feature by executing the following command:

      For more details on dropping Delta table features, refer to the official documentation: Drop Delta table features

    • When using mode: spark_connect, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands. Spark Docs

    Memory Considerations

    When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

    Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.

    • The Databricks Connector (mode: spark_connect) does not yet support streaming query results from Spark.

    Spark Connect
    Delta Lake
    SQL Statement Execution API
    datasets:
      - from: dynamodb:users
        name: my_users
        params: ...
    SELECT COUNT(*) FROM my_users;
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "dynamodb:Scan",
                    "dynamodb:DescribeTable"
                ],
                "Resource": [
                    "arn:aws:dynamodb:*:*:table/YOUR_TABLE_NAME"
                ]
            }
        ]
    }
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "dynamodb:Scan",
                    "dynamodb:DescribeTable"
                ],
                "Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/users"
            }
        ]
    }
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "dynamodb:Scan",
                    "dynamodb:DescribeTable"
                ],
                "Resource": [
                    "arn:aws:dynamodb:us-west-2:123456789012:table/users",
                    "arn:aws:dynamodb:us-west-2:123456789012:table/orders"
                ]
            }
        ]
    }
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "dynamodb:Scan",
                    "dynamodb:DescribeTable"
                ],
                "Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/*"
            }
        ]
    }
    version: v1
    kind: Spicepod
    name: dynamodb
    
    datasets:
      - from: dynamodb:users
        name: users
        params:
          dynamodb_aws_region: us-west-2
        acceleration:
          enabled: true
    version: v1
    kind: Spicepod
    name: dynamodb
    
    datasets:
      - from: dynamodb:users
        name: users
        params:
          dynamodb_aws_region: us-west-2
          dynamodb_aws_access_key_id: ${secrets:aws_access_key_id}
          dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key}
        acceleration:
          enabled: true
    -- Query nested structs
    SELECT metadata.registration_ip, metadata.user_agent 
    FROM users 
    LIMIT 5;
    
    -- Query nested structs in arrays
    SELECT address.city
    FROM (
        SELECT unnest(addresses) AS address 
        FROM users
    )
    WHERE address.city = 'San Francisco';
    describe users;
    +----------------+------------------+-------------+
    | column_name    | data_type       | is_nullable |
    +----------------+------------------+-------------+
    | email          | Utf8            | YES         |
    | id             | Int64           | YES         |
    | metadata       | Struct          | YES         |
    | addresses      | List(Struct)    | YES         |
    | preferences    | Struct          | YES         |
    | created_at     | Utf8            | YES         |
    ...
    +----------------+------------------+-------------+
    embeddings:
      - from: openai
        name: remote_service
        params:
          openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
    
      - name: local_embedding_model
        from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
    datasets:
      - from: github:github.com/spiceai/spiceai/issues
        name: spiceai.issues
        acceleration:
          enabled: true
        columns:
          - name: body
            embeddings:
              - from: local_embedding_model # Embedding model used for this column
    curl -XPOST http://localhost:8090/v1/search \
      -H 'Content-Type: application/json' \
      -d '{
        "datasets": ["spiceai.issues"],
        "text": "cutting edge AI",
        "where": "author=\"jeadie\"",
        "additional_columns": ["title", "state"],
        "limit": 2
      }'
    datasets:
      - from: github:github.com/spiceai/spiceai/issues
        name: spiceai.issues
        acceleration:
          enabled: true
        embeddings:
          - column: body
            from: local_embedding_model
            chunking:
              enabled: true
              target_chunk_size: 512
    curl -XPOST http://localhost:8090/v1/search \
      -H 'Content-Type: application/json' \
      -d '{
        "datasets": ["spiceai.issues"],
        "text": "cutting edge AI",
        "where": "array_has(assignees, \"jeadie\")",
        "additional_columns": ["title", "state", "body"],
        "limit": 2
      }'
    {
      "matches": [
        {
          "value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
          "dataset": "spiceai.issues",
          "metadata": {
            "title": "Improve scalar UDF array_distance",
            "state": "Closed",
            "body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
          }
        },
        {
          "value": "est external tools being returned for toolusing models",
          "dataset": "spiceai.issues",
          "metadata": {
            "title": "Automatic NSQL retries in /v1/nsql ",
            "state": "Open",
            "body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n  "sql": "SELECT ..."\n}\n```"
          }
        }
      ],
      "duration_ms": 45
    }
    sql> describe sales;
    +-------------------+-----------------------------------------+-------------+
    | column_name       | data_type                               | is_nullable |
    +-------------------+-----------------------------------------+-------------+
    | order_number      | Int64                                   | YES         |
    | quantity_ordered  | Int64                                   | YES         |
    | price_each        | Float64                                 | YES         |
    | order_line_number | Int64                                   | YES         |
    | address           | Utf8                                    | YES         |
    | address_embedding | FixedSizeList(                          | NO          |
    |                   |   Field {                               |             |
    |                   |     name: "item",                       |             |
    |                   |     data_type: Float32,                 |             |
    |                   |     nullable: false,                    |             |
    |                   |     dict_id: 0,                         |             |
    |                   |     dict_is_ordered: false,             |             |
    |                   |     metadata: {}                        |             |
    |                   |   },                                    |             |
    |                   |   384                                   |             |
    +-------------------+-----------------------------------------+-------------+
    sql> describe sales;
    +-------------------+-----------------------------------------+-------------+
    | column_name       | data_type                               | is_nullable |
    +-------------------+-----------------------------------------+-------------+
    | order_number      | Int64                                   | YES         |
    | quantity_ordered  | Int64                                   | YES         |
    | price_each        | Float64                                 | YES         |
    | order_line_number | Int64                                   | YES         |
    | address           | Utf8                                    | YES         |
    | address_embedding | List(Field {                            | NO          |
    |                   |   name: "item",                         |             |
    |                   |   data_type: FixedSizeList(             |             |
    |                   |     Field {                             |             |
    |                   |       name: "item",                     |             |
    |                   |       data_type: Float32,               |             |
    |                   |     },                                  |             |
    |                   |     384                                 |             |
    |                   |   ),                                    |             |
    |                   | })                                      |             |
    +-------------------+-----------------------------------------+-------------+
    | address_offset    | List(Field {                            | NO          |
    |                   |   name: "item",                         |             |
    |                   |   data_type: FixedSizeList(             |             |
    |                   |     Field {                             |             |
    |                   |       name: "item",                     |             |
    |                   |       data_type: Int32,                 |             |
    |                   |     },                                  |             |
    |                   |     2                                   |             |
    |                   |   ),                                    |             |
    |                   | })                                      |             |
    +-------------------+-----------------------------------------+-------------+
    
    ALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];
    datasets:
      - from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
        name: my_delta_lake_table
        params:
          mode: delta_lake
          databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
          databricks_token: ${secrets:my_token}
          databricks_aws_access_key_id: ${secrets:aws_access_key_id}
          databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
    datasets:
      - from: databricks:spiceai.datasets.my_awesome_table
        name: cool_dataset
        params: ...
    SELECT COUNT(*) FROM cool_dataset;
    +----------+
    | count(*) |
    +----------+
    | 6001215  |
    +----------+
    datasets:
      - from: databricks:spiceai.datasets.my_awesome_table
        name: my_awesome_table
        params:
          databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
          databricks_cluster_id: 1234-567890-abcde123
          databricks_token: ${secrets:DATABRICKS_TOKEN} # PAT
    datasets:
      - from: databricks:spiceai.datasets.my_awesome_table
        name: my_awesome_table
        params:
          databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
          databricks_cluster_id: 1234-567890-abcde123
          databricks_client_id: ${secrets:DATABRICKS_CLIENT_ID} # service principal client id
          databricks_client_secret: ${secrets:DATABRICKS_CLIENT_SECRET} # service principal client secret
    - from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
      name: my_delta_lake_table
      params:
        mode: spark_connect
        databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
        databricks_cluster_id: 1234-567890-abcde123
        databricks_token: ${secrets:my_token}
    - from: databricks:spiceai.datasets.my_table # A reference to a table in the Databricks unity catalog
      name: my_table
      params:
        mode: sql_warehouse
        databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
        databricks_sql_warehouse_id: 2b4e24cff378fb24
        databricks_token: ${secrets:my_token}
    - from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
      name: my_delta_lake_table
      params:
        mode: delta_lake
        databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
        databricks_token: ${secrets:my_token}
        databricks_aws_region: us-west-2 # Optional
        databricks_aws_access_key_id: ${secrets:aws_access_key_id}
        databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
        databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
    - from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
      name: my_delta_lake_table
      params:
        mode: delta_lake
        databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
        databricks_token: ${secrets:my_token}
    
        # Account Name + Key
        databricks_azure_storage_account_name: my_account
        databricks_azure_storage_account_key: ${secrets:my_key}
    
        # OR Service Principal + Secret
        databricks_azure_storage_client_id: my_client_id
        databricks_azure_storage_client_secret: ${secrets:my_secret}
    
        # OR SAS Key
        databricks_azure_storage_sas_key: my_sas_key
    - from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
      name: my_delta_lake_table
      params:
        mode: delta_lake
        databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
        databricks_token: ${secrets:my_token}
        databricks_google_service_account_path: /path/to/service-account.json

    databricks_token

    The Databricks API token to authenticate with the Unity Catalog API. Can't be used with databricks_client_id and databricks_client_secret.

    databricks_client_id

    The Databricks Service Principal Client ID. Can't be used with databricks_token.

    databricks_client_secret

    The Databricks Service Principal Client Secret. Can't be used with databricks_token.

    DOUBLE

    Float64

    BOOLEAN

    Boolean

    BINARY

    Binary

    DATE

    Date32

    TIMESTAMP

    Timestamp(Microsecond, Some("UTC"))

    TIMESTAMP_NTZ

    Timestamp(Microsecond, None)

    DECIMAL

    Decimal128

    ARRAY

    List

    STRUCT

    Struct

    MAP

    Map

    chunked
    # Static credentials
    [default]
    aws_access_key_id = YOUR_ACCESS_KEY
    aws_secret_access_key = YOUR_SECRET_KEY
    
    # SSO profile
    [profile sso-profile]
    sso_start_url = https://my-sso-portal.awsapps.com/start
    sso_region = us-west-2
    sso_account_id = 123456789012
    sso_role_name = MyRole
    region = us-west-2

    Task History

    The Spice runtime stores information about completed tasks in the spice.runtime.task_history table. A task is a single unit of execution within the runtime, such as a SQL query or an AI chat completion (see Task Types below). Tasks can be nested, and the runtime will record the parent-child relationship between tasks.

    Each task executed has a row in this table, and by default the data is retained for 8 hours. Use a SELECT query to return information about each task as shown in this example:

    Output:

    Task Types

    The following top-level task types are currently recorded:

    Task Type
    Description
    CLI Command

    Configuration

    Set the following parameters in the runtime.task_history section of the spicepod.yaml file to configure task history:

    • enabled: Enable or disable task history. Default: true.

    • retention_period: The duration for which task history data is retained. Default: 8h.

    • retention_check_interval: The interval at which the task history retention is checked. Default:

    Examples

    Adjust the retention period for task history:

    Disable task history:

    Disable capturing output from tasks:

    Table Schema

    Column Name
    Data Type
    Is Nullable
    Description

    Example Queries

    Retrieve all tasks within a specific timeframe

    Example output:

    Retrieve the most recent error messages

    Summarize number of tasks by type

    Example output:

    Identify the longest-running tasks

    Example output:

    Retrieve details of all tasks associated with a specific trace

    Example output:

    Retrieve details of most recent chat query

    Retrieve Recent Queries for Specific Dataset

    Example output:

    SELECT
      *
    FROM
      spice.runtime.task_history
    LIMIT
      100;
    +----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
    | trace_id                         | span_id          | parent_span_id | task                | input                                        | captured_output | start_time                 | end_time                   | execution_duration_ms | error_message                                                | labels                                                                                                                              |
    +----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
    | f94dba6b89de98c6e54b074f2353a897 | 4eb243d9b5347762 |                | accelerated_refresh | runtime.metrics                              |                 | 2024-09-23T23:17:39.907789 | 2024-09-23T23:17:39.917777 | 9.988                 |                                                              | {sql: SELECT * FROM runtime.metrics}                                                                                                |
    | 1f1f8305520e15ea7ad9b0a43e5d2c7e | 6aadf7c91caea3c4 |                | accelerated_refresh | runtime.task_history                         |                 | 2024-09-23T23:17:39.907873 | 2024-09-23T23:17:39.917797 | 9.924000000000001     |                                                              | {sql: SELECT * FROM runtime.task_history}                                                                                           |
    | 1432e30c5ed7764f4ef35f6508dfd56c | fbb31c60d41d8232 |                | accelerated_refresh | logs_file                                    |                 | 2024-09-23T23:17:40.143699 | 2024-09-23T23:17:40.271678 | 127.97900000000001    |                                                              | {sql: SELECT * FROM logs_file}                                                                                                      |
    | fd0b909b789938384d99f0e4e6f4b68b | 624ea4751bb6727a |                | accelerated_refresh | logs                                         |                 | 2024-09-23T23:17:40.676838 | 2024-09-23T23:17:42.345932 | 1669.0939999999998    |                                                              | {sql: SELECT * FROM "logs"}                                                                                                         |
    | 3db5488039408825ac0829a3feb49b05 | e3e5ac928b497eef |                | accelerated_refresh | decimal                                      |                 | 2024-09-23T23:17:41.592359 | 2024-09-23T23:17:43.781699 | 2189.34               |                                                              | {sql: SELECT * FROM "decimal"}                                                                                                      |
    | 5c5ddd481d1e19df823da74fe33f261f | 6afcfd1e65385a16 |                | sql_query           | select * from runtime.task_history limit 100 |                 | 2024-09-23T23:17:48.305649 | 2024-09-23T23:17:48.307369 | 1.72                  |                                                              | {runtime_query: true, query_execution_duration_ms: 1.429375, protocol: FlightSQL, datasets: runtime.task_history, rows_produced: 5} |
    | 4c3dd314b874aa63fcd15023e67fc645 | cab3cdc2d31c1b6a |                | sql_query           | select block_number from logs_file limit 5   |                 | 2024-09-23T23:18:00.267218 | 2024-09-23T23:18:00.269278 | 2.06                  |                                                              | {datasets: logs_file, rows_produced: 5, query_execution_duration_ms: 1.940291, accelerated: true, protocol: FlightSQL}              |
    | f135c00df3aecd68dfa4d2360eff78f5 | db3474855449715c |                | sql_query           | select * from foobar                         |                 | 2024-09-23T23:18:12.865122 | 2024-09-23T23:18:12.865196 | 0.074                 | Error during planning: table 'spice.public.foobar' not found | {protocol: FlightSQL, error_code: QueryPlanningError, rows_produced: 0, query_execution_duration_ms: 0.126959, datasets: }          |
    +----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+

    accelerated_refresh

    Accelerated Table Refresh

    text_embed

    Text Embedding

    1m
    .
  • captured_output: The level of output captured for tasks. none or truncated. Default: none. truncated captures the first 3 rows of the result set for sql_query and nsql_query task types. Other task types currently capture the entire output even in truncated mode.

  • task

    Utf8

    NO

    Name or description of the task being performed (e.g. sql_query)

    input

    Utf8

    NO

    Input data or parameters for the task

    captured_output

    Utf8

    YES

    Output or result of the task, if available

    start_time

    Timestamp(Nanosecond, None)

    NO

    Time when the task started

    end_time

    Timestamp(Nanosecond, None)

    NO

    Time when the task ended

    execution_duration_ms

    Float64

    NO

    Duration of the task execution in milliseconds

    error_message

    Utf8

    YES

    Error message if the task failed, otherwise null

    labels

    Map(Utf8, Utf8)

    NO

    Key-value pairs for additional metadata or attributes associated with the task

    sql_query

    SQL Query

    spice sql

    nsql_query

    Natural Language to SQL Query

    ai_chat

    AI Chat Completion

    spice chat

    vector_search

    Vector Search

    trace_id

    Utf8

    NO

    Unique identifier for the entire trace this task happened in

    span_id

    Utf8

    NO

    Unique identifier for this specific task within the trace

    parent_span_id

    Utf8

    YES

    spice search

    Identifier of the parent task, if any

    runtime:
      task_history:
        retention_period: 1h # Keep tasks for 1 hour
        retention_check_interval: 1m # Check for expired tasks every minute
    runtime:
      task_history:
        enabled: false
    runtime:
      task_history:
        captured_output: none # none or truncated
    SELECT 
        trace_id,
        span_id,
        task,
        start_time,
        end_time,
        execution_duration_ms,
        error_message
    FROM spice.runtime.task_history
    WHERE start_time >= NOW() - INTERVAL '10 MINUTES'
      AND end_time <= NOW();
    +----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
    | trace_id                         | span_id          | task                | start_time                 | end_time                   | execution_duration_ms | error_message                                                                               |
    +----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
    | 687e0970f8c49d19c5a08764ea2d4dc1 | f4f52ed29db8b151 | text_embed          | 2024-11-25T05:39:37.444749 | 2024-11-25T05:39:53.577195 | 16132.446000000002    |                                                                                             |
    | 687e0970f8c49d19c5a08764ea2d4dc1 | e47b17bd9fd9fe37 | accelerated_refresh | 2024-11-25T05:39:31.112504 | 2024-11-25T05:39:53.579933 | 22467.429             |                                                                                             |
    | 1e881188e5fd252b26adb8a8d838efb8 | 532b0019ad778094 | sql_query           | 2024-11-25T05:40:38.864982 | 2024-11-25T05:40:38.871090 | 6.108                 |                                                                                             |
    | 2ee1c700b450034bb6c2da3de2e2386c | 235dafed1e7d8c02 | sql_query           | 2024-11-25T05:39:38.249113 | 2024-11-25T05:39:39.387258 | 1138.145              |                                                                                             |
    | 20e75df9ea77ba1c8cb99a2632cdd091 | d07551cd172ffa80 | sql_query           | 2024-11-25T05:39:39.458135 | 2024-11-25T05:39:39.482181 | 24.046000000000003    |                                                                                             |
    | ca1d470b12191726b61d825df6f2ce2a | 65597a0bc0a4fde3 | sql_query           | 2024-11-25T05:39:39.675726 | 2024-11-25T05:39:39.822479 | 146.753               |                                                                                             |
    | ac5abd8bfec7e5aa7c19fc84772c55f1 | 316622ac359e3c00 | sql_query           | 2024-11-25T05:39:39.872946 | 2024-11-25T05:39:39.872994 | 0.048                 | This feature is not implemented: The context currently only supports a single SQL statement |
    | 1c640298e248ba297a12b1e3b59fffc7 | 031c3a25dc56d8e9 | sql_query           | 2024-11-25T05:39:40.467032 | 2024-11-25T05:39:40.486156 | 19.124                |                                                                                             |
    | 2c4d9abee740ced8ae423e0eb4fcff6b | a324b699b8bcf338 | sql_query           | 2024-11-25T05:39:40.525506 | 2024-11-25T05:39:40.525526 | 0.02                  | This feature is not implemented: The context currently only supports a single SQL statement |
    | e5ed7f7a98e62f493ef8af2e0cd7734e | e84c30862a546bb5 | sql_query           | 2024-11-25T05:39:40.560891 | 2024-11-25T05:39:40.560911 | 0.02                  | This feature is not implemented: The context currently only supports a single SQL statement |
    | d471f83092a95bde8663438cda74627f | 3dd9c4d4ebff4cb9 | sql_query           | 2024-11-25T05:39:40.600892 | 2024-11-25T05:39:40.647092 | 46.199999999999996    |                                                                                             |
    | 701874d7282dd47791e7519b343a9694 | 5dacf75c4537ee0e | accelerated_refresh | 2024-11-25T05:39:30.452534 | 2024-11-25T05:39:30.452900 | 0.366                 |                                                                                             |
    | 2e6b672a49a8cd5f0862a760661dc846 | f813941e0699e783 | accelerated_refresh | 2024-11-25T05:39:30.848425 | 2024-11-25T05:39:30.857242 | 8.817                 |                                                                                             |
    | 18d76b6389898cc5253a49294607477d | cc0d06a4e69cbcd5 | health              | 2024-11-25T05:39:30.451626 | 2024-11-25T05:39:31.563876 | 1112.25               |                                                                                             |
    | c75af81360e8962639faa64e6804b830 | 1ea2c95b243a5717 | accelerated_refresh | 2024-11-25T05:39:31.036470 | 2024-11-25T05:39:31.607845 | 571.375               |                                                                                             |
    | 817d88778e91322640414263779ce7f1 | 513a58d83f0416a7 | accelerated_refresh | 2024-11-25T05:39:30.998455 | 2024-11-25T05:39:32.076359 | 1077.904              |                                                                                             |
    | 3c507ee30211e6fab7d8a2eaf686e451 | d9be117925fb6d42 | accelerated_refresh | 2024-11-25T05:39:31.061851 | 2024-11-25T05:39:32.078412 | 1016.561              |                                                                                             |
    | aa6010405a12a14b6afaf76e9fabedb8 | 1f50a2b177003c54 | accelerated_refresh | 2024-11-25T05:39:30.933543 | 2024-11-25T05:39:32.476197 | 1542.654              |                                                                                             |
    | 3c75d16b6b4b8da98c551d115e1c049c | 9a16dc065a95236a | sql_query           | 2024-11-25T05:42:27.386754 | 2024-11-25T05:42:27.386859 | 0.10500000000000001   | SQL error: ParserError("Expected: an SQL statement, found: ELECT")                          |
    +----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
    SELECT 
        trace_id, 
        task, 
        error_message, 
        SUBSTRING(input, 1, 100) AS input_preview, 
        start_time
    FROM spice.runtime.task_history
    WHERE error_message IS NOT NULL
    ORDER BY start_time DESC
    LIMIT 5;
    +----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
    | trace_id                         | task      | error_message                                                                               | input_preview                                                                                        | start_time                 |
    +----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
    | 352539e75fdb1a3d5fc3b48bfd4b4bae | sql_query | Error during planning: Invalid function 'date'.                                             | SELECT DATE(start_time) AS task_date, COUNT(*) AS task_count                                         | 2024-11-25T06:17:40.573970 |
    |                                  |           | Did you mean 'tanh'?                                                                        | FROM spice.runtime.task_history                                                                      |                            |
    |                                  |           |                                                                                             | GROUP B                                                                                              |                            |
    | f6672d562ad97dde0bb4db428723461f | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name       ,c_first_name       ,s_store_name       ,ca_state       ,s_ | 2024-11-25T06:06:39.800900 |
    | b16fa36e5a2f7f119fc3834875f6bdee | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with frequent_ss_items as  (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date soldda | 2024-11-25T06:06:39.760532 |
    | 8d126ea506a374c0c6239c11ee5cbe5a | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with  cross_items as  (select i_item_sk ss_item_sk  from item,  (select iss.i_brand_id brand_id      | 2024-11-25T06:06:39.118422 |
    | 198a9dcc496f0435cff69de61cc07874 | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name       ,c_first_name       ,s_store_name       ,ca_state       ,s_ | 2024-11-25T06:04:11.317419 |
    +----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
    SELECT task, COUNT(*) AS task_count, AVG(execution_duration_ms) AS avg_duration_ms
    FROM spice.runtime.task_history
    GROUP BY task
    ORDER BY task_count DESC;
    +-------------------------------+------------+---------------------+
    | task                          | task_count | avg_duration_ms     |
    +-------------------------------+------------+---------------------+
    | sql_query                     | 65         | 55.10198461538462   |
    | accelerated_refresh           | 27         | 749.1187407407407   |
    | ai_completion                 | 9          | 5026.337888888888   |
    | tool_use::list_datasets       | 4          | 0.16899999999999998 |
    | text_embed                    | 4          | 3341.08975          |
    | ai_chat                       | 4          | 7151.03675          |
    | vector_search                 | 3          | 384.376             |
    | tool_use::document_similarity | 3          | 385.0406666666667   |
    | tool_use::get_readiness       | 1          | 0.12999999999999998 |
    | tool_use::sample_data         | 1          | 2.275               |
    | health                        | 1          | 661.0169999999999   |
    +-------------------------------+------------+---------------------+
    SELECT 
        task, 
        trace_id, 
        parent_span_id, 
        execution_duration_ms,
        labels 
    FROM spice.runtime.task_history
    ORDER BY execution_duration_ms DESC
    LIMIT 10;
    +---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
    | task                | trace_id                         | parent_span_id   | execution_duration_ms | labels                                                                                         |
    +---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
    | accelerated_refresh | d9c38c7e58a02ec939240385a4a25a04 |                  | 1093711.474           | {sql: SELECT * FROM react.issues}                                                              |
    | ai_chat             | 7a6427313880942316bf3018cd23a198 |                  | 17202.836000000003    | {model: gpt-4o}                                                                                |
    | ai_completion       | 7a6427313880942316bf3018cd23a198 | 59b1fd88c8397e3f | 17202.475             | {model: gpt-4o, total_tokens: 2673, prompt_tokens: 1807, completion_tokens: 866, stream: true} |
    | accelerated_refresh | 96758c1132164204a68e1a7234a06cda |                  | 15660.023000000001    | {sql: SELECT * FROM react.docs}                                                                |
    | text_embed          | 96758c1132164204a68e1a7234a06cda | 109c489b24602356 | 12406.787             | {outputs_produced: 2086}                                                                       |
    | ai_chat             | b2a69503a1b83215603ead321eea6f61 |                  | 6445.162              | {model: gpt-4o}                                                                                |
    | ai_completion       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 6444.1990000000005    | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30}  |
    | ai_completion       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 5608.6359999999995    | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true}  |
    | text_embed          | 65880ecfc884a41555ac4d21ceef9aef |                  | 5143.494000000001     | {outputs_produced: 1}                                                                          |
    | text_embed          | f51e5e9d4e26de31a2f7d5e9286dd8f4 |                  | 4769.832              | {outputs_produced: 1}                                                                          |
    +---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
    SELECT 
        task, 
        trace_id, 
        parent_span_id, 
        span_id, 
        execution_duration_ms,
        start_time, 
        end_time,
        SUBSTRING(input, 1, 100) AS input_preview,
        SUBSTRING(captured_output, 1, 100) AS output_preview,
        error_message,
        labels
    FROM spice.runtime.task_history
    WHERE trace_id = 'b2a69503a1b83215603ead321eea6f61'
    ORDER BY start_time;
    +-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
    | task                          | trace_id                         | parent_span_id   | span_id          | execution_duration_ms | start_time                 | end_time                   | input_preview                                                                                        | output_preview                                                                                       | error_message                                                                                                                                      | labels                                                                                                                                        |
    +-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
    | ai_chat                       | b2a69503a1b83215603ead321eea6f61 |                  | 95411c59fc9c8cb8 | 6445.162              | 2024-11-25T06:03:31.197980 | 2024-11-25T06:03:37.643142 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | It seems that the dataset containing the recent React issues is currently being refreshed and is not |                                                                                                                                                    | {model: gpt-4o}                                                                                                                               |
    | tool_use::list_datasets       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | aa5ba649da3f0581 | 0.367                 | 2024-11-25T06:03:31.198139 | 2024-11-25T06:03:31.198506 |                                                                                                      | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac |                                                                                                                                                    | {tool: list_datasets}                                                                                                                         |
    | ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd67a43da1b4312 | 6444.1990000000005    | 2024-11-25T06:03:31.198906 | 2024-11-25T06:03:37.643105 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30}                                                 |
    | tool_use::document_similarity | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 49b25ff51580d4fa | 140.337               | 2024-11-25T06:03:31.893913 | 2024-11-25T06:03:32.034250 | {"text":"how to install react","datasets":["spice.react.issues"],"limit":3}                          |                                                                                                      | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {tool: document_similarity}                                                                                                                   |
    | vector_search                 | b2a69503a1b83215603ead321eea6f61 | 49b25ff51580d4fa | 9feb8c9a54079cbd | 140.24200000000002    | 2024-11-25T06:03:31.894    | 2024-11-25T06:03:32.034242 | how to install react                                                                                 |                                                                                                      | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {limit: 3, tables: spice.react.issues}                                                                                                        |
    | text_embed                    | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 7f18fd8d96a1baeb | 122.771               | 2024-11-25T06:03:31.894072 | 2024-11-25T06:03:32.016843 | "how to install react"                                                                               |                                                                                                      |                                                                                                                                                    | {outputs_produced: 1}                                                                                                                         |
    | sql_query                     | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 025f6b1e5cd502a6 | 16.892                | 2024-11-25T06:03:32.017320 | 2024-11-25T06:03:32.034212 | WITH ranked_docs as (                                                                                |                                                                                                      | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues                                             | {error_code: QueryExecutionError, protocol: Internal, query_execution_duration_ms: 8.20325, datasets: spice.react.issues, rows_produced: 0}   |
    |                               |                                  |                  |                  |                       |                            |                            |                 SELECT id, dist, offset FROM (                                                       |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
    |                               |                                  |                  |                  |                       |                            |                            |                     SELECT                                                                           |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
    |                               |                                  |                  |                  |                       |                            |                            |                                                                                                      |                                                                                                      |                                                                                                                                                    |                                                                                                                                               |
    | ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 1b2a3273a1dc4cdf | 5608.6359999999995    | 2024-11-25T06:03:32.034451 | 2024-11-25T06:03:37.643087 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true}                                                 |
    | tool_use::sample_data         | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd10431fb18df87 | 2.275                 | 2024-11-25T06:03:33.039583 | 2024-11-25T06:03:33.041858 | TopNSample({"dataset":"spice.react.issues","limit":3,"order_by":"created_at DESC"})                  |                                                                                                      |                                                                                                                                                    | {sample_method: top_n_sample, tool: top_n_sample}                                                                                             |
    | sql_query                     | b2a69503a1b83215603ead321eea6f61 | 8bd10431fb18df87 | f1b2e06225aa2ba3 | 2.193                 | 2024-11-25T06:03:33.039625 | 2024-11-25T06:03:33.041818 | SELECT * FROM spice.react.issues ORDER BY created_at DESC LIMIT 3                                    |                                                                                                      | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues                                             | {query_execution_duration_ms: 1.5519999, datasets: spice.react.issues, protocol: Internal, error_code: QueryExecutionError, rows_produced: 0} |
    | ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | b649e4bbe8aac7a3 | 4601.02               | 2024-11-25T06:03:33.042037 | 2024-11-25T06:03:37.643057 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {prompt_tokens: 1599, completion_tokens: 11, model: gpt-4o, stream: true, total_tokens: 1610}                                                 |
    | tool_use::get_readiness       | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | f40f0b4cd608de8b | 0.12999999999999998   | 2024-11-25T06:03:33.499867 | 2024-11-25T06:03:33.499997 |                                                                                                      | {"dataset:call_center":"Ready","dataset:catalog_page":"Ready","dataset:catalog_returns":"Ready","dat |                                                                                                                                                    | {tool: get_readiness}                                                                                                                         |
    | ai_completion                 | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | e018a56742b5064f | 4142.789              | 2024-11-25T06:03:33.500219 | 2024-11-25T06:03:37.643008 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |                                                                                                                                                    | {stream: true, completion_tokens: 254, prompt_tokens: 1920, model: gpt-4o, total_tokens: 2174}                                                |
    +-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
    SELECT 
        task, 
        trace_id, 
        execution_duration_ms,
        start_time, 
        SUBSTRING(input, 1, 100) AS input_preview,
        SUBSTRING(captured_output, 1, 100) AS output_preview,
        error_message,
        labels
    FROM spice.runtime.task_history
    WHERE trace_id = (
        SELECT trace_id
        FROM spice.runtime.task_history
        WHERE task = 'ai_chat'
        ORDER BY start_time DESC
        LIMIT 1
    )
    ORDER BY start_time;
    -----------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
    | task                          | trace_id                         | execution_duration_ms | start_time                 | input_preview                                                                                        | output_preview                                                                                       | error_message | labels                                                                                                       |
    +-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
    | ai_chat                       | bb2de94b6f575b6c001f39cfded8bff4 | 5665.4169999999995    | 2024-11-25T07:02:43.240005 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | Here are the three most recent issues related to React, along with their summaries and links:        |               | {model: gpt-4o}                                                                                              |
    |                               |                                  |                       |                            |                                                                                                      |                                                                                                      |               |                                                                                                              |
    |                               |                                  |                       |                            |                                                                                                      | 1. **                                                                                                |               |                                                                                                              |
    | tool_use::list_datasets       | bb2de94b6f575b6c001f39cfded8bff4 | 0.157                 | 2024-11-25T07:02:43.240048 |                                                                                                      | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac |               | {tool: list_datasets}                                                                                        |
    | ai_completion                 | bb2de94b6f575b6c001f39cfded8bff4 | 5664.964              | 2024-11-25T07:02:43.240416 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |               | {prompt_tokens: 2688, total_tokens: 2716, completion_tokens: 28, model: gpt-4o, stream: true}                |
    | tool_use::document_similarity | bb2de94b6f575b6c001f39cfded8bff4 | 710.9609999999999     | 2024-11-25T07:02:44.344935 | {"text":"recent issues","datasets":["spice.react.issues"],"limit":3}                                 |                                                                                                      |               | {tool: document_similarity}                                                                                  |
    | vector_search                 | bb2de94b6f575b6c001f39cfded8bff4 | 710.842               | 2024-11-25T07:02:44.345018 | recent issues                                                                                        | {Full { catalog: "spice", schema: "react", table: "issues" }: VectorSearchTableResult { data: [Recor |               | {limit: 3, tables: spice.react.issues}                                                                       |
    | text_embed                    | bb2de94b6f575b6c001f39cfded8bff4 | 562.453               | 2024-11-25T07:02:44.345072 | "recent issues"                                                                                      |                                                                                                      |               | {outputs_produced: 1}                                                                                        |
    | sql_query                     | bb2de94b6f575b6c001f39cfded8bff4 | 147.672               | 2024-11-25T07:02:44.908148 | WITH ranked_docs as (                                                                                | [{"title_chunk":"app:lintVitalReleaseBug issu","id":"I_kwDOAJy2Ks47fqsI","title":"app:lintVitalRelea |               | {datasets: spice.react.issues, protocol: Internal, query_execution_duration_ms: 139.69496, rows_produced: 3} |
    |                               |                                  |                       |                            |                 SELECT id, dist, offset FROM (                                                       |                                                                                                      |               |                                                                                                              |
    |                               |                                  |                       |                            |                     SELECT                                                                           |                                                                                                      |               |                                                                                                              |
    |                               |                                  |                       |                            |                                                                                                      |                                                                                                      |               |                                                                                                              |
    | ai_completion                 | bb2de94b6f575b6c001f39cfded8bff4 | 3849.3140000000003    | 2024-11-25T07:02:45.055985 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func |                                                                                                      |               | {model: gpt-4o, total_tokens: 3240, stream: true, completion_tokens: 271, prompt_tokens: 2969}               |
    +-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
    SELECT 
        task,
        start_time,
        execution_duration_ms,
        SUBSTRING(input, 1, 100) AS input_preview, 
        error_message, 
        labels
    FROM spice.runtime.task_history
    WHERE 'catalog_sales' = ANY(string_to_array(labels['datasets'], ','))
    ORDER BY start_time DESC
    LIMIT 5;
    +-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | task      | start_time                 | execution_duration_ms | input_preview                                                                                        | error_message | labels                                                                                                                                                                            |
    +-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | sql_query | 2024-11-25T07:15:31.278018 | 19.424                | with ss as (  select           i_manufact_id,sum(ss_ext_sales_price) total_sales  from  	store_sales |               | {accelerated: true, protocol: FlightSQL, query_execution_duration_ms: 18.765831, rows_produced: 15, datasets: store_sales,customer_address,item,date_dim,web_sales,catalog_sales} |
    | sql_query | 2024-11-25T07:15:31.233147 | 1.661                 | select  sum(cs_ext_discount_amt)  as "excess discount amount" from    catalog_sales    ,item    ,dat |               | {query_execution_duration_ms: 1.380667, accelerated: true, datasets: date_dim,catalog_sales,item, rows_produced: 1, protocol: FlightSQL}                                          |
    | sql_query | 2024-11-25T07:15:30.889905 | 30.101                | select      i_item_id     ,i_item_desc     ,s_store_id     ,s_store_name     ,stddev_samp(ss_quantit |               | {query_execution_duration_ms: 29.511086, datasets: store_sales,store,item,date_dim,store_returns,catalog_sales, rows_produced: 0, protocol: FlightSQL, accelerated: true}         |
    | sql_query | 2024-11-25T07:15:30.658339 | 24.942                | select  i_item_id,         avg(cs_quantity) agg1,         avg(cs_list_price) agg2,         avg(cs_co |               | {query_execution_duration_ms: 24.620039, protocol: FlightSQL, datasets: item,date_dim,catalog_sales,promotion,customer_demographics, accelerated: true, rows_produced: 73}        |
    | sql_query | 2024-11-25T07:15:30.574257 | 40.908                | select  i_item_id  ,i_item_desc  ,s_store_id  ,s_store_name  ,min(ss_net_profit) as store_sales_prof |               | {protocol: FlightSQL, rows_produced: 0, accelerated: true, datasets: store_returns,date_dim,store,store_sales,item,catalog_sales, query_execution_duration_ms: 40.29496}          |
    +-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

    Create Chat Completion

    post
    /v1/chat/completions

    Creates a model response for the given chat conversation.

    Authorizations
    X-API-KEYstringRequired
    Body
    anyOptional
    Responses
    200

    Chat completion generated successfully

    application/json
    404

    The specified model was not found

    500

    An internal server error occurred while processing the chat completion

    application/json
    post
    /v1/chat/completions

    Text-to-SQL (NSQL)

    post
    /v1/nsql

    Generate and optionally execute a natural-language text-to-SQL (NSQL) query.

    This endpoint generates a SQL query using a natural language query (NSQL) and optionally executes it. The SQL query is generated by the specified model and executed if the Accept header is not set to application/sql.

    Authorizations
    X-API-KEYstringRequired
    Header parameters
    AcceptstringRequired

    The format of the response, one of 'application/json' (default), 'application/vnd.spiceai.nsql.v1+json', 'application/sql', 'text/csv' or 'text/plain'. 'application/sql' will only return the SQL query generated by the model.

    Body
    anyOptional
    Responses
    200

    SQL query executed successfully

    400

    Invalid request parameters

    application/json
    500

    Internal server error

    text/plain

    List Models

    get
    /v1/models

    List all models, both machine learning and language models, available in the runtime.

    Authorizations
    X-API-KEYstringRequired
    Query parameters
    formatanyOptional

    The format of the response (e.g., json or csv).

    statusbooleanOptional

    If true, includes the status of each model in the response.

    Responses
    200

    List of models in JSON format

    500

    Internal server error occurred while processing models

    application/json
    get
    /v1/models

    Search

    post
    /v1/search

    Perform a vector similarity search (VSS) operation on a dataset.

    The search operation will return the most relevant matches based on cosine similarity with the input text. The datasets queries should have an embedding column, and the appropriate embedding model loaded.

    Authorizations
    X-API-KEYstringRequired
    Body
    anyOptional
    Responses
    200

    Search completed successfully

    application/json
    400

    Invalid request parameters

    application/json
    500

    Internal server error

    application/json
    post
    /v1/search

    SQL Query

    post
    /v1/sql

    Execute a SQL query and return the results.\n\nThis endpoint allows users to execute SQL queries directly from an HTTP request. The SQL query is sent as plain text in the request body.

    Authorizations
    X-API-KEYstringRequired
    Header parameters
    AcceptstringRequired

    The format of the response, one of 'application/json' (default), 'application/vnd.spiceai.sql.v1+json', 'text/csv' or 'text/plain'.

    Body
    objectOptional
    Responses
    200

    SpiceAI SQL v1+json result

    400

    Bad Request

    application/json
    500

    Internal Server Error

    application/json
    post
    /v1/sql
    {
      "results": [
        {
          "matches": {
            "message": "I booked use some tickets"
          },
          "dataset": "app_messages",
          "primary_key": {
            "id": "6fd5a215-0881-421d-ace0-b293b83452b5"
          },
          "data": {
            "timestamp": 1724716542
          },
          "score": 0.914321
        },
        {
          "matches": {
            "message": "direct to Narata"
          },
          "dataset": "app_messages",
          "primary_key": {
            "id": "8a25595f-99fb-4404-8c82-e1046d8f4c4b"
          },
          "data": {
            "timestamp": 1724715881
          },
          "score": 0.83221
        },
        {
          "matches": {
            "message": "Yes, we're sitting together"
          },
          "dataset": "app_messages",
          "primary_key": {
            "id": "8421ed84-b86d-4b10-b4da-7a432e8912c0"
          },
          "data": {
            "timestamp": 1724716123
          },
          "score": 0.787654321
        }
      ],
      "duration_ms": 42
    }
    POST /v1/search HTTP/1.1
    Host: data.spiceai.io
    X-API-KEY: YOUR_API_KEY
    Content-Type: application/json
    Accept: */*
    Content-Length: 157
    
    {
      "datasets": [
        "app_messages"
      ],
      "text": "Tokyo plane tickets",
      "where": "user=1234321",
      "additional_columns": [
        "timestamp"
      ],
      "limit": 3,
      "keywords": [
        "plane",
        "tickets"
      ]
    }
    {}
    POST /v1/sql HTTP/1.1
    Host: data.spiceai.io
    X-API-KEY: YOUR_API_KEY
    Accept: text
    Content-Type: application/json
    Content-Length: 2
    
    {}
    post
    /v1/nsql

    GitHub

    GitHub Data Connector Documentation

    The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github as the selector in the from value for the dataset.

    Common Configuration

    Configuration

    from

    The from field takes the form of github:github.com/<owner>/<repo>/<content> where content could be files, issues, pulls, commits, stargazers. See for more configuration detail.

    name

    The dataset name. This will be used as the table name within Spice.

    params

    Personal Access Token

    Parameter Name
    Description

    GitHub App Installation

    GitHub Apps provide a secure and scalable way to integrate with GitHub's API. .

    Parameter Name
    Description

    Limitations

    With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.

    Common Parameters

    Parameter Name
    Description

    Filter Push Down

    GitHub queries support a github_query_mode parameter, which can be set to either auto or search for the following types:

    • Issues: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.

    • Pull Requests: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.

    Commits only supports auto mode. Query with filter push down is only enabled for the committed_date column. commited_date supports exact matches, or greater/less than matches for dates provided in format, like WHERE committed_date > '2024-09-24'.

    When set to search, Issues and Pull Requests will use the GitHub for improved filter performance when querying against the columns:

    • author and state; supports exact matches, or NOT matches. For example, WHERE author = 'peasee' or WHERE author <> 'peasee'.

    • body and title; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'.

    All other filters are supported when github_query_mode is set to search, but cannot be pushed down to the GitHub API for improved performance.

    Limitations

    • GitHub has a limitation in the Search API where it may return more stale data than the standard API used in the default query mode.

    • GitHub has a limitation in the Search API where it only returns a maximum of 1000 results for a query. Use to retrieve more results over time. See the for pull requests.

    Examples

    Querying GitHub Files

    Limitations

    • content column is fetched only when acceleration is enabled.

    • Querying GitHub files does not support filter push down, which may result in long query times when acceleration is disabled.

    • ref - Required. Specifies the GitHub branch or tag to fetch files from.

    • include - Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.

    Schema

    Column Name
    Data Type
    Is Nullable

    Example

    Querying GitHub Issues

    Limitations

    • Querying with filters using date columns requires the use of . For example, WHERE created_at > '2024-09-24'.

    Schema

    Column Name
    Data Type
    Is Nullable

    Example

    Querying GitHub Pull Requests

    Limitations

    • Querying with filters using date columns requires the use of . For example, WHERE created_at > '2024-09-24'.

    Schema

    Column Name
    Data Type
    Is Nullable

    Example

    Append Example

    Querying GitHub Commits

    Limitations

    • Querying with filters using date columns requires the use of . For example, WHERE committed_date > '2024-09-24'.

    • Setting github_query_mode

    Schema

    Column Name
    Data Type
    Is Nullable

    Example

    Querying GitHub stars (Stargazers)

    Limitations

    • Querying with filters using date columns requires the use of . For example, WHERE starred_at > '2024-09-24'.

    • Setting github_query_mode

    Schema

    Column Name
    Data Type
    Is Nullable

    Example

    {
      "id": "chatcmpl-123",
      "object": "chat.completion",
      "created": 1677652288,
      "model": "gpt-4o-mini",
      "system_fingerprint": "fp_44709d6fcb",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "\n\nHello there, how may I assist you today?"
          },
          "logprobs": null,
          "finish_reason": "stop"
        }
      ],
      "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21,
        "completion_tokens_details": {
          "reasoning_tokens": 0,
          "accepted_prediction_tokens": 0,
          "rejected_prediction_tokens": 0
        }
      }
    }
    POST /v1/nsql HTTP/1.1
    Host: data.spiceai.io
    X-API-KEY: YOUR_API_KEY
    Accept: text
    Content-Type: application/json
    Content-Length: 117
    
    {
      "query": "Get the top 5 customers by total sales",
      "model": "nql",
      "sample_data_enabled": true,
      "datasets": [
        "sales_data"
      ]
    }
    {
      "object": "list",
      "data": [
        {
          "id": "gpt-4",
          "object": "model",
          "owned_by": "openai",
          "datasets": null,
          "status": "ready"
        },
        {
          "id": "text-embedding-ada-002",
          "object": "model",
          "owned_by": "openai-internal",
          "datasets": [
            "text-dataset-1",
            "text-dataset-2"
          ],
          "status": "ready"
        }
      ]
    }
    POST /v1/chat/completions HTTP/1.1
    Host: data.spiceai.io
    X-API-KEY: YOUR_API_KEY
    Content-Type: application/json
    Accept: */*
    Content-Length: 143
    
    {
      "model": "gpt-4o",
      "messages": [
        {
          "role": "developer",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "Hello!"
        }
      ],
      "stream": false
    }
    GET /v1/models HTTP/1.1
    Host: data.spiceai.io
    X-API-KEY: YOUR_API_KEY
    Accept: */*
    

    updated_at, created_at, merged_at and closed_at; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'.

    Setting github_query_mode to search is not supported.

    mode

    Utf8

    YES

    url

    Utf8

    YES

    download_url

    Utf8

    YES

    content

    Utf8

    YES

    comments

    List(Struct)

    YES

    created_at

    Timestamp

    YES

    id

    Utf8

    YES

    labels

    List(Utf8)

    YES

    milestone_id

    Utf8

    YES

    milestone_title

    Utf8

    YES

    comments_count

    Int64

    YES

    number

    Int64

    YES

    state

    Utf8

    YES

    title

    Utf8

    YES

    updated_at

    Timestamp

    YES

    url

    Utf8

    YES

    changed_files

    Int64

    YES

    closed_at

    Timestamp

    YES

    comments_count

    Int64

    YES

    commits_count

    Int64

    YES

    created_at

    Timestamp

    YES

    deletions

    Int64

    YES

    hashes

    List(Utf8)

    YES

    id

    Utf8

    YES

    labels

    List(Utf8)

    YES

    merged_at

    Timestamp

    YES

    number

    Int64

    YES

    reviews_count

    Int64

    YES

    state

    Utf8

    YES

    title

    Utf8

    YES

    url

    Utf8

    YES

    to
    search
    is not supported.

    deletions

    Int64

    YES

    id

    Utf8

    YES

    message

    Utf8

    YES

    message_body

    Utf8

    YES

    message_head_line

    Utf8

    YES

    sha

    Utf8

    YES

    to
    search
    is not supported.

    company

    Utf8

    YES

    x_username

    Utf8

    YES

    location

    Utf8

    YES

    avatar_url

    Utf8

    YES

    bio

    Utf8

    YES

    github_token

    Required. GitHub personal access token to use to connect to the GitHub API. Learn more.

    github_client_id

    Required. Specifies the client ID for GitHub App Installation auth mode.

    github_private_key

    Required. Specifies the private key for GitHub App Installation auth mode.

    github_installation_id

    Required. Specifies the installation ID for GitHub App Installation auth mode.

    github_query_mode

    Optional. Specifies whether the connector should use the GitHub search API for improved filter performance. Defaults to auto, possible values of auto or search.

    owner

    Required. Specifies the owner of the GitHub repository.

    repo

    Required. Specifies the name of the GitHub repository.

    name

    Utf8

    YES

    path

    Utf8

    YES

    size

    Int64

    YES

    sha

    Utf8

    assignees

    List(Utf8)

    YES

    author

    Utf8

    YES

    body

    Utf8

    YES

    closed_at

    Timestamp

    additions

    Int64

    YES

    assignees

    List(Utf8)

    YES

    author

    Utf8

    YES

    body

    Utf8

    additions

    Int64

    YES

    author_email

    Utf8

    YES

    author_name

    Utf8

    YES

    committed_date

    Timestamp

    starred_at

    Timestamp

    YES

    login

    Utf8

    YES

    email

    Utf8

    YES

    name

    Utf8

    examples
    Learn more
    ISO8601
    Search API
    append mode acceleration
    append example
    ISO8601 formatted dates
    ISO8601 formatted dates
    ISO8601 formatted dates
    ISO8601 formatted dates

    YES

    YES

    YES

    YES

    YES

    [
      {
        "customer_id": "12345",
        "total_sales": 150000
      },
      {
        "customer_id": "67890",
        "total_sales": 125000
      }
    ]
    datasets:
      - from: github:github.com/<owner>/<repo>/files/<ref>
        name: spiceai.files
        params:
          github_token: ${secrets:GITHUB_TOKEN}
          include: '**/*.json; **/*.yaml'
        acceleration:
          enabled: true
    datasets:
      - from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
        name: spiceai.files
        params:
          github_token: ${secrets:GITHUB_TOKEN}
          include: '**/*.txt' # include txt files only
        acceleration:
          enabled: true
    sql> select * from spiceai.files
    +-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
    | name        | path        | size | sha                                      | mode   | url                                                                                             | download_url                                                               | content     |
    +-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
    | version.txt | version.txt | 12   | ee80f747038c30e776eecb2c2ae155dec9a68187 | 100644 | https://api.github.com/repos/spiceai/spiceai/git/blobs/ee80f747038c30e776eecb2c2ae155dec9a68187 | https://raw.githubusercontent.com/spiceai/spiceai/v0.17.2-beta/version.txt | 0.17.2-beta |
    |             |             |      |                                          |        |                                                                                                 |                                                                            |             |
    +-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
    
    Time: 0.005067 seconds. 1 rows.
    datasets:
      - from: github:github.com/<owner>/<repo>/issues
        name: spiceai.issues
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
    datasets:
      - from: github:github.com/spiceai/spiceai/issues
        name: spiceai.issues
        params:
          github_token: ${secrets:GITHUB_TOKEN}
    sql> select title, state, labels from spiceai.issues where title like '%duckdb%'
    +-----------------------------------------------------------------------------------------------------------+--------+----------------------+
    | title                                                                                                     | state  | labels               |
    +-----------------------------------------------------------------------------------------------------------+--------+----------------------+
    | Limitation documentation duckdb accelerator about nested struct and decimal256                            | CLOSED | [kind/documentation] |
    | Inconsistent duckdb connector params: `params.open` and `params.duckdb_file`                              | CLOSED | [kind/bug]           |
    | federation across multiple duckdb acceleration tables.                                                    | CLOSED | []                   |
    | Integration tests to cover "On Conflict" behaviors for duckdb accelerator                                 | CLOSED | [kind/task]          |
    | Permission denied issue while using duckdb data connector with spice using HELM for Kubernetes deployment | CLOSED | [kind/bug]           |
    +-----------------------------------------------------------------------------------------------------------+--------+----------------------+
    
    Time: 0.011877542 seconds. 5 rows.
    datasets:
      - from: github:github.com/<owner>/<repo>/pulls
        name: spiceai.pulls
        params:
          github_token: ${secrets:GITHUB_TOKEN}
    datasets:
      - from: github:github.com/spiceai/spiceai/pulls
        name: spiceai.pulls
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
    sql> select title, url, state from spiceai.pulls where title like '%GitHub connector%'
    +---------------------------------------------------------------------+----------------------------------------------+--------+
    | title                                                               | url                                          | state  |
    +---------------------------------------------------------------------+----------------------------------------------+--------+
    | GitHub connector: convert `labels` and `hashes` to primitive arrays | https://github.com/spiceai/spiceai/pull/2452 | MERGED |
    +---------------------------------------------------------------------+----------------------------------------------+--------+
    
    Time: 0.034996667 seconds. 1 rows.
    datasets:
      - from: github:github.com/spiceai/spiceai/pulls
        name: spiceai.pulls
        params:
          github_token: ${secrets:GITHUB_TOKEN}
          github_query_mode: search
        time_column: created_at
        acceleration:
          enabled: true
          refresh_mode: append
          refresh_check_interval: 6h # check for new results every 6 hours
          refresh_data_window: 90d # at initial load, load the last 90 days of pulls
    datasets:
      - from: github:github.com/<owner>/<repo>/commits
        name: spiceai.commits
        params:
          github_token: ${secrets:GITHUB_TOKEN}
    datasets:
      - from: github:github.com/spiceai/spiceai/commits
        name: spiceai.commits
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
    sql> select sha, message_head_line from spiceai.commits limit 10
    +------------------------------------------+------------------------------------------------------------------------+
    | sha                                      | message_head_line                                                      |
    +------------------------------------------+------------------------------------------------------------------------+
    | 2a9fab7905737e1af182e17f40aecc5c4b5dd236 |  wait 2 seconds for the status to turn ready in refreshing status tes… |
    | b9c210a818abeaf14d2493fde5227781f47faed8 | Update README.md - Remove bigquery from tablet of connectors (#1434)   |
    | d61e1af61ebf826f83703b8dd939f19e8b2ba426 | Add databricks_use_ssl parameter (#1406)                               |
    | f1ec55c5986e3e5d57eff94197182ffebbae1045 | wording and logs change reflected on readme (#1435)                    |
    | bfc74185584d1e048ef66c72ce3572a0b652bfd9 | Update acknowledgements (#1433)                                        |
    | 0d870f1791d456e7924b4ecbbda5f3b762db1e32 | Update helm version and use v0.13.0-alpha (#1436)                      |
    | 12f930cbad69833077bd97ea43599a75cff985fc | Enable push-down federation by default (#1429)                         |
    | 6e4521090aaf39664bd61d245581d34398ce77db | Add functional tests for federation push-down (#1428)                  |
    | fa3279b7d9fcaa5e8baaa2425f69b556bb30e309 | Add LRU cache support for http-based sql queries (#1410)               |
    | a3f93dde9d1312bfbf14f7ae3b75bdc468289212 | Add guides and examples about error handling (#1427)                   |
    +------------------------------------------+------------------------------------------------------------------------+
    
    Time: 0.0065395 seconds. 10 rows.
    datasets:
      - from: github:github.com/<owner>/<repo>/stargazers
        name: spiceai.stargazers
        params:
          github_token: ${secrets:GITHUB_TOKEN}
    datasets:
      - from: github:github.com/spiceai/spiceai/stargazers
        name: spiceai.stargazers
        params:
          github_token: ${secrets:GITHUB_TOKEN}
        acceleration:
          enabled: true
    sql> select starred_at, login from spiceai.stargazers order by starred_at DESC limit 10
    +----------------------+----------------------+
    | starred_at           | login                |
    +----------------------+----------------------+
    | 2024-09-15T13:22:09Z | cisen                |
    | 2024-09-14T18:04:22Z | tyan-boot            |
    | 2024-09-13T10:38:01Z | yofriadi             |
    | 2024-09-13T10:01:33Z | FourSpaces           |
    | 2024-09-13T04:02:11Z | d4x1                 |
    | 2024-09-11T18:10:28Z | stephenakearns-insta |
    | 2024-09-09T22:17:42Z | Lrs121               |
    | 2024-09-09T19:56:26Z | jonathanfinley       |
    | 2024-09-09T07:02:10Z | leookun              |
    | 2024-09-09T03:04:27Z | royswale             |
    +----------------------+----------------------+
    
    Time: 0.0088075 seconds. 10 rows.