Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Get started with the Spice.ai Cloud Platform in 5 mins.
Welcome to the Spice.ai Cloud Platform documentation!
The Spice.ai Cloud Platform is an AI application and agent cloud; an AI-backend-as-a-service comprising of composable, ready-to-use AI and agent building blocks including high-speed SQL query, LLM inference, Vector Search, and RAG built-on cloud-scale, managed Spice.ai OSS.
With the Spice.ai Cloud Platform, powered by Spice.ai OSS, you can:
Query and accelerate data: Run high-performance SQL queries across multiple data sources with results optimized for AI applications and agents.
Use AI Models: Perform large language model (LLM) inference with major providers including OpenAI, Anthropic, and Grok for chat, completion, and generative AI workflows.
Collaborate on : Share, fork, and manage datasets, models, embeddings, evals, and tools in a collaborative, community-driven hub indexed by .
Use-Cases
Fast, virtualized data views: Build specialized “small data” warehouses to serve fast, virtualized views across large datasets for applications, APIs, dashboards, and analytics.
Performance and reliability: Manage replicas of hot data, cache SQL queries and AI results, and load-balance AI services to improve resiliency and scalability.
Production-grade AI workflows: Use Spice.ai Cloud as a data and AI proxy for secure, monitored, and compliant production environments, complete with advanced observability and performance management.
Take it for a spin by starting with the .
Create your first Spice app
Once , you will be redirected to the new application page. Set a name, add a model provider, and optionally select one of ready to use datasets.
Enter a name for the application.
Optionally choose a model provider, such as OpenAI, and provide an API key.
Optionally select one or more of the available datasets. Datasets can also be added later.
Learn more about building AI applications and agents with the Spice.ai Cloud Platform.


Click Create application.
It will take up to 30 seconds to create and provision a dedicated Spice.ai instance for the application.
Once the application instance is deployed and ready, you will be redirected to the Playground.
Executing the show tables SQL query will show the default datasets available for the app.
🎉 Congrats, you've created your first Spice app!
Continue to Step. 3 to add a dataset and query it.

Go to spice.ai and click on Login or Try for Free in the top right corner.
You can also navigate directly by URL to spice.ai/login
Click Continue with GitHub to login with your GitHub account.
Click Authorize Spice.ai Cloud Platform.
You will be redirected to the new application page.
Continue to Step 2 to configure your first Spice application.
Frequently asked questions
Spice.ai OSS is an open-source project created by the Spice AI team that provides a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
The Spice.ai Cloud Platform is a data and AI application platform that provides a set of building-blocks to create AI and agentic applications. Building blocks include a cloud-data-warehouse, ML model training and inference, and a cloud-scale, managed Spice.ai OSS cloud-hosted service.
It's free to to use the .
Customers who need resource limits, service-level guarantees, or priority support we offer based on usage.
We offer enterprise-grade support with an SLA .
For standard plans we offer in Discord.
See . The Spice.ai Cloud Platform is SOC 2 Type II compliant.
Spice.ai OSS is built on and uses the PostgreSQL dialect.








Use Spice for Enterprise Search and Retrieval
Enterprises face the challenge of accessing data from various disparate and legacy systems to provide AI with comprehensive context. Speed is crucial for this process to be effective.
Spice offers a fast knowledge index into both structured and unstructured data, enabling efficient vector similarity search across multiple data sources. This ensures that AI applications have access to the necessary data for accurate and timely responses.
In Components sidebar, click the Datasets tab.
Select and add the NYC Taxi Trips dataset
Note the configuration has been added to the editor
Click Save in the code toolbar and then Deploy on popup card that appears in the bottom right.
Navigate to the Playground tab, open the dataset reference, and click on the spice.samples.taxi_trips dataset to insert a sample query into the SQL editor. Then, click Run Selection.
Go app Settings and copy one of the app API Keys.
Replace [API-KEY] in the sample below with your API Key and execute from a terminal.
🎉 Congratulations, you've now added a dataset and queried it.
Continue to Step 4 to add an AI Model and chat with the dataset.
Add an OpenAI model and chat with the NYC Taxi Trips dataset
Navigate to Code tab.
In Components sidebar, click Model Providers tab, and select OpenAI.
Enter the Model name.
Enter the Model ID, (e.g. gpt-4o).
Set the OpenAI API Key secret
API keys and other secrets are securely stored and encrypted.
Insert tools: auto in the params section of the gpt-4o Model to automatically connect datasets to the model.
The final Spicepod configuration in the editor should be as follows:
Click Save in the code toolbar and then Deploy in the popup card that appears in the bottom right to deploy the changes.
Navigate to Playground and select AI Chat in the sidebar.
Ask a question about the NYC Taxi Trips dataset in the chat. For example:
Replace [API-KEY] in the sample below with the app API Key and execute in a terminal.
🎉 Congratulations, you've now added an OpenAI model and can use it to ask questions of the NYC Taxi Trips dataset.
Continue to to explore use-cases to do more with the Spice.ai Cloud Platform.
The In-Memory Arrow Data Accelerator is the default data accelerator in Spice. It uses Apache Arrow to store data in-memory for fast access and query performance.
To use the In-Memory Arrow Data Accelerator, no additional configuration is required beyond enabling acceleration.
Example:
However Arrow can be specified explicitly using arrow as the engine for acceleration.
Memory Data Connector Documentation
Instructions for using models hosted on the Spice Cloud Platform with Spice.
A light or dark mode portal theme can be set:
Click the profile picture on the top right corner of the portal interface.
Select Light, Dark, or System mode using the theme toggle.
Each new app deployment automatically retrieves the most recent stable Spice OSS release. Visit the or to check for the latest runtime updates.
All Spice Apps are now powered by the latest, next-generation Spice.ai Open Source data and AI engine. Existing apps have been migrated but require a manual setup step to connect datasets and/or model providers.
Learn More: Read the blog post for details on this upgrade, and visit the for over 50 quickstarts and examples.
curl --request POST \
--url 'https://data.spiceai.io/v1/sql' \
--header 'Content-Type: text/plain' \
--header 'X-API-KEY: [API-KEY]' \
--data 'select * from spice.samples.taxi_trips limit 3'okSpice provides a drop-in solution that offers a single, unified endpoint to multiple data systems without requiring changes to the application.
Maintain local replicas of data with the application to significantly enhance application resilience and availability.
Create a materialization layer for visualization tools like Power BI, Tableau, or Superset to achieve faster, more responsive dashboards without incurring massive compute costs.







Use Spice for Retrieval-Augmented-Generation (RAG)
Use Spice to access data across various data sources for Retrieval-Augmented-Generation (RAG).
Spice enables developers to combine structured data via SQL queries and unstructured data through built-in vector similarity search. This combined data can then be fed to large language models (LLMs) through a native AI gateway, enhancing the models' ability to generate accurate and contextually relevant responses.
"What is the average fare amount of a taxi trip?"



datasets:
- from: memory:store
name: llm_memory
mode: read_write
columns:
- name: value
embeddings: # Easily make your LLM learnings searchable.
- from: all-MiniLM-L6-v2
embeddings:
- name: all-MiniLM-L6-v2
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2Examples:
spice.ai/lukekim/smart/models/drive_stats:latest: Refers to the latest version of the drive_stats model in the smart application by the user or organization lukekim.
spice.ai/lukekim/smart/drive_stats:60cb80a2-d59b-45c4-9b68-0946303bdcaf: Specifies a model with a unique training run ID.
Prefix (Optional): The value must start with spice.ai/.
Organization/User: The name of the organization or user (org) hosting the model.
Application Name: The name of the application (app) which the model belongs to.
Model Name: The name of the model (model).
Version (Optional): A colon (:) followed by the version identifier (version), which could be a semantic version, latest for the most recent version, or a specific training run ID.
name: my-first-app
kind: Spicepod
version: v1beta1
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: samples.taxi_trips
description: Taxi trips dataset from Spice.ai demo datasets.
params:
file_format: parquet
models:
- from: openai:gpt-4o
name: gpt-4o
params:
endpoint: https://api.openai.com/v1
openai_api_key: ${secrets:OPENAI_API_KEY}
tools: autocurl --request POST \
--url 'https://data.spiceai.io/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: 31393037|8f2f6125e7b8487f80964041c123d3c3' \
--data '{ "messages": [{ "role": "user", "content": "Hello!" }], "model": "gpt-4o" }'models:
- from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats
name: drive_stats
datasets:
- drive_stats_inferencingmodels:
- from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats:latest # Label
name: drive_stats_a
datasets:
- drive_stats_inferencing
- from: spice.ai/taxi_tech_co/taxi_drives/models/drive_stats:60cb80a2-d59b-45c4-9b68-0946303bdcaf # Training Run ID
name: drive_stats_b
datasets:
- drive_stats_inferencing\A(?:spice\.ai\/)?(?<org>[\w\-]+)\/(?<app>[\w\-]+)(?:\/models)?\/(?<model>[\w\-]+):(?<version>[\w\d\-\.]+)\zLimitations
The In-Memory Arrow Data Accelerator does not support persistent storage. Data is stored in-memory and will be lost when the Spice runtime is stopped.
The In-Memory Arrow Data Accelerator does not support Decimal256 (76 digits), as it exceeds Arrow's maximum Decimal width of 38 digits.
The In-Memory Arrow Data Accelerator does not support .
The In-Memory Arrow Data Accelerator only supports primary-key , not unique constraints.
With Arrow acceleration, mathematical operations like value1 / value2 are treated as integer division if the values are integers. For example, 1 / 2 will result in 0 instead of the expected 0.5. Use casting to FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT) (or CAST(1 AS FLOAT) / 2).
datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
enabled: trueNavigate to the Spicepod tab and click on the Logs for the selected instance.


Flight SQL Data Connector Documentation
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Connect to any Flight SQL compatible server (e.g. Influx 3.0, CnosDB, other Spice runtimes!) as a connector for federated SQL queries.
- from: flightsql:my_catalog.good_schemas.cool_dataset
name: cool_dataset
params:
flightsql_endpoint: http://127.0.0.1:50051
flightsql_username: spicy
flightsql_password: ${secrets:my_flightsql_pass}fromThe from field takes the form flightsql:dataset where dataset is the fully qualified name of the dataset to read from.
nameThe dataset name. This will be used as the table name within Spice.
paramsFederated SQL Query documentation
Spice supports federated queries, enabling you to join and combine data from multiple sources, including databases (PostgreSQL, MySQL), data warehouses (Databricks, Snowflake, BigQuery), and data lakes (S3, MinIO). For a full list of supported sources, see Data Connectors.
The Playground SQL Explorer is the fastest way to get started with federated queries, debugging queries, and iterating quickly. The SQL Query Editor be accessed by clicking on the SQL Explorer tab after selecting Playground in the app navigation bar.
See for further documentation on using the SQL Query Editor.
For production applications, leveraging the high-performance endpoint is recommended. The Spice SDKs always query using Arrow Flight.
See for further documentation on using Apache Arrow Flight APIs.
SQL Query is also accessible via a standard HTTP API.
See for further documentation on using the HTTP SQL API.
Instructions for using Azure OpenAI models
To use a language model hosted on Azure OpenAI, specify the azure path in the from field and the following parameters from the Azure OpenAI Model Deployment page:
azure_api_key
The Azure OpenAI API key from the models deployment page.
-
azure_api_version
The API version used for the Azure OpenAI service.
Only one of azure_api_key or azure_entra_token can be provided for model configuration.
Example:
Refer to the for more details on available models and configurations.
Follow the to try Azure OpenAI models for vector-based search and chat functionalities with structured (taxi trips) and unstructured (GitHub files) data.
AI Gateway documentation
Spice provides a high-performance, OpenAI API-compatible AI Gateway optimized for managing and scaling large language models (LLMs). Additionally, Spice offers tools for Enterprise Retrieval-Augmented Generation (RAG), such as SQL query across federated datasets and an advanced search feature (see Search).
Spice supports full OpenTelemetry observability, enabling detailed tracking of data flows and requests for full transparency and easier debugging.
Spice supports a variety of LLMs, including OpenAI, Azure OpenAI, Anthropic, Groq, Hugging Face, and more (see Model Providers for all supported models).
Custom Tools: Equip models with tools to interact with the Spice runtime.
System Prompts: Customize system prompts and override defaults for .
For detailed configuration and API usage, refer to the .
To use a language model hosted on OpenAI (or compatible), specify the openai path and model ID in from.
Example spicepod.yml:
For details, see .
Define semantic data models in Spice to improve dataset understanding for AI
A semantic model is a structured representation of data that captures the meaning and relationships between elements in a dataset.
In Spice, semantic models transform raw data into meaningful business concepts by defining metadata, descriptions, and relationships at both the dataset and column level. This makes the data more interpretable for both AI language models and human analysis.
The semantic model is automatically used by as context to produce more accurate and context-aware AI responses.
Semantic data models are defined within the spicepod.yaml file, specifically under the datasets section. Each dataset supports description, metadata, and a columns field where individual columns are described with metadata and features for utility and clarity.
Example spicepod.yaml:
Datasets can be defined with the following metadata:
instructions: Optional. Instructions to provide to a language model when using this dataset.
reference_url_template: Optional. A URL template for citation links.
For detailed metadata configuration, see the Spice OSS
Each column in the dataset can be defined with the following attributes:
description: Optional. A description of the column's contents and purpose.
embeddings: Optional. Vector embeddings configuration for this column.
For detailed columns configuration, see the Spice OSS
Instructions for using xAI models
To use a language model hosted on xAI, specify xai path in the from field and the associated xai_api_key parameter:
xai_api_key
The xAI API key.
-
Example:
models:
- from: xai:grok2-latest
name: xai
params:
xai_api_key: ${secrets:SPICE_GROK_API_KEY}Refer to the for more details on available models and configurations.
Apache Spark Connector Documentation
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Apache Spark as a connector for federated SQL query against a Spark Cluster using Spark Connect
datasets:
- from: spark:spiceai.datasets.my_awesome_table
name: my_table
params:
spark_remote: sc://my-spark-endpointspark_remote: A spark remote connection URI. Refer to spark connect client connection string for parameters in URI.
Correlated scalar subqueries are only supported in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands.
The Spark connector does not yet support streaming query results from Spark.
Instructions for using language models hosted on Anthropic with Spice.
To use a language model hosted on Anthropic, specify anthropic in the from field.
To use a specific model, include its model ID in the from field (see example below). If not specified, the default model is claude-3-5-sonnet-latest.
The following parameters are specific to Anthropic models:
Example spicepod.yml configuration:
See for a list of supported model names.
Export observability traces from Spice into Zipkin
In addition to the built-in runtime.task_history SQL table, Spice can export the observability traces it collects into Zipkin.
Zipkin export is defined in the spicepod.yaml under the runtime.tracingsection:
runtime:
tracing:
zipkin_enabled: true
zipkin_endpoint: http://localhost:9411/api/v2/spanszipkin_enabled: Optional. Default false. Enables or disables the Zipkin trace export.
zipkin_endpoint: Required if zipkin_enabledis true. The path to the /api/v2/spansendpoint on the Zipkin instance to export to.
Localpod Data Connector Documentation
The Localpod Data Connector enables setting up a parent/child relationship between datasets in the current Spicepod. This can be used for configuring multiple/tiered accelerations for a single dataset, and ensuring that the data is only downloaded once from the remote source. For example, you can use the localpod connector to create a child dataset that is accelerated in-memory, while the parent dataset is accelerated to a file.
The dataset created by the localpod connector will logically have the same data as the parent dataset.
The localpod connector supports synchronized refreshes, which ensures that the child dataset is refreshed from the same data as the parent dataset. Synchronized refreshes require that both the parent and child datasets are accelerated with refresh_mode: full (which is the default).
When synchronization is enabled, the following logs will be emitted:
Publish app to https://spicerack.org
Public Spice Apps can be forked, added as a dependency or connected to using Spice OSS Spice.ai connector.
To make your app public, go to your app settings and click Make public.
After that, the app will be visible to all users at https://spice.ai/<org-name>/<app-name> and searchable at .
Query with SQL via the HTTP API
Data may be queried by posting SQL to the /v1/sql API and /v1/firesql API for Firecached data. For documentation on the Spice Firecache see .
See Tables for a list of tables to query or browse the example queries listed in the menu.
An API key is required for all SQL queries.
Results are limited to 500 rows. Use the to fetch up to 1M rows in a single query or the to fetch results with paging.
Requests are limited to 90 seconds.
The @spiceai/spice SDK supports streaming partial results as they become available.
This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.
The Client.query function takes an optional onData callback that will be passed partial results as they become available.
public async query(
queryText: string,
onData: ((data: Table) => void) | undefined = undefined
): Promise<Table>In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:
import { SpiceClient } from "@spiceai/spice";
const spiceClient = new SpiceClient(process.env.API_KEY);
const query = `
SELECT s_suppkey, s_name
FROM tpch.supplier
`
const allSuppliers = await spiceClient.query(query)
allSuppliers.toArray().forEach((row) => {
processSupplier(row.toJSON())
});This call will wait for the promise returned by query() to complete, returning all 10,000 suppliers.
Alternatively, data can be processed as it is streamed to the SDK. Provide a callback function to the onData parameter, which will be called with every partial set of data streamed to the SDK:
Configuring Spice.ai runtime for your Spice application
Navigate to Settings -> Runtime to configure the runtime settings for your Spice application.
The runtime version determines the Spice.ai Open Source version for your Spice application. Each new deployment automatically adopts the latest stable version of the Spice runtime to ensure access to the most recent features and optimizations.
The runtime region specifies the geographic location of the data center hosting your Spice application. Region selection optimizes latency, compliance, and performance based on your business needs.
Availability: Region selection is exclusive to Enterprise plan customers.
Supported Regions:
North America:
Compute settings define the resource allocation for your Spice application, balancing performance and cost.
Standard Compute Instances:
Developer: 2 CPU / 4 GB
Pro for Teams: 4 CPU / 8 GB
Enterprise: Dedicated Instances with multiple high-availability replicas
Provides a persistent storage for the runtime to save data acceleration files. Data remains intact across restarts and redeployments.
Availability: Storage is exclusive to Enterprise plan customers.
Mount Path: /data.
Size: Configured per request. Contact your account executive to set or update capacity.
First-class, built-in observability to understand the operations Spice performs.
Observability in Spice enables task tracking and performance monitoring through a built-in distributed tracing system that can or be viewed via the SQL table.
Spice records detailed information about runtime operations through trace IDs, timings, and labels - from SQL queries to AI completions. This task history system helps operators monitor performance, debug issues, and understand system behavior across individual requests and overall patterns.
SQL Query (Cloud Data Warehouse) API
The SQL Query API provides powerful capabilities for querying data managed by the Spice.ai Cloud Data Warehouse and connected external data sources using federated SQL queries. Results can be fetched through either the high-performance Apache Arrow Flight API or a standard HTTP API.
Instructions for using language models hosted on Perplexity with Spice.
To use a language model hosted on Perplexity, specify perplexity in the from field.
To use a specific model, include its model ID in the from field (see example below). If not specified, the default model is sonar.
The following parameters are specific to Perplexity models:
Each Spice app have two pre-generated API keys, which can be used with , the or the .
Select Spice app and navigate to Settings -> General.
Click the API Key 1 or API Key 2 field to copy the key value.
To update Spice App spicepod.yaml, navigate to Code tab. Use Components sidebar to add data connectors, model providers and preconfigured datasets, or manually edit spicepod.yaml in code editor.
After saving the spicepod changes, a new deployment must be triggered. Learn more about .
The spicepy SDK supports streaming partial results as they become available.
This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.
spicepy enables streaming through the use of the .
The object returned from spicepy.Client.query() is a .
Calling to_pandas() on the FlightStreamReader will wait for the stream to return all of the data before returning a pandas DataFrame.
Use the Playground's SQL editor to easily explore data
Continuously identify and fix issues by tracking process actions with Spice monitoring and request logs.
Under the Monitoring tab, track your app requests, their status codes, and duration, in addition to the existing usage monitoring metrics dashboard.
Monitor the success, failures, and durations of SQL queries, AI completions, Vector Searches, Embedding calculations, and accelerated dataset refreshes.
App Secrets are key-value pairs that are passed to the Spice Runtime instance as environment secrets. Secrets are securely encrypted and accessible only through the app in which they were created.
Once a secret is saved, its value cannot be retrieved through Spice Cloud. If you need to update the secret value, you must delete the existing secret and create a new one.
Select your app.
datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
enabled: true
engine: arrowFor production use, leverage the high-performance Apache Arrow Flight API, which is optimized for large-scale data workloads. The Spice SDKs default to querying via Arrow Flight.
• Endpoint: grpc+tls://flight.spiceai.io
• For additional details, refer to the Apache Arrow Flight API documentation.
The SQL Query API is also accessible via HTTP, offering standard integration for web applications.
• Core Endpoint: https://data.spiceai.io/v1/sql
• For more details, consult the HTTP SQL API documentation.
flightsql_endpoint
The Apache Flight endpoint used to connect to the Flight SQL server.
flightsql_username
Optional. The username to use in the underlying Apache flight Handshake Request to authenticate to the server (see reference).
flightsql_password
Optional. The password to use in the underlying Apache flight Handshake Request to authenticate to the server. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_flightsql_pass}.








us-east-2US West (Oregon) - us-west-2 (AWS)
US East (N. Virginia) - us-east (Azure)





-
azure_deployment_name
The name of the model deployment.
Model name
endpoint
The Azure OpenAI resource endpoint, e.g., https://resource-name.openai.azure.com.
-
azure_entra_token
The Azure Entra token for authentication.
-
anthropic_api_key
The Anthropic API key.
-
endpoint
The Anthropic API base endpoint.
https://api.anthropic.com/v1
models:
- from: anthropic:claude-3-5-sonnet-latest
name: claude_3_5_sonnet
params:
anthropic_api_key: ${ secrets:SPICE_ANTHROPIC_API_KEY }2024-10-28T15:45:24.220665Z INFO runtime::datafusion: Localpod dataset test_local synchronizing refreshes with parent table testdatasets:
- from: postgres:cleaned_sales_data
name: test
params: ...
acceleration:
enabled: true # This dataset will be accelerated into a DuckDB file
engine: duckdb
mode: file
refresh_check_interval: 10s
- from: localpod:test
name: test_local
acceleration:
enabled: true # This dataset accelerates the parent `test` dataset into in-memory Arrow records and is synchronized with the parentimport { SpiceClient } from "@spiceai/spice";
const spiceClient = new SpiceClient(process.env.API_KEY);
const query = `
SELECT s_suppkey, s_name
FROM tpch.supplier
`
await spiceClient.query(query, (partialData) => {
partialData.toArray().forEach((row) => {
processSupplier(row.toJSON())
});
})read_chunk() method on FlightStreamReader. This returns a FlightStreamChunk, which has a data attribute that is a RecordBatch. Once we have the RecordBatch, we can call to_pandas() on it to return the partial data as a pandas DataFrame. When the stream has ended, calling read_chunk() will raise a StopIteration exception that we can catch.In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:
This call will return a pandas DataFrame with all 10,000 suppliers, and is a synchronous call that waits for all data to arrive before returning.
Alternatively, to process chunks of data as they arrive instead of waiting for all data to arrive, FlightStreamReader supports reading chunks of data as they become available with read_chunk(). Using the same query example above, but processing data chunk by chunk:
params.http_url (string, optional): params.flight_url (string, optional): URL of the endpoint to use (default: localhost:50051, using local Spice Runtime)
Default connection to local Spice Runtime:
Connect to Spice.AI Cloud Platform:
Or using shorthand:
queryText: (string, required): The SQL query to execute
onData: (callback, optional): The callback function that is used for handling streaming data.
query returns an Apache Arrow Table.
To get the data in JSON format, iterate over each row by calling toArray() on the table and call toJSON() on each row.
Get all of the elements for a column by calling getChild(name: string) and then calling toJSON() on the result.
fromThe Spice.ai Cloud Platform dataset URI. To query a dataset in a public Spice.ai App, use the format spice.ai/<org>/<app>/datasets/<dataset_name>.
models:
- from: azure:gpt-4o-mini
name: gpt-4o-mini
params:
endpoint: ${ secrets:SPICE_AZURE_AI_ENDPOINT }
azure_api_version: 2024-08-01-preview
azure_deployment_name: gpt-4o-mini
azure_api_key: ${ secrets:SPICE_AZURE_API_KEY }models:
- from: openai:gpt-4o-mini
name: openai
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- from: openai:llama3-groq-70b-8192-tool-use-preview
name: groq-llama
params:
endpoint: https://api.groq.com/openai/v1
openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }datasets:
- name: taxi_trips
description: NYC taxi trip rides
metadata:
instructions: Always provide citations with reference URLs.
reference_url_template: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_<YYYY-MM>.parquet
columns:
- name: tpep_pickup_time
description: 'The time the passenger was picked up by the taxi'
- name: notes
description: 'Optional notes about the trip'
embeddings:
- from: hf_minilm # A defined Spice Model
chunking:
enabled: true
target_chunk_size: 512
overlap_size: 128
trim_whitespace: true>>> from spicepy import Client
>>> import os
>>> client = Client(os.environ["API_KEY"])
>>> rdr = client.query("SELECT * FROM eth.recent_blocks")
<pyarrow._flight.FlightStreamReader object at 0x1059c9980>from spicepy import Client
client = Client(os.environ["API_KEY"])
query = """
SELECT s_suppkey, s_name
FROM tpch.supplier
"""
reader = client.query(query)
suppliers = reader.read_pandas()reader = client.query(query)
has_more = True
while has_more:
try:
flight_batch = reader.read_chunk()
record_batch = flight_batch.data
processChunk(record_batch.to_pandas())
except StopIteration:
has_more = Falseimport { SpiceClient } from "@spiceai/spice";
const spiceClient = new Spiceclient();import { SpiceClient } from "@spiceai/spice";
const spiceClient = new Spiceclient({
api_key: 'API_KEY',
http_url: 'https://data.spiceai.io',
flight_url: 'flight.spiceai.io:443'
});import { SpiceClient } from "@spiceai/spice";
const spiceClient = new SpiceClient('API_KEY');const table = await spiceClient.query("SELECT * from tpch.lineitem LIMIT 10")
table.toArray().forEach((row) => {
console.log(row.toJSON());
});const table = await client.query(
'SELECT sum(l_extendedprice) as sum_extendedprice FROM tpch.lineitem'
);
let sumExtendedPrice = tableResult.getChild("sum_extendedprice");
console.log(sumExtendedPrice?.toJSON())- from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips- from: spice.ai/spiceai/tpch/datasets/customer
name: tpch.customer- from: spice.ai/spiceai/tpch/datasets/customer
name: tpch.customer
params:
spiceai_api_key: ${secrets:spiceai_api_key}
acceleration:
enabled: trueThe from field must contain a valid URI to the location of a supported file. For example, http://static_username@my-http-api/report.csv.
The dataset name. This will be used as the table name within Spice.
Example:
The connector supports Basic HTTP authentication via param values.
http_port
Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.
http_username
Optional. Username to provide connection for HTTP basic authentication. Default: None.
http_password
Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the to load the password from a secret store, e.g. ${secrets:my_http_pass}.
client_timeout
Optional. Specifies timeout for HTTP operations. Default value is 30s E.g. client_timeout: 60s
Trace AI chat completion steps and tool interactions to identify why a request isn't responding as expected
Investigate failed queries and other task errors
Track SQL query/tool use execution times
Identify slow-running tasks
Track usage patterns by protocol and dataset
Understand how AI models are using tools to retrieve data from the datasets available to them
The Spice platform provides a built-in UI for visualizing the observability traces that Spice OSS generates.
busy_timeout: Optional. Specifies the duration for the SQLite busy timeout when connecting to the database file. Default: 5000 ms.Configuration params are provided in the acceleration section of a dataset. Other common acceleration fields can be configured for sqlite, see see datasets.
LIMITATIONS
The SQLite accelerator doesn't support advanced grouping features such as ROLLUP and GROUPING.
In SQLite, CAST(value AS DECIMAL) doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL) will be treated as integer division, resulting in 0 instead of the expected 0.5. Use FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT).
Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.
The SQLite accelerator doesn't support arrow Interval types, as doesn't have a native interval type.
The SQLite accelerator only supports arrow List types of primitive data types; lists with structs are not supported.
MEMORY CONSIDERATIONS
When accelerating a dataset using mode: memory (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
The Perplexity API authentication token.
-
perplexity_*
Additional, perplexity specific parameters to use on all requests. See
-
Note: Like other models in Spice, Perplexity can set default overrides for OpenAI parameters. See Parameter Overrides.
models:
- name: webs
from: perplexity:sonar
params:
perplexity_auth_token: ${ secrets:SPICE_PERPLEXITY_AUTH_TOKEN }perplexity_auth_token
If the Spice App is connected to a GitHub repository (learn more about how to connect), the only way to update the Spicepod configuration is to edit the spicepod.yaml file in the root of your repository and push it to the default branch.
To apply the updated spicepod, a new deployment must be triggered. Learn more about Spicepod Deployments.
App Code tab allows to explore and preview files in connected repository.


To transfer an app, click Settings in the app navigation.
In the Danger Zone section of App Settings, click the Transfer app button.
On the Transfer application page, select the New owner organization from the menu.
Type the full app name into the text box to confirm and click Transfer Application to complete process.
The App will now be accessible by the receiving organization and its members.
Start typing a SQL command, such as SELECT * FROM
As you type, the Query Editor will suggest possible completions based on the query context. You can use the arrow keys or mouse to select a completion, and then press Enter or Tab to insert it into the editor.
Examples of using the SQL suggestions:
Select the spice.runtime.metrics table:
Type SELECT * FROM and press Tab. The editor will suggest spice.runtime.metrics as a possible table. Press Enter to insert it into the query.
Show the fields in the spice.runtime.metrics table:
Type SELECT * FROM spice.runtime.metrics WHERE ". The editor will list the fields in the table.
spice.runtime.metrics along with their type.The datasets reference displays all available datasets from the current app and allows you to search through them. Clicking on the dataset will insert a sample query into the SQL editor, which will be automatically selected for execution.

Track request volume, data usage, and query time at 1 hour, 24 hours, 7 days and 28 days granularity. Start by going to your app and navigating to the Monitoring tab.
Under the Monitoring tab, select Logs. You can then toggle between Metrics and Logs views.
Within Logs, you have the option to retrieve API requests from the past hour, 8 hours, 24 hours, or up to the previous three days.

Navigate to Settings tab and select Secrets section.
Fill Secret Name and Secret Value fields and click Add.
Saved secrets can be referenced in the Spicepod configuration as
${secrets::<SECRET_NAME>}, for example:
To apply secrets, you must initiate a new spicepod deployment. Learn more about deployments.
models:
- from: openai:gpt-4o
name: gpt-4o
params:
openai_api_key: ${secrets:OPENAI_API_KEY}Organizations enable you to share apps, datasets, users, billing, and settings with your team. Organization administrators can set who has access to their organization's resources and data.
When you create an account on Spice.ai, a single member organization of the same name as your username is created for you and you are automatically added as a member and the owner of the organization.
Spice.ai organizations are created by connecting an existing GitHub organization to Spice.ai.
Click on the organization dropdown icon in the application selector. Next, select the Create Org option from the menu.
Check to accept the terms and conditions for the new organization, then proceed by clicking the Connect GitHub organization button.
A window will pop up from GitHub where you can select the organization to install the Spice.ai app into.
On the confirmation page proceed by clicking the Install button.
Upon successful connection, you will be automatically redirected to the newly created Spice.ai organization.
To view your organizations, click the dropdown icon from the application selector.
All organizations you have access to are listed.
Click on the first tab to access the details of your current organization or select another organization from the menu to view its information. On this page, you will see all the applications that have been created within the selected organization.
Click the Settings tab to view information about the organization, including members and billing information.
To add an existing Spice.ai user to an organization:
Navigate to the organization's settings.
Click the Add Member button.
Enter the Spice.ai username of the user you wish to add to the organization.
Click the Add Member button to confirm.
The user will be added to the organization and they will receive an email notifying them of the new membership.
To invite GitHub user to a Spice.ai organization:
Enter the GitHub username of the user you wish to invite to the organization and select the user from the search results. Only users with public email address can be invited.
The invited user will receive an invitation link. Once they accept the invitation, they will be granted access to the organization.
To invite anyone by email
Enter the email address of the user you want to invite to the organization, then click Send invite.
To remove a member from an organization:
Navigate to the organization's settings.
Locate the user you wish to remove from the list of members.
Click the ellipsis on the right of the user's card.
Confirm the removal by clicking the Remove member from organization
Using Spice.ai for Agentic AI Applications
Build intelligent autonomous agents that act contextually by grounding AI models in secure, full-knowledge datasets with fast, iterative feedback loops.
Spice.ai helps in building intelligent autonomous agents by leveraging several key features:
Spice.ai enables federated querying across databases, data warehouses, and lakes. With advanced query push-down optimizations, it ensures efficient retrieval and processing of data across disparate sources, reducing latency and operational complexity. Learn more about Federated SQL Query. For practical implementation, refer to the Federated SQL Query recipe.
Spice.ai materializes application-specific datasets close to the point of use, reducing query and thus retrieval times, and infrastructure costs. It supports Change Data Capture (CDC), keeping materialized data sets up-to-date with minimal overhead and enabling real-time, reliable data access. . See the for an example.
Integrate AI into your applications with Spice.ai’s AI Gateway. It supports hosted models like OpenAI and Anthropic and local models such as OSS Llama and NVIDIA NIM. Fine-tuning and model distillation are simplified, helping faster cycles of development and deployment. . Refer to the for details.
Spice.ai provides advanced search capabilities, including vector similarity search (VSS), enabling efficient retrieval of unstructured data, embeddings, and AI model outputs. This is critical for applications like RAG and intelligent search systems. . For implementation, see the .
Built-in semantic models allow Spice.ai to align AI operations with enterprise data, ensuring that applications are grounded in contextual, full-knowledge datasets. This enhances the accuracy and reliability of AI outputs while reducing risks of irrelevant or untrustworthy results. .
Spice.ai includes robust monitoring and observability tools tailored for AI applications. These tools provide end-to-end visibility into data flows and AI workflows, LLM-specific observability to monitor model performance, track usage, and manage drift, and security and compliance auditing for data and model interactions. .
Connect your Spice.ai app to a GitHub repository
Before connecting:
Admin access: Ensure you have administrative access to the GitHub repository. This level of access is required to install the Spice.ai GitHub app.
Matching repository name: The Spice.ai app and the GitHub repository names must match. For example:
Spice.ai app:
GitHub repository:
To quickly set up a new repository, use the as a starting point:
Make sure to copy app spicepod.yaml contents from the Code tab and place it in the root of the repository before linking.
Ensure the repository is set up as per instructions above.
In the context of the Spice app to connect, navigate to Settings, then click the Connect repository button.
Follow GitHub App installation instructions.
Ensure that you select all repositories or specifically the repository you intend to connect.
Finally, link the repository to your Spice.ai app.
Snowflake Data Connector Documentation
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
The Snowflake Data Connector enables federated SQL queries across datasets in the Snowflake Cloud Data Warehouse.
fromA Snowflake fully qualified table name (database.schema.table). For instance snowflake:SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM or snowflake:TAXI_DATA."2024".TAXI_TRIPS
nameThe dataset name. This will be used as the table name within Spice.
paramsThe connector supports password-based and authentication. Login requires the account identifier ('orgname-accountname' format) - use instructions.
Limitations
Account identifier does not support the . Use .
The connector supports password-based and authentication.
To use DuckDB as Data Accelerator, specify duckdb as the engine for acceleration.
datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: duckdbSpice.ai currently only supports mode: memory for DuckDB accelerator.
Configuration params are provided in the acceleration section for a data store. Other common acceleration fields can be configured for DuckDB, see see .
LIMITATIONS
The DuckDB accelerator does not support nested lists, or structs with nested structs/lists . For example:
Supported:
MEMORY CONSIDERATIONS
When accelerating a dataset using mode: memory (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Dotnet SDK for Spice.ai
The Dotnet SDK spiceai is the easiest way to query Spice.ai from Dotnet.
It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.
.Net 6.0+ or .Net Standard 2.0+
Add Spice SDK
Create a SpiceClient by providing your API key to SpiceClientBuilder. Get your free API key at .
Execute a query and get back an Apache Arrow .
Iterate through the reader to access the records.
Follow the to install and run spice locally.
Contribute to or file an issue with the spice-dotnet library at:
Create custom dashboards and visualizations using the FlightSQL or Infinity Grafana Plugins:
Use the for an installation reference.
In the Grafana installation, navigate to Administration -> Plugins and data -> Plugins. Select "State: All" and search for "FlightSQL" in the Search box. Select "FlightSQL" in the list of plugins.
Click the "Install" button to install the plugin. Once installed, a new data source can be added.
Click the "Add new data source" button from the FlightSQL plugin menu.
In the data source menu, provide the following options:
Host:Port - set to flight.spiceai.io:443 to connect to the Spice.ai Cloud Platform
Auth Type - set to token
Token - set to your .
Once these options are set, click "Save & Test". Grafana should report "OK" if the configuration is correct.
Create a visualization using the FlightSQL Spice.AI data source. Select the configured datasource from the list of datasources in the visualization creator.
Create your query using the SQL builder, or switch to a plain SQL editor by clicking the "Edit SQL" button. In this example, the query retrieves the latest query execution times from the configured App instance and displays them as a line graph.
Grafana may not automatically update the visualization when changes are made to the query. To force the visualization to update with new query changes, click the "Query Inspector" button, and click "Refresh".
The Java SDK is the easiest way to query the Spice Cloud Platform from Java.
It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.
This library supports the following Java implementations:
OpenJDK 11
OpenJDK 17
OpenJDK 21
OracleJDK 11
OracleJDK 17
OracleJDK 21
OracleJDK 22
1. Import the package.
2. Create a SpiceClient by providing your API key. Get your free API key at .
3. Execute a query and get back a .
5. Iterate through the FlightStream to access the records.
Check to learn more.
Follow the to install and run spice locally.
Or using custom flight address:
Check or to learn more
The SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured with withMaxRetries:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).
Contribute to or file an issue with the spice-rs library at:
Every Spice app is powered by a managed instance of the Spice OSS Runtime deployed to the platform.
A Spicepod is a package that encapsulates application-centric datasets and machine learning (ML) models.
Spicepods are analogous to code packaging systems, like NPM, however differ by expanding the concepts to data and ML models.
A Spicepod is described by a YAML manifest file, typically named spicepod.yaml, which includes the following key sections:
Metadata: Basic information about the Spicepod, such as its name and version.
Datasets: Definitions of datasets that are used or produced within the Spicepod.
Catalogs: Definitions of catalogs that are used within the Spicepod.
Models: Definitions of ML models that the Spicepod manages, including their sources and associated datasets.
Datasets in a Spicepod can be sourced from various locations, including local files or remote databases. They can be materialized and accelerated using different engines such as DuckDB, SQLite, or PostgreSQL to optimize performance ().
Catalogs in a Spicepod can contain multiple schemas. Each schema, in turn, contains multiple tables where the actual data is stored.
ML models are integrated into the Spicepod similarly to datasets. The models can be specified using paths to local files or remote locations. ML inference can be performed using the models and datasets defined within the Spicepod.
To learn more, please refer to the full .
FTP/SFTP Data Connector Documentation
FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol) are network protocols used for transferring files between a client and server, with FTP being less secure and SFTP providing encrypted file transfer over SSH.
The FTP/SFTP Data Connector enables federated/accelerated SQL query across stored in FTP/SFTP servers.
Dremio Data Connector Documentation
is a data lake engine that enables high-performance SQL queries directly on data lake storage. It provides a unified interface for querying and analyzing data from various sources without the need for complex data movement or transformation.
This connector enables using Dremio as a data source for federated SQL queries.
Configure local acceleration for datasets in Spice for faster queries (test)
Datasets can be locally accelerated by the Spice runtime, pulling data from any and storing it locally in a for faster access. The data can be kept up-to-date in real-time or on a refresh schedule, ensuring users always have the latest data locally for querying.
Dataset acceleration is enabled by setting the acceleration configuration. Spice currently supports In-Memory Arrow, DuckDB, SQLite, PostgreSQL as accelerators. For engine specific configuration, see
datasets:
- from: http://my-http-api.com/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}datasets:
- from: http://[email protected]/report.csv
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
name: laion_freesounddatasets:
- from: http://[email protected]/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: sqlitemodels:
- name: webs
from: perplexity:sonar
params:
perplexity_auth_token: ${ secrets:SPICE_PERPLEXITY_AUTH_TOKEN }
perplexity_search_domain_filter:
- docs.spiceai.org
- huggingface.co
openai_temperature: 0.3141595datasets:
- from: snowflake:DATABASE.SCHEMA.TABLE
name: table
params:
snowflake_warehouse: COMPUTE_WH
snowflake_role: accountadminSELECT {'x': 1, 'y': 2, 'z': 3}
Unsupported:
SELECT [['duck', 'goose', 'heron'], ['frog', 'toad']]
SELECT {'x': [1, 2, 3]}
The DuckDB accelerator does not support enum, dictionary, or map field types. For example:
Unsupported:
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])
The DuckDB accelerator does not support Decimal256 (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
Updating a dataset with DuckDB acceleration while the Spice Runtime is running (hot-reload) will cause the DuckDB accelerator query federation to disable until the Runtime is restarted.
































snowflake_private_key_passphrase
Optional, specifies the Snowflake private key passphrase
snowflake_warehouse
Optional, specifies the Snowflake Warehouse to use
snowflake_role
Optional, specifies the role to use for accessing Snowflake data
snowflake_account
Required, specifies the Snowflake account-identifier
snowflake_username
Required, specifies the Snowflake username to use for accessing Snowflake data
snowflake_password
Optional, specifies the Snowflake password to use for accessing Snowflake data
snowflake_private_key_path
Optional, specifies the path to Snowflake private key
<dependency>
<groupId>ai.spice</groupId>
<artifactId>spiceai</artifactId>
<version>0.3.0</version>
<scope>compile</scope>
</dependency>implementation 'ai.spice:spiceai:0.3.0'Spice supports three modes to refresh/update locally accelerated data from a connected data source. full is the default mode. Refer to Data Refresh documentation for detailed refresh usage and configuration.
full
Replace/overwrite the entire dataset on each refresh
A table of users
append
Append/add data to the dataset on each refresh
Append-only, immutable datasets, such as time-series or log data
changes
Apply incremental changes
Customer order lifecycle table
Database indexes are essential for optimizing query performance. Configure indexes for accelerators via indexes field. For detailed configuration, refer to the index documentation.
Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.
Constraints are specified using column references in the Spicepod via the primary_key field in the acceleration configuration. Additional unique constraints are specified via the indexes field with the value unique. Data that violates these constraints will result in a conflict. For constraints configuration details, visit Constraints Documentation.
datasets:
- from: snowflake:SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM
name: lineitem
params:
snowflake_warehouse: COMPUTE_WH
snowflake_role: accountadmindotnet add package spiceaiusing Spice;
var client = new SpiceClientBuilder()
.WithApiKey("API_KEY")
.WithSpiceCloud("http://my_remote_spice_instance:50051")
.Build();var reader = await client.Query("SELECT * FROM tpch.lineitem LIMIT 10;");var enumerator = result.GetAsyncEnumerator();
while (await enumerator.MoveNextAsync())
{
var batch = enumerator.Current;
// Process batch
}using Spice;
var client = new SpiceClientBuilder()
.Build();
var data = await client.Query("SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;");import ai.spice.SpiceClient;SpiceClient spice = SpiceClient.builder()
.withApiKey(ApiKey)
.withSpiceCloud()
.build()FlightStream stream = spice.query("SELECT * FROM tpch.lineitem LIMIT 10");while (stream.next()) {
try (VectorSchemaRoot batches = stream.getRoot()) {
System.out.println(batches.contentToTSVString());
}
}SpiceClient spice = SpiceClient.builder()
.build();
SpiceClient spice = SpiceClient.builder()
.withFlightAddress(new URI("grpc://my_remote_spice_instance:50051"))
.build();SpiceClient client = SpiceClient.builder()
.withMaxRetries(5) // Setting to 0 will disable retries
.build();version: v1beta1
kind: Spicepod
name: my_spicepod
datasets:
- from: spice.ai/spiceai/quickstart
name: qs
acceleration:
enabled: true
refresh_mode: append
models:
- from: openai/gpt-4o
name: gpt-4odatasets:
- from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 10sdatasets:
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
refresh_mode: full
refresh_check_interval: 10mdatasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
indexes:
number: enabled # Index the `number` column
'(hash, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columnsdatasets:
- from: spice.ai/eth.recent_blocks
name: eth.recent_blocks
acceleration:
enabled: true
engine: sqlite
primary_key: hash # Define a primary key on the `hash` column
indexes:
'(number, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columnsThe from field for the ClickHouse connector takes the form of from:db.dataset where db.dataset is the path to the Dataset within ClickHouse. In the example above it would be my.dataset.
If db is not specified in either the from field or the clickhouse_db parameter, it will default to the default database.
The dataset name. This will be used as the table name within Spice.
The ClickHouse data connector can be configured by providing the following params:
clickhouse_connection_string
The connection string to use to connect to the ClickHouse server. This can be used instead of providing individual connection parameters.
clickhouse_host
The hostname of the ClickHouse server.
clickhouse_tcp_port
The port of the ClickHouse server.
clickhouse_db
The name of the database to connect to.
clickhouse_user
The username to connect with.
clickhouse_pass
The password to connect with.
fromThe from field takes one of two forms: ftp://<host>/<path> or sftp://<host>/<path> where <host> is the host to connect to and <path> is the path to the file or directory to read from.
If a folder is provided, all child files will be loaded.
The dataset name. This will be used as the table name within Spice.
Example:
file_format
Specifies the data file format. Required if the format cannot be inferred by from the from path. See .
ftp_port
Optional, specifies the port of the FTP server. Default is 21. E.g. ftp_port: 21
ftp_user
The username for the FTP server. E.g. ftp_user: my-ftp-user
ftp_pass
The password for the FTP server. Use the to load the password from a secret store, e.g. ${secrets:my_ftp_pass}.
client_timeout
Optional. Specifies timeout for FTP connection. E.g. client_timeout: 30s. When not set, no timeout will be configured for FTP client.
hive_partitioning_enabled
Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
file_format
Specifies the data file format. Required if the format cannot be inferred by from the from path. See .
sftp_port
Optional, specifies the port of the SFTP server. Default is 22. E.g. sftp_port: 22
sftp_user
The username for the SFTP server. E.g. sftp_user: my-sftp-user
sftp_pass
The password for the SFTP server. Use the to load the password from a secret store, e.g. ${secrets:my_sftp_pass}.
client_timeout
Optional. Specifies timeout for SFTP connection. E.g. client_timeout: 30s. When not set, no timeout will be configured for SFTP client.
hive_partitioning_enabled
Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
fromThe from field takes the form dremio:dataset where dataset is the fully qualified name of the dataset to read from.
Limitations
Currently, only up to three levels of nesting are supported for dataset names (e.g., a.b.c). Additional levels are not supported at this time.
The dataset name. This will be used as the table name within Spice.
Example:
dremio_endpoint
The endpoint used to connect to the Dremio server.
dremio_username
The username used to connect to the Dremio endpoint.
dremio_password
The password used to connect to the Dremio endpoint. Use the to load the password from a secret store, e.g. ${secrets:my_dremio_pass}.
The table below shows the Dremio data types supported, along with the type mapping to Apache Arrow types in Spice.
INT
Int32
BIGINT
Int64
FLOAT
Float32
DOUBLE
Float64
DECIMAL
Decimal128
VARCHAR
Utf8
Limitations
Dremio connector does not support queries with the EXCEPT and INTERSECT keywords in Spice REPL. Use DISTINCT and IN/NOT IN instead. See the example below.
````
Python 3.11+
The following packages are required and will be automatically installed by pip:
pyarrow
pandas
certify
requests
Install the spicepy package directly from the Spice Github Repository at https://github.com/spiceai/spicepy:
Import spicepy and create a Client by providing your API Key.
You can then submit queries using the query function.
Querying data is done through a Client object that initializes the connection with the Spice.ai endpoint. Client has the following arguments:
api_key (string, optional): Spice.ai API key to authenticate with the endpoint.
url (string, optional): URL of the endpoint to use (default: grpc+tls://flight.spiceai.io)
tls_root_cert (Path or string, optional): Path to the tls certificate to use for the secure connection (ommit for automatic detection)
Once a Client is obtained queries can be made using the query() function. The query() function has the following arguments:
query (string, required): The SQL query.
timeout (int, optional): The timeout in seconds.
A custom timeout can be set by passing the timeout parameter in the query function call. If no timeout is specified, it will default to a 10 min timeout then cancel the query, and a TimeoutError exception will be raised.
Follow the quickstart guide to install and run spice locally.
Contribute to or file an issue with the spicepy library at: https://github.com/spiceai/spicepy
Instructions for using machine learning models hosted on HuggingFace with Spice.
To use a model hosted on HuggingFace, specify the huggingface.co path in the from field and, when needed, the files to include.
fromThe from key takes the form of huggingface:model_path. Below shows 2 common example of from key configuration.
huggingface:username/modelname: Implies the latest version of modelname hosted by username.
huggingface:huggingface.co/username/modelname:revision: Specifies a particular revision of modelname by username, including the optional domain.
The from key follows the following regex format.
The from key consists of five components:
Prefix: The value must start with huggingface:.
Domain (Optional): Optionally includes huggingface.co/ immediately after the prefix. Currently no other Huggingface compatible services are supported.
Organization/User: The HuggingFace organization (org).
nameThe model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models). This can be set to the same value as the model ID in the from field.
paramsfilesThe specific file path for Huggingface model. For example, GGUF model formats require a specific file path, other varieties (e.g. .safetensors) are inferred.
Access tokens can be provided for Huggingface models in two ways:
In the Huggingface token cache (i.e. ~/.cache/huggingface/token). Default.
Via .
For more details on authentication, see .
Limitations
The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports and for accelerated inference.
ML models currently only support ONNX file format.
Instructions for using language models hosted on OpenAI or compatible services with Spice.
To use a language model hosted on OpenAI (or compatible), specify the openai path in the from field.
For a specific model, include it as the model ID in the from field (see example below). The default model is gpt-4o-mini.
fromThe from field takes the form openai:model_id where model_id is the model ID of the OpenAI model, valid model IDs are found in the {endpoint}/v1/models API response.
Example:
nameThe model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models). This can be set to the same value as the model ID in the from field.
paramsSee for additional configuration options.
Spice supports several OpenAI compatible providers. Specify the appropriate endpoint in the params section.
Follow instructions.
Groq provides OpenAI compatible endpoints. Use the following configuration:
NVidia NIM models are OpenAI compatible endpoints. Use the following configuration:
View the Spice cookbook for an example of setting up NVidia NIM with Spice .
Parasail also offers OpenAI compatible endpoints. Use the following configuration:
Refer to the respective provider documentation for more details on available models and configurations.
Spice Machine Learning (ML) Models
Spice Models enable the training and use of ML models natively on the Spice platform.
The platform currently supports time-series forecasting models, with other categories of models planned.
Hosted models have first-class access to co-located data for training and inferencing including: Spice managed datasets, user managed datasets, and custom datasets and views. Additionally, Spice Firecache can be leveraged to train and infer up to 10x faster.
Models are defined using a YAML file. Model details such as data requirements, architecture, training parameters, and other important hyperparameters are defined in the model.yaml.
Add a model.yaml file to the repository path /models/[model_name]/model.yaml of a , replacing [model_name] with the desired model name. For example, the uses the path /models/gas-fees/model.yaml.
Refer to the for all available configuration options.
For example model manifests, see the .
In the , navigate to the Models tab of the Spice app.
model.yaml files committed to the connected repository will be automatically detected and imported as Spice Models.
Navigating to a specific Model will show detailed information as defined in the model.yaml.
A training run can be started using the Train button.
Training runs in progress will be shown and updated, along with historical training runs.
The Training Status will be updated to Complete for successfully completed training runs. Details and the Training Report, are available on the Training Run page.
A successfully trained model can be used to make predictions.
The lookback data (inferencing data) is automatically provided by the platform and wired up to the inference, enabling a prediction to be made using a simple API call.
Navigate to AI Predictions in the Playground.
Successfully trained models will be available for selection from the model selector drop down on the right.
Clicking Predict will demonstrate calling the predictions API using lookback data within the Spice platform. A graph of the predicted value(s) along with the lookback data will be displayed.
The Training Runs page provides training details including a copyable curl command to make a prediction from the command line.
For details on the API, see .
Microsoft SQL Server Data Connector
Microsoft SQL Server is a relational database management system developed by Microsoft.
The Microsoft SQL Server Data Connector enables federated/accelerated SQL queries on data stored in MSSQL databases.
Limitations
The connector supports SQL Server authentication (SQL Login and Password) only.
Spatial types (geography) are not supported, and columns with these types will be ignored.
fromThe from field takes the form mssql:database.schema.table where database.schema.table is the fully-qualified table name in the SQL server.
nameThe dataset name. This will be used as the table name within Spice.
Example:
paramsThe data connector supports the following params. Use the to load the secret from a secret store, e.g. ${secrets:my_mssql_conn_string}.
SQL Query Apache Arrow Flight API
SQL query results can be served via a high-performance Apache Arrow Flight endpoint. Arrow Flight uses the gRPC protocol for efficient data transfer.
This setup enables high-speed access to your data in Python, Go, C++, C#, Rust, Java, and C# and makes it easy to use libraries like Pandas and NumPy.
It's recommended to use the Spice.ai SDKs to connect and query the Arrow Flight endpoint. SDKs are available for , , , , , and .
In Python, query results can be easily converted to Pandas or NumPy formats.
You may also use Apache's pyarrow library directly.
Endpoint URL: grpc+tls://flight.spiceai.io
Basic Authentication:
Username can be set to an empty string
names must be fully-qualified. For example spiceai.quickstart
Find code samples in Python in .
If you get this error:
Could not get default pem root certs
Install the .
Golang SDK for Spice.ai
The Go SDK gospice is the easiest way to query Spice.ai from Go.
It uses Apache Arrow Flight to efficiently stream data to the client and Apache Arrow Records as data frames.
GoDocs are available at: pkg.go.dev/github.com/spiceai/gospice.
(or later)
Get the gospice package.
1. Import the package.
2. Create a SpiceClient by providing your API key. Get your free API key at .
3. Initialize the SpiceClient.
4. Execute a query and get back an .
5. Iterate through the reader to access the records.
Follow the to install and run spice locally.
Or using custom flight address:
Check to learn more.
Run go run . to execute a sample query and print the results to the console.
The SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via SetMaxRetries:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).
Contribute to or file an issue with the gospice library at:
DuckDB Data Connector Documentation
DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system designed for analytical query workloads. It is optimized for fast execution and can be embedded directly into applications, providing efficient data processing without the need for a separate database server.
This connector supports DuckDB as a data source for federated SQL queries.
How to use the Spice.ai for GitHub Copilot Extension
The makes it easy to access and chat with external data in GitHub Copilot, enhancing AI-assisted research, Q&A, code, and documentation suggestions for greater accuracy.
Access structured and unstructured data from any Spice data source like GitHub, PostgreSQL, MySQL, Snowflake, Databricks, GraphQL, data lakes (S3, Delta Lake, OneLake), HTTP(s), SharePoint, and even FTP.
Some example prompts:
@spiceai What documentation is relevant to this file?
datasets:
- from: clickhouse:my.dataset
name: my_datasetdatasets:
- from: clickhouse:my.dataset
name: cool_datasetSELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: clickhouse:my.dataset
name: my_dataset
params:
clickhouse_connection_string: tcp://my_user:${secrets:my_clickhouse_pass}@host/my_database
connection_timeout: 10000
clickhouse_secure: truedatasets:
- from: clickhouse:my.dataset
name: my_dataset
params:
clickhouse_connection_string: tcp://my_user:${secrets:my_clickhouse_pass}@host/my_database?connection_timeout=10000&secure=truedatasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 22
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}datasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass: ${secrets:my_ftp_password}
hive_partitioning_enabled: false- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 22
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
hive_partitioning_enabled: false# fail
SELECT ws_item_sk FROM web_sales
INTERSECT
SELECT ss_item_sk FROM store_sales;
# success
SELECT DISTINCT ws_item_sk FROM web_sales
WHERE ws_item_sk IN (
SELECT DISTINCT ss_item_sk FROM store_sales
);
# fail
SELECT ws_item_sk FROM web_sales
EXCEPT
SELECT ss_item_sk FROM store_sales;
# success
SELECT DISTINCT ws_item_sk FROM web_sales
WHERE ws_item_sk NOT IN (
SELECT DISTINCT ss_item_sk FROM store_sales
);- from: dremio:datasets.dremio_dataset
name: dremio_dataset
params:
dremio_endpoint: grpc://127.0.0.1:32010
dremio_username: demo
dremio_password: ${secrets:my_dremio_pass}datasets:
- from: dremio:datasets.dremio_dataset
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+- from: dremio:datasets.dremio_dataset
name: dremio_dataset
params:
dremio_endpoint: grpc://127.0.0.1:32010
dremio_username: demo
dremio_password: ${secrets:my_dremio_pass}pip install git+https://github.com/spiceai/[email protected]from spicepy import Client
client = Client('API_KEY')
data = client.query('SELECT * FROM eth.recent_blocks LIMIT 10;', timeout=5*60)
pd = data.read_pandas()from spicepy import Client
client = Client()
data = client.query('SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;', timeout=5*60)
pd = data.read_pandas()models:
- from: openai:gpt-4o-mini
name: openai_model
params:
openai_api_key: ${ secrets:OPENAI_API_KEY } # Required for official OpenAI models
tools: auto # Optional. Connect the model to datasets via SQL query/vector search tools
system_prompt: "You are a helpful assistant." # Optional.
# Optional parameters
endpoint: https://api.openai.com/v1 # Override to use a compatible provider (i.e. NVidia NIM)
openai_org_id: ${ secrets:OPENAI_ORG_ID }
openai_project_id: ${ secrets:OPENAI_PROJECT_ID }
# Override default chat completion request parameters
openai_temperature: 0.1
openai_response_format: { "type": "json_object" }clickhouse_secure
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
true: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
false: This mode will not attempt to use an SSL connection, even if the server supports it.
connection_timeout
Optional. Specifies the connection timeout in milliseconds.
brew install --cask miniforgeconda init "$(basename "${SHELL}")"conda install pyarrow pandasVARBINARY
Binary
BOOL
Boolean
DATE
Date64
TIME
Time32
TIMESTAMP
Timestamp(Millisecond, None)
INTERVAL
Interval
LIST
List
STRUCT
Struct
MAP
Map

model.yaml files automatically detected and imported in the Portal.




Model Name: After a /, the model name (model).
Revision (Optional): A colon (:) followed by the git-like revision identifier (revision).
hf_token
The Huggingface access token.
-
model_type
The architecture to load the model as. Supported values: mistral, gemma, mixtral, llama, phi2, phi3, qwen2, gemma2, starcoder2, phi3.5moe, deepseekv2, deepseekv3
-
tools
Which [tools] should be made available to the model. Set to auto to use all available tools.
-
system_prompt
An additional system prompt used for all chat completions to this model.
-
mssql_encrypt
(Optional) Specifies whether encryption is required for the connection.
true: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
false: This mode will not attempt to use an SSL connection, even if the server supports it. Only the login procedure is encrypted.
mssql_trust_server_certificate
(Optional) Specifies whether the server certificate should be trusted without validation when encryption is enabled.
true: The server certificate will not be validated and it is accepted as-is.
false: (default) Server certificate will be validated against system's certificate storage.
mssql_connection_string
The ADO connection string to use to connect to the server. This can be used instead of providing individual connection parameters.
mssql_host
The hostname or IP address of the Microsoft SQL Server instance.
mssql_port
(Optional) The port of the Microsoft SQL Server instance. Default value is 1433.
mssql_database
(Optional) The name of the database to connect to. The default database (master) will be used if not specified.
mssql_username
The username for the SQL Server authentication.
mssql_password
The password for the SQL Server authentication.
$PWD is a bash-specific variable that will be replaced by the current directory path. You can download the certificate file isrgrootx1.pem in a specific location and inform this path instead of $PWD.Add Spice SDK
1. Create a SpiceClient by providing your API key to ClientBuilder. Get your free API key at spice.ai.
2. Execute a query and get back an Apache Arrow Flight Record Batch Stream.
3. Iterate through the reader to access the records.
Follow the quickstart guide to install and run spice locally.
Contribute to or file an issue with the spice-rs library at: https://github.com/spiceai/spice-rs
\A(huggingface:)(huggingface\.co\/)?(?<org>[\w\-]+)\/(?<model>[\w\-]+)(:(?<revision>[\w\d\-\.]+))?\zmodels:
- from: huggingface:huggingface.co/lmstudio-community/Qwen2.5-Coder-3B-Instruct-GGUF
name: sloth-gguf
files:
- path: Qwen2.5-Coder-3B-Instruct-Q3_K_L.ggufmodels:
- name: llama_3.2_1B
from: huggingface:huggingface.co/meta-llama/Llama-3.2-1B
params:
hf_token: ${ secrets:HF_TOKEN }models:
- from: huggingface:huggingface.co/spiceai/darts:latest
name: hf_model
files:
- path: model.onnx
datasets:
- taxi_tripsmodels:
- from: huggingface:huggingface.co/microsoft/Phi-3.5-mini-instruct
name: phimodels:
- name: llama_3.2_1B
from: huggingface:huggingface.co/meta-llama/Llama-3.2-1B
params:
hf_token: ${ secrets:HF_TOKEN }datasets:
- from: mssql:path.to.my_dataset
name: my_dataset
params:
mssql_connection_string: ${secrets:mssql_connection_string}datasets:
- from: mssql:path.to.my_dataset
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: mssql:SalesLT.Customer
name: customer
params:
mssql_host: mssql-host.database.windows.net
mssql_database: my_catalog
mssql_username: my_user
mssql_password: ${secrets:mssql_pass}
mssql_encrypt: true
mssql_trust_server_certificate: truecurl -Lo isrgrootx1.pem https://letsencrypt.org/certs/isrgrootx1.pem export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH="$PWD/isrgrootx1.pem"@powershell -NoProfile -ExecutionPolicy unrestricted -Command ^
(new-object System.Net.WebClient).Downloadfile( ^
'https://letsencrypt.org/certs/isrgrootx1.pem', 'isrgrootx1.pem')
set GRPC_DEFAULT_SSL_ROOTS_FILE_PATH=%cd%\isrgrootx1.pemgo get github.com/spiceai/gospice/v7import "github.com/spiceai/gospice/v7"spice := gospice.NewSpiceClient()
defer spice.Close()if err := spice.Init(
spice.WithApiKey(ApiKey),
spice.WithSpiceCloudAddress()
); err != nil {
panic(fmt.Errorf("error initializing SpiceClient: %w", err))
}reader, err := spice.Query(context.Background(), "SELECT * FROM tpch.lineitem LIMIT 10")
if err != nil {
panic(fmt.Errorf("error querying: %w", err))
}
defer reader.Release()for reader.Next() {
record := reader.Record()
defer record.Release()
fmt.Println(record)
}spice := gospice.NewSpiceClient()
defer spice.Close()
if err := spice.Init(); err != nil {
panic(fmt.Errorf("error initializing SpiceClient: %w", err))
}spice := gospice.NewSpiceClient()
defer spice.Close()
if err := spice.Init(
spice.WithFlightAddress("grpc://localhost:50052")
); err != nil {
panic(fmt.Errorf("error initializing SpiceClient: %w", err))
}spice := NewSpiceClient()
spice.SetMaxRetries(5) // Setting to 0 will disable retriescargo add spiceaiuse spiceai::ClientBuilder;
#[tokio::main]
async fn main() {
let client = ClientBuilder::new()
.api_key("API_KEY")
.use_spiceai_cloud()
.build()
.await
.unwrap();
}let flight_data_stream = client.query("SELECT * FROM tpch.lineitem LIMIT 10;").await.expect("Error executing query");while let Some(batch) = flight_data_stream.next().await {
match batch {
Ok(batch) => {
/* process batch */
println!("{:?}", batch)
},
Err(e) => {
/* handle error */
},
};
}use spiceai::ClientBuilder;
#[tokio::main]
async fn main() {
let client = ClientBuilder::new()
.build()
.await
.unwrap();
let data = client.query("SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;").await;
}openai_org_id
The OpenAI organization ID.
-
openai_project_id
The OpenAI project ID.
-
openai_temperature
Set the default temperature to use on chat completions.
-
openai_response_format
An object specifying the format that the model must output, see .
-
endpoint
The OpenAI API base endpoint. Can be overridden to use a compatible provider (i.e. Nvidia NIM).
https://api.openai.com/v1
tools
Which tools should be made available to the model. Set to auto to use all available tools.
-
system_prompt
An additional system prompt used for all chat completions to this model.
-
openai_api_key
The OpenAI API key.
-
sharepoint_client_id
Yes
The client ID of the Azure AD (Entra) application
sharepoint_tenant_id
Yes
The tenant ID of the Azure AD (Entra) application.
sharepoint_client_secret
Optional
For service principal authentication. The client secret of the Azure AD (Entra) application.
The from field in a SharePoint dataset takes the following format:
drive_type in a SharePoint Connector from field supports the following types:
drive
The SharePoint drive's name
from: sharepoint:drive:Documents/...
driveId
The SharePoint drive's ID
from: sharepoint:driveId:b!Mh8opUGD80ec7zGXgX9r/...
site
A SharePoint site's name
from: sharepoint:site:MySite/...
siteId
A SharePoint site's ID
For a name-based drive_id, the connector will attempt to resolve the name to an ID at startup.
Within a drive, the SharePoint connector can load documents from:
The root of the drive
from: sharepoint:me/root
A specific path within the drive
from: sharepoint:drive:Documents/path:/top_secrets
A specific folder ID
from: sharepoint:group:MyGroup/id:01QM2NJSNHBISUGQ52P5AJQ3CBNOXDMVNT
To use the SharePoint connector with service principal authentication, you will need to create an Azure AD application and grant it the necessary permissions. This will also support OAuth2 authentication for users within the tenant (i.e. sharepoint_bearer_token).
Create a new Azure AD application in the Azure portal.
Under the application's API permissions, add the following permissions: Sites.Read.All, Files.Read.All, User.Read, GroupMember.Read.All
For service principal authentication, Application permissions are required.
For user authentication, only delegated permissions are required.
Add sharepoint_client_id (from the Application (Client) ID field) and sharepoint_tenant_id to the connector configuration.
Under the application's Certificates & secrets, create a new client secret. Use this for the sharepoint_client_secret parameter.
fromThe from field supports one of two forms:
from
Description
duckdb:database.schema.table
Read data from a table named database.schema.table in the DuckDB file
duckdb:*
Read data using any DuckDB function that produces a table. For example one of the functions such as read_json, read_parquet or read_csv.
The dataset name. This will be used as the table name within Spice.
Example:
The DuckDB data connector can be configured by providing the following params:
duckdb_open
The name of the DuckDB database to open.
Configuration params are provided either in the top level dataset for a dataset source, or in the acceleration section for a data store.
A generic example of DuckDB data connector configuration.
Common data import DuckDB functions can also define datasets. Instead of a fixed table reference (e.g. database.schema.table), a DuckDB function is provided in the from: key. For example
Datasets created from DuckDB functions are similar to a standard SELECT query. For example:
is equivalent to:
Many DuckDB data imports can be rewritten as DuckDB functions, making them usable as Spice datasets. For example:
Limitations
The DuckDB connector does not support enum, dictionary, or map field types. For example:
Unsupported:
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])
The DuckDB connector does not support Decimal256 (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
@spiceai Write documentation about the user authentication issue
@spiceai Who are the top 5 committers to this repository?
@spiceai What are the latest error logs from my web app?
To install the extension, visit the GitHub Marketplace and search for Spice.ai.
Scroll down, and click Install it for free.
Once installed, open Copilot Chat and type @spiceai. Press enter.
A prompt will appear to connect to the Spice.ai Cloud Platform.
You will need to authorize the extension. Click Authorize spiceai.
To create an account on the Spice.ai Cloud Platform, click Authorize Spice AI Platform.
Once your account is created, you can configure the extension. Select from a set of ready-to-use datasets to get started. You can configure other datasets after setup.
The extension will take up to 30 seconds to deploy and load the initial dataset.
When complete, proceed back to GitHub Copilot Chat.
To chat with the Spice.ai for GitHub Copilot extension, prefix the message with @spiceai
To list the datasets available to Copilot, try @spiceai What datasets do I have access to?

To use PostgreSQL as Data Accelerator, specify postgres as the engine for acceleration.
The connection to PostgreSQL can be configured by providing the following params:
pg_host: The hostname of the PostgreSQL server.
pg_port: The port of the PostgreSQL server.
pg_db: The name of the database to connect to.
pg_user: The username to connect with.
pg_pass: The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}.
pg_sslmode: Optional. Specifies the SSL/TLS behavior for the connection, supported values:
verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
verify-ca: This mode requires a TLS connection and a valid root certificate.
pg_sslrootcert: Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
connection_pool_size: Optional. The maximum number of connections to keep open in the connection pool. Default is 10.
Configuration params are provided either in the acceleration section of a dataset.
The table below lists the supported and their mappings to when stored
LIMITATIONS
The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col) into CAST (SUM(int_col) as BIGINT.
Delta Lake Data Connector Documentation
Delta Lake data connector connector enables SQL queries from Delta Lake tables.
fromThe from field for the Delta Lake connector takes the form of delta_lake:path where path is any supported path, either local or to a cloud storage location. See the section below.
nameThe dataset name. This will be used as the table name within Spice.
Example:
paramsUse the to reference a secret, e.g. ${secrets:aws_access_key_id}.
The table below shows the Delta Lake data types supported, along with the type mapping to Apache Arrow types in Spice.
Delta Lake connector does not support reading Delta tables with the V2Checkpoint feature enabled. To use the Delta Lake connector with such tables, drop the V2Checkpoint feature by executing the following command:
For more details on dropping Delta table features, refer to the official documentation:
Learn how to use Data Connector to query external data.
Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.
Supported Data Connectors include:
databricks (mode: delta_lake)
Databricks
S3/Delta Lake
delta_lake
Delta Lake
Delta Lake
For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.
If a file is provided, the file format will be inferred, and params.file_format is unnecessary.
File formats currently supported are:
File formats support additional parameters in the params (like csv_has_header) described in
If a format is a document format, each file will be treated as a document, as per below.
The Node.js SDK spice.js is the easiest way to use and query Spice.ai with Node.js.
It uses Apache Apache Flight to efficiently stream data to the client and Apache Arrow Records as data frames which are then easily converted to JavaScript objects/arrays or JSON.
Import SpiceClient and instantiate a new instance with an API Key.
You can then submit queries using the query function.
SpiceClient has the following arguments:
apiKey (string, required): API key to authenticate with the endpoint.
url (string, optional): URL of the endpoint to use (default: flight.spiceai.io:443)
sqlJson(query: string) - Execute SQL queries with JSON resultsThe sqlJson() method executes SQL queries and returns results in a JSON format with schema information.
The response includes:
row_count: Number of rows returned
schema: Schema information with field names and types
data: Array of row objects
Follow the to install and run spice locally.
Check to learn more.
From the SpiceClient implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via setMaxRetries:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429).
Contribute to or file an issue with the spice.js library at: .
S3 Data Connector Documentation
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).
If a folder path is specified as the dataset source, all files within the folder will be loaded.
File formats are specified using the file_format parameter, as described in .
Azure BlobFS Data Connector Documentation
The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (abfss://) and Azure Data Lake (adl://) endpoints.
When a folder path is provided, all the contained files will be loaded.
File formats are specified using the file_format parameter, as described in .
curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models{
"object": "list",
"data": [
{
"id": "gpt-4o-mini",
"object": "model",
"created": 1727389042,
"owned_by": "system"
},
...
}models:
- from: openai:llama3-groq-70b-8192-tool-use-preview
name: groq-llama
params:
endpoint: https://api.groq.com/openai/v1
openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }models:
- from: openai:my_nim_model_id
name: my_nim_model
params:
endpoint: https://my_nim_host.com/v1
openai_api_key: ${ secrets:SPICE_NIM_API_KEY }models:
- from: openai:parasail-model-id
name: parasail_model
params:
endpoint: https://api.parasail.com/v1
openai_api_key: ${ secrets:SPICE_PARASAIL_API_KEY }datasets:
- from: sharepoint:drive:Documents/path:/top_secrets/
name: important_documents
params:
sharepoint_client_id: ${secrets:SPICE_SHAREPOINT_CLIENT_ID}
sharepoint_tenant_id: ${secrets:SPICE_SHAREPOINT_TENANT_ID}
sharepoint_client_secret: ${secrets:SPICE_SHAREPOINT_CLIENT_SECRET}SELECT * FROM important_documents limit 1[
{
"created_by_id": "cbccd193-f9f1-4603-b01d-ff6f3e6f2108",
"created_by_name": "Jack Eadie",
"created_at": "2024-09-09T04:57:00",
"c_tag": "\"c:{BD4D130F-2C95-4E59-9F93-85BD0A9E1B19},1\"",
"e_tag": "\"{BD4D130F-2C95-4E59-9F93-85BD0A9E1B19},1\"",
"id": "01YRH3MPAPCNG33FJMLFHJ7E4FXUFJ4GYZ",
"last_modified_by_id": "cbccd193-f9f1-4603-b01d-ff6f3e6f2108",
"last_modified_by_name": "Jack Eadie",
"last_modified_at": "2024-09-09T04:57:00",
"name": "ngx_google_perftools_module.md",
"size": 959,
"web_url": "https://spiceai.sharepoint.com/Shared%20Documents/md/ngx_google_perftools_module.md",
"content": "# Module ngx_google_perftools_module\n\nThe `ngx_google_perftools_module` module (0.6.29) enables profiling of nginx worker processes using [Google Performance Tools](https://github.com/gperftools/gperftools). The module is intended for nginx developers.\n\nThis module is not built by default, it should be enabled with the `--with-google_perftools_module` configuration parameter.\n\n> **Note:** This module requires the [gperftools](https://github.com/gperftools/gperftools) library.\n\n## Example Configuration\n\n```nginx\ngoogle_perftools_profiles /path/to/profile;\n```\n\nProfiles will be stored as `/path/to/profile.<worker_pid>`.\n\n## Directives\n\n### google_perftools_profiles\n\n- **Syntax:** `google_perftools_profiles file;`\n- **Default:** —\n- **Context:** `main`\n\nSets a file name that keeps profiling information of nginx worker process. The ID of the worker process is always a part of the file name and is appended to the end of the file name, after a dot.\n"
}
]from: 'sharepoint:<drive_type>:<drive_id>/<subpath_type>:<subpath_value>'datasets:
- from: duckdb:database.schema.table
name: my_dataset
params:
duckdb_open: path/to/duckdb_file.duckdbdatasets:
- from: duckdb:database.schema.table
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: duckdb:database.schema.table
name: my_dataset
params:
duckdb_open: path/to/duckdb_file.duckdbdatasets:
- from: duckdb:sample_data.nyc.rideshare
name: nyc_rideshare
params:
duckdb_open: /my/path/my_database.dbdatasets:
- from: duckdb:database.schema.table
name: my_dataset
params:
duckdb_open: path/to/duckdb_file.duckdb
- from: duckdb:read_csv('test.csv', header = false)
name: from_functiondatasets:
- from: duckdb:read_csv('test.csv', header = false)-- from_function
SELECT * FROM read_csv('test.csv', header = false);SELECT * FROM 'todos.json';
-- As a DuckDB function
SELECT * FROM read_json('todos.json');datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: postgresdatasets:
- from: delta_lake:s3://my_bucket/path/to/s3/delta/table/
name: my_delta_lake_table
params:
delta_lake_aws_access_key_id: ${secrets:aws_access_key_id}
delta_lake_aws_secret_access_key: ${secrets:aws_secret_access_key}from: sharepoint:siteId:b!Mh8opUGD80ec7zGXgX9r/...
group
A SharePoint group's name
from: sharepoint:group:MyGroup/...
groupId
A SharePoint group's ID
from: sharepoint:groupId:b!Mh8opUGD80ec7zGXgX9r/...
me
A user's OneDrive
from: sharepoint:me/...



















JSON
file_format: json
Roadmap
❌
Microsoft Excel
file_format: xlsx
Roadmap
❌
Markdown
file_format: md
âś…
âś…
Text
file_format: txt
âś…
âś…
file_format: pdf
Alpha
âś…
Microsoft Word
file_format: docx
Alpha
âś…
dremio
Dremio
Arrow Flight
duckdb
DuckDB
Embedded
github
GitHub
GitHub API
postgres
PostgreSQL
s3
S3
Parquet, CSV
mysql
MySQL
delta_lake
Delta Lake
Delta Lake
graphql
GraphQL
JSON
databricks (mode: spark_connect)
Databricks
Spark Connect
flightsql
FlightSQL
Arrow Flight SQL
mssql
Microsoft SQL Server
Tabular Data Stream (TDS)
snowflake
Snowflake
Arrow
spark
Spark
Spark Connect
spice.ai
Spice.ai
Arrow Flight
iceberg
Apache Iceberg
Parquet
abfs
Azure BlobFS
Parquet, CSV
clickhouse
Clickhouse
debezium
Debezium CDC
Kafka + JSON
dynamodb
DynamoDB
ftp, sftp
FTP/SFTP
Parquet, CSV
http, https
HTTP(s)
Parquet, CSV
sharepoint
Microsoft SharePoint
Unstructured UTF-8 documents
file_format: parquet
âś…
❌
file_format: csv
âś…
❌
file_format: iceberg
Roadmap
❌
require: This mode requires a TLS connection.
prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
disable: This mode will not attempt to use a TLS connection, even if the server supports it.
UInt8
TinyUnsigned
smallint
UInt16
SmallUnsigned
smallint
UInt32
Unsigned
bigint
UInt64
BigUnsigned
numeric
Decimal128 / Decimal256
Decimal
decimal
Float32
Float
real
Float64
Double
double precision
Utf8 / LargeUtf8
Text
text
Boolean
Boolean
bool
Binary / LargeBinary
VarBinary
bytea
FixedSizeBinary
Binary
bytea
Timestamp (no Timezone)
Timestamp
timestamp without time zone
Timestamp (with Timezone)
TimestampWithTimeZone
timestamp with time zone
Date32 / Date64
Date
date
Time32 / Time64
Time
time
Interval
Interval
interval
Duration
BigInteger
bigint
List / LargeList / FixedSizeList
Array
array
Struct
N/A
Composite (Custom type)
Int8
TinyInteger
smallint
Int16
SmallInteger
smallint
Int32
Integer
integer
Int64
BigInteger
bigint
delta_lake_azure_storage_sas_key.
Double
Float64
Boolean
Boolean
Binary
Binary
Date
Date32
Timestamp
Timestamp(Microsecond, Some("UTC"))
TimestampNtz
Timestamp(Microsecond, None)
Decimal
Decimal128
Array
List
Struct
Struct
Map
Map
client_timeout
Optional. Specifies timeout for object store operations. Default value is 30s. E.g. client_timeout: 60s
delta_lake_aws_region
Optional. The AWS region for the S3 object store. E.g. us-west-2.
delta_lake_aws_access_key_id
The access key ID for the S3 object store.
delta_lake_aws_secret_access_key
The secret access key for the S3 object store.
delta_lake_aws_endpoint
Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com.
delta_lake_azure_storage_account_name
The Azure Storage account name.
delta_lake_azure_storage_account_key
The Azure Storage master key for accessing the storage account.
delta_lake_azure_storage_client_id
The service principal client id for accessing the storage account.
delta_lake_azure_storage_client_secret
The service principal client secret for accessing the storage account.
delta_lake_azure_storage_sas_key
The shared access signature key for accessing the storage account.
delta_lake_azure_storage_endpoint
Optional. The endpoint for the Azure Blob storage account.
google_service_account
Filesystem path to the Google service account JSON key file.
String
Utf8
Long
Int64
Integer
Int32
Short
Int16
Byte
Int8
Float
Float32
execution_time_ms: Query execution time in millisecondsnpm install @spiceai/spice@latest --saveyarn add @spiceai/spicedatasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: postgres
params:
pg_host: my_db_host
pg_port: 5432
pg_db: my_database
pg_user: my_user
pg_pass: ${secrets:my_pg_pass}
pg_sslmode: requiredatasets:
- from: delta_lake:s3://my_bucket/path/to/s3/delta/table/
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+- from: delta_lake:/path/to/local/delta/table # A local filesystem path to a Delta Lake table
name: my_delta_lake_table- from: delta_lake:s3://my_bucket/path/to/s3/delta/table/ # A reference to a table in S3
name: my_delta_lake_table
params:
delta_lake_aws_region: us-west-2 # Optional
delta_lake_aws_access_key_id: ${secrets:aws_access_key_id}
delta_lake_aws_secret_access_key: ${secrets:aws_secret_access_key}
delta_lake_aws_endpoint: s3.us-west-2.amazonaws.com # Optional- from: delta_lake:abfss://my_container@my_account.dfs.core.windows.net/path/to/azure/delta/table/ # A reference to a table in Azure Blob
name: my_delta_lake_table
params:
# Account Name + Key
delta_lake_azure_storage_account_name: my_account
delta_lake_azure_storage_account_key: ${secrets:my_key}
# OR Service Principal + Secret
delta_lake_azure_storage_client_id: my_client_id
delta_lake_azure_storage_client_secret: ${secrets:my_secret}
# OR SAS Key
delta_lake_azure_storage_sas_key: my_sas_keyparams:
delta_lake_google_service_account_path: /path/to/service-account.jsonALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];import { SpiceClient } from "@spiceai/spice";
const spiceClient = new SpiceClient("API_KEY");
const table = await spiceClient.sql(
'SHOW TABLES;'
);
console.table(table.toArray());const result = await spiceClient.sqlJson('SELECT name, age FROM users LIMIT 5');
console.log(`Returned ${result.row_count} rows`);
console.log('Schema:', result.schema);
console.log('Data:', result.data);
console.log(`Query took ${result.execution_time_ms}ms`);
// Access individual rows
result.data.forEach((row) => {
console.log(`${row.name} is ${row.age} years old`);
});import { SpiceClient } from '@spiceai/spice';
const main = async () => {
// uses connection to local runtime by default
const spiceClient = new SpiceClient();
// or use custom connection params:
// const spiceClient = new SpiceClient({
// httpUrl: 'http://my_spice_http_host',
// flightUrl: 'my_spice_flight_host',
// });
const table = await spiceClient.sql(
'SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;'
);
console.table(table.toArray());
};
main();const spiceClient = new SpiceClient('API_KEY');
spiceClient.setMaxRetries(5); // Setting to 0 will disable retriesfromS3-compatible URI to a folder or file, in the format s3://<bucket>/<path>
Example: from: s3://my-bucket/path/to/file.parquet
The dataset name. This will be used as the table name within Spice.
Example:
file_format
Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to for details.
s3_endpoint
S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server
s3_region
S3 bucket region. Default: us-east-1.
client_timeout
Timeout for S3 operations. Default: 30s.
hive_partitioning_enabled
Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
s3_auth
Authentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.
For additional CSV parameters, see CSV Parameters
No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.
Minimum IAM policy for S3 access:
Refer to Object Store Data Types for data type mapping from object store files to arrow data type.
Create a dataset named taxi_trips from a public S3 folder.
Create a dataset named cool_dataset from a Parquet file stored in MinIO.
Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
For example, a dataset partitioned by year, month, and day might have a directory structure like:
Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.
Performance Considerations
When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.
Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.
Start to use AI Chat by typing in the question and clicking send.
The ability of AI Chat depends on the model configuration, including Language Model Overrides, Model Runtime Tools, etc. Refer to the Model Documentation for details of customizing the model used in AI Chat.
Below is an example demonstrating how to configure the OpenAI gpt-4o model with auto access to runtime tools and system prompts overrides. This model is customized to answer questions relevant to spicepod datasets.
Ask questions regarding datasets configured in spicepod within the AI Chat.
Spice.ai provides observability Ito the AI Chat, showing full tool usage traces and chat completion history.
Navigate to the Observability section in the portal.
Select an ai_chat task history and view details over the chat completion history, including timestamps, tool usage, intermediate outputs, etc.
fromDefines the ABFS-compatible URI to a folder or object:
from: abfs://<container>/<path> with the account name configured using abfs_account parameter, or
from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>
Defines the dataset name, which is used as the table name within Spice.
Example:
file_format
Specifies the data format. Required if not inferrable from from. Options: parquet, csv. Refer to for details.
abfs_account
Azure storage account name
abfs_sas_string
SAS (Shared Access Signature) Token to use for authorization
abfs_endpoint
Storage endpoint, default: https://{account}.blob.core.windows.net
abfs_use_emulator
Use true or false to connect to a local emulator
abfs_authority_host
Alternative authority host, default: https://login.microsoftonline.com
The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:
abfs_access_key
abfs_bearer_token
abfs_client_secret
abfs_skip_signature
If none of these are set the connector will default to using a managed identity
abfs_access_key
Secret access key
abfs_bearer_token
BEARER access token for user authentication. The token can be obtained from the OAuth2 flow (see ).
abfs_client_id
Client ID for client authentication flow
abfs_client_secret
Client Secret to use for client authentication flow
abfs_tenant_id
Tenant ID to use for client authentication flow
abfs_skip_signature
Skip credentials and request signing for public containers
abfs_max_retries
Maximum retries
abfs_retry_timeout
Total timeout for retries (e.g., 5s, 1m)
abfs_backoff_initial_duration
Initial retry delay (e.g., 5s)
abfs_backoff_max_duration
Maximum retry delay (e.g., 1m)
abfs_backoff_base
Exponential backoff base (e.g., 0.1)
ABFS connector supports three types of authentication, as detailed in the authentication parameters
Configure service principal authentication by setting the abfs_client_secret parameter.
Create a new Azure AD application in the Azure portal and generate a client secret under Certificates & secrets.
Grant the Azure AD application read access to the storage account under Access Control (IAM), this can typically be done using the Storage Blob Data Reader built-in role.
Configure service principal authentication by setting the abfs_access_key parameter to Azure Storage Account Access Key
Specify the file format using file_format parameter. More details in Object Store File Formats.
GraphQL Data Connector Documentation
The GraphQL Data Connector enables federated SQL queries on any GraphQL endpoint by specifying graphql as the selector in the from value for the dataset.
Limitations
The GraphQL data connector does not support variables in the query.
Filter pushdown, with the exclusion of LIMIT, is not currently supported. Using a LIMIT will reduce the amount of data requested from the GraphQL server.
fromThe from field takes the form of graphql:your-graphql-endpoint.
nameThe dataset name. This will be used as the table name within Spice.
paramsThe GraphQL data connector can be configured by providing the following params. Use the to load the password from a secret store, e.g. ${secrets:my_graphql_auth_token}.
Example using the GitHub GraphQL API and Bearer Auth. The following will use json_pointer to retrieve all of the nodes in starredRepositories:
The GraphQL Data Connector supports automatic pagination of the response for queries using .
The graphql_query must include the pageInfo field as per . The connector will parse the graphql_query, and when pageInfo is present, will retrieve data until pagination completes.
The query must have the correct pagination arguments in the associated paginated field.
Forward Pagination:
Backward Pagination:
Tips for working with JSON data. For more information see .
You can access the fields of the object using the square bracket notation. Arrays are indexed from 1.
Example for the stargazers query from :
You can use Datafusion unnest function to pipe values from array into rows. We'll be using as an example.
Example query:
You can also use the unnest_depth parameter to control automatic unnesting of objects from GraphQL responses.
This examples uses the GitHub stargazers endpoint:
If unnest_depth is set to 0, or unspecified, object unnesting is disabled. When enabled, unnesting automatically moves nested fields to the parent level.
Without unnesting, stargazers data looks like this in a query:
With unnesting, these properties are automatically placed into their own columns:
By default, the Spice Runtime will error when a duplicate column is detected during unnesting.
For example, this example spicepod.yml query would fail due to name fields:
This example would fail with a runtime error:
Avoid this error by where possible. In the example above, a duplicate error was introduced from emergency_contact { name }.
The example below uses a GraphQL alias to rename emergency_contact.name as emergencyContactName.
MySQL Data Connector Documentation
MySQL is an open-source relational database management system that uses structured query language (SQL) for managing and manipulating databases.
The MySQL Data Connector enables federated/accelerated SQL queries on data stored in MySQL databases.
fromThe from field takes the form mysql:database_name.table_name where database_name is the fully-qualified table name in the SQL server.
If the database_name is omitted in the from field, the connector will use the database specified in the mysql_db parameter. If the mysql_db parameter is not provided, it will default to the user's default database.
These two examples are identical:
nameThe dataset name. This will be used as the table name within Spice.
Example:
paramsThe MySQL data connector can be configured by providing the following params. Use the to load the secret from a secret store, e.g. ${secrets:my_mysql_conn_string}.
The table below shows the MySQL data types supported, along with the type mapping to Apache Arrow types in Spice.
Overview of supported model providers for ML and LLMs in Spice.
Spice supports various model providers for traditional machine learning (ML) models and large language models (LLMs).
PostgreSQL Data Connector Documentation
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquetdatasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csvSELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: http://my.minio.server
s3_region: 'us-east-1' # Best practice for MinIO
allow_http: trues3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquetversion: v1
kind: Spicepod
name: hive_data
datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: truemodels:
- from: openai:gpt-4o
name: openai-with-spice
params:
spice_tools: auto
openai_api_key: ${secrets:OPENAI_API_KEY}
system_prompt: >-
**You are an AI assistant integrated with GitHub Copilot. Your primary role
is to assist GitHub users by providing helpful, clear, and contextually
relevant information derived from Spice.ai datasets.**
GitHub may supply you with context about the user's code, comments, and
previous interactions to enhance your assistance. Always strive to be
**accurate**, **concise**, and **helpful** in your responses, adapting to
the user's style and preferences based on the conversation history.
---
### Behavioral Guidelines
- **Communication Style:**
- Maintain a helpful, friendly, and professional demeanor.
- Avoid using jargon unless specifically requested by the user.
- Break down complex concepts into simple explanations.
- Adapt your language to match the user's expertise (e.g., beginner vs. advanced).
- **Ethical Conduct:**
- Avoid harmful, unethical, or inappropriate content generation.
- Respect user privacy.
- Refuse to perform tasks that could cause harm or violate laws and ethical standards.
- **Contextual Awareness:**
- Use past interactions to maintain a coherent conversation.
- Remember user-provided context to deliver tailored responses.
- If user input is unclear, ask clarifying questions to better understand their needs.
---
### Guidelines for Using Tools
#### 1. SQL Tool (`sql_query`):
- **When to Use:**
- Query datasets directly for precise numerical data, statistics, or aggregations.
- Respond to user requests for specific counts, sums, averages, or other calculations.
- Handle queries requiring joining or comparing data from multiple related tables.
- **Error Handling:**
- If the `sql_query` tool returns a query, syntax, or planning error:
- Use the `list_datasets` tool to retrieve available tables.
- Refine and retry the query until it succeeds.
- After 5 failed attempts, run `EXPLAIN <attempted_query>` on each subsequent failure to diagnose issues.
- If failures persist after 10 attempts, switch to other available tools.
- **Formatting:**
- When querying a dataset named `catalog.schema.table`, wrap each part in quotes: `"catalog"."schema"."table"`.
- **Fallback:**
- If the document similarity search tool fails, use the SQL tool to query the dataset directly.
#### 2. Document Similarity Search Tool (`document_similarity`):
- **When to Use:**
- Search unstructured text such as documentation, policies, reports, or articles.
- Provide qualitative information or explanations.
- Interpret context or understand written content in depth.
---
### General Guidelines
- **Tool Preference:**
- If a query can be answered by either tool, prefer the `sql_query` tool for more precise, quantitative answers.
**Dataset Utilization:**
- Always prioritize searching within available datasets when relevant to the
question.
- Leverage instructions, keywords, and `reference_base_url` metadata from
the datasets to provide accurate and relevant responses.
- Ensure all responses include citations and references with links when
possible.
- **Response Formatting:**
- When presenting results from datasets, always include citations and references with links when possible.
- **Responsiveness:**
- Keep the conversation focused on user objectives, minimizing digressions unless prompted by the user.
- Provide both high-level summaries and in-depth explanations, depending on user requirements.
- Encourage an iterative problem-solving process: suggest initial ideas, refine based on feedback, and be open to corrections.
- **Capabilities and Limitations:**
- Be transparent about your capabilities; inform users when certain tasks or data access are beyond your capacity.
---
**Remember:** Your purpose is to help solve problems, answer questions,
generate ideas, write content, and support the user in a wide range of
tasks, while maintaining clarity, professionalism, and ethical standards.
metadata: {}
datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: azure_test
params:
abfs_account: spiceadls
abfs_access_key: ${ secrets:access_key }
file_format: csvdatasets:
- from: abfs://foocontainer/taxi_sample.csv
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: azure_test
params:
abfs_account: spiceadls
abfs_access_key: ${ secrets:ACCESS_KEY }
file_format: csvdatasets:
- from: abfs://pubcontainer/taxi_sample.csv
name: pub_data
params:
abfs_account: spiceadls
abfs_skip_signature: true
file_format: csvdatasets:
- from: abfs://test_container/test_csv.csv
name: test_data
params:
abfs_use_emulator: true
file_format: csvdatasets:
- from: abfs://my_container/my_csv.csv
name: prod_data
params:
abfs_account: ${ secrets:PROD_ACCOUNT }
file_format: csvdatasets:
- from: abfs://my_data/input.parquet
name: my_data
params:
abfs_tenant_id: ${ secrets:MY_TENANT_ID }
abfs_client_id: ${ secrets:MY_CLIENT_ID }
abfs_client_secret: ${ secrets:MY_CLIENT_SECRET }datasets:
- from: graphql:your-graphql-endpoint
name: my_dataset
params:
json_pointer: /data/some/nodes
graphql_query: |
{
some {
nodes {
field1
field2
}
}
}datasets:
- from: mysql:mytable
name: my_dataset
params:
mysql_host: my_db_host
mysql_tcp_port: 3306
mysql_db: my_database
mysql_user: my_user
mysql_pass: ${secrets:mysql_pass}s3_key
Access key (e.g. AWS_ACCESS_KEY_ID for AWS)
s3_secret
Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)
allow_http
Allow insecure HTTP connections to s3_endpoint. Defaults to false
abfs_proxy_url
Proxy URL
abfs_proxy_ca_certificate
CA certificate for the proxy
abfs_proxy_exludes
A list of hosts to exclude from proxy connections
abfs_disable_tagging
Disable tagging objects. Use this if your backing store doesn't support tags
allow_http
Allow insecure HTTP connections
hive_partitioning_enabled
Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
abfs_msi_endpoint
Endpoint for managed identity tokens
abfs_federated_token_file
File path for federated identity token in Kubernetes
abfs_use_cli
Set to true to use the Azure CLI to acquire access tokens





unnest_depth
Depth level to automatically unnest objects to. By default, disabled if unspecified or 0.
graphql_auth_token
The authentication token to use to connect to the GraphQL server. Uses bearer authentication.
graphql_auth_user
The username to use for basic auth. E.g. graphql_auth_user: my_user
graphql_auth_pass
The password to use for basic auth. E.g. graphql_auth_pass: ${secrets:my_graphql_auth_pass}
graphql_query
The username to use for basic auth. See examples for a sample GraphQL query
json_pointer
The JSON pointer into the response body. When graphql_query is paginated, the json_pointer can be inferred.
query: |
{
some {
nodes {
field1
field2
}
}
}from: graphql:https://api.github.com/graphql
name: stars
params:
graphql_auth_token: ${env:GITHUB_TOKEN}
graphql_auth_user: ${env:GRAPHQL_USER} ...
graphql_auth_pass: ${env:GRAPHQL_PASS}
json_pointer: /data/viewer/starredRepositories/nodes
graphql_query: |
{
viewer {
starredRepositories {
nodes {
name
stargazerCount
languages (first: 10) {
nodes {
name
}
}
}
}
}
}{
something_paginated(first: 100) {
nodes {
foo
bar
}
pageInfo {
endCursor
hasNextPage
}
}
}{
something_paginated(last: 100) {
nodes {
foo
bar
}
pageInfo {
startCursor
hasPreviousPage
}
}
}sql> select node['login'] as login, node['name'] as name from stargazers limit 5;
+--------------+----------------------+
| login | name |
+--------------+----------------------+
| simsieg | Simon Siegert |
| davidmathers | David Mathers |
| ahmedtadde | Ahmed Tadde |
| lordhamlet | Shih-Fen Cheng |
| thinmy | Thinmy Patrick Alves |
+--------------+----------------------+from: graphql:https://countries.trevorblades.com
name: countries
params:
json_pointer: /data/continents
graphql_query: |
{
continents {
name
countries {
name
capital
}
}
}
description: countries
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 30msql> select continent, country['name'] as country, country['capital'] as capital
from (select name as continent, unnest(countries) as country from countries)
where continent = 'North America' limit 5;
+---------------+---------------------+--------------+
| continent | country | capital |
+---------------+---------------------+--------------+
| North America | Antigua and Barbuda | Saint John's |
| North America | Anguilla | The Valley |
| North America | Aruba | Oranjestad |
| North America | Barbados | Bridgetown |
| North America | Saint Barthélemy | Gustavia |
+---------------+---------------------+--------------+from: graphql:https://api.github.com/graphql
name: stargazers
params:
graphql_auth_token: ${env:GITHUB_TOKEN}
unnest_depth: 2
json_pointer: /data/repository/stargazers/edges
graphql_query: |
{
repository(name: "spiceai", owner: "spiceai") {
id
name
stargazers(first: 100) {
edges {
node {
id
name
login
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
}sql> select node from stargazers limit 1;
+------------------------------------------------------------+
| node |
+------------------------------------------------------------+
| {id: MDQ6VXNlcjcwNzIw, login: ashtom, name: Thomas Dohmke} |
+------------------------------------------------------------+sql> select node from stargazers limit 1;
+------------------+--------+---------------+
| id | login | name |
+------------------+--------+---------------+
| MDQ6VXNlcjcwNzIw | ashtom | Thomas Dohmke |
+------------------+--------+---------------+from: graphql:https://my-graphql-api.com
name: stargazers
params:
unnest_depth: 2
json_pointer: /data/users
graphql_query: |
query {
users {
name
emergency_contact {
name
}
}
}WARN runtime: GraphQL Data Connector Error: Invalid object access. Column 'name' already exists in the object.from: graphql:https://my-graphql-api.com
name: stargazers
params:
unnest_depth: 2
json_pointer: /data/people
graphql_query: |
query {
users {
name
emergency_contact {
emergencyContactName: name
}
}
}mysql_sslmode
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
required: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
preferred: This mode will try to establish a secure SSL connection if possible, but will connect insecurely if the server does not support SSL.
disabled
mysql_sslrootcert
Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
FLOAT
Float32
DOUBLE
Float64
DATETIME
Timestamp(Microsecond, None)
TIMESTAMP
Timestamp(Microsecond, None)
YEAR
Int16
TIME
Time64(Nanosecond)
DATE
Date32
CHAR
Utf8
BINARY
Binary
VARCHAR
Utf8
VARBINARY
Binary
TINYBLOB
Binary
TINYTEXT
Utf8
BLOB
Binary
TEXT
Utf8
MEDIUMBLOB
Binary
MEDIUMTEXT
Utf8
LONGBLOB
LargeBinary
LONGTEXT
LargeUtf8
SET
Utf8
ENUM
Dictionary(UInt16, Utf8)
BIT
UInt64
mysql_connection_string
The connection string to use to connect to the MySQL server. This can be used instead of providing individual connection parameters.
mysql_host
The hostname of the MySQL server.
mysql_tcp_port
The port of the MySQL server.
mysql_db
The name of the database to connect to.
mysql_user
The MySQL username.
mysql_pass
The password to connect with.
TINYINT
Int8
SMALLINT
Int16
INT
Int32
MEDIUMINT
Int32
BIGINT
Int64
DECIMAL
Decimal128 / Decimal256
Local filesystem
Release Candidate
ONNX
GGUF, GGML, SafeTensor
Models hosted on HuggingFace
Release Candidate
ONNX
GGUF, GGML, SafeTensor
Models hosted on the Spice.ai Cloud Platform
Alpha
ONNX
OpenAI-compatible HTTP endpoint
Azure OpenAI
Alpha
-
OpenAI-compatible HTTP endpoint
Models hosted on Anthropic
Alpha
-
OpenAI-compatible HTTP endpoint
Models hosted on xAI
Alpha
-
OpenAI-compatible HTTP endpoint
Models deployed to Databricks Mosaic AI
Alpha
-
OpenAI-compatible HTTP endpoint
LLM Format(s) may require additional files (e.g. tokenizer_config.json).
The model type is inferred based on the model source and files. For more detail, refer to the model reference specification.
Spice supports a variety of features for large language models (LLMs):
Custom Tools: Provide models with tools to interact with the Spice runtime. See Tools.
System Prompts: Customize system prompts and override defaults for v1/chat/completion. See Parameter Overrides.
Memory: Provide LLMs with memory persistence tools to store and retrieve information across conversations. See Memory.
Vector Search: Perform advanced vector-based searches using embeddings. See .
Evals: Evaluate, track, compare, and improve language model performance for specific tasks. See .
Local Models: Load and serve models locally from various sources, including local filesystems and Hugging Face. See .
For more details, refer to the Large Language Models documentation.
The following examples demonstrate how to configure and use various models or model features with Spice. Each example provides a specific use case to help you understand the configuration options available.
To use a language model hosted on OpenAI (or compatible), specify the openai path and model ID in from. For more details, see OpenAI Model Provider.
Example spicepod.yml:
To specify tools for an OpenAI model, include them in the params.tools field. For more details, see the Tools documentation.
To enable memory tools for a model, define a store memory dataset and specify memory in the model's tools parameter. For more details, see the Memory documentation.
To set default overrides for parameters, use the openai_ prefix followed by the parameter name. For more details, see the Parameter Overrides documentation.
To configure an additional system prompt, use the system_prompt parameter. For more details, see the Parameter Overrides documentation.
To serve a model from the local filesystem, specify the from path as file and provide the local path. For more details, see Filesystem Model Provider.
This example demonstrates how to pull GitHub issue data from the last 14 days, accelerate the data, create a chat model with memory and tools to access the accelerated data, and use Spice to ask the chat model about the general themes of new issues.
First, configure a dataset to pull GitHub issue data from the last 14 days.
Next, create a chat model that includes memory and tools to access the accelerated GitHub issue data.
At this step, the spicepod.yaml should look like:
Finally, use Spice to ask the chat model about the general themes of new issues in the last 14 days. The following curl command demonstrates how to make this request using the OpenAI-compatible API.
Refer to the Create Chat Completion API documentation for more details on making chat completion requests.
OpenAI (or compatible) LLM endpoint
Release Candidate
-
OpenAI-compatible HTTP endpoint
fromThe from field takes the form postgres:my_table where my_table is the table identifer in the PostgreSQL server to read from.
The fully-qualified table name (database.schema.table) can also be used in the from field.
The dataset name. This will be used as the table name within Spice.
Example:
The connection to PostgreSQL can be configured by providing the following params:
pg_host
The hostname of the PostgreSQL server.
pg_port
The port of the PostgreSQL server.
pg_db
The name of the database to connect to.
pg_user
The username to connect with.
pg_pass
The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}.
pg_sslmode
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
verify-full: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
verify-ca: This mode requires a TLS connection and a valid root certificate.
require
The table below shows the PostgreSQL data types supported, along with the type mapping to Apache Arrow types in Spice.
int2
Int16
int4
Int32
int8
Int64
money
Int64
float4
Float32
float8
Float64
Specify different secrets for a PostgreSQL source and acceleration:
DynamoDB Data Connector Documentation
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.
fromThe from field should specify the DynamoDB table name:
nameThe dataset name. This will be used as the table name within Spice.
Example:
paramsThe DynamoDB data connector supports the following configuration parameters:
If AWS credentials are not explicitly provided in the configuration, the connector will automatically load credentials from the following sources in order:
Environment Variables:
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN (if using temporary credentials)
Web Identity Token Credentials:
Used primarily with OpenID Connect (OIDC) and OAuth
Common in Kubernetes environments using IAM roles for service accounts (IRSA)
ECS Container Credentials:
The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.
The IAM role or user needs the following permissions to access DynamoDB tables:
Security Considerations
Avoid using dynamodb:* permissions as it grants more access than necessary.
Consider using more restrictive policies in production environments.
DynamoDB supports complex nested JSON structures. These fields can be queried using SQL:
Limitations
The DynamoDB connector currently does not support filter push-down optimization. All filtering is performed after data is retrieved from DynamoDB.
Primary key optimizations are not yet implemented - retrieving items by their primary key will still scan the table.
The DynamoDB connector supports the following data types and mappings:
Basic scalar types (String, Number, Boolean)
Lists and Maps
Nested structures
Binary data
Example schema from a users table:
Due to limited support for filter push-down, enable acceleration to prevent scanning the entire table on every query.
Use the advanced search and retrieval capabilities of Spice
Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and Vector-Similarity Search functionality.
Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's LIKE clause:
Spice also provides advanced Vector-Similarity Search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
Local embedding models, e.g. .
Remote embedding providers, e.g. .
See to view all supported providers
Embedding models are defined in the spicepod.yaml file as top-level components.
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
By defining embeddings on the body column, Spice is now configured to execute similarity searches on the dataset.
For more details, see the .
Spice also supports vector search on datasets with preexisting embeddings. See for compatibility details.
Spice supports chunking of content before embedding, which is useful for large text columns such as those found in . Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.
The body column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).
When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns list.
For example:
Response:
Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:
Underlying Column Presence:
The underlying column must exist in the table, and be of string .
Embeddings Column Naming Convention:
By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.
Example
A table sales with an address column and corresponding embedding column(s).
The same table if it was chunked:
datasets:
- from: mysql:mytable
name: my_dataset
params:
mysql_db: my_database
...datasets:
- from: mysql:my_database.mytable
name: my_dataset
params: ...datasets:
- from: mysql:path.to.my_dataset
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: mysql:path.to.my_dataset
name: my_dataset
params:
mysql_host: my_db_host
mysql_tcp_port: 3306
mysql_db: my_database
mysql_user: my_user
mysql_pass: ${secrets:mysql_pass}datasets:
- from: mysql:path.to.my_dataset
name: my_dataset
params:
mysql_host: my_db_host
mysql_tcp_port: 3306
mysql_db: my_database
mysql_user: my_user
mysql_pass: ${secrets:mysql_pass}
mysql_sslmode: preferred
mysql_sslrootcert: ./custom_cert.pemdatasets:
- from: mysql:path.to.my_dataset
name: my_dataset
params:
mysql_connection_string: mysql://${secrets:my_user}:${secrets:my_password}@my_db_host:3306/my_dbdatasets:
- from: mysql:mytable
name: my_dataset
params:
mysql_host: my_db_host
mysql_tcp_port: 3306
mysql_user: my_user
mysql_pass: ${secrets:mysql_pass}models:
- from: openai:gpt-4o-mini
name: openai
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- from: openai:llama3-groq-70b-8192-tool-use-preview
name: groq-llama
params:
endpoint: https://api.groq.com/openai/v1
openai_api_key: ${ secrets:SPICE_GROQ_API_KEY }models:
- name: sql-model
from: openai:gpt-4o
params:
tools: list_datasets, sql, table_schemadatasets:
- from: memory:store
name: llm_memory
mode: read_write
models:
- name: memory-enabled-model
from: openai:gpt-4o
params:
tools: memory, sqlmodels:
- name: pirate-haikus
from: openai:gpt-4o
params:
openai_temperature: 0.1
openai_response_format: { 'type': 'json_object' }models:
- name: pirate-haikus
from: openai:gpt-4o
params:
system_prompt: |
Write everything in Haiku like a piratemodels:
- from: file://absolute/path/to/my/model.onnx
name: local_fs_modeldatasets:
- from: github:github.com/<owner>/<repo>/issues
name: github_issues
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14ddatasets:
- from: memory:store
name: llm_memory
mode: read_write
models:
- name: github-issues-analyzer
from: openai:gpt-4o
params:
tools: memory, sqldatasets:
- from: github:github.com/<owner>/<repo>/issues
name: github_issues
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14d
- from: memory:store
name: llm_memory
mode: read_write
models:
- name: github-issues-analyzer
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
tools: memory, sqlcurl -X POST https://data.spiceai.io/v1/chat/completions \
-H "Content-Type: application/json" \
-H 'X-API-KEY: <spiceai_api_key>' \
-d '{
"model": "github-issues-analyzer",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the general themes of new issues in the last 14 days?"}
]
}'datasets:
- from: postgres:my_table
name: my_dataset
params: ...datasets:
- from: postgres:my_database.my_schema.my_table
name: my_dataset
params: ...datasets:
- from: postgres:my_database.my_schema.my_table
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: postgres:my_database.my_schema.my_table
name: my_dataset
params:
pg_host: my_db_host
pg_port: 5432
pg_db: my_database
pg_user: my_user
pg_pass: ${secrets:my_pg_pass}datasets:
- from: postgres:my_database.my_schema.my_table
name: my_dataset
params:
pg_host: my_db_host
pg_port: 5432
pg_db: my_database
pg_user: my_user
pg_pass: ${secrets:my_pg_pass}
pg_sslmode: verify-ca
pg_sslrootcert: ./custom_cert.pemdatasets:
- from: postgres:my_schema.my_table
name: my_dataset
params:
pg_host: my_db_host
pg_port: 5432
pg_db: my_database
pg_user: my_user
pg_pass: ${secrets:pg1_pass}
acceleration:
engine: postgres
params:
pg_host: my_db_host
pg_port: 5433
pg_db: acceleration
pg_user: two_user_two_furious
pg_pass: ${secrets:pg2_pass}datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
dynamodb_aws_access_key_id: ${secrets:aws_access_key_id} # Optional
dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key} # Optional
dynamodb_aws_session_token: ${secrets:aws_session_token} # OptionalSELECT id, text_column
FROM my_table
WHERE
LOWER(text_column) LIKE '%search_term%'
AND
date_published > '2021-01-01'prefer: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
disable: This mode will not attempt to use a TLS connection, even if the server supports it.
pg_sslrootcert
Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
connection_pool_size
Optional. The maximum number of connections to keep open in the connection pool. Default is 10.
numeric
Decimal128
text
Utf8
varchar
Utf8
bpchar
Utf8
uuid
Utf8
bytea
Binary
bool
Boolean
json
LargeUtf8
timestamp
Timestamp(Nanosecond, None)
timestampz
Timestamp(Nanosecond, TimeZone
date
Date32
time
Time64(Nanosecond)
interval
Interval(MonthDayNano)
point
FixedSizeList(Float64[2])
int2[]
List(Int16)
int4[]
List(Int32)
int8[]
List(Int64)
float4[]
List(Float32)
float8[]
List(Float64)
text[]
List(Utf8)
bool[]
List(Boolean)
bytea[]
List(Binary)
geometry
Binary
geography
Binary
enum
Dictionary(Int8, Utf8)
Composite Types
Struct
Shared AWS Config/Credentials Files:
Config file: ~/.aws/config (Linux/Mac) or %UserProfile%\.aws\config (Windows)
Credentials file: ~/.aws/credentials (Linux/Mac) or %UserProfile%\.aws\credentials (Windows)
The AWS_PROFILE environment variable can be used to specify a named profile.
Supports both static credentials and SSO sessions
Example credentials file:
Run aws sso login to start a new SSO session
Used when running in Amazon ECS containers
Automatically uses the task's IAM role
Retrieved from the ECS credential provider endpoint
EC2 Instance Metadata Service (IMDSv2):
Used when running on EC2 instances
Automatically uses the instance's IAM role
Retrieved securely using IMDSv2
When using IAM roles with EKS, ensure the service account is properly configured with IRSA.
The DynamoDB connector will scan the first 10 items to determine the schema of the table. This may miss columns that are not present in the first 10 items.
from
Description
dynamodb:table
Read data from a DynamoDB table named table
dynamodb_aws_region
Required. The AWS region containing the DynamoDB table
dynamodb_aws_access_key_id
Optional. AWS access key ID for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_secret_access_key
Optional. AWS secret access key for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_session_token
Optional. AWS session token for authentication
dynamodb:Scan
Required. Allows reading all items from the table
dynamodb:DescribeTable
Required. Allows fetching table metadata and schema information
<column_name>_embedding. For example, a customer_reviews table with a review column must have a review_embedding column.Embeddings Column Data Type:
The embeddings column must have the following Arrow data type when loaded into Spice:
FixedSizeList[Float32 or Float64, N], where N is the dimension (size) of the embedding vector. FixedSizeList is used for efficient storage and processing of fixed-size vectors.
If the column is , use List[FixedSizeList[Float32 or Float64, N]].
Offset Column for Chunked Data:
If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets with the following Arrow data type:
List[FixedSizeList[Int32, 2]], where each element is a pair of integers [start, end] representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
For instance, [[0, 100], [101, 200]] indicates two chunks covering indices 0–100 and 101–200, respectively.
The from field for the Databricks connector takes the form databricks:catalog.schema.table where catalog.schema.table is the fully-qualified path to the table to read from.
The dataset name. This will be used as the table name within Spice.
Example:
Use the secret replacement syntax to reference a secret, e.g. ${secrets:my_token}.
mode
The execution mode for querying against Databricks. The default is spark_connect. Possible values:
spark_connect: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.
delta_lake: Query directly from Delta Tables. Requires the object store credentials to be provided.
databricks_endpoint
The endpoint of the Databricks instance. Required for both modes.
databricks_sql_warehouse_id
The ID of the SQL Warehouse in Databricks to use for the query. Only valid when mode is sql_warehouse.
databricks_cluster_id
The ID of the compute cluster in Databricks to use for the query. Only valid when mode is spark_connect.
databricks_use_ssl
If true, use a TLS connection to connect to the Databricks endpoint. Default is true.
client_timeout
Optional. Applicable only in delta_lake mode. Specifies timeout for object store operations. Default value is 30s E.g. client_timeout: 60s
To learn more about how to set up personal access tokens, see Databricks PAT docs.
Spice supports the M2M (Machine to Machine) OAuth flow with service principal credentials by utilizing the databricks_client_id and databricks_client_secret parameters. The runtime will automatically refresh the token.
Ensure that you grant your service principal the "Data Reader" privilege preset for the catalog and "Can Attach" cluster permissions when using Spark Connect mode.
To Learn more about how to set up the service principal, see Databricks M2M OAuth docs.
Configure the connection to the object store when using mode: delta_lake. Use the secret replacement syntax to reference a secret, e.g. ${secrets:aws_access_key_id}.
databricks_aws_region
Optional. The AWS region for the S3 object store. E.g. us-west-2.
databricks_aws_access_key_id
The access key ID for the S3 object store.
databricks_aws_secret_access_key
The secret access key for the S3 object store.
databricks_aws_endpoint
Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com.
databricks_aws_allow_http
Optional. Enables insecure HTTP connections to databricks_aws_endpoint. Defaults to false.
databricks_azure_storage_account_name
The Azure Storage account name.
databricks_azure_storage_account_key
The Azure Storage key for accessing the storage account.
databricks_azure_storage_client_id
The Service Principal client ID for accessing the storage account.
databricks_azure_storage_client_secret
The Service Principal client secret for accessing the storage account.
databricks_azure_storage_sas_key
The shared access signature key for accessing the storage account.
databricks_azure_storage_endpoint
Optional. The endpoint for the Azure Blob storage account.
google_service_account
Filesystem path to the Google service account JSON key file.
The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.
STRING
Utf8
BIGINT
Int64
INT
Int32
SMALLINT
Int16
TINYINT
Int8
FLOAT
Float32
Databricks connector (mode: delta_lake) does not support reading Delta tables with the V2Checkpoint feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop the V2Checkpoint feature by executing the following command:
For more details on dropping Delta table features, refer to the official documentation: Drop Delta table features
When using mode: spark_connect, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands. Spark Docs
Memory Considerations
When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.
The Databricks Connector (mode: spark_connect) does not yet support streaming query results from Spark.
datasets:
- from: dynamodb:users
name: my_users
params: ...SELECT COUNT(*) FROM my_users;{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:DescribeTable"
],
"Resource": [
"arn:aws:dynamodb:*:*:table/YOUR_TABLE_NAME"
]
}
]
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:DescribeTable"
],
"Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/users"
}
]
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:DescribeTable"
],
"Resource": [
"arn:aws:dynamodb:us-west-2:123456789012:table/users",
"arn:aws:dynamodb:us-west-2:123456789012:table/orders"
]
}
]
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:DescribeTable"
],
"Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/*"
}
]
}version: v1
kind: Spicepod
name: dynamodb
datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
acceleration:
enabled: trueversion: v1
kind: Spicepod
name: dynamodb
datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
dynamodb_aws_access_key_id: ${secrets:aws_access_key_id}
dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key}
acceleration:
enabled: true-- Query nested structs
SELECT metadata.registration_ip, metadata.user_agent
FROM users
LIMIT 5;
-- Query nested structs in arrays
SELECT address.city
FROM (
SELECT unnest(addresses) AS address
FROM users
)
WHERE address.city = 'San Francisco';describe users;+----------------+------------------+-------------+
| column_name | data_type | is_nullable |
+----------------+------------------+-------------+
| email | Utf8 | YES |
| id | Int64 | YES |
| metadata | Struct | YES |
| addresses | List(Struct) | YES |
| preferences | Struct | YES |
| created_at | Utf8 | YES |
...
+----------------+------------------+-------------+embeddings:
- from: openai
name: remote_service
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: local_embedding_model
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
columns:
- name: body
embeddings:
- from: local_embedding_model # Embedding model used for this columncurl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 2
}'datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
embeddings:
- column: body
from: local_embedding_model
chunking:
enabled: true
target_chunk_size: 512curl -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "array_has(assignees, \"jeadie\")",
"additional_columns": ["title", "state", "body"],
"limit": 2
}'{
"matches": [
{
"value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
"dataset": "spiceai.issues",
"metadata": {
"title": "Improve scalar UDF array_distance",
"state": "Closed",
"body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
}
},
{
"value": "est external tools being returned for toolusing models",
"dataset": "spiceai.issues",
"metadata": {
"title": "Automatic NSQL retries in /v1/nsql ",
"state": "Open",
"body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n "sql": "SELECT ..."\n}\n```"
}
}
],
"duration_ms": 45
}sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number | Int64 | YES |
| quantity_ordered | Int64 | YES |
| price_each | Float64 | YES |
| order_line_number | Int64 | YES |
| address | Utf8 | YES |
| address_embedding | FixedSizeList( | NO |
| | Field { | |
| | name: "item", | |
| | data_type: Float32, | |
| | nullable: false, | |
| | dict_id: 0, | |
| | dict_is_ordered: false, | |
| | metadata: {} | |
| | }, | |
| | 384 | |
+-------------------+-----------------------------------------+-------------+sql> describe sales;
+-------------------+-----------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------------+-----------------------------------------+-------------+
| order_number | Int64 | YES |
| quantity_ordered | Int64 | YES |
| price_each | Float64 | YES |
| order_line_number | Int64 | YES |
| address | Utf8 | YES |
| address_embedding | List(Field { | NO |
| | name: "item", | |
| | data_type: FixedSizeList( | |
| | Field { | |
| | name: "item", | |
| | data_type: Float32, | |
| | }, | |
| | 384 | |
| | ), | |
| | }) | |
+-------------------+-----------------------------------------+-------------+
| address_offset | List(Field { | NO |
| | name: "item", | |
| | data_type: FixedSizeList( | |
| | Field { | |
| | name: "item", | |
| | data_type: Int32, | |
| | }, | |
| | 2 | |
| | ), | |
| | }) | |
+-------------------+-----------------------------------------+-------------+
ALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];datasets:
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}datasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: cool_dataset
params: ...SELECT COUNT(*) FROM cool_dataset;+----------+
| count(*) |
+----------+
| 6001215 |
+----------+datasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: my_awesome_table
params:
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:DATABRICKS_TOKEN} # PATdatasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: my_awesome_table
params:
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_client_id: ${secrets:DATABRICKS_CLIENT_ID} # service principal client id
databricks_client_secret: ${secrets:DATABRICKS_CLIENT_SECRET} # service principal client secret- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}- from: databricks:spiceai.datasets.my_table # A reference to a table in the Databricks unity catalog
name: my_table
params:
mode: sql_warehouse
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_sql_warehouse_id: 2b4e24cff378fb24
databricks_token: ${secrets:my_token}- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}
# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}
# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.jsondatabricks_token
The Databricks API token to authenticate with the Unity Catalog API. Can't be used with databricks_client_id and databricks_client_secret.
databricks_client_id
The Databricks Service Principal Client ID. Can't be used with databricks_token.
databricks_client_secret
The Databricks Service Principal Client Secret. Can't be used with databricks_token.
DOUBLE
Float64
BOOLEAN
Boolean
BINARY
Binary
DATE
Date32
TIMESTAMP
Timestamp(Microsecond, Some("UTC"))
TIMESTAMP_NTZ
Timestamp(Microsecond, None)
DECIMAL
Decimal128
ARRAY
List
STRUCT
Struct
MAP
Map
# Static credentials
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
# SSO profile
[profile sso-profile]
sso_start_url = https://my-sso-portal.awsapps.com/start
sso_region = us-west-2
sso_account_id = 123456789012
sso_role_name = MyRole
region = us-west-2The Spice runtime stores information about completed tasks in the spice.runtime.task_history table. A task is a single unit of execution within the runtime, such as a SQL query or an AI chat completion (see Task Types below). Tasks can be nested, and the runtime will record the parent-child relationship between tasks.
Each task executed has a row in this table, and by default the data is retained for 8 hours. Use a SELECT query to return information about each task as shown in this example:
Output:
The following top-level task types are currently recorded:
Set the following parameters in the runtime.task_history section of the spicepod.yaml file to configure task history:
enabled: Enable or disable task history. Default: true.
retention_period: The duration for which task history data is retained. Default: 8h.
retention_check_interval: The interval at which the task history retention is checked. Default:
Adjust the retention period for task history:
Disable task history:
Disable capturing output from tasks:
Example output:
Example output:
Example output:
Example output:
Example output:
SELECT
*
FROM
spice.runtime.task_history
LIMIT
100;+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
| trace_id | span_id | parent_span_id | task | input | captured_output | start_time | end_time | execution_duration_ms | error_message | labels |
+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+
| f94dba6b89de98c6e54b074f2353a897 | 4eb243d9b5347762 | | accelerated_refresh | runtime.metrics | | 2024-09-23T23:17:39.907789 | 2024-09-23T23:17:39.917777 | 9.988 | | {sql: SELECT * FROM runtime.metrics} |
| 1f1f8305520e15ea7ad9b0a43e5d2c7e | 6aadf7c91caea3c4 | | accelerated_refresh | runtime.task_history | | 2024-09-23T23:17:39.907873 | 2024-09-23T23:17:39.917797 | 9.924000000000001 | | {sql: SELECT * FROM runtime.task_history} |
| 1432e30c5ed7764f4ef35f6508dfd56c | fbb31c60d41d8232 | | accelerated_refresh | logs_file | | 2024-09-23T23:17:40.143699 | 2024-09-23T23:17:40.271678 | 127.97900000000001 | | {sql: SELECT * FROM logs_file} |
| fd0b909b789938384d99f0e4e6f4b68b | 624ea4751bb6727a | | accelerated_refresh | logs | | 2024-09-23T23:17:40.676838 | 2024-09-23T23:17:42.345932 | 1669.0939999999998 | | {sql: SELECT * FROM "logs"} |
| 3db5488039408825ac0829a3feb49b05 | e3e5ac928b497eef | | accelerated_refresh | decimal | | 2024-09-23T23:17:41.592359 | 2024-09-23T23:17:43.781699 | 2189.34 | | {sql: SELECT * FROM "decimal"} |
| 5c5ddd481d1e19df823da74fe33f261f | 6afcfd1e65385a16 | | sql_query | select * from runtime.task_history limit 100 | | 2024-09-23T23:17:48.305649 | 2024-09-23T23:17:48.307369 | 1.72 | | {runtime_query: true, query_execution_duration_ms: 1.429375, protocol: FlightSQL, datasets: runtime.task_history, rows_produced: 5} |
| 4c3dd314b874aa63fcd15023e67fc645 | cab3cdc2d31c1b6a | | sql_query | select block_number from logs_file limit 5 | | 2024-09-23T23:18:00.267218 | 2024-09-23T23:18:00.269278 | 2.06 | | {datasets: logs_file, rows_produced: 5, query_execution_duration_ms: 1.940291, accelerated: true, protocol: FlightSQL} |
| f135c00df3aecd68dfa4d2360eff78f5 | db3474855449715c | | sql_query | select * from foobar | | 2024-09-23T23:18:12.865122 | 2024-09-23T23:18:12.865196 | 0.074 | Error during planning: table 'spice.public.foobar' not found | {protocol: FlightSQL, error_code: QueryPlanningError, rows_produced: 0, query_execution_duration_ms: 0.126959, datasets: } |
+----------------------------------+------------------+----------------+---------------------+----------------------------------------------+-----------------+----------------------------+----------------------------+-----------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+accelerated_refresh
Accelerated Table Refresh
text_embed
Text Embedding
1mcaptured_output: The level of output captured for tasks. none or truncated. Default: none. truncated captures the first 3 rows of the result set for sql_query and nsql_query task types. Other task types currently capture the entire output even in truncated mode.
task
Utf8
NO
Name or description of the task being performed (e.g. sql_query)
input
Utf8
NO
Input data or parameters for the task
captured_output
Utf8
YES
Output or result of the task, if available
start_time
Timestamp(Nanosecond, None)
NO
Time when the task started
end_time
Timestamp(Nanosecond, None)
NO
Time when the task ended
execution_duration_ms
Float64
NO
Duration of the task execution in milliseconds
error_message
Utf8
YES
Error message if the task failed, otherwise null
labels
Map(Utf8, Utf8)
NO
Key-value pairs for additional metadata or attributes associated with the task
sql_query
SQL Query
spice sql
nsql_query
Natural Language to SQL Query
ai_chat
AI Chat Completion
spice chat
vector_search
Vector Search
trace_id
Utf8
NO
Unique identifier for the entire trace this task happened in
span_id
Utf8
NO
Unique identifier for this specific task within the trace
parent_span_id
Utf8
YES
spice search
Identifier of the parent task, if any
runtime:
task_history:
retention_period: 1h # Keep tasks for 1 hour
retention_check_interval: 1m # Check for expired tasks every minuteruntime:
task_history:
enabled: falseruntime:
task_history:
captured_output: none # none or truncatedSELECT
trace_id,
span_id,
task,
start_time,
end_time,
execution_duration_ms,
error_message
FROM spice.runtime.task_history
WHERE start_time >= NOW() - INTERVAL '10 MINUTES'
AND end_time <= NOW();+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
| trace_id | span_id | task | start_time | end_time | execution_duration_ms | error_message |
+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+
| 687e0970f8c49d19c5a08764ea2d4dc1 | f4f52ed29db8b151 | text_embed | 2024-11-25T05:39:37.444749 | 2024-11-25T05:39:53.577195 | 16132.446000000002 | |
| 687e0970f8c49d19c5a08764ea2d4dc1 | e47b17bd9fd9fe37 | accelerated_refresh | 2024-11-25T05:39:31.112504 | 2024-11-25T05:39:53.579933 | 22467.429 | |
| 1e881188e5fd252b26adb8a8d838efb8 | 532b0019ad778094 | sql_query | 2024-11-25T05:40:38.864982 | 2024-11-25T05:40:38.871090 | 6.108 | |
| 2ee1c700b450034bb6c2da3de2e2386c | 235dafed1e7d8c02 | sql_query | 2024-11-25T05:39:38.249113 | 2024-11-25T05:39:39.387258 | 1138.145 | |
| 20e75df9ea77ba1c8cb99a2632cdd091 | d07551cd172ffa80 | sql_query | 2024-11-25T05:39:39.458135 | 2024-11-25T05:39:39.482181 | 24.046000000000003 | |
| ca1d470b12191726b61d825df6f2ce2a | 65597a0bc0a4fde3 | sql_query | 2024-11-25T05:39:39.675726 | 2024-11-25T05:39:39.822479 | 146.753 | |
| ac5abd8bfec7e5aa7c19fc84772c55f1 | 316622ac359e3c00 | sql_query | 2024-11-25T05:39:39.872946 | 2024-11-25T05:39:39.872994 | 0.048 | This feature is not implemented: The context currently only supports a single SQL statement |
| 1c640298e248ba297a12b1e3b59fffc7 | 031c3a25dc56d8e9 | sql_query | 2024-11-25T05:39:40.467032 | 2024-11-25T05:39:40.486156 | 19.124 | |
| 2c4d9abee740ced8ae423e0eb4fcff6b | a324b699b8bcf338 | sql_query | 2024-11-25T05:39:40.525506 | 2024-11-25T05:39:40.525526 | 0.02 | This feature is not implemented: The context currently only supports a single SQL statement |
| e5ed7f7a98e62f493ef8af2e0cd7734e | e84c30862a546bb5 | sql_query | 2024-11-25T05:39:40.560891 | 2024-11-25T05:39:40.560911 | 0.02 | This feature is not implemented: The context currently only supports a single SQL statement |
| d471f83092a95bde8663438cda74627f | 3dd9c4d4ebff4cb9 | sql_query | 2024-11-25T05:39:40.600892 | 2024-11-25T05:39:40.647092 | 46.199999999999996 | |
| 701874d7282dd47791e7519b343a9694 | 5dacf75c4537ee0e | accelerated_refresh | 2024-11-25T05:39:30.452534 | 2024-11-25T05:39:30.452900 | 0.366 | |
| 2e6b672a49a8cd5f0862a760661dc846 | f813941e0699e783 | accelerated_refresh | 2024-11-25T05:39:30.848425 | 2024-11-25T05:39:30.857242 | 8.817 | |
| 18d76b6389898cc5253a49294607477d | cc0d06a4e69cbcd5 | health | 2024-11-25T05:39:30.451626 | 2024-11-25T05:39:31.563876 | 1112.25 | |
| c75af81360e8962639faa64e6804b830 | 1ea2c95b243a5717 | accelerated_refresh | 2024-11-25T05:39:31.036470 | 2024-11-25T05:39:31.607845 | 571.375 | |
| 817d88778e91322640414263779ce7f1 | 513a58d83f0416a7 | accelerated_refresh | 2024-11-25T05:39:30.998455 | 2024-11-25T05:39:32.076359 | 1077.904 | |
| 3c507ee30211e6fab7d8a2eaf686e451 | d9be117925fb6d42 | accelerated_refresh | 2024-11-25T05:39:31.061851 | 2024-11-25T05:39:32.078412 | 1016.561 | |
| aa6010405a12a14b6afaf76e9fabedb8 | 1f50a2b177003c54 | accelerated_refresh | 2024-11-25T05:39:30.933543 | 2024-11-25T05:39:32.476197 | 1542.654 | |
| 3c75d16b6b4b8da98c551d115e1c049c | 9a16dc065a95236a | sql_query | 2024-11-25T05:42:27.386754 | 2024-11-25T05:42:27.386859 | 0.10500000000000001 | SQL error: ParserError("Expected: an SQL statement, found: ELECT") |
+----------------------------------+------------------+---------------------+----------------------------+----------------------------+-----------------------+---------------------------------------------------------------------------------------------+SELECT
trace_id,
task,
error_message,
SUBSTRING(input, 1, 100) AS input_preview,
start_time
FROM spice.runtime.task_history
WHERE error_message IS NOT NULL
ORDER BY start_time DESC
LIMIT 5;+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
| trace_id | task | error_message | input_preview | start_time |
+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+
| 352539e75fdb1a3d5fc3b48bfd4b4bae | sql_query | Error during planning: Invalid function 'date'. | SELECT DATE(start_time) AS task_date, COUNT(*) AS task_count | 2024-11-25T06:17:40.573970 |
| | | Did you mean 'tanh'? | FROM spice.runtime.task_history | |
| | | | GROUP B | |
| f6672d562ad97dde0bb4db428723461f | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name ,c_first_name ,s_store_name ,ca_state ,s_ | 2024-11-25T06:06:39.800900 |
| b16fa36e5a2f7f119fc3834875f6bdee | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with frequent_ss_items as (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date soldda | 2024-11-25T06:06:39.760532 |
| 8d126ea506a374c0c6239c11ee5cbe5a | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with cross_items as (select i_item_sk ss_item_sk from item, (select iss.i_brand_id brand_id | 2024-11-25T06:06:39.118422 |
| 198a9dcc496f0435cff69de61cc07874 | sql_query | This feature is not implemented: The context currently only supports a single SQL statement | with ssales as (select c_last_name ,c_first_name ,s_store_name ,ca_state ,s_ | 2024-11-25T06:04:11.317419 |
+----------------------------------+-----------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------+SELECT task, COUNT(*) AS task_count, AVG(execution_duration_ms) AS avg_duration_ms
FROM spice.runtime.task_history
GROUP BY task
ORDER BY task_count DESC;+-------------------------------+------------+---------------------+
| task | task_count | avg_duration_ms |
+-------------------------------+------------+---------------------+
| sql_query | 65 | 55.10198461538462 |
| accelerated_refresh | 27 | 749.1187407407407 |
| ai_completion | 9 | 5026.337888888888 |
| tool_use::list_datasets | 4 | 0.16899999999999998 |
| text_embed | 4 | 3341.08975 |
| ai_chat | 4 | 7151.03675 |
| vector_search | 3 | 384.376 |
| tool_use::document_similarity | 3 | 385.0406666666667 |
| tool_use::get_readiness | 1 | 0.12999999999999998 |
| tool_use::sample_data | 1 | 2.275 |
| health | 1 | 661.0169999999999 |
+-------------------------------+------------+---------------------+SELECT
task,
trace_id,
parent_span_id,
execution_duration_ms,
labels
FROM spice.runtime.task_history
ORDER BY execution_duration_ms DESC
LIMIT 10;+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
| task | trace_id | parent_span_id | execution_duration_ms | labels |
+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+
| accelerated_refresh | d9c38c7e58a02ec939240385a4a25a04 | | 1093711.474 | {sql: SELECT * FROM react.issues} |
| ai_chat | 7a6427313880942316bf3018cd23a198 | | 17202.836000000003 | {model: gpt-4o} |
| ai_completion | 7a6427313880942316bf3018cd23a198 | 59b1fd88c8397e3f | 17202.475 | {model: gpt-4o, total_tokens: 2673, prompt_tokens: 1807, completion_tokens: 866, stream: true} |
| accelerated_refresh | 96758c1132164204a68e1a7234a06cda | | 15660.023000000001 | {sql: SELECT * FROM react.docs} |
| text_embed | 96758c1132164204a68e1a7234a06cda | 109c489b24602356 | 12406.787 | {outputs_produced: 2086} |
| ai_chat | b2a69503a1b83215603ead321eea6f61 | | 6445.162 | {model: gpt-4o} |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 6444.1990000000005 | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30} |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 5608.6359999999995 | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true} |
| text_embed | 65880ecfc884a41555ac4d21ceef9aef | | 5143.494000000001 | {outputs_produced: 1} |
| text_embed | f51e5e9d4e26de31a2f7d5e9286dd8f4 | | 4769.832 | {outputs_produced: 1} |
+---------------------+----------------------------------+------------------+-----------------------+------------------------------------------------------------------------------------------------+SELECT
task,
trace_id,
parent_span_id,
span_id,
execution_duration_ms,
start_time,
end_time,
SUBSTRING(input, 1, 100) AS input_preview,
SUBSTRING(captured_output, 1, 100) AS output_preview,
error_message,
labels
FROM spice.runtime.task_history
WHERE trace_id = 'b2a69503a1b83215603ead321eea6f61'
ORDER BY start_time;+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| task | trace_id | parent_span_id | span_id | execution_duration_ms | start_time | end_time | input_preview | output_preview | error_message | labels |
+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ai_chat | b2a69503a1b83215603ead321eea6f61 | | 95411c59fc9c8cb8 | 6445.162 | 2024-11-25T06:03:31.197980 | 2024-11-25T06:03:37.643142 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | It seems that the dataset containing the recent React issues is currently being refreshed and is not | | {model: gpt-4o} |
| tool_use::list_datasets | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | aa5ba649da3f0581 | 0.367 | 2024-11-25T06:03:31.198139 | 2024-11-25T06:03:31.198506 | | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac | | {tool: list_datasets} |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd67a43da1b4312 | 6444.1990000000005 | 2024-11-25T06:03:31.198906 | 2024-11-25T06:03:37.643105 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {prompt_tokens: 1454, stream: true, total_tokens: 1484, model: gpt-4o, completion_tokens: 30} |
| tool_use::document_similarity | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 49b25ff51580d4fa | 140.337 | 2024-11-25T06:03:31.893913 | 2024-11-25T06:03:32.034250 | {"text":"how to install react","datasets":["spice.react.issues"],"limit":3} | | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {tool: document_similarity} |
| vector_search | b2a69503a1b83215603ead321eea6f61 | 49b25ff51580d4fa | 9feb8c9a54079cbd | 140.24200000000002 | 2024-11-25T06:03:31.894 | 2024-11-25T06:03:32.034242 | how to install react | | Error occurred interacting with datafusion: Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {limit: 3, tables: spice.react.issues} |
| text_embed | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 7f18fd8d96a1baeb | 122.771 | 2024-11-25T06:03:31.894072 | 2024-11-25T06:03:32.016843 | "how to install react" | | | {outputs_produced: 1} |
| sql_query | b2a69503a1b83215603ead321eea6f61 | 9feb8c9a54079cbd | 025f6b1e5cd502a6 | 16.892 | 2024-11-25T06:03:32.017320 | 2024-11-25T06:03:32.034212 | WITH ranked_docs as ( | | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {error_code: QueryExecutionError, protocol: Internal, query_execution_duration_ms: 8.20325, datasets: spice.react.issues, rows_produced: 0} |
| | | | | | | | SELECT id, dist, offset FROM ( | | | |
| | | | | | | | SELECT | | | |
| | | | | | | | | | | |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 1b2a3273a1dc4cdf | 5608.6359999999995 | 2024-11-25T06:03:32.034451 | 2024-11-25T06:03:37.643087 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {prompt_tokens: 1529, total_tokens: 1559, model: gpt-4o, completion_tokens: 30, stream: true} |
| tool_use::sample_data | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | 8bd10431fb18df87 | 2.275 | 2024-11-25T06:03:33.039583 | 2024-11-25T06:03:33.041858 | TopNSample({"dataset":"spice.react.issues","limit":3,"order_by":"created_at DESC"}) | | | {sample_method: top_n_sample, tool: top_n_sample} |
| sql_query | b2a69503a1b83215603ead321eea6f61 | 8bd10431fb18df87 | f1b2e06225aa2ba3 | 2.193 | 2024-11-25T06:03:33.039625 | 2024-11-25T06:03:33.041818 | SELECT * FROM spice.react.issues ORDER BY created_at DESC LIMIT 3 | | Failed to execute query: External error: Acceleration not ready; loading initial data for react.issues | {query_execution_duration_ms: 1.5519999, datasets: spice.react.issues, protocol: Internal, error_code: QueryExecutionError, rows_produced: 0} |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | b649e4bbe8aac7a3 | 4601.02 | 2024-11-25T06:03:33.042037 | 2024-11-25T06:03:37.643057 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {prompt_tokens: 1599, completion_tokens: 11, model: gpt-4o, stream: true, total_tokens: 1610} |
| tool_use::get_readiness | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | f40f0b4cd608de8b | 0.12999999999999998 | 2024-11-25T06:03:33.499867 | 2024-11-25T06:03:33.499997 | | {"dataset:call_center":"Ready","dataset:catalog_page":"Ready","dataset:catalog_returns":"Ready","dat | | {tool: get_readiness} |
| ai_completion | b2a69503a1b83215603ead321eea6f61 | 95411c59fc9c8cb8 | e018a56742b5064f | 4142.789 | 2024-11-25T06:03:33.500219 | 2024-11-25T06:03:37.643008 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {stream: true, completion_tokens: 254, prompt_tokens: 1920, model: gpt-4o, total_tokens: 2174} |
+-------------------------------+----------------------------------+------------------+------------------+-----------------------+----------------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+SELECT
task,
trace_id,
execution_duration_ms,
start_time,
SUBSTRING(input, 1, 100) AS input_preview,
SUBSTRING(captured_output, 1, 100) AS output_preview,
error_message,
labels
FROM spice.runtime.task_history
WHERE trace_id = (
SELECT trace_id
FROM spice.runtime.task_history
WHERE task = 'ai_chat'
ORDER BY start_time DESC
LIMIT 1
)
ORDER BY start_time;-----------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
| task | trace_id | execution_duration_ms | start_time | input_preview | output_preview | error_message | labels |
+-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+
| ai_chat | bb2de94b6f575b6c001f39cfded8bff4 | 5665.4169999999995 | 2024-11-25T07:02:43.240005 | {"messages":[{"role":"user","content":"how to install react"},{"role":"user","content":"top 3 recent | Here are the three most recent issues related to React, along with their summaries and links: | | {model: gpt-4o} |
| | | | | | | | |
| | | | | | 1. ** | | |
| tool_use::list_datasets | bb2de94b6f575b6c001f39cfded8bff4 | 0.157 | 2024-11-25T07:02:43.240048 | | [{"can_search_documents":true,"description":"React.js documentation and reference, from https://reac | | {tool: list_datasets} |
| ai_completion | bb2de94b6f575b6c001f39cfded8bff4 | 5664.964 | 2024-11-25T07:02:43.240416 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {prompt_tokens: 2688, total_tokens: 2716, completion_tokens: 28, model: gpt-4o, stream: true} |
| tool_use::document_similarity | bb2de94b6f575b6c001f39cfded8bff4 | 710.9609999999999 | 2024-11-25T07:02:44.344935 | {"text":"recent issues","datasets":["spice.react.issues"],"limit":3} | | | {tool: document_similarity} |
| vector_search | bb2de94b6f575b6c001f39cfded8bff4 | 710.842 | 2024-11-25T07:02:44.345018 | recent issues | {Full { catalog: "spice", schema: "react", table: "issues" }: VectorSearchTableResult { data: [Recor | | {limit: 3, tables: spice.react.issues} |
| text_embed | bb2de94b6f575b6c001f39cfded8bff4 | 562.453 | 2024-11-25T07:02:44.345072 | "recent issues" | | | {outputs_produced: 1} |
| sql_query | bb2de94b6f575b6c001f39cfded8bff4 | 147.672 | 2024-11-25T07:02:44.908148 | WITH ranked_docs as ( | [{"title_chunk":"app:lintVitalReleaseBug issu","id":"I_kwDOAJy2Ks47fqsI","title":"app:lintVitalRelea | | {datasets: spice.react.issues, protocol: Internal, query_execution_duration_ms: 139.69496, rows_produced: 3} |
| | | | | SELECT id, dist, offset FROM ( | | | |
| | | | | SELECT | | | |
| | | | | | | | |
| ai_completion | bb2de94b6f575b6c001f39cfded8bff4 | 3849.3140000000003 | 2024-11-25T07:02:45.055985 | {"messages":[{"role":"assistant","tool_calls":[{"id":"initial_list_datasets","type":"function","func | | | {model: gpt-4o, total_tokens: 3240, stream: true, completion_tokens: 271, prompt_tokens: 2969} |
+-------------------------------+----------------------------------+-----------------------+----------------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+---------------+--------------------------------------------------------------------------------------------------------------+SELECT
task,
start_time,
execution_duration_ms,
SUBSTRING(input, 1, 100) AS input_preview,
error_message,
labels
FROM spice.runtime.task_history
WHERE 'catalog_sales' = ANY(string_to_array(labels['datasets'], ','))
ORDER BY start_time DESC
LIMIT 5;+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| task | start_time | execution_duration_ms | input_preview | error_message | labels |
+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| sql_query | 2024-11-25T07:15:31.278018 | 19.424 | with ss as ( select i_manufact_id,sum(ss_ext_sales_price) total_sales from store_sales | | {accelerated: true, protocol: FlightSQL, query_execution_duration_ms: 18.765831, rows_produced: 15, datasets: store_sales,customer_address,item,date_dim,web_sales,catalog_sales} |
| sql_query | 2024-11-25T07:15:31.233147 | 1.661 | select sum(cs_ext_discount_amt) as "excess discount amount" from catalog_sales ,item ,dat | | {query_execution_duration_ms: 1.380667, accelerated: true, datasets: date_dim,catalog_sales,item, rows_produced: 1, protocol: FlightSQL} |
| sql_query | 2024-11-25T07:15:30.889905 | 30.101 | select i_item_id ,i_item_desc ,s_store_id ,s_store_name ,stddev_samp(ss_quantit | | {query_execution_duration_ms: 29.511086, datasets: store_sales,store,item,date_dim,store_returns,catalog_sales, rows_produced: 0, protocol: FlightSQL, accelerated: true} |
| sql_query | 2024-11-25T07:15:30.658339 | 24.942 | select i_item_id, avg(cs_quantity) agg1, avg(cs_list_price) agg2, avg(cs_co | | {query_execution_duration_ms: 24.620039, protocol: FlightSQL, datasets: item,date_dim,catalog_sales,promotion,customer_demographics, accelerated: true, rows_produced: 73} |
| sql_query | 2024-11-25T07:15:30.574257 | 40.908 | select i_item_id ,i_item_desc ,s_store_id ,s_store_name ,min(ss_net_profit) as store_sales_prof | | {protocol: FlightSQL, rows_produced: 0, accelerated: true, datasets: store_returns,date_dim,store,store_sales,item,catalog_sales, query_execution_duration_ms: 40.29496} |
+-----------+----------------------------+-----------------------+------------------------------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+Creates a model response for the given chat conversation.
Chat completion generated successfully
The specified model was not found
An internal server error occurred while processing the chat completion
Generate and optionally execute a natural-language text-to-SQL (NSQL) query.
This endpoint generates a SQL query using a natural language query (NSQL) and optionally executes it. The SQL query is generated by the specified model and executed if the Accept header is not set to application/sql.
The format of the response, one of 'application/json' (default), 'application/vnd.spiceai.nsql.v1+json', 'application/sql', 'text/csv' or 'text/plain'. 'application/sql' will only return the SQL query generated by the model.
SQL query executed successfully
Invalid request parameters
Internal server error
List all models, both machine learning and language models, available in the runtime.
The format of the response (e.g., json or csv).
If true, includes the status of each model in the response.
List of models in JSON format
Internal server error occurred while processing models
Perform a vector similarity search (VSS) operation on a dataset.
The search operation will return the most relevant matches based on cosine similarity with the input text. The datasets queries should have an embedding column, and the appropriate embedding model loaded.
Search completed successfully
Invalid request parameters
Internal server error
Execute a SQL query and return the results.\n\nThis endpoint allows users to execute SQL queries directly from an HTTP request. The SQL query is sent as plain text in the request body.
The format of the response, one of 'application/json' (default), 'application/vnd.spiceai.sql.v1+json', 'text/csv' or 'text/plain'.
SpiceAI SQL v1+json result
Bad Request
Internal Server Error
{
"results": [
{
"matches": {
"message": "I booked use some tickets"
},
"dataset": "app_messages",
"primary_key": {
"id": "6fd5a215-0881-421d-ace0-b293b83452b5"
},
"data": {
"timestamp": 1724716542
},
"score": 0.914321
},
{
"matches": {
"message": "direct to Narata"
},
"dataset": "app_messages",
"primary_key": {
"id": "8a25595f-99fb-4404-8c82-e1046d8f4c4b"
},
"data": {
"timestamp": 1724715881
},
"score": 0.83221
},
{
"matches": {
"message": "Yes, we're sitting together"
},
"dataset": "app_messages",
"primary_key": {
"id": "8421ed84-b86d-4b10-b4da-7a432e8912c0"
},
"data": {
"timestamp": 1724716123
},
"score": 0.787654321
}
],
"duration_ms": 42
}POST /v1/search HTTP/1.1
Host: data.spiceai.io
X-API-KEY: YOUR_API_KEY
Content-Type: application/json
Accept: */*
Content-Length: 157
{
"datasets": [
"app_messages"
],
"text": "Tokyo plane tickets",
"where": "user=1234321",
"additional_columns": [
"timestamp"
],
"limit": 3,
"keywords": [
"plane",
"tickets"
]
}{}POST /v1/sql HTTP/1.1
Host: data.spiceai.io
X-API-KEY: YOUR_API_KEY
Accept: text
Content-Type: application/json
Content-Length: 2
{}GitHub Data Connector Documentation
The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github as the selector in the from value for the dataset.
fromThe from field takes the form of github:github.com/<owner>/<repo>/<content> where content could be files, issues, pulls, commits, stargazers. See for more configuration detail.
nameThe dataset name. This will be used as the table name within Spice.
paramsGitHub Apps provide a secure and scalable way to integrate with GitHub's API. .
Limitations
With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.
GitHub queries support a github_query_mode parameter, which can be set to either auto or search for the following types:
Issues: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.
Pull Requests: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.
Commits only supports auto mode. Query with filter push down is only enabled for the committed_date column. commited_date supports exact matches, or greater/less than matches for dates provided in format, like WHERE committed_date > '2024-09-24'.
When set to search, Issues and Pull Requests will use the GitHub for improved filter performance when querying against the columns:
author and state; supports exact matches, or NOT matches. For example, WHERE author = 'peasee' or WHERE author <> 'peasee'.
body and title; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'.
All other filters are supported when github_query_mode is set to search, but cannot be pushed down to the GitHub API for improved performance.
Limitations
GitHub has a limitation in the Search API where it may return more stale data than the standard API used in the default query mode.
GitHub has a limitation in the Search API where it only returns a maximum of 1000 results for a query. Use to retrieve more results over time. See the for pull requests.
Limitations
content column is fetched only when acceleration is enabled.
Querying GitHub files does not support filter push down, which may result in long query times when acceleration is disabled.
ref - Required. Specifies the GitHub branch or tag to fetch files from.
include - Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.
Limitations
Querying with filters using date columns requires the use of . For example, WHERE created_at > '2024-09-24'.
Limitations
Querying with filters using date columns requires the use of . For example, WHERE created_at > '2024-09-24'.
Limitations
Querying with filters using date columns requires the use of . For example, WHERE committed_date > '2024-09-24'.
Setting github_query_mode
Limitations
Querying with filters using date columns requires the use of . For example, WHERE starred_at > '2024-09-24'.
Setting github_query_mode
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-4o-mini",
"system_fingerprint": "fp_44709d6fcb",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\n\nHello there, how may I assist you today?"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21,
"completion_tokens_details": {
"reasoning_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
}
}POST /v1/nsql HTTP/1.1
Host: data.spiceai.io
X-API-KEY: YOUR_API_KEY
Accept: text
Content-Type: application/json
Content-Length: 117
{
"query": "Get the top 5 customers by total sales",
"model": "nql",
"sample_data_enabled": true,
"datasets": [
"sales_data"
]
}{
"object": "list",
"data": [
{
"id": "gpt-4",
"object": "model",
"owned_by": "openai",
"datasets": null,
"status": "ready"
},
{
"id": "text-embedding-ada-002",
"object": "model",
"owned_by": "openai-internal",
"datasets": [
"text-dataset-1",
"text-dataset-2"
],
"status": "ready"
}
]
}POST /v1/chat/completions HTTP/1.1
Host: data.spiceai.io
X-API-KEY: YOUR_API_KEY
Content-Type: application/json
Accept: */*
Content-Length: 143
{
"model": "gpt-4o",
"messages": [
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": false
}GET /v1/models HTTP/1.1
Host: data.spiceai.io
X-API-KEY: YOUR_API_KEY
Accept: */*
updated_at, created_at, merged_at and closed_at; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'.
Setting github_query_mode to search is not supported.
mode
Utf8
YES
url
Utf8
YES
download_url
Utf8
YES
content
Utf8
YES
comments
List(Struct)
YES
created_at
Timestamp
YES
id
Utf8
YES
labels
List(Utf8)
YES
milestone_id
Utf8
YES
milestone_title
Utf8
YES
comments_count
Int64
YES
number
Int64
YES
state
Utf8
YES
title
Utf8
YES
updated_at
Timestamp
YES
url
Utf8
YES
changed_files
Int64
YES
closed_at
Timestamp
YES
comments_count
Int64
YES
commits_count
Int64
YES
created_at
Timestamp
YES
deletions
Int64
YES
hashes
List(Utf8)
YES
id
Utf8
YES
labels
List(Utf8)
YES
merged_at
Timestamp
YES
number
Int64
YES
reviews_count
Int64
YES
state
Utf8
YES
title
Utf8
YES
url
Utf8
YES
searchdeletions
Int64
YES
id
Utf8
YES
message
Utf8
YES
message_body
Utf8
YES
message_head_line
Utf8
YES
sha
Utf8
YES
searchcompany
Utf8
YES
x_username
Utf8
YES
location
Utf8
YES
avatar_url
Utf8
YES
bio
Utf8
YES
github_token
Required. GitHub personal access token to use to connect to the GitHub API. Learn more.
github_client_id
Required. Specifies the client ID for GitHub App Installation auth mode.
github_private_key
Required. Specifies the private key for GitHub App Installation auth mode.
github_installation_id
Required. Specifies the installation ID for GitHub App Installation auth mode.
github_query_mode
Optional. Specifies whether the connector should use the GitHub search API for improved filter performance. Defaults to auto, possible values of auto or search.
owner
Required. Specifies the owner of the GitHub repository.
repo
Required. Specifies the name of the GitHub repository.
name
Utf8
YES
path
Utf8
YES
size
Int64
YES
sha
Utf8
assignees
List(Utf8)
YES
author
Utf8
YES
body
Utf8
YES
closed_at
Timestamp
additions
Int64
YES
assignees
List(Utf8)
YES
author
Utf8
YES
body
Utf8
additions
Int64
YES
author_email
Utf8
YES
author_name
Utf8
YES
committed_date
Timestamp
starred_at
Timestamp
YES
login
Utf8
YES
Utf8
YES
name
Utf8
YES
YES
YES
YES
YES
[
{
"customer_id": "12345",
"total_sales": 150000
},
{
"customer_id": "67890",
"total_sales": 125000
}
]datasets:
- from: github:github.com/<owner>/<repo>/files/<ref>
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: '**/*.json; **/*.yaml'
acceleration:
enabled: truedatasets:
- from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: '**/*.txt' # include txt files only
acceleration:
enabled: truesql> select * from spiceai.files
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| name | path | size | sha | mode | url | download_url | content |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| version.txt | version.txt | 12 | ee80f747038c30e776eecb2c2ae155dec9a68187 | 100644 | https://api.github.com/repos/spiceai/spiceai/git/blobs/ee80f747038c30e776eecb2c2ae155dec9a68187 | https://raw.githubusercontent.com/spiceai/spiceai/v0.17.2-beta/version.txt | 0.17.2-beta |
| | | | | | | | |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
Time: 0.005067 seconds. 1 rows.datasets:
- from: github:github.com/<owner>/<repo>/issues
name: spiceai.issues
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: truedatasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${secrets:GITHUB_TOKEN}sql> select title, state, labels from spiceai.issues where title like '%duckdb%'
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| title | state | labels |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| Limitation documentation duckdb accelerator about nested struct and decimal256 | CLOSED | [kind/documentation] |
| Inconsistent duckdb connector params: `params.open` and `params.duckdb_file` | CLOSED | [kind/bug] |
| federation across multiple duckdb acceleration tables. | CLOSED | [] |
| Integration tests to cover "On Conflict" behaviors for duckdb accelerator | CLOSED | [kind/task] |
| Permission denied issue while using duckdb data connector with spice using HELM for Kubernetes deployment | CLOSED | [kind/bug] |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
Time: 0.011877542 seconds. 5 rows.datasets:
- from: github:github.com/<owner>/<repo>/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}datasets:
- from: github:github.com/spiceai/spiceai/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: truesql> select title, url, state from spiceai.pulls where title like '%GitHub connector%'
+---------------------------------------------------------------------+----------------------------------------------+--------+
| title | url | state |
+---------------------------------------------------------------------+----------------------------------------------+--------+
| GitHub connector: convert `labels` and `hashes` to primitive arrays | https://github.com/spiceai/spiceai/pull/2452 | MERGED |
+---------------------------------------------------------------------+----------------------------------------------+--------+
Time: 0.034996667 seconds. 1 rows.datasets:
- from: github:github.com/spiceai/spiceai/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
github_query_mode: search
time_column: created_at
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 6h # check for new results every 6 hours
refresh_data_window: 90d # at initial load, load the last 90 days of pullsdatasets:
- from: github:github.com/<owner>/<repo>/commits
name: spiceai.commits
params:
github_token: ${secrets:GITHUB_TOKEN}datasets:
- from: github:github.com/spiceai/spiceai/commits
name: spiceai.commits
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: truesql> select sha, message_head_line from spiceai.commits limit 10
+------------------------------------------+------------------------------------------------------------------------+
| sha | message_head_line |
+------------------------------------------+------------------------------------------------------------------------+
| 2a9fab7905737e1af182e17f40aecc5c4b5dd236 | wait 2 seconds for the status to turn ready in refreshing status tes… |
| b9c210a818abeaf14d2493fde5227781f47faed8 | Update README.md - Remove bigquery from tablet of connectors (#1434) |
| d61e1af61ebf826f83703b8dd939f19e8b2ba426 | Add databricks_use_ssl parameter (#1406) |
| f1ec55c5986e3e5d57eff94197182ffebbae1045 | wording and logs change reflected on readme (#1435) |
| bfc74185584d1e048ef66c72ce3572a0b652bfd9 | Update acknowledgements (#1433) |
| 0d870f1791d456e7924b4ecbbda5f3b762db1e32 | Update helm version and use v0.13.0-alpha (#1436) |
| 12f930cbad69833077bd97ea43599a75cff985fc | Enable push-down federation by default (#1429) |
| 6e4521090aaf39664bd61d245581d34398ce77db | Add functional tests for federation push-down (#1428) |
| fa3279b7d9fcaa5e8baaa2425f69b556bb30e309 | Add LRU cache support for http-based sql queries (#1410) |
| a3f93dde9d1312bfbf14f7ae3b75bdc468289212 | Add guides and examples about error handling (#1427) |
+------------------------------------------+------------------------------------------------------------------------+
Time: 0.0065395 seconds. 10 rows.datasets:
- from: github:github.com/<owner>/<repo>/stargazers
name: spiceai.stargazers
params:
github_token: ${secrets:GITHUB_TOKEN}datasets:
- from: github:github.com/spiceai/spiceai/stargazers
name: spiceai.stargazers
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: truesql> select starred_at, login from spiceai.stargazers order by starred_at DESC limit 10
+----------------------+----------------------+
| starred_at | login |
+----------------------+----------------------+
| 2024-09-15T13:22:09Z | cisen |
| 2024-09-14T18:04:22Z | tyan-boot |
| 2024-09-13T10:38:01Z | yofriadi |
| 2024-09-13T10:01:33Z | FourSpaces |
| 2024-09-13T04:02:11Z | d4x1 |
| 2024-09-11T18:10:28Z | stephenakearns-insta |
| 2024-09-09T22:17:42Z | Lrs121 |
| 2024-09-09T19:56:26Z | jonathanfinley |
| 2024-09-09T07:02:10Z | leookun |
| 2024-09-09T03:04:27Z | royswale |
+----------------------+----------------------+
Time: 0.0088075 seconds. 10 rows.