Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Add an OpenAI model and chat with the NYC Taxi Trips dataset
An OpenAI API Platform account and API key is required.
Navigate to Code tab.
In Components sidebar, click Model Providers tab, and select OpenAI.
Enter the Model name.
Enter the Model ID, (e.g. gpt-4o
).
Set the OpenAI API Key secret
API keys and other secrets are securely stored and encrypted.
Insert tools: auto
in the params
section of the gpt-4o
Model to automatically connect datasets to the model.
The final Spicepod configuration in the editor should be as follows:
Click Save in the code toolbar and then Deploy in the popup card that appears in the bottom right to deploy the changes.
Navigate to Playground and select AI Chat in the sidebar.
Ask a question about the NYC Taxi Trips dataset in the chat. For example:
"What datasets are available?"
"What is the average fare amount of a taxi trip?"
Replace [API-KEY]
in the sample below with the app API Key and execute in a terminal.
🎉 Congratulations, you've now added an OpenAI model and can use it to ask questions of the NYC Taxi Trips dataset.
Continue to Next Steps to explore use-cases to do more with the Spice.ai Cloud Platform.
Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.
Welcome to the Spice.ai Cloud Platform documentation!
The Spice.ai Cloud Platform is an AI application and agent cloud; an AI-backend-as-a-service comprising of composable, ready-to-use AI and agent building blocks including high-speed SQL query, LLM inference, Vector Search, and RAG built-on cloud-scale, managed Spice.ai OSS.
This documentation pertains to the Spice.ai Cloud Platform.
With the Spice.ai Cloud Platform, powered by Spice.ai OSS, you can:
Query and accelerate data: Run high-performance SQL queries across multiple data sources with results optimized for AI applications and agents.
Use AI Models: Perform large language model (LLM) inference with major providers including OpenAI, Anthropic, and Grok for chat, completion, and generative AI workflows.
Collaborate on Spicepods: Share, fork, and manage datasets, models, embeddings, evals, and tools in a collaborative, community-driven hub indexed by spicerack.org.
Use-Cases
Fast, virtualized data views: Build specialized “small data” warehouses to serve fast, virtualized views across large datasets for applications, APIs, dashboards, and analytics.
Performance and reliability: Manage replicas of hot data, cache SQL queries and AI results, and load-balance AI services to improve resiliency and scalability.
Production-grade AI workflows: Use Spice.ai Cloud as a data and AI proxy for secure, monitored, and compliant production environments, complete with advanced observability and performance management.
Take it for a spin by starting with the getting started guide.
Feel free to ask any questions or queries to the team in Discord.
Sign in to the Portal with GitHub
A GitHub account is required to access the Spice.ai Cloud Platform. If you don't have one, you can create an accout here.
Go to spice.ai and click on Login or Try for Free in the top right corner.
You can also navigate directly by URL to spice.ai/login
Click Continue with GitHub to login with your GitHub account.
Click Authorize Spice.ai Cloud Platform.
You will be redirected to the new application page.
Continue to Step 2 to configure your first Spice application.
Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.
For documentation on the self-hostable Spice.ai OSS Project, please visit docs.spiceai.org.
Learn more about building AI applications and agents with the Spice.ai Cloud Platform.
Frequently asked questions
Spice.ai OSS is an open-source project created by the Spice AI team that provides a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
The Spice.ai Cloud Platform is a data and AI application platform that provides a set of building-blocks to create AI and agentic applications. Building blocks include a cloud-data-warehouse, ML model training and inference, and a cloud-scale, managed Spice.ai OSS cloud-hosted service.
It's free to get an API key to use the Community Edition.
Customers who need resource limits, service-level guarantees, or priority support we offer high-value paid tiers based on usage.
We offer enterprise-grade support with an SLA for Enterprise Plans.
For standard plans we offer best-effort community support in Discord.
See Security. The Spice.ai Cloud Platform is SOC 2 Type II compliant.
Spice.ai OSS is built on Apache DataFusion and uses the PostgreSQL dialect.
Federated SQL Query documentation
Spice supports federated queries, enabling you to join and combine data from multiple sources, including databases (PostgreSQL, MySQL), data warehouses (Databricks, Snowflake, BigQuery), and data lakes (S3, MinIO). For a full list of supported sources, see Data Connectors.
The Playground SQL Explorer is the fastest way to get started with federated queries, debugging queries, and iterating quickly. The SQL Query Editor be accessed by clicking on the SQL Explorer tab after selecting Playground in the app navigation bar.
See SQL Query for further documentation on using the SQL Query Editor.
For production applications, leveraging the high-performance Apache Arrow Flight endpoint is recommended. The Spice SDKs always query using Arrow Flight.
See Apache Arrow Flight API for further documentation on using Apache Arrow Flight APIs.
SQL Query is also accessible via a standard HTTP API.
See HTTP API for further documentation on using the HTTP SQL API.
Playground
Start experimenting in the Playground
Features
Explore the features of the platform
Use-Cases
Explore use-cases for Spice.ai.
Configure local acceleration for datasets in Spice for faster queries (test)
Datasets can be locally accelerated by the Spice runtime, pulling data from any Data Connector and storing it locally in a Data Accelerator for faster access. The data can be kept up-to-date in real-time or on a refresh schedule, ensuring users always have the latest data locally for querying.
Dataset acceleration is enabled by setting the acceleration
configuration. Spice currently supports In-Memory Arrow, DuckDB, SQLite, PostgreSQL as accelerators. For engine specific configuration, see Data Accelerator Documentation
Spice supports three modes to refresh/update locally accelerated data from a connected data source. full
is the default mode. Refer to Data Refresh documentation for detailed refresh usage and configuration.
full
Replace/overwrite the entire dataset on each refresh
A table of users
append
Append/add data to the dataset on each refresh
Append-only, immutable datasets, such as time-series or log data
changes
Apply incremental changes
Customer order lifecycle table
Database indexes are essential for optimizing query performance. Configure indexes for accelerators via indexes
field. For detailed configuration, refer to the index documentation.
Constraints enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality and configure behavior for data updates that violate constraints.
Constraints are specified using column references in the Spicepod via the primary_key
field in the acceleration configuration. Additional unique constraints are specified via the indexes
field with the value unique
. Data that violates these constraints will result in a conflict. For constraints configuration details, visit Constraints Documentation.
To use DuckDB as Data Accelerator, specify duckdb
as the engine
for acceleration.
Spice.ai currently only supports mode: memory
for DuckDB accelerator.
Configuration params
are provided in the acceleration
section for a data store. Other common acceleration
fields can be configured for DuckDB, see see datasets.
LIMITATIONS
The DuckDB accelerator does not support nested lists, or structs with nested structs/lists field types. For example:
Supported:
SELECT {'x': 1, 'y': 2, 'z': 3}
Unsupported:
SELECT [['duck', 'goose', 'heron'], ['frog', 'toad']]
SELECT {'x': [1, 2, 3]}
The DuckDB accelerator does not support enum, dictionary, or map field types. For example:
Unsupported:
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])
The DuckDB accelerator does not support Decimal256
(76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
Updating a dataset with DuckDB acceleration while the Spice Runtime is running (hot-reload) will cause the DuckDB accelerator query federation to disable until the Runtime is restarted.
MEMORY CONSIDERATIONS
When accelerating a dataset using mode: memory
(the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
To use SQLite as Data Accelerator, specify sqlite
as the engine
for acceleration.
The connection to SQLite can be configured by providing the following params
:
busy_timeout
: Optional. Specifies the duration for the SQLite busy timeout when connecting to the database file. Default: 5000 ms.
Configuration params
are provided in the acceleration
section of a dataset. Other common acceleration
fields can be configured for sqlite, see see datasets.
LIMITATIONS
The SQLite accelerator doesn't support advanced grouping features such as ROLLUP
and GROUPING
.
In SQLite, CAST(value AS DECIMAL)
doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL)
will be treated as integer division, resulting in 0 instead of the expected 0.5. Use FLOAT
to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT)
.
Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.
The SQLite accelerator doesn't support arrow Interval
types, as SQLite doesn't have a native interval type.
The SQLite accelerator only supports arrow List
types of primitive data types; lists with structs are not supported.
MEMORY CONSIDERATIONS
When accelerating a dataset using mode: memory
(the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Add a dataset and query it using SQL Query in the Playground
To add a dataset to the Spice app, navigate to the Code tab.
Use the Components sidebar on the right to select from available Data Connectors, Model Providers, and ready-to-use Datasets.
Navigate to Code tab.
In Components sidebar, click the Datasets tab.
Select and add the NYC Taxi Trips dataset
Note the configuration has been added to the editor
Click Save in the code toolbar and then Deploy on popup card that appears in the bottom right.
Navigate to the Playground tab, open the dataset reference, and click on the spice.samples.taxi_trips
dataset to insert a sample query into the SQL editor. Then, click Run Selection.
Go app Settings and copy one of the app API Keys.
Replace [API-KEY]
in the sample below with your API Key and execute from a terminal.
🎉 Congratulations, you've now added a dataset and queried it.
Continue to Step 4 to add an AI Model and chat with the dataset.
Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.
Define semantic data models in Spice to improve dataset understanding for AI
A semantic model is a structured representation of data that captures the meaning and relationships between elements in a dataset.
In Spice, semantic models transform raw data into meaningful business concepts by defining metadata, descriptions, and relationships at both the dataset and column level. This makes the data more interpretable for both AI language models and human analysis.
The semantic model is automatically used by Spice Models as context to produce more accurate and context-aware AI responses.
Semantic data models are defined within the spicepod.yaml
file, specifically under the datasets
section. Each dataset supports description
, metadata
, and a columns
field where individual columns are described with metadata and features for utility and clarity.
Example spicepod.yaml
:
Datasets can be defined with the following metadata:
instructions
: Optional. Instructions to provide to a language model when using this dataset.
reference_url_template
: Optional. A URL template for citation links.
For detailed metadata
configuration, see the Spice OSS Dataset Reference
Each column in the dataset can be defined with the following attributes:
description
: Optional. A description of the column's contents and purpose.
embeddings
: Optional. Vector embeddings configuration for this column.
For detailed columns
configuration, see the Spice OSS Dataset Reference
Use the advanced search and retrieval capabilities of Spice
Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and Vector-Similarity Search functionality.
Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's LIKE
clause:
Spice also provides advanced Vector-Similarity Search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
Local embedding models, e.g. sentence-transformers/all-MiniLM-L6-v2.
Remote embedding providers, e.g. OpenAI.
See Model Providers to view all supported providers
Embedding models are defined in the spicepod.yaml
file as top-level components.
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
By defining embeddings on the body
column, Spice is now configured to execute similarity searches on the dataset.
For more details, see the API reference for /v1/search.
Spice also supports vector search on datasets with preexisting embeddings. See below for compatibility details.
Spice supports chunking of content before embedding, which is useful for large text columns such as those found in Document Tables. Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.
The body
column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).
When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns
list.
For example:
Response:
Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:
Underlying Column Presence:
The underlying column must exist in the table, and be of string
Arrow data type .
Embeddings Column Naming Convention:
For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding
. For example, a customer_reviews
table with a review
column must have a review_embedding
column.
Embeddings Column Data Type:
The embeddings column must have the following Arrow data type when loaded into Spice:
FixedSizeList[Float32 or Float64, N]
, where N
is the dimension (size) of the embedding vector. FixedSizeList
is used for efficient storage and processing of fixed-size vectors.
If the column is chunked, use List[FixedSizeList[Float32 or Float64, N]]
.
Offset Column for Chunked Data:
If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets
with the following Arrow data type:
List[FixedSizeList[Int32, 2]]
, where each element is a pair of integers [start, end]
representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
For instance, [[0, 100], [101, 200]]
indicates two chunks covering indices 0–100 and 101–200, respectively.
By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.
Example
A table sales
with an address
column and corresponding embedding column(s).
The same table if it was chunked:
The In-Memory Arrow Data Accelerator is the default data accelerator in Spice. It uses Apache Arrow to store data in-memory for fast access and query performance.
To use the In-Memory Arrow Data Accelerator, no additional configuration is required beyond enabling acceleration.
Example:
However Arrow can be specified explicitly using arrow
as the engine
for acceleration.
Limitations
The In-Memory Arrow Data Accelerator does not support persistent storage. Data is stored in-memory and will be lost when the Spice runtime is stopped.
The In-Memory Arrow Data Accelerator does not support Decimal256
(76 digits), as it exceeds Arrow's maximum Decimal width of 38 digits.
The In-Memory Arrow Data Accelerator does not support indexes.
The In-Memory Arrow Data Accelerator only supports primary-key constraints, not unique
constraints.
With Arrow acceleration, mathematical operations like value1 / value2
are treated as integer division if the values are integers. For example, 1 / 2
will result in 0 instead of the expected 0.5. Use casting to FLOAT to ensure conversion to a floating-point value: CAST(1 AS FLOAT) / CAST(2 AS FLOAT)
(or CAST(1 AS FLOAT) / 2
).
AI Gateway documentation
Spice provides a high-performance, OpenAI API-compatible AI Gateway optimized for managing and scaling large language models (LLMs). Additionally, Spice offers tools for Enterprise Retrieval-Augmented Generation (RAG), such as SQL query across federated datasets and an advanced search feature (see Search).
Spice supports full OpenTelemetry observability, enabling detailed tracking of data flows and requests for full transparency and easier debugging.
Spice supports a variety of LLMs, including OpenAI, Azure OpenAI, Anthropic, Groq, Hugging Face, and more (see Model Providers for all supported models).
Custom Tools: Equip models with tools to interact with the Spice runtime.
System Prompts: Customize system prompts and override defaults for v1/chat/completion
.
For detailed configuration and API usage, refer to the API Documentation.
To use a language model hosted on OpenAI (or compatible), specify the openai
path and model ID in from
.
Example spicepod.yml
:
For details, see OpenAI (or Compatible) Language Models.
First-class, built-in observability to understand the operations Spice performs.
Observability in Spice enables task tracking and performance monitoring through a built-in distributed tracing system that can export to Zipkin or be viewed via the runtime.task_history
SQL table.
Spice records detailed information about runtime operations through trace IDs, timings, and labels - from SQL queries to AI completions. This task history system helps operators monitor performance, debug issues, and understand system behavior across individual requests and overall patterns.
Trace AI chat completion steps and tool interactions to identify why a request isn't responding as expected
Investigate failed queries and other task errors
Track SQL query/tool use execution times
Identify slow-running tasks
Track usage patterns by protocol and dataset
Understand how AI models are using tools to retrieve data from the datasets available to them
The Spice platform provides a built-in UI for visualizing the observability traces that Spice OSS generates.
Create your first Spice app
Once signed in with GitHub, you will be redirected to the new application page. Set a name, add a model provider, and optionally select one of ready to use datasets.
Enter a name for the application.
Choose a model provider and provide an API key.
Optionally select one or more of the available datasets. Datasets can also be added later.
Click Create application.
It will take up to 30 seconds to create and provision a dedicated Spice.ai OSS instance for the application.
Once the application instance is loaded, you will be redirected to the Playground.
Executing the show tables
query will show the default datasets available for the app.
🎉 Congrats, you've created your first Spice app!
Continue to Step. 3 to add a dataset and query it.
Need help? Ask a question, raise issues, and provide feedback to the Spice AI team on Discord.
Export observability traces from Spice into Zipkin
In addition to the built-in runtime.task_history
SQL table, Spice can export the observability traces it collects into Zipkin.
Zipkin export is defined in the spicepod.yaml
under the runtime.tracing
section:
zipkin_enabled
: Optional. Default false
. Enables or disables the Zipkin trace export.
zipkin_endpoint
: Required if zipkin_enabled
is true. The path to the /api/v2/spans
endpoint on the Zipkin instance to export to.
To use PostgreSQL as Data Accelerator, specify postgres
as the engine
for acceleration.
The connection to PostgreSQL can be configured by providing the following params
:
pg_host
: The hostname of the PostgreSQL server.
pg_port
: The port of the PostgreSQL server.
pg_db
: The name of the database to connect to.
pg_user
: The username to connect with.
pg_sslmode
: Optional. Specifies the SSL/TLS behavior for the connection, supported values:
verify-full
: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
verify-ca
: This mode requires a TLS connection and a valid root certificate.
require
: This mode requires a TLS connection.
prefer
: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
disable
: This mode will not attempt to use a TLS connection, even if the server supports it.
pg_sslrootcert
: Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
connection_pool_size
: Optional. The maximum number of connections to keep open in the connection pool. Default is 10.
Configuration params
are provided either in the acceleration
section of a dataset.
LIMITATIONS
The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col)
into CAST (SUM(int_col) as BIGINT
.
Spice Machine Learning (ML) Models
Spice Models are in beta for Design Partners. Get in touch for more info.
Spice Models enable the training and use of ML models natively on the Spice platform.
The platform currently supports time-series forecasting models, with other categories of models planned.
model.yaml
files committed to the connected repository will be automatically detected and imported as Spice Models.
Navigating to a specific Model will show detailed information as defined in the model.yaml
.
A training run can be started using the Train button.
Training runs in progress will be shown and updated, along with historical training runs.
The Training Status will be updated to Complete
for successfully completed training runs. Details and the Training Report, are available on the Training Run page.
Spice Models (beta) currently supports time-series forecasting.
Additional categories of data science and machine learning are on our roadmap.
A successfully trained model can be used to make predictions.
The lookback data (inferencing data) is automatically provided by the platform and wired up to the inference, enabling a prediction to be made using a simple API call.
Navigate to AI Predictions in the Playground.
Successfully trained models will be available for selection from the model selector drop down on the right.
Clicking Predict will demonstrate calling the predictions API using lookback data within the Spice platform. A graph of the predicted value(s) along with the lookback data will be displayed.
The Training Runs page provides training details including a copyable curl
command to make a prediction from the command line.
ClickHouse Data Connector Documentation
ClickHouse is a fast, open-source columnar database management system designed for online analytical processing (OLAP) and real-time analytics. This connector enables federated SQL queries from a ClickHouse server.
from
The from
field for the ClickHouse connector takes the form of from:db.dataset
where db.dataset
is the path to the Dataset within ClickHouse. In the example above it would be my.dataset
.
If db
is not specified in either the from
field or the clickhouse_db
parameter, it will default to the default
database.
name
The dataset name. This will be used as the table name within Spice.
params
The ClickHouse data connector can be configured by providing the following params
:
Learn how to use Data Connector to query external data.
Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.
Supported Data Connectors include:
For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format
.
If a file is provided, the file format will be inferred, and params.file_format
is unnecessary.
File formats currently supported are:
Note Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.
pg_pass
: The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}
.
The table below lists the supported and their mappings to when stored
Hosted models have first-class access to co-located data for training and inferencing including: , , and . Additionally, can be leveraged to train and infer up to 10x faster.
Models are defined using a YAML file. Model details such as data requirements, architecture, training parameters, and other important hyperparameters are defined in the model.yaml.
Add a model.yaml
file to the repository path /models/[model_name]/model.yaml
of a , replacing [model_name]
with the desired model name. For example, the uses the path /models/gas-fees/model.yaml
.
Refer to the for all available configuration options.
For example model manifests, see the .
In the , navigate to the Models tab of the Spice app.
For details on the API, see .
File formats support additional parameters in the params
(like csv_has_header
) described in
If a format is a document format, each file will be treated as a document, as per below.
Int8
TinyInteger
smallint
Int16
SmallInteger
smallint
Int32
Integer
integer
Int64
BigInteger
bigint
UInt8
TinyUnsigned
smallint
UInt16
SmallUnsigned
smallint
UInt32
Unsigned
bigint
UInt64
BigUnsigned
numeric
Decimal128
/ Decimal256
Decimal
decimal
Float32
Float
real
Float64
Double
double precision
Utf8 / LargeUtf8
Text
text
Boolean
Boolean
bool
Binary / LargeBinary
VarBinary
bytea
FixedSizeBinary
Binary
bytea
Timestamp
(no Timezone)
Timestamp
timestamp
without time zone
Timestamp
(with Timezone)
TimestampWithTimeZone
timestamp
with time zone
Date32
/ Date64
Date
date
Time32
/ Time64
Time
time
Interval
Interval
interval
Duration
BigInteger
bigint
List
/ LargeList
/ FixedSizeList
Array
array
Struct
N/A
Composite
(Custom type)
clickhouse_connection_string
The connection string to use to connect to the ClickHouse server. This can be used instead of providing individual connection parameters.
clickhouse_host
The hostname of the ClickHouse server.
clickhouse_tcp_port
The port of the ClickHouse server.
clickhouse_db
The name of the database to connect to.
clickhouse_user
The username to connect with.
clickhouse_pass
The password to connect with.
clickhouse_secure
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
true
: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
false
: This mode will not attempt to use an SSL connection, even if the server supports it.
connection_timeout
Optional. Specifies the connection timeout in milliseconds.
databricks (mode: delta_lake)
[Databricks][databricks]
S3/Delta Lake
delta_lake
Delta Lake
Delta Lake
dremio
[Dremio][dremio]
Arrow Flight
duckdb
DuckDB
Embedded
github
GitHub
GitHub API
postgres
PostgreSQL
s3
[S3][s3]
Parquet, CSV
mysql
MySQL
delta_lake
Delta Lake
Delta Lake
graphql
GraphQL
JSON
databricks (mode: spark_connect)
[Databricks][databricks]
[Spark Connect][spark]
flightsql
FlightSQL
Arrow Flight SQL
mssql
Microsoft SQL Server
Tabular Data Stream (TDS)
snowflake
Snowflake
Arrow
spark
Spark
[Spark Connect][spark]
spice.ai
[Spice.ai][spiceai]
Arrow Flight
iceberg
[Apache Iceberg][iceberg]
Parquet
abfs
Azure BlobFS
Parquet, CSV
clickhouse
Clickhouse
debezium
Debezium CDC
Kafka + JSON
dynamodb
DynamoDB
ftp
, sftp
FTP/SFTP
Parquet, CSV
http
, https
HTTP(s)
Parquet, CSV
sharepoint
Microsoft SharePoint
Unstructured UTF-8 documents
file_format: parquet
✅
❌
file_format: csv
✅
❌
file_format: iceberg
Roadmap
❌
JSON
file_format: json
Roadmap
❌
Microsoft Excel
file_format: xlsx
Roadmap
❌
Markdown
file_format: md
✅
✅
Text
file_format: txt
✅
✅
file_format: pdf
Alpha
✅
Microsoft Word
file_format: docx
Alpha
✅
model.yaml
files automatically detected and imported in the Portal.The Spice runtime stores information about completed tasks in the spice.runtime.task_history
table. A task is a single unit of execution within the runtime, such as a SQL query or an AI chat completion (see Task Types below). Tasks can be nested, and the runtime will record the parent-child relationship between tasks.
Each task executed has a row in this table, and by default the data is retained for 8 hours. Use a SELECT
query to return information about each task as shown in this example:
Output:
The following top-level task types are currently recorded:
sql_query
SQL Query
spice sql
nsql_query
Natural Language to SQL Query
ai_chat
AI Chat Completion
spice chat
vector_search
Vector Search
spice search
accelerated_refresh
Accelerated Table Refresh
text_embed
Text Embedding
Set the following parameters in the runtime.task_history
section of the spicepod.yaml
file to configure task history:
enabled
: Enable or disable task history. Default: true
.
retention_period
: The duration for which task history data is retained. Default: 8h
.
retention_check_interval
: The interval at which the task history retention is checked. Default: 1m
.
captured_output
: The level of output captured for tasks. none
or truncated
. Default: none
. truncated
captures the first 3 rows of the result set for sql_query
and nsql_query
task types. Other task types currently capture the entire output even in truncated mode.
Adjust the retention period for task history:
Disable task history:
Disable capturing output from tasks:
trace_id
Utf8
NO
Unique identifier for the entire trace this task happened in
span_id
Utf8
NO
Unique identifier for this specific task within the trace
parent_span_id
Utf8
YES
Identifier of the parent task, if any
task
Utf8
NO
Name or description of the task being performed (e.g. sql_query
)
input
Utf8
NO
Input data or parameters for the task
captured_output
Utf8
YES
Output or result of the task, if available
start_time
Timestamp(Nanosecond, None)
NO
Time when the task started
end_time
Timestamp(Nanosecond, None)
NO
Time when the task ended
execution_duration_ms
Float64
NO
Duration of the task execution in milliseconds
error_message
Utf8
YES
Error message if the task failed, otherwise null
labels
Map(Utf8, Utf8)
NO
Key-value pairs for additional metadata or attributes associated with the task
Example output:
Example output:
Example output:
Example output:
Example output:
Dremio Data Connector Documentation
Dremio is a data lake engine that enables high-performance SQL queries directly on data lake storage. It provides a unified interface for querying and analyzing data from various sources without the need for complex data movement or transformation.
This connector enables using Dremio as a data source for federated SQL queries.
from
The from
field takes the form dremio:dataset
where dataset
is the fully qualified name of the dataset to read from.
Limitations
Currently, only up to three levels of nesting are supported for dataset names (e.g., a.b.c). Additional levels are not supported at this time.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
dremio_endpoint
The endpoint used to connect to the Dremio server.
dremio_username
The username used to connect to the Dremio endpoint.
dremio_password
The table below shows the Dremio data types supported, along with the type mapping to Apache Arrow types in Spice.
INT
Int32
BIGINT
Int64
FLOAT
Float32
DOUBLE
Float64
DECIMAL
Decimal128
VARCHAR
Utf8
VARBINARY
Binary
BOOL
Boolean
DATE
Date64
TIME
Time32
TIMESTAMP
Timestamp(Millisecond, None)
INTERVAL
Interval
LIST
List
STRUCT
Struct
MAP
Map
Limitations
Dremio connector does not support queries with the EXCEPT and INTERSECT keywords in Spice REPL. Use DISTINCT and IN/NOT IN instead. See the example below.
````
Azure BlobFS Data Connector Documentation
The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (abfss://
) and Azure Data Lake (adl://
) endpoints.
When a folder path is provided, all the contained files will be loaded.
File formats are specified using the file_format
parameter, as described in Object Store File Formats.
from
Defines the ABFS-compatible URI to a folder or object:
from: abfs://<container>/<path>
with the account name configured using abfs_account
parameter, or
from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>
name
Defines the dataset name, which is used as the table name within Spice.
Example:
params
file_format
abfs_account
Azure storage account name
abfs_sas_string
SAS (Shared Access Signature) Token to use for authorization
abfs_endpoint
Storage endpoint, default: https://{account}.blob.core.windows.net
abfs_use_emulator
Use true
or false
to connect to a local emulator
abfs_authority_host
Alternative authority host, default: https://login.microsoftonline.com
abfs_proxy_url
Proxy URL
abfs_proxy_ca_certificate
CA certificate for the proxy
abfs_proxy_exludes
A list of hosts to exclude from proxy connections
abfs_disable_tagging
Disable tagging objects. Use this if your backing store doesn't support tags
allow_http
Allow insecure HTTP connections
hive_partitioning_enabled
Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:
abfs_access_key
abfs_bearer_token
abfs_client_secret
abfs_skip_signature
If none of these are set the connector will default to using a managed identity
abfs_access_key
Secret access key
abfs_bearer_token
abfs_client_id
Client ID for client authentication flow
abfs_client_secret
Client Secret to use for client authentication flow
abfs_tenant_id
Tenant ID to use for client authentication flow
abfs_skip_signature
Skip credentials and request signing for public containers
abfs_msi_endpoint
Endpoint for managed identity tokens
abfs_federated_token_file
File path for federated identity token in Kubernetes
abfs_use_cli
Set to true
to use the Azure CLI to acquire access tokens
abfs_max_retries
Maximum retries
abfs_retry_timeout
Total timeout for retries (e.g., 5s
, 1m
)
abfs_backoff_initial_duration
Initial retry delay (e.g., 5s
)
abfs_backoff_max_duration
Maximum retry delay (e.g., 1m
)
abfs_backoff_base
Exponential backoff base (e.g., 0.1
)
ABFS connector supports three types of authentication, as detailed in the authentication parameters
Configure service principal authentication by setting the abfs_client_secret
parameter.
Create a new Azure AD application in the Azure portal and generate a client secret
under Certificates & secrets
.
Grant the Azure AD application read access to the storage account under Access Control (IAM)
, this can typically be done using the Storage Blob Data Reader
built-in role.
Configure service principal authentication by setting the abfs_access_key
parameter to Azure Storage Account Access Key
Specify the file format using file_format
parameter. More details in Object Store File Formats.
DynamoDB Data Connector Documentation
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.
from
The from
field should specify the DynamoDB table name:
If an expected table is not found, verify the dynamodb_aws_region
parameter. DynamoDB tables are region-specific.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
The DynamoDB data connector supports the following configuration parameters:
If AWS credentials are not explicitly provided in the configuration, the connector will automatically load credentials from the following sources in order:
Environment Variables:
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
(if using temporary credentials)
Shared AWS Config/Credentials Files:
Config file: ~/.aws/config
(Linux/Mac) or %UserProfile%\.aws\config
(Windows)
Credentials file: ~/.aws/credentials
(Linux/Mac) or %UserProfile%\.aws\credentials
(Windows)
The AWS_PROFILE
environment variable can be used to specify a named profile.
Supports both static credentials and SSO sessions
Example credentials file:
To set up SSO authentication:
Run aws configure sso
to configure a new SSO profile
Use the profile by setting AWS_PROFILE=sso-profile
Run aws sso login
to start a new SSO session
Web Identity Token Credentials:
Used primarily with OpenID Connect (OIDC) and OAuth
Common in Kubernetes environments using IAM roles for service accounts (IRSA)
ECS Container Credentials:
Used when running in Amazon ECS containers
Automatically uses the task's IAM role
Retrieved from the ECS credential provider endpoint
EC2 Instance Metadata Service (IMDSv2):
Used when running on EC2 instances
Automatically uses the instance's IAM role
Retrieved securely using IMDSv2
The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.
IAM Permissions Regardless of the credential source, the IAM role or user must have appropriate DynamoDB permissions (e.g., dynamodb:Scan
, dynamodb:DescribeTable
) to access the table.
The IAM role or user needs the following permissions to access DynamoDB tables:
Security Considerations
Avoid using dynamodb:*
permissions as it grants more access than necessary.
Consider using more restrictive policies in production environments.
DynamoDB supports complex nested JSON structures. These fields can be queried using SQL:
Limitations
The DynamoDB connector currently does not support filter push-down optimization. All filtering is performed after data is retrieved from DynamoDB.
Primary key optimizations are not yet implemented - retrieving items by their primary key will still scan the table.
The DynamoDB connector will scan the first 10 items to determine the schema of the table. This may miss columns that are not present in the first 10 items.
The DynamoDB connector supports the following data types and mappings:
Basic scalar types (String, Number, Boolean)
Lists and Maps
Nested structures
Binary data
Example schema from a users table:
Due to limited support for filter push-down, enable acceleration to prevent scanning the entire table on every query.
Flight SQL Data Connector Documentation
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Connect to any Flight SQL compatible server (e.g. Influx 3.0, CnosDB, other Spice runtimes!) as a connector for federated SQL queries.
from
The from
field takes the form flightsql:dataset
where dataset
is the fully qualified name of the dataset to read from.
name
The dataset name. This will be used as the table name within Spice.
params
DuckDB Data Connector Documentation
DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system designed for analytical query workloads. It is optimized for fast execution and can be embedded directly into applications, providing efficient data processing without the need for a separate database server.
from
The from
field supports one of two forms:
name
The dataset name. This will be used as the table name within Spice.
Example:
params
The DuckDB data connector can be configured by providing the following params
:
Configuration params
are provided either in the top level dataset
for a dataset source, or in the acceleration
section for a data store.
A generic example of DuckDB data connector configuration.
Datasets created from DuckDB functions are similar to a standard SELECT
query. For example:
is equivalent to:
Many DuckDB data imports can be rewritten as DuckDB functions, making them usable as Spice datasets. For example:
Limitations
Unsupported:
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30])
The DuckDB connector does not support Decimal256
(76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
Delta Lake Data Connector Documentation
from
name
The dataset name. This will be used as the table name within Spice.
Example:
params
Note One of the following auth values must be provided for Azure Blob:
delta_lake_azure_storage_account_key
,
delta_lake_azure_storage_client_id
and azure_storage_client_secret
, or
delta_lake_azure_storage_sas_key
.
The table below shows the Delta Lake data types supported, along with the type mapping to Apache Arrow types in Spice.
Delta Lake connector does not support reading Delta tables with the V2Checkpoint
feature enabled. To use the Delta Lake connector with such tables, drop the V2Checkpoint
feature by executing the following command:
Databricks Data Connector Documentation
from
The from
field for the Databricks connector takes the form databricks:catalog.schema.table
where catalog.schema.table
is the fully-qualified path to the table to read from.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
Note
One of the following auth values must be provided for Azure Blob:
databricks_azure_storage_account_key
,
databricks_azure_storage_client_id
and azure_storage_client_secret
, or
databricks_azure_storage_sas_key
.
The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.
Databricks connector (mode: delta_lake) does not support reading Delta tables with the V2Checkpoint
feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop the V2Checkpoint
feature by executing the following command:
Memory Considerations
When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
The Databricks Connector (mode: spark_connect
) does not yet support streaming query results from Spark.
The password used to connect to the Dremio endpoint. Use the to load the password from a secret store, e.g. ${secrets:my_dremio_pass}
.
Specifies the data format. Required if not inferrable from from
. Options: parquet
, csv
. Refer to for details.
BEARER
access token for user authentication. The token can be obtained from the OAuth2 flow (see ).
When using IAM roles with EKS, ensure the .
The HTTP(s) Data Connector enables federated SQL query across stored at an HTTP(s) endpoint.
The from
field must contain a valid URI to the location of a . For example, http://static_username@my-http-api/report.csv
.
This connector supports DuckDB as a data source for federated SQL queries.
Common DuckDB functions can also define datasets. Instead of a fixed table reference (e.g. database.schema.table
), a DuckDB function is provided in the from:
key. For example
The DuckDB connector does not support enum, dictionary, or map . For example:
Delta Lake data connector connector enables SQL queries from tables.
The from
field for the Delta Lake connector takes the form of delta_lake:path
where path
is any supported path, either local or to a cloud storage location. See the section below.
Use the to reference a secret, e.g. ${secrets:aws_access_key_id}
.
For more details on dropping Delta table features, refer to the official documentation:
Databricks as a connector for federated SQL query against Databricks using or directly from tables.
Use the to reference a secret, e.g. ${secrets:my_token}
.
Configure the connection to the object store when using mode: delta_lake
. Use the to reference a secret, e.g. ${secrets:aws_access_key_id}
.
For more details on dropping Delta table features, refer to the official documentation:
When using mode: spark_connect
, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by and accelerators by specifying mode: file
.
from
Description
dynamodb:table
Read data from a DynamoDB table named table
dynamodb_aws_region
Required. The AWS region containing the DynamoDB table
dynamodb_aws_access_key_id
Optional. AWS access key ID for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_secret_access_key
Optional. AWS secret access key for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_session_token
Optional. AWS session token for authentication
dynamodb:Scan
Required. Allows reading all items from the table
dynamodb:DescribeTable
Required. Allows fetching table metadata and schema information
flightsql_endpoint
The Apache Flight endpoint used to connect to the Flight SQL server.
flightsql_username
Optional. The username to use in the underlying Apache flight Handshake Request to authenticate to the server (see reference).
flightsql_password
Optional. The password to use in the underlying Apache flight Handshake Request to authenticate to the server. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_flightsql_pass}
.
http_port
Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.
http_username
Optional. Username to provide connection for HTTP basic authentication. Default: None.
http_password
Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_http_pass}
.
client_timeout
Optional. Specifies timeout for HTTP operations. Default value is 30s
E.g. client_timeout: 60s
from
Description
duckdb:database.schema.table
Read data from a table named database.schema.table
in the DuckDB file
duckdb:*
Read data using any DuckDB function that produces a table. For example one of the data import functions such as read_json
, read_parquet
or read_csv
.
duckdb_open
The name of the DuckDB database to open.
client_timeout
Optional. Specifies timeout for object store operations. Default value is 30s
. E.g. client_timeout: 60s
delta_lake_aws_region
Optional. The AWS region for the S3 object store. E.g. us-west-2
.
delta_lake_aws_access_key_id
The access key ID for the S3 object store.
delta_lake_aws_secret_access_key
The secret access key for the S3 object store.
delta_lake_aws_endpoint
Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com
.
delta_lake_azure_storage_account_name
The Azure Storage account name.
delta_lake_azure_storage_account_key
The Azure Storage master key for accessing the storage account.
delta_lake_azure_storage_client_id
The service principal client id for accessing the storage account.
delta_lake_azure_storage_client_secret
The service principal client secret for accessing the storage account.
delta_lake_azure_storage_sas_key
The shared access signature key for accessing the storage account.
delta_lake_azure_storage_endpoint
Optional. The endpoint for the Azure Blob storage account.
google_service_account
Filesystem path to the Google service account JSON key file.
String
Utf8
Long
Int64
Integer
Int32
Short
Int16
Byte
Int8
Float
Float32
Double
Float64
Boolean
Boolean
Binary
Binary
Date
Date32
Timestamp
Timestamp(Microsecond, Some("UTC"))
TimestampNtz
Timestamp(Microsecond, None)
Decimal
Decimal128
Array
List
Struct
Struct
Map
Map
mode
The execution mode for querying against Databricks. The default is spark_connect
. Possible values:
spark_connect
: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.
delta_lake
: Query directly from Delta Tables. Requires the object store credentials to be provided.
databricks_endpoint
The endpoint of the Databricks instance. Required for both modes.
databricks_cluster_id
The ID of the compute cluster in Databricks to use for the query. Only valid when mode
is spark_connect
.
databricks_use_ssl
If true, use a TLS connection to connect to the Databricks endpoint. Default is true
.
client_timeout
Optional. Applicable only in delta_lake
mode. Specifies timeout for object store operations. Default value is 30s
E.g. client_timeout: 60s
databricks_aws_region
Optional. The AWS region for the S3 object store. E.g. us-west-2
.
databricks_aws_access_key_id
The access key ID for the S3 object store.
databricks_aws_secret_access_key
The secret access key for the S3 object store.
databricks_aws_endpoint
Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com
.
databricks_azure_storage_account_name
The Azure Storage account name.
databricks_azure_storage_account_key
The Azure Storage key for accessing the storage account.
databricks_azure_storage_client_id
The Service Principal client ID for accessing the storage account.
databricks_azure_storage_client_secret
The Service Principal client secret for accessing the storage account.
databricks_azure_storage_sas_key
The shared access signature key for accessing the storage account.
databricks_azure_storage_endpoint
Optional. The endpoint for the Azure Blob storage account.
google_service_account
Filesystem path to the Google service account JSON key file.
STRING
Utf8
BIGINT
Int64
INT
Int32
SMALLINT
Int16
TINYINT
Int8
FLOAT
Float32
DOUBLE
Float64
BOOLEAN
Boolean
BINARY
Binary
DATE
Date32
TIMESTAMP
Timestamp(Microsecond, Some("UTC"))
TIMESTAMP_NTZ
Timestamp(Microsecond, None)
DECIMAL
Decimal128
ARRAY
List
STRUCT
Struct
MAP
Map
Microsoft SQL Server Data Connector
Microsoft SQL Server is a relational database management system developed by Microsoft.
The Microsoft SQL Server Data Connector enables federated/accelerated SQL queries on data stored in MSSQL databases.
Limitations
The connector supports SQL Server authentication (SQL Login and Password) only.
Spatial types (geography
) are not supported, and columns with these types will be ignored.
from
The from
field takes the form mssql:database.schema.table
where database.schema.table
is the fully-qualified table name in the SQL server.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
The data connector supports the following params
. Use the secret replacement syntax to load the secret from a secret store, e.g. ${secrets:my_mssql_conn_string}
.
mssql_connection_string
The ADO connection string to use to connect to the server. This can be used instead of providing individual connection parameters.
mssql_host
The hostname or IP address of the Microsoft SQL Server instance.
mssql_port
(Optional) The port of the Microsoft SQL Server instance. Default value is 1433.
mssql_database
(Optional) The name of the database to connect to. The default database (master
) will be used if not specified.
mssql_username
The username for the SQL Server authentication.
mssql_password
The password for the SQL Server authentication.
mssql_encrypt
(Optional) Specifies whether encryption is required for the connection.
true
: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
false
: This mode will not attempt to use an SSL connection, even if the server supports it. Only the login procedure is encrypted.
mssql_trust_server_certificate
(Optional) Specifies whether the server certificate should be trusted without validation when encryption is enabled.
true
: The server certificate will not be validated and it is accepted as-is.
false
: (default) Server certificate will be validated against system's certificate storage.
MySQL Data Connector Documentation
MySQL is an open-source relational database management system that uses structured query language (SQL) for managing and manipulating databases.
The MySQL Data Connector enables federated/accelerated SQL queries on data stored in MySQL databases.
from
The from
field takes the form mysql:database_name.table_name
where database_name
is the fully-qualified table name in the SQL server.
If the database_name
is omitted in the from
field, the connector will use the database specified in the mysql_db
parameter. If the mysql_db
parameter is not provided, it will default to the user's default database.
These two examples are identical:
name
The dataset name. This will be used as the table name within Spice.
Example:
params
The MySQL data connector can be configured by providing the following params
. Use the secret replacement syntax to load the secret from a secret store, e.g. ${secrets:my_mysql_conn_string}
.
mysql_connection_string
The connection string to use to connect to the MySQL server. This can be used instead of providing individual connection parameters.
mysql_host
The hostname of the MySQL server.
mysql_tcp_port
The port of the MySQL server.
mysql_db
The name of the database to connect to.
mysql_user
The MySQL username.
mysql_pass
The password to connect with.
mysql_sslmode
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
required
: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.
preferred
: This mode will try to establish a secure SSL connection if possible, but will connect insecurely if the server does not support SSL.
disabled
: This mode will not attempt to use an SSL connection, even if the server supports it.
mysql_sslrootcert
Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
The table below shows the MySQL data types supported, along with the type mapping to Apache Arrow types in Spice.
TINYINT
Int8
SMALLINT
Int16
INT
Int32
MEDIUMINT
Int32
BIGINT
Int64
DECIMAL
Decimal128
/ Decimal256
FLOAT
Float32
DOUBLE
Float64
DATETIME
Timestamp(Microsecond, None)
TIMESTAMP
Timestamp(Microsecond, None)
YEAR
Int16
TIME
Time64(Nanosecond)
DATE
Date32
CHAR
Utf8
BINARY
Binary
VARCHAR
Utf8
VARBINARY
Binary
TINYBLOB
Binary
TINYTEXT
Utf8
BLOB
Binary
TEXT
Utf8
MEDIUMBLOB
Binary
MEDIUMTEXT
Utf8
LONGBLOB
LargeBinary
LONGTEXT
LargeUtf8
SET
Utf8
ENUM
Dictionary(UInt16, Utf8)
BIT
UInt64
The MySQL TIMESTAMP
value is retrieved as a UTC time value.
Memory Data Connector Documentation
The Memory Data Connector enables configuring an in-memory dataset for tables used, or produced by the Spice runtime. Only certain tables, with predefined schemas, can be defined by the connector. These are:
store
: Defines a table that LLMs, with memory tooling, can store data in. Requires mode: read_write
.
Localpod Data Connector Documentation
The Localpod Data Connector enables setting up a parent/child relationship between datasets in the current Spicepod. This can be used for configuring multiple/tiered accelerations for a single dataset, and ensuring that the data is only downloaded once from the remote source. For example, you can use the localpod
connector to create a child dataset that is accelerated in-memory, while the parent dataset is accelerated to a file.
The dataset created by the localpod
connector will logically have the same data as the parent dataset.
The localpod
connector supports synchronized refreshes, which ensures that the child dataset is refreshed from the same data as the parent dataset. Synchronized refreshes require that both the parent and child datasets are accelerated with refresh_mode: full
(which is the default).
When synchronization is enabled, the following logs will be emitted:
GraphQL Data Connector Documentation
The GraphQL Data Connector enables federated SQL queries on any GraphQL endpoint by specifying graphql
as the selector in the from
value for the dataset.
Limitations
The GraphQL data connector does not support variables in the query.
Filter pushdown, with the exclusion of LIMIT
, is not currently supported. Using a LIMIT
will reduce the amount of data requested from the GraphQL server.
from
The from
field takes the form of graphql:your-graphql-endpoint
.
name
The dataset name. This will be used as the table name within Spice.
params
The GraphQL data connector can be configured by providing the following params
. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_graphql_auth_token}
.
unnest_depth
Depth level to automatically unnest objects to. By default, disabled if unspecified or 0
.
graphql_auth_token
The authentication token to use to connect to the GraphQL server. Uses bearer authentication.
graphql_auth_user
The username to use for basic auth. E.g. graphql_auth_user: my_user
graphql_auth_pass
The password to use for basic auth. E.g. graphql_auth_pass: ${secrets:my_graphql_auth_pass}
graphql_query
json_pointer
Example using the GitHub GraphQL API and Bearer Auth. The following will use json_pointer
to retrieve all of the nodes in starredRepositories:
The GraphQL Data Connector supports automatic pagination of the response for queries using cursor pagination.
The graphql_query
must include the pageInfo
field as per spec. The connector will parse the graphql_query
, and when pageInfo
is present, will retrieve data until pagination completes.
The query must have the correct pagination arguments in the associated paginated field.
Forward Pagination:
Backward Pagination:
Tips for working with JSON data. For more information see Datafusion Docs.
You can access the fields of the object using the square bracket notation. Arrays are indexed from 1.
Example for the stargazers query from pagination section:
You can use Datafusion unnest
function to pipe values from array into rows. We'll be using countries GraphQL api as an example.
Example query:
You can also use the unnest_depth
parameter to control automatic unnesting of objects from GraphQL responses.
This examples uses the GitHub stargazers endpoint:
If unnest_depth
is set to 0, or unspecified, object unnesting is disabled. When enabled, unnesting automatically moves nested fields to the parent level.
Without unnesting, stargazers data looks like this in a query:
With unnesting, these properties are automatically placed into their own columns:
By default, the Spice Runtime will error when a duplicate column is detected during unnesting.
For example, this example spicepod.yml
query would fail due to name
fields:
This example would fail with a runtime error:
Avoid this error by using aliases in the query where possible. In the example above, a duplicate error was introduced from emergency_contact { name }
.
The example below uses a GraphQL alias to rename emergency_contact.name
as emergencyContactName
.
PostgreSQL Data Connector Documentation
PostgreSQL is an advanced open-source relational database management system known for its robustness, extensibility, and support for SQL compliance.
The PostgreSQL Server Data Connector enables federated/accelerated SQL queries on data stored in PostgreSQL databases.
from
The from
field takes the form postgres:my_table
where my_table
is the table identifer in the PostgreSQL server to read from.
The fully-qualified table name (database.schema.table
) can also be used in the from
field.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
The connection to PostgreSQL can be configured by providing the following params
:
pg_host
The hostname of the PostgreSQL server.
pg_port
The port of the PostgreSQL server.
pg_db
The name of the database to connect to.
pg_user
The username to connect with.
pg_pass
pg_sslmode
Optional. Specifies the SSL/TLS behavior for the connection, supported values:
verify-full
: (default) This mode requires an SSL connection, a valid root certificate, and the server host name to match the one specified in the certificate.
verify-ca
: This mode requires a TLS connection and a valid root certificate.
require
: This mode requires a TLS connection.
prefer
: This mode will try to establish a secure TLS connection if possible, but will connect insecurely if the server does not support TLS.
disable
: This mode will not attempt to use a TLS connection, even if the server supports it.
pg_sslrootcert
Optional parameter specifying the path to a custom PEM certificate that the connector will trust.
connection_pool_size
Optional. The maximum number of connections to keep open in the connection pool. Default is 10.
The table below shows the PostgreSQL data types supported, along with the type mapping to Apache Arrow types in Spice.
int2
Int16
int4
Int32
int8
Int64
money
Int64
float4
Float32
float8
Float64
numeric
Decimal128
text
Utf8
varchar
Utf8
bpchar
Utf8
uuid
Utf8
bytea
Binary
bool
Boolean
json
LargeUtf8
timestamp
Timestamp(Nanosecond, None)
timestampz
Timestamp(Nanosecond, TimeZone
date
Date32
time
Time64(Nanosecond)
interval
Interval(MonthDayNano)
point
FixedSizeList(Float64[2])
int2[]
List(Int16)
int4[]
List(Int32)
int8[]
List(Int64)
float4[]
List(Float32)
float8[]
List(Float64)
text[]
List(Utf8)
bool[]
List(Boolean)
bytea[]
List(Binary)
geometry
Binary
geography
Binary
enum
Dictionary(Int8, Utf8)
Composite Types
Struct
The Postgres federated queries may result in unexpected result types due to the difference in DataFusion and Postgres size increase rules. Please explicitly specify the expected output type of aggregation functions when writing query involving Postgres table in Spice. For example, rewrite SUM(int_col)
into CAST (SUM(int_col) as BIGINT
.
Specify different secrets for a PostgreSQL source and acceleration:
GitHub Data Connector Documentation
The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github
as the selector in the from
value for the dataset.
from
The from
field takes the form of github:github.com/<owner>/<repo>/<content>
where content
could be files
, issues
, pulls
, commits
, stargazers
. See examples for more configuration detail.
name
The dataset name. This will be used as the table name within Spice.
params
github_token
GitHub Apps provide a secure and scalable way to integrate with GitHub's API. Learn more.
github_client_id
Required. Specifies the client ID for GitHub App Installation auth mode.
github_private_key
Required. Specifies the private key for GitHub App Installation auth mode.
github_installation_id
Required. Specifies the installation ID for GitHub App Installation auth mode.
Limitations
With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.
github_query_mode
owner
Required. Specifies the owner of the GitHub repository.
repo
Required. Specifies the name of the GitHub repository.
GitHub queries support a github_query_mode
parameter, which can be set to either auto
or search
for the following types:
Issues: Defaults to auto
. Query filters are only pushed down to the GitHub API in search
mode.
Pull Requests: Defaults to auto
. Query filters are only pushed down to the GitHub API in search
mode.
Commits only supports auto
mode. Query with filter push down is only enabled for the committed_date
column. commited_date
supports exact matches, or greater/less than matches for dates provided in ISO8601 format, like WHERE committed_date > '2024-09-24'
.
When set to search
, Issues and Pull Requests will use the GitHub Search API for improved filter performance when querying against the columns:
author
and state
; supports exact matches, or NOT matches. For example, WHERE author = 'peasee'
or WHERE author <> 'peasee'
.
body
and title
; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'
.
updated_at
, created_at
, merged_at
and closed_at
; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'
.
All other filters are supported when github_query_mode
is set to search
, but cannot be pushed down to the GitHub API for improved performance.
Limitations
GitHub has a limitation in the Search API where it may return more stale data than the standard API used in the default query mode.
GitHub has a limitation in the Search API where it only returns a maximum of 1000 results for a query. Use append mode acceleration to retrieve more results over time. See the append example for pull requests.
Limitations
content
column is fetched only when acceleration is enabled.
Querying GitHub files does not support filter push down, which may result in long query times when acceleration is disabled.
Setting github_query_mode
to search
is not supported.
ref
- Required. Specifies the GitHub branch or tag to fetch files from.
include
- Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.
name
Utf8
YES
path
Utf8
YES
size
Int64
YES
sha
Utf8
YES
mode
Utf8
YES
url
Utf8
YES
download_url
Utf8
YES
content
Utf8
YES
Limitations
Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'
.
assignees
List(Utf8)
YES
author
Utf8
YES
body
Utf8
YES
closed_at
Timestamp
YES
comments
List(Struct)
YES
created_at
Timestamp
YES
id
Utf8
YES
labels
List(Utf8)
YES
milestone_id
Utf8
YES
milestone_title
Utf8
YES
comments_count
Int64
YES
number
Int64
YES
state
Utf8
YES
title
Utf8
YES
updated_at
Timestamp
YES
url
Utf8
YES
Limitations
Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'
.
additions
Int64
YES
assignees
List(Utf8)
YES
author
Utf8
YES
body
Utf8
YES
changed_files
Int64
YES
closed_at
Timestamp
YES
comments_count
Int64
YES
commits_count
Int64
YES
created_at
Timestamp
YES
deletions
Int64
YES
hashes
List(Utf8)
YES
id
Utf8
YES
labels
List(Utf8)
YES
merged_at
Timestamp
YES
number
Int64
YES
reviews_count
Int64
YES
state
Utf8
YES
title
Utf8
YES
url
Utf8
YES
Limitations
Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE committed_date > '2024-09-24'
.
Setting github_query_mode
to search
is not supported.
additions
Int64
YES
author_email
Utf8
YES
author_name
Utf8
YES
committed_date
Timestamp
YES
deletions
Int64
YES
id
Utf8
YES
message
Utf8
YES
message_body
Utf8
YES
message_head_line
Utf8
YES
sha
Utf8
YES
Limitations
Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE starred_at > '2024-09-24'
.
Setting github_query_mode
to search
is not supported.
starred_at
Timestamp
YES
login
Utf8
YES
Utf8
YES
name
Utf8
YES
company
Utf8
YES
x_username
Utf8
YES
location
Utf8
YES
avatar_url
Utf8
YES
bio
Utf8
YES
S3 Data Connector Documentation
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).
If a folder path is specified as the dataset source, all files within the folder will be loaded.
File formats are specified using the file_format
parameter, as described in Object Store File Formats.
from
S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>
Example: from: s3://my-bucket/path/to/file.parquet
name
The dataset name. This will be used as the table name within Spice.
Example:
params
file_format
s3_endpoint
S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server
s3_region
S3 bucket region. Default: us-east-1
.
client_timeout
Timeout for S3 operations. Default: 30s
.
hive_partitioning_enabled
Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
s3_auth
Authentication type. Options: public
, key
and iam_role
. Defaults to public
if s3_key
and s3_secret
are not provided, otherwise defaults to key
.
s3_key
Access key (e.g. AWS_ACCESS_KEY_ID
for AWS)
s3_secret
Secret key (e.g. AWS_SECRET_ACCESS_KEY
for AWS)
allow_http
Allow insecure HTTP connections to s3_endpoint
. Defaults to false
For additional CSV parameters, see CSV Parameters
No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth
to iam_role
. If using iam_role, the AWS IAM role of the running instance is used.
Minimum IAM policy for S3 access:
Refer to Object Store Data Types for data type mapping from object store files to arrow data type.
Create a dataset named taxi_trips
from a public S3 folder.
Create a dataset named cool_dataset
from a Parquet file stored in MinIO.
Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
For example, a dataset partitioned by year, month, and day might have a directory structure like:
Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled
is set to true
.
Performance Considerations
When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb
and sqlite
accelerators by specifying mode: file
.
Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.
FTP/SFTP Data Connector Documentation
FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol) are network protocols used for transferring files between a client and server, with FTP being less secure and SFTP providing encrypted file transfer over SSH.
The FTP/SFTP Data Connector enables federated/accelerated SQL query across supported file formats stored in FTP/SFTP servers.
from
The from
field takes one of two forms: ftp://<host>/<path>
or sftp://<host>/<path>
where <host>
is the host to connect to and <path>
is the path to the file or directory to read from.
If a folder is provided, all child files will be loaded.
name
The dataset name. This will be used as the table name within Spice.
Example:
params
file_format
ftp_port
Optional, specifies the port of the FTP server. Default is 21. E.g. ftp_port: 21
ftp_user
The username for the FTP server. E.g. ftp_user: my-ftp-user
ftp_pass
client_timeout
Optional. Specifies timeout for FTP connection. E.g. client_timeout: 30s
. When not set, no timeout will be configured for FTP client.
hive_partitioning_enabled
Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
file_format
sftp_port
Optional, specifies the port of the SFTP server. Default is 22. E.g. sftp_port: 22
sftp_user
The username for the SFTP server. E.g. sftp_user: my-sftp-user
sftp_pass
client_timeout
Optional. Specifies timeout for SFTP connection. E.g. client_timeout: 30s
. When not set, no timeout will be configured for SFTP client.
hive_partitioning_enabled
Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to false
Snowflake Data Connector Documentation
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
from
A Snowflake fully qualified table name (database.schema.table). For instance snowflake:SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM
or snowflake:TAXI_DATA."2024".TAXI_TRIPS
name
The dataset name. This will be used as the table name within Spice.
params
Limitations
Instructions for using language models hosted on Anthropic with Spice.
To use a language model hosted on Anthropic, specify anthropic
in the from
field.
To use a specific model, include its model ID in the from
field (see example below). If not specified, the default model is claude-3-5-sonnet-latest
.
The following parameters are specific to Anthropic models:
Example spicepod.yml
configuration:
Instructions for using language models hosted on OpenAI or compatible services with Spice.
To use a language model hosted on OpenAI (or compatible), specify the openai
path in the from
field.
For a specific model, include it as the model ID in the from
field (see example below). The default model is gpt-4o-mini
.
from
The from
field takes the form openai:model_id
where model_id
is the model ID of the OpenAI model, valid model IDs are found in the {endpoint}/v1/models
API response.
Example:
name
The model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models
). This can be set to the same value as the model ID in the from
field.
params
Spice supports several OpenAI compatible providers. Specify the appropriate endpoint in the params section.
Groq provides OpenAI compatible endpoints. Use the following configuration:
NVidia NIM models are OpenAI compatible endpoints. Use the following configuration:
Parasail also offers OpenAI compatible endpoints. Use the following configuration:
Refer to the respective provider documentation for more details on available models and configurations.
Instructions for using Azure OpenAI models
Only one of azure_api_key
or azure_entra_token
can be provided for model configuration.
Example:
Instructions for using language models hosted on Perplexity with Spice.
SharePoint Data Connector Documentation
The SharePoint Data Connector enables federated SQL queries on documents stored in SharePoint.
Returns
Only one of sharepoint_client_secret
or sharepoint_bearer_token
is allowed.
from
formatsThe from
field in a SharePoint dataset takes the following format:
drive_type
in a SharePoint Connector from
field supports the following types:
For the me
drive type the user is identified based on sharepoint_client_code
and cannot be used with sharepoint_client_secret
For a name-based drive_id
, the connector will attempt to resolve the name to an ID at startup.
Within a drive, the SharePoint connector can load documents from:
To use the SharePoint connector with service principal authentication, you will need to create an Azure AD application and grant it the necessary permissions. This will also support OAuth2 authentication for users within the tenant (i.e. sharepoint_bearer_token
).
Under the application's API permissions
, add the following permissions: Sites.Read.All
, Files.Read.All
, User.Read
, GroupMember.Read.All
For service principal authentication, Application permissions are required.
For user authentication, only delegated permissions are required.
Add sharepoint_client_id
(from the Application (Client) ID
field) and sharepoint_tenant_id
to the connector configuration.
Under the application's Certificates & secrets
, create a new client secret. Use this for the sharepoint_client_secret
parameter.
Overview of supported model providers for ML and LLMs in Spice.
Spice supports various model providers for traditional machine learning (ML) models and large language models (LLMs).
LLM Format(s) may require additional files (e.g. tokenizer_config.json
).
Spice supports a variety of features for large language models (LLMs):
The following examples demonstrate how to configure and use various models or model features with Spice. Each example provides a specific use case to help you understand the configuration options available.
Example spicepod.yml
:
This example demonstrates how to pull GitHub issue data from the last 14 days, accelerate the data, create a chat model with memory and tools to access the accelerated data, and use Spice to ask the chat model about the general themes of new issues.
First, configure a dataset to pull GitHub issue data from the last 14 days.
Next, create a chat model that includes memory and tools to access the accelerated GitHub issue data.
At this step, the spicepod.yaml
should look like:
Finally, use Spice to ask the chat model about the general themes of new issues in the last 14 days. The following curl
command demonstrates how to make this request using the OpenAI-compatible API.
The username to use for basic auth. See for a sample GraphQL query
The into the response body. When graphql_query
is , the json_pointer
can be inferred.
The password to connect with. Use the to load the password from a secret store, e.g. ${secrets:my_pg_pass}
.
Required. GitHub personal access token to use to connect to the GitHub API. .
Optional. Specifies whether the connector should use the GitHub for improved filter performance. Defaults to auto
, possible values of auto
or search
.
Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet
, csv
, json
. Refer to for details.
Specifies the data file format. Required if the format cannot be inferred by from the from
path. See .
The password for the FTP server. Use the to load the password from a secret store, e.g. ${secrets:my_ftp_pass}
.
Specifies the data file format. Required if the format cannot be inferred by from the from
path. See .
The password for the SFTP server. Use the to load the password from a secret store, e.g. ${secrets:my_sftp_pass}
.
The Snowflake Data Connector enables federated SQL queries across datasets in the .
Hint Unquoted table identifiers should be UPPERCASED in the from
field. See .
The connector supports password-based and authentication. Login requires the account identifier ('orgname-accountname' format) - use instructions.
Account identifier does not support the . Use .
The connector supports password-based and authentication.
The Data Connector enables federated SQL query across datasets in the . Access to these datasets requires a free .
See for a list of supported model names.
See for additional configuration options.
Follow instructions.
View the Spice cookbook for an example of setting up NVidia NIM with Spice .
Apache Spark as a connector for federated SQL query against a Spark Cluster using
spark_remote
: A connection URI. Refer to for parameters in URI.
Correlated scalar subqueries are only supported in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands.
To use a language model hosted on Azure OpenAI, specify the azure
path in the from
field and the following parameters from the page:
Refer to the for more details on available models and configurations.
Follow the to try Azure OpenAI models for vector-based search and chat functionalities with structured (taxi trips) and unstructured (GitHub files) data.
Note: Like other models in Spice, Perplexity can set default overrides for OpenAI parameters. See .
Limitations The sharepoint connector does not yet support creating a dataset from a single file (e.g. an Excel spreadsheet). Datasets must be created from a folder of documents (see ).
Create a new Azure AD application in the .
The model type is inferred based on the model source and files. For more detail, refer to the model
.
Custom Tools: Provide models with tools to interact with the Spice runtime. See .
System Prompts: Customize system prompts and override defaults for . See .
Memory: Provide LLMs with memory persistence tools to store and retrieve information across conversations. See .
Vector Search: Perform advanced vector-based searches using embeddings. See .
Evals: Evaluate, track, compare, and improve language model performance for specific tasks. See .
Local Models: Load and serve models locally from various sources, including local filesystems and Hugging Face. See .
For more details, refer to the .
To use a language model hosted on OpenAI (or compatible), specify the openai
path and model ID in from
. For more details, see .
To specify tools for an OpenAI model, include them in the params.tools
field. For more details, see the .
To enable memory tools for a model, define a store
memory dataset and specify memory
in the model's tools
parameter. For more details, see the .
To set default overrides for parameters, use the openai_
prefix followed by the parameter name. For more details, see the .
To configure an additional system prompt, use the system_prompt
parameter. For more details, see the .
To serve a model from the local filesystem, specify the from
path as file
and provide the local path. For more details, see .
Refer to the for more details on making chat completion requests.
snowflake_warehouse
Optional, specifies the Snowflake Warehouse to use
snowflake_role
Optional, specifies the role to use for accessing Snowflake data
snowflake_account
Required, specifies the Snowflake account-identifier
snowflake_username
Required, specifies the Snowflake username to use for accessing Snowflake data
snowflake_password
Optional, specifies the Snowflake password to use for accessing Snowflake data
snowflake_private_key_path
Optional, specifies the path to Snowflake private key
snowflake_private_key_passphrase
Optional, specifies the Snowflake private key passphrase
anthropic_api_key
The Anthropic API key.
-
endpoint
The Anthropic API base endpoint.
https://api.anthropic.com/v1
endpoint
The OpenAI API base endpoint. Can be overridden to use a compatible provider (i.e. Nvidia NIM).
https://api.openai.com/v1
tools
Which tools should be made available to the model. Set to auto
to use all available tools.
-
system_prompt
An additional system prompt used for all chat completions to this model.
-
openai_api_key
The OpenAI API key.
-
openai_org_id
The OpenAI organization ID.
-
openai_project_id
The OpenAI project ID.
-
openai_temperature
Set the default temperature to use on chat completions.
-
openai_response_format
An object specifying the format that the model must output, see structured outputs.
-
azure_api_key
The Azure OpenAI API key from the models deployment page.
-
azure_api_version
The API version used for the Azure OpenAI service.
-
azure_deployment_name
The name of the model deployment.
Model name
endpoint
The Azure OpenAI resource endpoint, e.g., https://resource-name.openai.azure.com
.
-
azure_entra_token
The Azure Entra token for authentication.
-
perplexity_auth_token
The Perplexity API authentication token.
-
perplexity_*
Additional, perplexity specific parameters to use on all requests. See Perplexity API Reference
-
sharepoint_client_id
Yes
The client ID of the Azure AD (Entra) application
sharepoint_tenant_id
Yes
The tenant ID of the Azure AD (Entra) application.
sharepoint_client_secret
Optional
For service principal authentication. The client secret of the Azure AD (Entra) application.
drive
The SharePoint drive's name
from: sharepoint:drive:Documents/...
driveId
The SharePoint drive's ID
from: sharepoint:driveId:b!Mh8opUGD80ec7zGXgX9r/...
site
A SharePoint site's name
from: sharepoint:site:MySite/...
siteId
A SharePoint site's ID
from: sharepoint:siteId:b!Mh8opUGD80ec7zGXgX9r/...
group
A SharePoint group's name
from: sharepoint:group:MyGroup/...
groupId
A SharePoint group's ID
from: sharepoint:groupId:b!Mh8opUGD80ec7zGXgX9r/...
me
A user's OneDrive
from: sharepoint:me/...
The root of the drive
from: sharepoint:me/root
A specific path within the drive
from: sharepoint:drive:Documents/path:/top_secrets
A specific folder ID
from: sharepoint:group:MyGroup/id:01QM2NJSNHBISUGQ52P5AJQ3CBNOXDMVNT
OpenAI (or compatible) LLM endpoint
-
OpenAI-compatible HTTP endpoint
Models hosted on HuggingFace
ONNX
GGUF, GGML, SafeTensor
Models hosted on the Spice.ai Cloud Platform
ONNX
OpenAI-compatible HTTP endpoint
Azure OpenAI
-
OpenAI-compatible HTTP endpoint
Models hosted on Anthropic
-
OpenAI-compatible HTTP endpoint
Models hosted on xAI
-
OpenAI-compatible HTTP endpoint
Query web3 data with SQL via the Apache Arrow Flight API
SQL query results are now available as Apache Arrow data frames via a high-performance Apache Arrow Flight endpoint.
Arrow Flight is a data protocol built on the high-performance, open-source gRPC protocol.
This enables high-speed access to your data in Python, Go, C++, C#, and Rust, and makes it easy to use libraries like Pandas and NumPy.
We recommend using our SDKs to connect and query this endpoint. SDKs are available for Python, Node.js, and Go with more coming soon. In Python, the query results from the SDK can be easily converted to Pandas or NumPy format.
You may also use Apache's pyarrow
library directly.
Note on Apple M1 Macs - How do I know if I have an M1?
The spicepy/pyarrow
installation requires miniforge.
See the Python SDK page for installation steps.
Use the gRPC + TLS URL: grpc+tls://flight.spiceai.io
For Firecache use the gRPC + TLS URL: grpc+tls://firecache.spiceai.io
For documentation on the Spice Firecache see
Use basic authentication
Username can be set to an empty string
Password should be set to the API key of your app
Table names must be fully-qualified. For example eth.blocks
Find code samples in Python in Arrow Flight Samples.
If you get this error:
Could not get default pem root certs
Install the Let's Encrypt root certificates.
Instructions for using models hosted on the Spice Cloud Platform with Spice.
To use a model hosted on the Spice Cloud Platform, specify the spice.ai
path in the from
field.
Example:
Specific model versions can be referenced using a version label or Training Run ID.
from
FormatThe from key must conform to the following regex format:
Examples:
spice.ai/lukekim/smart/models/drive_stats:latest
: Refers to the latest version of the drive_stats model in the smart application by the user or organization lukekim.
spice.ai/lukekim/smart/drive_stats:60cb80a2-d59b-45c4-9b68-0946303bdcaf
: Specifies a model with a unique training run ID.
Prefix (Optional): The value must start with spice.ai/
.
Organization/User: The name of the organization or user (org
) hosting the model.
Application Name: The name of the application (app
) which the model belongs to.
Model Name: The name of the model (model
).
Version (Optional): A colon (:
) followed by the version identifier (version
), which could be a semantic version, latest
for the most recent version, or a specific training run ID.
SQL Query (Cloud Data Warehouse) API
The SQL Query API provides powerful capabilities for querying data managed by the Spice.ai Cloud Data Warehouse and connected external data sources using federated SQL queries. Results can be fetched through either the high-performance Apache Arrow Flight API or a standard HTTP API.
For production use, leverage the high-performance Apache Arrow Flight API, which is optimized for large-scale data workloads. The Spice SDKs default to querying via Arrow Flight.
• Endpoint: grpc+tls://flight.spiceai.io
• For additional details, refer to the Apache Arrow Flight API documentation.
The SQL Query API is also accessible via HTTP, offering standard integration for web applications.
• Core Endpoint: https://data.spiceai.io/v1/sql
• For more details, consult the HTTP SQL API documentation.
Query web3 data with SQL via the HTTP API
Blockchain and contract data may be queried by posting SQL to the /v1/sql
API and /v1/firesql
API for Firecached data. For documentation on the Spice Firecache see .
See Tables for a list of tables to query or browse the example queries listed in the menu.
An API key is required for all SQL queries.
Results are limited to 500 rows. Use the Apache Arrow Flight API to fetch up to 1M rows in a single query or the Async HTTP API to fetch results with paging.
Requests are limited to 90 seconds.
POST
https://data.spiceai.io/v1/sql
The SQL query should be sent in the body of the request as plain text
api_key
String
The API Key for your Spice app
Content-Type*
String
text/plain
X-API-KEY
String
The API Key for your Spice app
POST
https://data.spiceai.io/v1/firesql
The SQL query should be sent in the body of the request as plain text
api_key
String
The API Key for your Spice app
Content-Type*
String
text/plain
X-API-KEY
String
The API Key for your Spice app
Instructions for using xAI models
To use a language model hosted on xAI, specify xai
path in the from
field and the associated xai_api_key
parameter:
xai_api_key
The xAI API key.
-
Example:
Refer to the xAI models documentation for more details on available models and configurations.
Although the xAI documentation show that xAI models can returned structured outputs, this is not true.
Instructions for using machine learning models hosted on HuggingFace with Spice.
To use a model hosted on HuggingFace, specify the huggingface.co
path in the from
field and, when needed, the files to include.
from
The from
key takes the form of huggingface:model_path
. Below shows 2 common example of from
key configuration.
huggingface:username/modelname
: Implies the latest version of modelname
hosted by username
.
huggingface:huggingface.co/username/modelname:revision
: Specifies a particular revision
of modelname
by username
, including the optional domain.
The from
key follows the following regex format.
The from
key consists of five components:
Prefix: The value must start with huggingface:
.
Domain (Optional): Optionally includes huggingface.co/
immediately after the prefix. Currently no other Huggingface compatible services are supported.
Organization/User: The HuggingFace organization (org
).
Model Name: After a /
, the model name (model
).
Revision (Optional): A colon (:
) followed by the git-like revision identifier (revision
).
name
The model name. This will be used as the model ID within Spice and Spice's endpoints (i.e. https://data.spiceai.io/v1/models
). This can be set to the same value as the model ID in the from
field.
params
hf_token
The Huggingface access token.
-
model_type
The architecture to load the model as. Supported values: mistral
, gemma
, mixtral
, llama
, phi2
, phi3
, qwen2
, gemma2
, starcoder2
, phi3.5moe
, deepseekv2
, deepseekv3
-
tools
Which [tools] should be made available to the model. Set to auto
to use all available tools.
-
system_prompt
An additional system prompt used for all chat completions to this model.
-
files
The specific file path for Huggingface model. For example, GGUF model formats require a specific file path, other varieties (e.g. .safetensors
) are inferred.
Access tokens can be provided for Huggingface models in two ways:
In the Huggingface token cache (i.e. ~/.cache/huggingface/token
). Default.
Via model params.
For more details on authentication, see access tokens.
Limitations
The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.
ML models currently only support ONNX file format.
Apps are self-contained instances of Spice OSS Runtime, running in Spice.ai Cloud Platform.
Each app has a unique API Key and owned by individual accounts or organizations.
Use the Playground's SQL editor to easily explore data
Open the SQL editor by navigating to an App Playground and clicking SQL Query in the sidebar.
The Spice.ai Query Editor will suggest table and column names along with keywords as you type. You can manually prompt for a suggestion by pressing ctrl+space.
Start typing a SQL command, such as SELECT * FROM
As you type, the Query Editor will suggest possible completions based on the query context. You can use the arrow keys or mouse to select a completion, and then press Enter or Tab to insert it into the editor.
Examples of using the SQL suggestions:
Select the spice.runtime.metrics
table:
Type SELECT * FROM
and press Tab. The editor will suggest spice.runtime.metrics
as a possible table. Press Enter to insert it into the query.
Show the fields in the spice.runtime.metrics
table:
Type SELECT * FROM spice.runtime.metrics WHERE "
. The editor will list the fields in the table.
The datasets reference displays all available datasets from the current app and allows you to search through them. Clicking on the dataset will insert a sample query into the SQL editor, which will be automatically selected for execution.
Each Spice app have two pre-generated API keys, which can be used with Spice SDKs, the HTTP API or the Apache Arrow Flight API.
Select Spice app and navigate to Settings -> General.
Click the API Key 1 or API Key 2 field to copy the key value.
You can regenerate each key if you need to invalidate it.
Organizations enable you to share apps, datasets, users, billing, and settings with your team. Organization administrators can set who has access to their organization's resources and data.
When you create an account on Spice.ai, a single member organization of the same name as your username is created for you and you are automatically added as a member and the owner of the organization.
Spice.ai organizations are created by connecting an existing GitHub organization to Spice.ai.
Click on the organization dropdown icon in the application selector. Next, select the Create Org option from the menu.
Check to accept the terms and conditions for the new organization, then proceed by clicking the Connect GitHub organization button.
A window will pop up from GitHub where you can select the organization to install the Spice.ai app into.
On the confirmation page proceed by clicking the Install button.
Upon successful connection, you will be automatically redirected to the newly created Spice.ai organization.
To view your organizations, click the dropdown icon from the application selector.
All organizations you have access to are listed.
Click on the first tab to access the details of your current organization or select another organization from the menu to view its information. On this page, you will see all the applications that have been created within the selected organization.
Click the Settings tab to view information about the organization, including members and billing information.
To add an existing Spice.ai user to an organization:
Navigate to the organization's settings.
Click the Add Member button.
Enter the Spice.ai username of the user you wish to add to the organization.
Click the Add Member button to confirm.
The user will be added to the organization and they will receive an email notifying them of the new membership.
To invite GitHub user to a Spice.ai organization:
Enter the GitHub username of the user you wish to invite to the organization and select the user from the search results. Only users with public email address can be invited.
The invited user will receive an invitation link. Once they accept the invitation, they will be granted access to the organization.
To invite anyone by email
Enter the email address of the user you want to invite to the organization, then click Send invite.
To remove a member from an organization:
Navigate to the organization's settings.
Locate the user you wish to remove from the list of members.
Click the ellipsis on the right of the user's card.
Confirm the removal by clicking the Remove member from organization button in the confirmation popup.
To update Spice App spicepod.yaml, navigate to Code tab. Use Components sidebar to add data connectors, model providers and preconfigured datasets, or manually edit spicepod.yaml in code editor.
App Code tab allows to explore and preview files in connected repository.
App transfer is currently limited to organizations you have access to.
To transfer an app, click Settings in the app navigation.
In the Danger Zone section of App Settings, click the Transfer app button.
On the Transfer application page, select the New owner organization from the menu.
Type the full app name into the text box to confirm and click Transfer Application to complete process.
The App will now be accessible by the receiving organization and its members.
A Spicepod is a package that encapsulates application-centric datasets and machine learning (ML) models.
Spicepods are analogous to code packaging systems, like NPM, however differ by expanding the concepts to data and ML models.
A Spicepod is described by a YAML manifest file, typically named spicepod.yaml
, which includes the following key sections:
Metadata: Basic information about the Spicepod, such as its name and version.
Datasets: Definitions of datasets that are used or produced within the Spicepod.
Catalogs: Definitions of catalogs that are used within the Spicepod.
Models: Definitions of ML models that the Spicepod manages, including their sources and associated datasets.
Catalogs in a Spicepod can contain multiple schemas. Each schema, in turn, contains multiple tables where the actual data is stored.
ML models are integrated into the Spicepod similarly to datasets. The models can be specified using paths to local files or remote locations. ML inference can be performed using the models and datasets defined within the Spicepod.
Connect your Spice.ai app to a GitHub repository
Before connecting:
Matching repository name: The Spice.ai app and the GitHub repository names must match. For example:
Make sure to copy app spicepod.yaml contents from the Code tab and place it in the root of the repository before linking.
Ensure the repository is set up as per instructions above.
In the context of the Spice app to connect, navigate to Settings, then click the Connect repository button.
Follow GitHub App installation instructions.
Ensure that you select all repositories or specifically the repository you intend to connect.
Finally, link the repository to your Spice.ai app.
App Secrets are key-value pairs that are passed to the Spice Runtime instance as environment secrets. Secrets are securely encrypted and accessible only through the app in which they were created.
Once a secret is saved, its value cannot be retrieved through Spice Cloud. If you need to update the secret value, you must delete the existing secret and create a new one.
Select your app.
Navigate to Settings tab and select Secrets section.
Fill Secret Name and Secret Value fields and click Add.
Saved secrets can be referenced in the Spicepod configuration as
${secrets::<SECRET_NAME>}
, for example:
Use AI Chat to interact with Spice Models
PREREQUISITE
Open the AI Chat by navigating to an App Playground and clicking AI Chat in the sidebar.
Start to use AI Chat by typing in the question and clicking send.
Below is an example demonstrating how to configure the OpenAI gpt-4o
model with auto
access to runtime tools and system prompts overrides. This model is customized to answer questions relevant to spicepod
datasets.
Ask questions regarding datasets configured in spicepod within the AI Chat.
Spice.ai provides observability Ito the AI Chat, showing full tool usage traces and chat completion history.
Navigate to the Observability section in the portal.
Select an ai_chat
task history and view details over the chat completion history, including timestamps, tool usage, intermediate outputs, etc.
spice.runtime.metrics
along with their type.After saving the spicepod changes, a new deployment must be triggered. Learn more about .
If the Spice App is connected to a GitHub repository ( about how to connect), the only way to update the Spicepod configuration is to edit the spicepod.yaml file in the root of your repository and push it to the default branch.
To apply the updated spicepod, a new deployment must be triggered. Learn more about .
The app must be connected to a public GitHub repository to be made public. Check out how to connect app to the repository - .
After that, the app will be visible to all users at https://spice.ai/<org-name>/<app-name>
and searchable at .
You can transfer an App's ownership to another .
Learn more about .
Every Spice app is powered by a managed instance of the deployed to the platform.
Datasets in a Spicepod can be sourced from various locations, including local files or remote databases. They can be materialized and accelerated using different engines such as DuckDB, SQLite, or PostgreSQL to optimize performance ().
To learn more, please refer to the full .
Admin access: Ensure you have administrative access to the GitHub repository. This level of access is required to .
Spice.ai app:
GitHub repository:
To quickly set up a new repository, use the as a starting point:
Spice provides an OpenAI compatible chat completion AI at . Authorize with the endpoint using an .
For more information about using chat completions, refer to the .
Example completion response:
This example requires the openai
Python package.
Create and run the example Python script to run a completion:
Running this example outputs a model response:
To apply secrets, you must initiate a new spicepod deployment.
Ensure the Spice App is deployed with a model. For detailed instructions on how to deploy a model, refer to the .
The ability of AI Chat depends on the model configuration, including , , etc. Refer to the for details of customizing the model used in AI Chat.
Each new app deployment automatically retrieves the most recent stable Spice OSS release. Visit the Releases page or Spice OSS blog to check for the latest runtime updates.
All Spice Apps are now powered by the latest, next-generation Spice.ai Open Source data and AI engine. Existing apps have been migrated but require a manual setup step to connect datasets and/or model providers.
Learn More: Read the Announcing 1.0-stable blog post for details on this upgrade, and visit the Spice.ai Cookbook for over 50 quickstarts and examples.
Using Spice.ai for Agentic AI Applications
Build intelligent autonomous agents that act contextually by grounding AI models in secure, full-knowledge datasets with fast, iterative feedback loops.
Spice.ai helps in building intelligent autonomous agents by leveraging several key features:
Spice.ai enables federated querying across databases, data warehouses, and lakes. With advanced query push-down optimizations, it ensures efficient retrieval and processing of data across disparate sources, reducing latency and operational complexity. Learn more about Federated SQL Query. For practical implementation, refer to the Federated SQL Query recipe.
Spice.ai materializes application-specific datasets close to the point of use, reducing query and thus retrieval times, and infrastructure costs. It supports Change Data Capture (CDC), keeping materialized data sets up-to-date with minimal overhead and enabling real-time, reliable data access. Learn more about Data Acceleration. See the DuckDB Data Accelerator recipe for an example.
Integrate AI into your applications with Spice.ai’s AI Gateway. It supports hosted models like OpenAI and Anthropic and local models such as OSS Llama and NVIDIA NIM. Fine-tuning and model distillation are simplified, helping faster cycles of development and deployment. Learn more about AI Gateway. Refer to the Running Llama3 Locally recipe for details.
Spice.ai provides advanced search capabilities, including vector similarity search (VSS), enabling efficient retrieval of unstructured data, embeddings, and AI model outputs. This is critical for applications like RAG and intelligent search systems. Learn more about Vector Similarity Search. For implementation, see the Searching GitHub Files recipe.
Built-in semantic models allow Spice.ai to align AI operations with enterprise data, ensuring that applications are grounded in contextual, full-knowledge datasets. This enhances the accuracy and reliability of AI outputs while reducing risks of irrelevant or untrustworthy results. Learn more about Semantic Model for AI.
Spice.ai includes robust monitoring and observability tools tailored for AI applications. These tools provide end-to-end visibility into data flows and AI workflows, LLM-specific observability to monitor model performance, track usage, and manage drift, and security and compliance auditing for data and model interactions. Learn more about Monitoring and Observability.
Use Spice as a CDN for Databases
Colocate a local working set of hot data with data applications and frontends to serve more concurrent requests and users with faster page loads and data updates.
Maintain local replicas of data with the application to significantly enhance application resilience and availability.
Create a materialization layer for visualization tools like Power BI, Tableau, or Superset to achieve faster, more responsive dashboards without incurring massive compute costs.
Use Spice for Retrieval-Augmented-Generation (RAG)
Use Spice to access data across various data sources for Retrieval-Augmented-Generation (RAG).
Spice enables developers to combine structured data via SQL queries and unstructured data through built-in vector similarity search. This combined data can then be fed to large language models (LLMs) through a native AI gateway, enhancing the models' ability to generate accurate and contextually relevant responses.
Personal Access Tokens (PATs) provide a secure and straightforward way to authenticate and manage access to your Spice AI account via the API and CLI.
Personal Access Tokens can be created in Account settings.
Click the profile picture on the top right corner of the portal interface.
From the dropdown menu, find and select the Account settings option.
Select Tokens in the sidebar
Enter descriptive token name
Specify the scope of the token to limit access only for the selected organization, or use All
Select expiration date
After creating a token, copy and securely store the value, as it cannot be retrieved later.
A light or dark mode portal theme can be set:
Click the profile picture on the top right corner of the portal interface.
Select Light, Dark, or System mode using the theme toggle.
Continuously identify and fix issues by tracking process actions with Spice monitoring and request logs.
Under the Monitoring tab, track your app requests, their status codes, and duration, in addition to the existing usage monitoring metrics dashboard.
Monitor the success, failures, and durations of SQL queries, AI completions, Vector Searches, Embedding calculations, and accelerated dataset refreshes.
Track request volume, data usage, and query time at 1 hour, 24 hours, 7 days and 28 days granularity. Start by going to your app and navigating to the Monitoring tab.
Under the Monitoring tab, select Logs. You can then toggle between Metrics and Logs views.
Within Logs, you have the option to retrieve API requests from the past hour, 8 hours, 24 hours, or up to the previous three days.
The spicepy
SDK supports streaming partial results as they become available.
This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.
Calling to_pandas()
on the FlightStreamReader
will wait for the stream to return all of the data before returning a pandas DataFrame.
In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:
Alternatively, to process chunks of data as they arrive instead of waiting for all data to arrive, FlightStreamReader
supports reading chunks of data as they become available with read_chunk()
. Using the same query example above, but processing data chunk by chunk:
spicepy
enables streaming through the use of the .
The object returned from spicepy.Client.query()
is a .
To operate on partial results while the data is streaming, we will take advantage of the method on FlightStreamReader
. This returns a FlightStreamChunk
, which has a data
attribute that is a . Once we have the RecordBatch, we can call to_pandas()
on it to return the partial data as a pandas DataFrame. When the stream has ended, calling read_chunk()
will raise a StopIteration
exception that we can catch.
This call will return a pandas with all 10,000 suppliers, and is a synchronous call that waits for all data to arrive before returning.
The Python SDK spicepy
is the easiest way to use and query Spice.ai in Python.
The Python SDK uses Apache Apache Flight to efficiently stream data to the client and Apache Arrow Records as data frames which are then easily converted to Pandas data frames.
Python 3.11+
The following packages are required and will be automatically installed by pip:
pyarrow
pandas
certify
requests
Install the spicepy
package directly from the Spice Github Repository at https://github.com/spiceai/spicepy:
Import spicepy
and create a Client
by providing your API Key.
You can then submit queries using the query function.
Querying data is done through a Client
object that initializes the connection with the Spice.ai endpoint. Client
has the following arguments:
api_key (string, optional): Spice.ai API key to authenticate with the endpoint.
url (string, optional): URL of the endpoint to use (default: grpc+tls://flight.spiceai.io)
tls_root_cert (Path or string, optional): Path to the tls certificate to use for the secure connection (ommit for automatic detection)
Once a Client
is obtained queries can be made using the query()
function. The query()
function has the following arguments:
query (string, required): The SQL query.
timeout (int, optional): The timeout in seconds.
A custom timeout can be set by passing the timeout
parameter in the query
function call. If no timeout is specified, it will default to a 10 min timeout then cancel the query, and a TimeoutError exception will be raised.
Follow the quickstart guide to install and run spice locally.
Contribute to or file an issue with the spicepy
library at: https://github.com/spiceai/spicepy
The Node.js SDK spice.js is the easiest way to use and query Spice.ai with Node.js.
It uses Apache Apache Flight to efficiently stream data to the client and Apache Arrow Records as data frames which are then easily converted to JavaScript objects/arrays or JSON.
Import SpiceClient
and instantiate a new instance with an API Key.
You can then submit queries using the query
function.
SpiceClient
has the following arguments:
apiKey
(string, required): API key to authenticate with the endpoint.
url
(string, optional): URL of the endpoint to use (default: flight.spiceai.io:443)
Follow the quickstart guide to install and run spice locally.
Check Spice OSS documentation to learn more.
From version 1.0.1 the SpiceClient
implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via setMaxRetries
:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429)
.
Contribute to or file an issue with the spice.js
library at: https://github.com/spiceai/spice.js.
Use Spice for Enterprise Search and Retrieval
Enterprises face the challenge of accessing data from various disparate and legacy systems to provide AI with comprehensive context. Speed is crucial for this process to be effective.
Spice offers a fast knowledge index into both structured and unstructured data, enabling efficient vector similarity search across multiple data sources. This ensures that AI applications have access to the necessary data for accurate and timely responses.
Perform across databases, data warehouses, and data lakes using .
The @spiceai/spice
SDK supports streaming partial results as they become available.
This can be used to enable more efficient pipelining scenarios where processing each row of the result set can happen independently.
The Client.query
function takes an optional onData
callback that will be passed partial results as they become available.
In this example, we retrieve all 10,000 suppliers from the TPCH Suppliers table. This query retrieves all suppliers in a single call:
This call will wait for the promise returned by query()
to complete, returning all 10,000 suppliers.
Alternatively, data can be processed as it is streamed to the SDK. Provide a callback function to the onData
parameter, which will be called with every partial set of data streamed to the SDK:
The top-level object that connects to Spice.ai
params.api_key
(string, optional): API key to authenticate with the endpoint
params.http_url
(string, optional):
params.flight_url
(string, optional): URL of the endpoint to use (default: localhost:50051
, using local Spice Runtime)
Default connection to local Spice Runtime:
Connect to Spice.AI Cloud Platform:
Or using shorthand:
query(queryText: string, onData: (partialData: Table) => void) => Table
queryText
: (string, required): The SQL query to execute
onData
: (callback, optional): The callback function that is used for handling streaming data.
query
returns an Apache Arrow Table.
To get the data in JSON format, iterate over each row by calling toArray()
on the table and call toJSON()
on each row.
Get all of the elements for a column by calling getChild(name: string)
and then calling toJSON()
on the result.
How to use the Spice.ai for GitHub Copilot Extension
Access structured and unstructured data from any Spice data source like GitHub, PostgreSQL, MySQL, Snowflake, Databricks, GraphQL, data lakes (S3, Delta Lake, OneLake), HTTP(s), SharePoint, and even FTP.
Some example prompts:
@spiceai
What documentation is relevant to this file?
@spiceai
Write documentation about the user authentication issue
@spiceai
Who are the top 5 committers to this repository?
@spiceai
What are the latest error logs from my web app?
Scroll down, and click Install it for free.
Once installed, open Copilot Chat and type @spiceai
. Press enter.
A prompt will appear to connect to the Spice.ai Cloud Platform.
You will need to authorize the extension. Click Authorize spiceai.
To create an account on the Spice.ai Cloud Platform, click Authorize Spice AI Platform.
Once your account is created, you can configure the extension. Select from a set of ready-to-use datasets to get started. You can configure other datasets after setup.
The extension will take up to 30 seconds to deploy and load the initial dataset.
When complete, proceed back to GitHub Copilot Chat.
To chat with the Spice.ai for GitHub Copilot extension, prefix the message with @spiceai
To list the datasets available to Copilot, try @spiceai What datasets do I have access to?
This library supports the following Java implementations:
OpenJDK 11
OpenJDK 17
OpenJDK 21
OracleJDK 11
OracleJDK 17
OracleJDK 21
OracleJDK 22
1. Import the package.
5. Iterate through the FlightStream
to access the records.
Or using custom flight address:
The SpiceClient
implements connection retry mechanism (3 attempts by default). The number of attempts can be configured with withMaxRetries
:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429)
.
Golang SDK for Spice.ai
Get the gospice package.
1. Import the package.
3. Initialize the SpiceClient
.
5. Iterate through the reader to access the records.
Or using custom flight address:
Run go run .
to execute a sample query and print the results to the console.
The SpiceClient
implements connection retry mechanism (3 attempts by default). The number of attempts can be configured via SetMaxRetries
:
Retries are performed for connection and system internal errors. It is the SDK user's responsibility to properly handle other errors, for example RESOURCE_EXHAUSTED (HTTP 429)
.
Create custom dashboards and visualizations using the FlightSQL or Infinity Grafana Plugins:
In the Grafana installation, navigate to Administration
-> Plugins and data
-> Plugins
. Select "State: All" and search for "FlightSQL" in the Search box. Select "FlightSQL" in the list of plugins.
Click the "Install" button to install the plugin. Once installed, a new data source can be added.
Click the "Add new data source" button from the FlightSQL plugin menu.
In the data source menu, provide the following options:
Host:Port
- set to flight.spiceai.io:443
to connect to the Spice.ai Cloud Platform
Auth Type
- set to token
Once these options are set, click "Save & Test". Grafana should report "OK" if the configuration is correct.
Create a visualization using the FlightSQL Spice.AI data source. Select the configured datasource from the list of datasources in the visualization creator.
Create your query using the SQL builder, or switch to a plain SQL editor by clicking the "Edit SQL" button. In this example, the query retrieves the latest query execution times from the configured App instance and displays them as a line graph.
Grafana may not automatically update the visualization when changes are made to the query. To force the visualization to update with new query changes, click the "Query Inspector" button, and click "Refresh".
The makes it easy to access and chat with external data in GitHub Copilot, enhancing AI-assisted research, Q&A, code, and documentation suggestions for greater accuracy.
To install the extension, visit the and search for Spice.ai.
If @spiceai
does not appear in the popup (2), ensure that all the steps have been followed.
The is the easiest way to query the Spice Cloud Platform from Java.
It uses to efficiently stream data to the client and Records as data frames.
2. Create a SpiceClient
by providing your API key. Get your free API key at .
3. Execute a query and get back a .
Check to learn more.
Follow the to install and run spice locally.
Check or to learn more
Contribute to or file an issue with the spice-rs
library at:
The gospice
is the easiest way to query from Go.
It uses to efficiently stream data to the client and Records as data frames.
GoDocs are available at:
(or later)
2. Create a SpiceClient
by providing your API key. Get your free API key at .
4. Execute a query and get back an .
Follow the to install and run spice locally.
Check to learn more.
Contribute to or file an issue with the gospice
library at:
The spiceai
is the easiest way to query from Dotnet.
It uses to efficiently stream data to the client and Records as data frames.
Create a SpiceClient
by providing your API key to SpiceClientBuilder
. Get your free API key at .
Execute a query and get back an Apache Arrow .
Follow the to install and run spice locally.
Contribute to or file an issue with the spice-dotnet
library at:
Use the for an installation reference.
Token
- set to your .
The spice-rs
is the easiest way to query from Rust.
It uses to efficiently stream data to the client and Records as data frames.
1. Create a SpiceClient
by providing your API key to ClientBuilder
. Get your free API key at .
2. Execute a query and get back an Apache Arrow .
Follow the to install and run spice locally.
Contribute to or file an issue with the spice-rs
library at: