User-Defined Functions

User-defined scalar functions (UDFs) and table functions (UDTFs) declared in a spicepod's `functions:` section — SQL, Remote HTTP, and WebAssembly tiers.

Spice.ai supports user-defined functions declared declaratively in a spicepod's functions: section. Each function can be a scalar UDF (one value per input row) or a table UDTF (a relation returned in a SQL FROM clause), and is dispatched through one of three execution tiers:

  • SQL (from: sql) — in-process function whose body is a DataFusion SQL expression (scalar) or query (table).

  • Remote (from: http://… or from: https://…) — async function that invokes a remote HTTP endpoint with a JSON request and receives a JSON response.

  • WebAssembly (from: wasm) — function backed by a sandboxed WASM module invoked with Arrow IPC batches.

Functions are automatically registered into the SQL session, exposed via the list_udfs() UDTF and the GET /v1/functions HTTP endpoint, and (when scalar) surfaced to LLM tool-calling. They hot-reload when the spicepod changes on disk.

Distribution availability. Inline SQL functions (from: sql) are available in all Spice distributions, including open source. Remote HTTP (from: http://…) and WebAssembly (from: wasm) tiers ship by default in Spice.ai Enterprise and Spice Cloud Platform distributions; open source users can enable them by building locally with the http-functions and wasm-functions cargo features. See Distributions.

Set runtime.functions.enabled: true in spicepod.yaml to register user-defined functions. Without the flag, functions: entries log an error and are skipped at startup, and tools: entries with as_sql: true log a warning and remain LLM-only.

Quickstart

Declare a function in spicepod.yaml:

version: v2
kind: Spicepod
name: my_app

runtime:
  functions:
    enabled: true

functions:
  - name: haversine_km
    from: sql
    description: Haversine distance in kilometres.
    volatility: immutable
    signature:
      args:
        - { name: lat1, type: float64 }
        - { name: lon1, type: float64 }
        - { name: lat2, type: float64 }
        - { name: lon2, type: float64 }
      returns: float64
    body: |
      6371 * acos(
        cos(radians(lat1)) * cos(radians(lat2)) *
        cos(radians(lon2) - radians(lon1)) +
        sin(radians(lat1)) * sin(radians(lat2))
      )

Use it in SQL:

Schema Reference

Each entry in functions: is a Function object. Fields are strictly validated (deny_unknown_fields is enforced).

Field
Type
Required
Description

name

string

yes

Identifier the function is registered under. Referenced by that name in SQL.

from

string

yes

Source URI selecting the execution tier. sql, http://…, https://…, wasm.

enabled

bool

no

Defaults to true. Set to false to keep the declaration in the spicepod without registering it for SQL, tool exposure, list_udfs(), or /v1/functions.

description

string

no

Free-form description surfaced in list_udfs() and GET /v1/functions.

kind

enum

no

scalar (default) or table. Both are wired today. aggregate and window are reserved and rejected at registration with a clear error.

volatility

enum

no

immutable, stable, volatile (default). See Volatility.

signature

object

yes

Typed signature. See below.

body

string

tier-dep

Inline SQL expression (scalar) or SELECT query (table). Required for from: sql unless body_ref is set. Optional for from: wasm to supply a table input from SQL. Forbidden for from: http*. Mutually exclusive with body_ref.

body_ref

string

tier-dep

Path to a file whose contents are the function body. Resolved relative to the runtime's CWD. Same tier rules as body. Mutually exclusive with body.

metadata

map

no

Free-form metadata surfaced alongside the declaration.

params

map

no

Tier-specific knobs. Supports ${ secrets:KEY } / ${ env:KEY } interpolation. See Remote params and WebAssembly params.

dependsOn

string[]

no

Names of spicepod components that must load before this function. Inferred from SQL bodies and params.input_table when omitted.

metrics

object

no

Per-function metrics configuration.

as_tool

bool

no

Expose the function as an LLM tool. Defaults to true for scalar functions; table functions are always SQL-only. See LLM Tool Exposure.

signature

Field
Description

tables

Optional list of declared table inputs. When present, table inputs are passed before scalar arguments at call sites. Used by table functions and scalar functions that consume a relation.

tables[].name

Logical input name exposed to the backend.

tables[].columns

Arrow schema declared for that input.

args

Positional scalar argument list. Empty for niladic functions.

args[].name

Argument name. Referenced by name in SQL bodies.

args[].type

Arrow logical-type string (e.g. float64, utf8, list<int64>, decimal(38, 10), timestamp(us, utc)). See Supported types.

returns

For kind: scalar, a single Arrow type string. For kind: table, a list of named output columns (each with name and type).

Execution Tiers

SQL (from: sql)

The body is parsed into a DataFusion logical expression against a schema derived from the argument list, then lowered to a physical expression at build time. The standard DataFusion scalar functions (math, string, datetime), Spark built-ins, and datafusion-functions-json are all available in bodies.

The body runs entirely in-process with no sandbox. Prefer SQL-tier UDFs for anything that can be expressed in a SQL expression — they're fastest, type-checked at startup, and don't leave the runtime.

SQL functions support the full set of Arrow logical types accepted by DataFusion, including primitives, list<…>, large_list<…>, struct<…>, decimal(p, s), decimal256(p, s), and timestamp(unit[, timezone]).

body_ref for longer bodies

Keep non-trivial SQL in its own file with proper editor support:

body and body_ref are mutually exclusive; exactly one must be set when from: sql.

Remote (from: http://… | https://…)

The remote tier requires the http-functions cargo feature. It is enabled by default in Spice.ai Enterprise and Spice Cloud Platform distributions. Open source users must build locally with --features http-functions; the prebuilt OSS images and install script do not ship this tier. See Distributions.

The remote tier invokes an external HTTP endpoint, batching rows for scalar invocations and issuing a single request for table inputs. Powered by DataFusion's AsyncScalarUDFImpl, so the query scheduler overlaps remote calls with other work.

Wire contract

Scalar functions without table arguments issue one HTTP request per batch:

  • RequestPOST <endpoint>, Content-Type: application/json, body:

  • Response — HTTP 200 with body:

  • values.length must equal rows.length. A mismatch is treated as an execution error.

Scalar functions with dynamic table arguments and table functions (kind: table) issue a single request per invocation with both scalar arguments and table inputs:

  • Request body:

    The tables field is omitted when no table inputs are declared. Each row in a table input is a JSON object whose keys match the declared tables[].columns.

  • Response body{"rows": [{"<output_col>": <value>, ...}, ...]} for table functions, {"values": [<result>]} for scalar.

Non-2xx responses surface as query errors, with the response body snippet (up to 256 chars, one line) included in the error for debugging. Argument and return values are encoded through Arrow's JSON reader/writer, so any Arrow type with a JSON representation (including list<…>, struct<…>, and timestamps) is permitted.

Remote params

Key
Type
Default
Description

timeout

integer seconds or duration string (2s, 500ms)

30s

Per-call HTTP timeout.

batch_size

integer, 1..=100000

1024

Rows per HTTP request for scalar functions. Large inputs are chunked.

batch_concurrency

integer, 1..=64

4

Maximum in-flight HTTP batches per invocation. Results are appended in input order.

max_response_bytes

integer bytes, up to 1 GiB

10MiB

Maximum decoded response body size per HTTP call. Larger responses fail the query.

max_rows

integer, up to 1000000

100000

Cap on rows returned by a table-function call when the query has no LIMIT. Ignored for scalar batches.

auth_bearer

string (supports ${secrets:…})

none

Sets Authorization: Bearer <value> on every request.

WebAssembly (from: wasm)

The WASM tier requires the wasm-functions cargo feature (compiling Rust source additionally requires wasm-functions-compile). It is enabled by default in Spice.ai Enterprise and Spice Cloud Platform distributions. Open source users must build locally to enable it. See Distributions.

The WebAssembly tier executes user code inside a sandboxed wasmtime module, exchanging Arrow IPC streams with the host. Both scalar and table kinds are supported; a single guest export serves both shapes.

WebAssembly params

Key
Type
Default
Description

module

string

Path to a precompiled .wasm artifact. Mutually exclusive with source. One of the two is required.

source

string

Path to source code to compile to WASM at registration. Requires wasm-functions-compile.

language

string

rust

Source language when source is set. Only rust is supported today.

entrypoint

string

spice_transform

Exported guest function name.

input_table

string

Name of an existing table to use as the table input. Mutually exclusive with body / body_ref.

fuel

integer

100000000

Per-invocation wasmtime fuel budget. Exhaustion fails the call with an interrupt error.

max_memory_bytes

integer

67108864 (64MiB)

Hard cap on guest linear-memory growth.

max_table_elements

integer

1000000

Hard cap on guest WASM-table growth.

The guest module must export memory, spice_alloc, spice_dealloc, and the configured entrypoint. The entrypoint receives (input_ptr, input_len, args_ptr, args_len) and returns a packed (output_ptr << 32) | output_len pointer to an Arrow IPC stream matching the declared return schema.

Table Functions

Set kind: table to register a user-defined table function (UDTF). Table functions return a relation instead of a single value and are invoked in a SQL FROM clause:

The signature.returns field becomes a list of output columns. The function may take any combination of scalar arguments and declared signature.tables inputs; table inputs always precede scalar arguments at call sites.

SQL table functions

The body is a single SELECT query. Scalar arguments are visible through the reserved args table; declared signature.tables inputs are visible by their declared names.

To pass a relation into a table function, declare a signature.tables entry and reference the input table inside the body. The argument at the call site can be a table name or an inline subquery:

Remote and WebAssembly table functions

Remote (from: http://…) and WebAssembly (from: wasm) table functions follow the same kind: table shape. Inputs and outputs are exchanged over the tier's native wire format — JSON for remote, Arrow IPC for WASM — as described in Execution Tiers.

Table functions are always SQL-only. They are not exposed to LLM tool-calling regardless of the as_tool setting.

Supported Types

All tiers share a single Arrow type parser. Type names are case-insensitive and accept shorthand aliases (stringutf8, boolboolean, intint32, doublefloat64).

Arrow type
Notes

int8, int16, int32, int64

uint8, uint16, uint32, uint64

float32, float64

utf8 / string, large_utf8

boolean / bool

binary, large_binary

date32, date64

timestamp(s), timestamp(ms), timestamp(us), timestamp(ns)

Optional second argument for a timezone, e.g. timestamp(us, utc).

decimal(p, s), decimal128(p, s), decimal256(p, s)

list<T>, large_list<T>

struct<name:T, name2:T2, …>

Field names may be unquoted or quoted ("name":T / 'name':T).

Return types are coerced implicitly where DataFusion knows how (e.g. Int32 → Int64, Float32 → Float64, Utf8 → LargeUtf8) — a literal like 6371 widening to Float64 is fine.

Tier notes:

  • SQL tier lowers the body to a DataFusion physical expression, so any DataFusion-recognised type round-trips natively.

  • Remote tier serializes arguments and results through Arrow's JSON reader/writer, so any Arrow type with a JSON representation is acceptable. Plan for wire size when passing nested structures or large lists.

  • WebAssembly tier uses Arrow IPC streams as the host/guest ABI; the guest module must understand the declared input and output schemas.

Volatility

Volatility governs caching, constant-folding, and plan-time evaluation. Defaults to volatile — the safest choice.

Value
Meaning
Safe to

immutable

Same inputs always yield the same output (e.g. abs, upper).

Constant-fold, cache forever.

stable

Stable within a single query, may change across queries (e.g. now()).

Cache per query.

volatile

Unpredictable on every call (e.g. random()).

Never cache.

Remote tier auto-caps at stable. A user-declared immutable or stable remote function is lowered to DataFusion's Stable volatility — constant-folding is disabled because the runtime cannot prove the remote service is deterministic or stable across process lifetimes. Queries within a single execution still benefit from common-subexpression elimination.

Federation Deny-List

Every user function is automatically added to Spice's federation deny-list. This prevents the query planner from attempting to push a user function down into a federated data source (e.g. Postgres, MySQL, Databricks) — the remote source doesn't know about it, and a pushdown would either fail or, worse, silently call a different function with the same name.

Built-in Spice UDFs (ai, embed, bucket, identity SQL functions, etc.) are likewise denied. Functions are added on registration and removed on deregistration / hot-reload removal. No manual configuration is required.

Hot-Reload

When the spicepod file changes on disk, Spice diffs the functions: section against the live set:

  • Removed functions are deregistered from DataFusion, removed from the deny-list, removed from USER_FUNCTION_INFO, and dropped from the tool registry.

  • Changed functions are deregistered and re-registered (equality is structural — any field change triggers a re-register).

  • Added functions are built and registered.

Failures to build a new or changed function are logged at ERROR and do not halt the diff — other functions still apply.

Introspection

SQL: list_udfs()

A built-in UDTF returns every function registered in the current session — Spice built-ins, DataFusion standard library, and user-declared functions together.

Column
Type
Description

name

utf8, NOT NULL

UDF identifier.

source

utf8, NOT NULL

"builtin" or "user".

kind

utf8, nullable

"scalar" | "aggregate" | "window" | "table". NULL for built-ins.

volatility

utf8, nullable

"immutable" | "stable" | "volatile". NULL for built-ins.

from

utf8, nullable

Source URI for user functions. NULL for built-ins.

description

utf8, nullable

Free-form description.

HTTP: GET /v1/functions

Returns only the user-declared functions (not built-ins). Intended for operational tooling.

The OpenAPI spec exposes this under operation_id: list_functions, tag Functions.

LLM Tool Exposure

When as_tool: true (the default), every declared scalar function is also registered as an LLM tool. The tool invokes the function through a SELECT fn_name(arg0, …) AS result query over the runtime's DataFusion session, so all execution tiers dispatch transparently. Table functions (kind: table) are always SQL-only and are not surfaced to the LLM.

  • Argument values are interpolated as typed SQL literals by the tool bridge — integers, floats, strings, booleans. No free-form SQL from the model is ever concatenated.

  • The tool's JSON Schema parameters is derived from the function's Arrow signature. Types outside the JSON-encodable primitive set (int64, float64, utf8, boolean) are not eligible for tool exposure — the function still registers for SQL use, but tool registration is skipped with a WARN log.

  • Name collisions with a previously-registered tool (built-in or tools: entry) are skipped with a WARN log. Rename one, or set as_tool: false on the function.

Exposing Tools as SQL UDFs

The inverse of LLM Tool Exposure: set as_sql: true on a tools: entry to register the tool as a SQL scalar UDF. The UDF dispatches to the tool per row, packing each row's arguments into a JSON object that matches the supplied signature.

Tool fields for SQL exposure

Field
Type
Description

as_sql

bool

Defaults to false. Set to true to register a scalar UDF alongside the LLM tool. Requires runtime.functions.enabled: true and a signature: block.

signature

object

Typed signature for SQL invocation. Required when as_sql: true; ignored otherwise.

Behaviour

  • A signature: block is required when as_sql: true — a tool's free-form JSON Schema does not uniquely map to a SQL signature, so argument and return Arrow types must be pinned explicitly.

  • Supported Arrow types are the same primitive set as the Remote tier: int64, float64, utf8, boolean. Other types are rejected at build time with an Unsupported*Type error naming the tool.

  • The derived UDF is always volatile. Constant-folding is disabled and per-row results are not memoized because the underlying tool is non-deterministic RPC.

  • Per-query dispatch is capped at 16 concurrent in-flight calls, with row order preserved on output.

  • Tool-backed UDFs are automatically added to the federation deny-list and are never pushed down to remote data sources.

  • If runtime.functions.enabled is false, or signature: is missing, SQL registration is skipped with a WARN log and the tool remains callable as an LLM tool only.

Distributed Execution

In a Spice.ai Enterprise distributed cluster, UDFs declared on the scheduler are automatically propagated to executors as part of the GetAppDefinition bootstrap RPC — executors load the full Spicepod definition (datasets, catalogs, views, and UDFs) from the scheduler at startup, so query planning and execution use a consistent view of the function set across all nodes.

User functions are deny-listed from federation pushdown on every node, so there's no risk of a stray executor attempting to push a user function into a downstream source.

Limitations

Current (ALPHA) scope:

  • kind: scalar and kind: table are implemented. aggregate and window are reserved kinds — they parse but are rejected at registration so forward-compatible spicepods still load.

  • Supported from: schemes are sql, http://, https://, and wasm. The HTTP and WASM tiers are gated behind the http-functions and wasm-functions cargo features respectively (default-on in Enterprise and Cloud, opt-in for OSS builds). Other schemes (e.g. grpc://, flight://) parse but are rejected.

  • LLM tool exposure (as_tool: true) is limited to scalar functions with primitive Arrow types in their signature (int64, float64, utf8, boolean). Functions with richer signatures remain callable from SQL.

  • Remote tier issues batches in parallel with a default batch_concurrency of 4 (max 64). Retries, circuit-breakers, per-function rate limits, and http-arrow / Flight protocols are not yet implemented.

  • No aggregate/window semantics, UDAF pushdown, or execution-engine delegation beyond what DataFusion's scalar UDF and table function paths provide.

Last updated

Was this helpful?