Search & Retrieval
Use the advanced search and retrieval capabilities of Spice
Last updated
Was this helpful?
Use the advanced search and retrieval capabilities of Spice
Last updated
Was this helpful?
Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and Vector-Similarity Search functionality.
Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's LIKE
clause:
Spice also provides advanced Vector-Similarity Search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
Local embedding models, e.g. .
Remote embedding providers, e.g. .
See to view all supported providers
Embedding models are defined in the spicepod.yaml
file as top-level components.
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
By defining embeddings on the body
column, Spice is now configured to execute similarity searches on the dataset.
The body
column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).
When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the additional_columns
list.
For example:
Response:
Datasets that already include embeddings can utilize the same functionalities (e.g., vector search) as those augmented with embeddings using Spice. To ensure compatibility, these table columns must adhere to the following constraints:
Underlying Column Presence:
Embeddings Column Naming Convention:
For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding
. For example, a customer_reviews
table with a review
column must have a review_embedding
column.
Embeddings Column Data Type:
FixedSizeList[Float32 or Float64, N]
, where N
is the dimension (size) of the embedding vector. FixedSizeList
is used for efficient storage and processing of fixed-size vectors.
Offset Column for Chunked Data:
If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets
with the following Arrow data type:
List[FixedSizeList[Int32, 2]]
, where each element is a pair of integers [start, end]
representing the start and end indices of the chunk in the underlying text column. This offset column maps each chunk in the embeddings back to the corresponding segment in the underlying text column.
For instance, [[0, 100], [101, 200]]
indicates two chunks covering indices 0–100 and 101–200, respectively.
By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.
A table sales
with an address
column and corresponding embedding column(s).
The same table if it was chunked:
For more details, see the .
Spice also supports vector search on datasets with preexisting embeddings. See for compatibility details.
Spice supports chunking of content before embedding, which is useful for large text columns such as those found in . Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.
The underlying column must exist in the table, and be of string
.
The embeddings column must have the following when loaded into Spice:
If the column is , use List[FixedSizeList[Float32 or Float64, N]]
.
Example