LogoLogo
BlogTwitterDiscordTelegramSignup/Login
  • Getting Started
    • Welcome to Spice.ai Cloud
    • Getting Started
      • Sign in with GitHub
      • Create a Spice app
      • Add a Dataset and query data
      • Add AI Model and chat with your data
      • Next Steps
    • FAQ
  • Features
    • Federated SQL Query
    • Data Acceleration
      • In-Memory Arrow Data Accelerator
      • DuckDB Data Accelerator
      • PostgreSQL Data Accelerator
      • SQLite Data Accelerator
    • Search & Retrieval
    • AI Gateway
    • Semantic Models
    • ML Models
    • Observability
      • Task History
      • Zipkin
  • Building Blocks
    • Data Connectors
      • ABFS
      • ClickHouse
      • Databricks
      • Delta Lake
      • Dremio
      • DuckDB
      • DynamoDB
      • FlightSQL
      • FTP
      • GitHub
      • GraphQL
      • HTTPS
      • LocalPod
      • Memory
      • MSSQL
      • MySQL
      • ODBC
      • Postgres
      • S3
      • SharePoint
      • Snowflake
      • Spark
      • SpiceAI
    • Model Providers
      • Anthropic
      • Azure
      • Hugging Face
      • OpenAI
      • Perplexity
      • SpiceAI
      • XAI
  • API
    • SQL Query API
      • HTTP API
      • Apache Arrow Flight API
    • OpenAI API
    • Health API
  • Portal
    • Playground
      • SQL Query
      • AI Chat
    • Organizations
    • Apps
      • API keys
      • Secrets
      • Connect GitHub
      • Transfer
    • Public Apps
    • App Spicepod
      • Spicepod Configuration
      • Deployments
      • Spice Runtime Versions
    • Monitoring
    • Profile
      • Personal Access Tokens
  • Use-Cases
    • Agentic AI Apps
    • Database CDN
    • Data Lakehouse
    • Enterprise Search
    • Enterprise RAG
  • SDKs
    • Python SDK
      • Streaming
    • Node.js SDK
      • Streaming
      • API Reference
    • Go SDK
    • Rust SDK
    • Dotnet SDK
    • Java SDK
  • Integrations
    • GitHub Copilot
    • Grafana
  • REFERENCE
    • Core Concepts
      • Duration Literals
    • SQL Reference
      • Data Types
      • SQL Functions
        • Aggregate
          • APPROX_COUNT_DISTINCT
          • AVG
          • BIT_AND
          • BIT_OR
          • CORR
          • COUNT
          • COVAR_POP
          • COVAR_SAMP
          • HLL
          • LISTAGG
          • MAX
          • MIN
          • NDV
          • STDDEV
          • STDDEV_POP
          • STDDEV_SAMP
          • SUM
          • VAR_POP
          • VAR_SAMP
        • Binary
          • BASE64
          • BIT_LENGTH
          • FROM_HEX
          • HEX
          • TO_HEX
          • UNBASE64
          • UNHEX
        • Bitwise
          • BIT_AND
          • BIT_OR
          • LSHIFT
          • RSHIFT
          • XOR
        • Boolean
          • IS [NOT] DISTINCT FROM
          • ISFALSE
          • IS [NOT] NULL
          • ISNUMERIC
          • ISTRUE
          • IS_MEMBER
        • Conditional
          • BOOL_AND
          • BOOL_OR
          • CASE
          • COALESCE
          • GREATEST
          • LEAST
          • NULLIF
        • Conversion
          • BINARY_STRING
          • CAST
          • CONVERT_FROM
          • CONVERT_REPLACEUTF8
          • CONVERT_TIMEZONE
          • CONVERT_TO
          • FLATTEN
          • FROM_HEX
          • HASH
          • HEX
          • TOASCII
          • TO_CHAR
          • TO_DATE
          • TO_HEX
          • TO_NUMBER
          • TO_TIME
          • TO_TIMESTAMP
          • UNHEX
        • Cryptography
          • AES_DECRYPT
          • AES_ENCRYPT
          • MD5
          • SHA
          • SHA1
          • SHA256
          • SHA512
        • Data Generation
          • RANDOM
        • Datatype
          • IS_BIGINT
          • IS_DATE
          • IS_INT
          • IS_VARCHAR
          • SIZE
          • TYPEOF
        • Date/Time
          • CONVERT_TIMEZONE
          • CURRENT_DATE
          • CURRENT_DATE_UTC
          • CURRENT_TIME
          • CURRENT_TIMESTAMP
          • DATEDIFF
          • DATE_ADD
          • DATE_DIFF
          • DATE_PART
          • DATE_SUB
          • DATE_TRUNC
          • DAY
          • DAYOFMONTH
          • DAYOFWEEK
          • DAYOFYEAR
          • EXTRACT
          • HOUR
          • LAST_DAY
          • MINUTE
          • MONTH
          • MONTHS_BETWEEN
          • NEXT_DAY
          • QUARTER
          • SECOND
          • TIMESTAMPADD
          • TIMESTAMPDIFF
          • TO_DATE
          • TO_TIME
          • TO_TIMESTAMP
          • UNIX_TIMESTAMP
          • WEEK
          • WEEKOFYEAR
          • YEAR
        • Math
          • ABS
          • ACOS
          • ASIN
          • ATAN
          • CBRT
          • CEILING
          • COS
          • COSH
          • COT
          • DEGREES
          • E
          • EXP
          • FLOOR
          • LOG
          • LOG10
          • MOD
          • PI
          • POWER
          • RADIANS
          • ROUND
          • SIGN
          • SIN
          • SINH
          • SQRT
          • STDDEV
          • STDDEV_POP
          • STDDEV_SAMP
          • TAN
          • TANH
          • TRUNCATE
        • Percentile
          • MEDIAN
          • PERCENTILE_CONT
          • PERCENTILE_DISC
        • Regular Expressions
          • REGEXP_EXTRACT
          • REGEXP_LIKE
          • REGEXP_MATCHES
          • REGEXP_REPLACE
          • REGEXP_SPLIT
        • Semistructured Data
          • ARRAY_CONTAINS
          • MAP_KEYS
          • MAP_VALUES
        • String
          • ASCII
          • BASE64
          • BTRIM
          • CHARACTER_LENGTH
          • CHAR_LENGTH
          • CHR
          • COL_LIKE
          • CONCAT
          • CONCAT_WS
          • ENDS_WITH
          • FROM_HEX
          • HEX
          • ILIKE
          • INITCAP
          • INSTR
          • IS_UTF8
          • LCASE
          • LEFT
          • LENGTH
          • LEVENSHTEIN
          • LIKE
          • LOCATE
          • LOWER
          • LPAD
          • LTRIM
          • MASK
          • MASK_FIRST_N
          • MASK_HASH
          • MASK_LAST_N
          • MASK_SHOW_FIRST_N
          • MASK_SHOW_LAST_N
          • OCTET_LENGTH
          • POSITION
          • QUOTE
          • REGEXP_EXTRACT
          • REGEXP_LIKE
          • REGEXP_MATCHES
          • REGEXP_REPLACE
          • REGEXP_SPLIT
          • REPEAT
          • REPEATSTR
          • REPLACE
          • REVERSE
          • RIGHT
          • RPAD
          • RTRIM
          • SIMILAR_TO
          • SOUNDEX
          • SPLIT_PART
          • STARTS_WITH
          • STRPOS
          • SUBSTRING
          • SUBSTRING_INDEX
          • TOASCII
          • TO_HEX
          • TRANSLATE
          • TRIM
          • UCASE
          • UNBASE64
          • UNHEX
          • UPPER
        • Window
          • COUNT
          • COVAR_POP
          • COVAR_SAMP
          • CUME_DIST
          • DENSE_RANK
          • FIRST_VALUE
          • HLL
          • LAG
          • LEAD
          • MAX
          • MIN
          • NDV
          • NTILE
          • PERCENT_RANK
          • RANK
          • ROW_NUMBER
          • SUM
          • VAR_POP
          • VAR_SAMP
      • SQL Commands
        • SELECT
        • USE
        • SHOW
        • DESCRIBE
        • WITH
    • Release Notes
  • Pricing
    • Paid Plans
    • Community Plan
  • Support
    • Support
  • Security
    • Security at Spice AI
    • Report a vulnerability
  • Legal
    • Privacy Policy
    • Website Terms of Use
    • Terms of Service
    • End User License Agreement
Powered by GitBook
On this page
  • Configuration
  • from
  • name
  • params
  • Delta Lake object store parameters
  • AWS S3
  • Azure Blob
  • Google Storage (GCS)
  • Examples
  • Spark Connect
  • Delta Lake (S3)
  • Delta Lake (Azure Blobs)
  • Delta Lake (GCP)
  • Types
  • mode: delta_lake
  • Limitations

Was this helpful?

Edit on GitHub
Export as PDF
  1. Building Blocks
  2. Data Connectors

Databricks

Databricks Data Connector Documentation

Last updated 3 months ago

Was this helpful?

Databricks as a connector for federated SQL query against Databricks using or directly from tables.

datasets:
  - from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
    name: my_delta_lake_table
    params:
      mode: delta_lake
      databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
      databricks_token: ${secrets:my_token}
      databricks_aws_access_key_id: ${secrets:aws_access_key_id}
      databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}

Configuration

from

The from field for the Databricks connector takes the form databricks:catalog.schema.table where catalog.schema.table is the fully-qualified path to the table to read from.

name

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: databricks:spiceai.datasets.my_awesome_table
    name: cool_dataset
    params: ...
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

params

Parameter Name
Description

mode

The execution mode for querying against Databricks. The default is spark_connect. Possible values:

  • spark_connect: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.

  • delta_lake: Query directly from Delta Tables. Requires the object store credentials to be provided.

databricks_endpoint

The endpoint of the Databricks instance. Required for both modes.

databricks_cluster_id

The ID of the compute cluster in Databricks to use for the query. Only valid when mode is spark_connect.

databricks_use_ssl

If true, use a TLS connection to connect to the Databricks endpoint. Default is true.

client_timeout

Optional. Applicable only in delta_lake mode. Specifies timeout for object store operations. Default value is 30s E.g. client_timeout: 60s

Delta Lake object store parameters

AWS S3

Parameter Name
Description

databricks_aws_region

Optional. The AWS region for the S3 object store. E.g. us-west-2.

databricks_aws_access_key_id

The access key ID for the S3 object store.

databricks_aws_secret_access_key

The secret access key for the S3 object store.

databricks_aws_endpoint

Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com.

Azure Blob

Note

One of the following auth values must be provided for Azure Blob:

  • databricks_azure_storage_account_key,

  • databricks_azure_storage_client_id and azure_storage_client_secret, or

  • databricks_azure_storage_sas_key.

Parameter Name
Description

databricks_azure_storage_account_name

The Azure Storage account name.

databricks_azure_storage_account_key

The Azure Storage key for accessing the storage account.

databricks_azure_storage_client_id

The Service Principal client ID for accessing the storage account.

databricks_azure_storage_client_secret

The Service Principal client secret for accessing the storage account.

databricks_azure_storage_sas_key

The shared access signature key for accessing the storage account.

databricks_azure_storage_endpoint

Optional. The endpoint for the Azure Blob storage account.

Google Storage (GCS)

Parameter Name
Description

google_service_account

Filesystem path to the Google service account JSON key file.

Examples

Spark Connect

- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
  name: my_delta_lake_table
  params:
    mode: spark_connect
    databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
    databricks_cluster_id: 1234-567890-abcde123
    databricks_token: ${secrets:my_token}

Delta Lake (S3)

- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
  name: my_delta_lake_table
  params:
    mode: delta_lake
    databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
    databricks_token: ${secrets:my_token}
    databricks_aws_region: us-west-2 # Optional
    databricks_aws_access_key_id: ${secrets:aws_access_key_id}
    databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
    databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional

Delta Lake (Azure Blobs)

- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
  name: my_delta_lake_table
  params:
    mode: delta_lake
    databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
    databricks_token: ${secrets:my_token}

    # Account Name + Key
    databricks_azure_storage_account_name: my_account
    databricks_azure_storage_account_key: ${secrets:my_key}

    # OR Service Principal + Secret
    databricks_azure_storage_client_id: my_client_id
    databricks_azure_storage_client_secret: ${secrets:my_secret}

    # OR SAS Key
    databricks_azure_storage_sas_key: my_sas_key

Delta Lake (GCP)

- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
  name: my_delta_lake_table
  params:
    mode: delta_lake
    databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
    databricks_token: ${secrets:my_token}
    databricks_google_service_account_path: /path/to/service-account.json

Types

mode: delta_lake

The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.

Databricks SQL Type
Arrow Type

STRING

Utf8

BIGINT

Int64

INT

Int32

SMALLINT

Int16

TINYINT

Int8

FLOAT

Float32

DOUBLE

Float64

BOOLEAN

Boolean

BINARY

Binary

DATE

Date32

TIMESTAMP

Timestamp(Microsecond, Some("UTC"))

TIMESTAMP_NTZ

Timestamp(Microsecond, None)

DECIMAL

Decimal128

ARRAY

List

STRUCT

Struct

MAP

Map

Limitations

  • Databricks connector (mode: delta_lake) does not support reading Delta tables with the V2Checkpoint feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop the V2Checkpoint feature by executing the following command:

    ALTER TABLE <table-name> DROP FEATURE v2Checkpoint [TRUNCATE HISTORY];

Memory Considerations

When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

  • The Databricks Connector (mode: spark_connect) does not yet support streaming query results from Spark.

Use the to reference a secret, e.g. ${secrets:my_token}.

Configure the connection to the object store when using mode: delta_lake. Use the to reference a secret, e.g. ${secrets:aws_access_key_id}.

For more details on dropping Delta table features, refer to the official documentation:

When using mode: spark_connect, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands.

Memory limitations can be mitigated by storing acceleration data on disk, which is supported by and accelerators by specifying mode: file.

Spark Connect
Delta Lake
secret replacement syntax
secret replacement syntax
Drop Delta table features
Spark Docs
duckdb
sqlite