Databricks
Databricks Data Connector Documentation
Databricks as a connector for federated SQL query against Databricks using Spark Connect, directly from Delta Lake tables, or using the SQL Statement Execution API.
Configuration
from
from
The from
field for the Databricks connector takes the form databricks:catalog.schema.table
where catalog.schema.table
is the fully-qualified path to the table to read from.
name
name
The dataset name. This will be used as the table name within Spice.
Example:
params
params
Use the secret replacement syntax to reference a secret, e.g. ${secrets:my_token}
.
mode
The execution mode for querying against Databricks. The default is spark_connect
. Possible values:
spark_connect
: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.delta_lake
: Query directly from Delta Tables. Requires the object store credentials to be provided.
databricks_endpoint
The endpoint of the Databricks instance. Required for both modes.
databricks_sql_warehouse_id
The ID of the SQL Warehouse in Databricks to use for the query. Only valid when mode
is sql_warehouse
.
databricks_cluster_id
The ID of the compute cluster in Databricks to use for the query. Only valid when mode
is spark_connect
.
databricks_use_ssl
If true, use a TLS connection to connect to the Databricks endpoint. Default is true
.
client_timeout
Optional. Applicable only in delta_lake
mode. Specifies timeout for object store operations. Default value is 30s
E.g. client_timeout: 60s
databricks_token
The Databricks API token to authenticate with the Unity Catalog API. Can't be used with databricks_client_id
and databricks_client_secret
.
databricks_client_id
The Databricks Service Principal Client ID. Can't be used with databricks_token
.
databricks_client_secret
The Databricks Service Principal Client Secret. Can't be used with databricks_token
.
Authentication
Personal access token
To learn more about how to set up personal access tokens, see Databricks PAT docs.
Databricks service principal
Spice supports the M2M (Machine to Machine) OAuth flow with service principal credentials by utilizing the databricks_client_id
and databricks_client_secret
parameters. The runtime will automatically refresh the token.
Ensure that you grant your service principal the "Data Reader" privilege preset for the catalog and "Can Attach" cluster permissions when using Spark Connect mode.
To Learn more about how to set up the service principal, see Databricks M2M OAuth docs.
Delta Lake object store parameters
Configure the connection to the object store when using mode: delta_lake
. Use the secret replacement syntax to reference a secret, e.g. ${secrets:aws_access_key_id}
.
AWS S3
databricks_aws_region
Optional. The AWS region for the S3 object store. E.g. us-west-2
.
databricks_aws_access_key_id
The access key ID for the S3 object store.
databricks_aws_secret_access_key
The secret access key for the S3 object store.
databricks_aws_endpoint
Optional. The endpoint for the S3 object store. E.g. s3.us-west-2.amazonaws.com
.
databricks_aws_allow_http
Optional. Enables insecure HTTP connections to databricks_aws_endpoint
. Defaults to false
.
Azure Blob
databricks_azure_storage_account_name
The Azure Storage account name.
databricks_azure_storage_account_key
The Azure Storage key for accessing the storage account.
databricks_azure_storage_client_id
The Service Principal client ID for accessing the storage account.
databricks_azure_storage_client_secret
The Service Principal client secret for accessing the storage account.
databricks_azure_storage_sas_key
The shared access signature key for accessing the storage account.
databricks_azure_storage_endpoint
Optional. The endpoint for the Azure Blob storage account.
Google Storage (GCS)
google_service_account
Filesystem path to the Google service account JSON key file.
Examples
Spark Connect
SQL Warehouse
Delta Lake (S3)
Delta Lake (Azure Blobs)
Delta Lake (GCP)
Types
mode: delta_lake
The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.
STRING
Utf8
BIGINT
Int64
INT
Int32
SMALLINT
Int16
TINYINT
Int8
FLOAT
Float32
DOUBLE
Float64
BOOLEAN
Boolean
BINARY
Binary
DATE
Date32
TIMESTAMP
Timestamp(Microsecond, Some("UTC"))
TIMESTAMP_NTZ
Timestamp(Microsecond, None)
DECIMAL
Decimal128
ARRAY
List
STRUCT
Struct
MAP
Map
Limitations
Databricks connector (mode: delta_lake) does not support reading Delta tables with the
V2Checkpoint
feature enabled. To use the Databricks connector (mode: delta_lake) with such tables, drop theV2Checkpoint
feature by executing the following command:For more details on dropping Delta table features, refer to the official documentation: Drop Delta table features
When using
mode: spark_connect
, correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commands. Spark Docs
Memory Considerations
When using the Databricks (mode: delta_lake) Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb
and sqlite
accelerators by specifying mode: file
.
The Databricks Connector (
mode: spark_connect
) does not yet support streaming query results from Spark.
Last updated
Was this helpful?