S3

S3 Data Connector Documentation

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

datasets:
  - from: s3://spiceai-demo-datasets/taxi_trips/2024/
    name: taxi_trips
    params:
      file_format: parquet

Configuration

from

S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

Example: from: s3://my-bucket/path/to/file.parquet

name

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: s3://s3-bucket-name/taxi_sample.csv
    name: cool_dataset
    params:
      file_format: csv

params

Parameter Name
Description

file_format

Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to Object Store File Formats for details.

s3_endpoint

S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server

s3_region

S3 bucket region. Default: us-east-1.

client_timeout

Timeout for S3 operations. Default: 30s.

hive_partitioning_enabled

Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

s3_auth

Authentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.

s3_key

Access key (e.g. AWS_ACCESS_KEY_ID for AWS)

s3_secret

Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)

allow_http

Allow insecure HTTP connections to s3_endpoint. Defaults to false

For additional CSV parameters, see CSV Parameters

Authentication

No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.

Minimum IAM policy for S3 access:

Types

Refer to Object Store Data Types for data type mapping from object store files to arrow data type.

Examples

Public bucket Example

Create a dataset named taxi_trips from a public S3 folder.

MinIO Example

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

Limitations

Last updated

Was this helpful?