S3

S3 Data Connector Documentation

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

datasets:
  - from: s3://spiceai-demo-datasets/taxi_trips/2024/
    name: taxi_trips
    params:
      file_format: parquet

Configuration

from

S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

Example: from: s3://my-bucket/path/to/file.parquet

name

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: s3://s3-bucket-name/taxi_sample.csv
    name: cool_dataset
    params:
      file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

params

Parameter Name
Description

file_format

Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to Object Store File Formats for details.

s3_endpoint

S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server

s3_region

S3 bucket region. Default: us-east-1.

client_timeout

Timeout for S3 operations. Default: 30s.

hive_partitioning_enabled

Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

s3_auth

Authentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.

s3_key

Access key (e.g. AWS_ACCESS_KEY_ID for AWS)

s3_secret

Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)

allow_http

Allow insecure HTTP connections to s3_endpoint. Defaults to false

For additional CSV parameters, see CSV Parameters

Authentication

No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.

Minimum IAM policy for S3 access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::company-bucketname-datasets"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::company-bucketname-datasets/*"
    }
  ]
}

Types

Refer to Object Store Data Types for data type mapping from object store files to arrow data type.

Examples

Public bucket Example

Create a dataset named taxi_trips from a public S3 folder.

- from: s3://spiceai-demo-datasets/taxi_trips/2024/
  name: taxi_trips
  params:
    file_format: parquet

MinIO Example

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
  name: cool_dataset
  params:
    s3_endpoint: http://my.minio.server
    s3_region: 'us-east-1' # Best practice for MinIO
    allow_http: true

Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

version: v1
kind: Spicepod
name: hive_data

datasets:
  - from: s3://spiceai-public-datasets/hive_partitioned_data/
    name: hive_data_infer
    params:
      file_format: parquet
      hive_partitioning_enabled: true

Limitations

Last updated

Was this helpful?