# S3

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](https://docs.spice.ai/building-blocks/data-connectors/..#object-store-file-formats).

```yaml
datasets:
  - from: s3://spiceai-demo-datasets/taxi_trips/2024/
    name: taxi_trips
    params:
      file_format: parquet
```

## Configuration

### `from`

S3-compatible URI to a folder or file, in the format `s3://<bucket>/<path>`

Example: `from: s3://my-bucket/path/to/file.parquet`

### `name`

The dataset name. This will be used as the table name within Spice.

Example:

```yaml
datasets:
  - from: s3://s3-bucket-name/taxi_sample.csv
    name: cool_dataset
    params:
      file_format: csv
```

```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+
```

### `params`

| Parameter Name              | Description                                                                                                                                                                                                                                                |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format`               | Specifies the data format. Required if it cannot be inferred from the object URI. Options: `parquet`, `csv`, `json`. Refer to [Object Store File Formats](https://docs.spice.ai/building-blocks/data-connectors/..#object-store-file-formats) for details. |
| `s3_endpoint`               | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server`                                                                                                                                             |
| `s3_region`                 | S3 bucket region. Default: `us-east-1`.                                                                                                                                                                                                                    |
| `client_timeout`            | Timeout for S3 operations. Default: `30s`.                                                                                                                                                                                                                 |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`                                                                                                                                                           |
| `s3_auth`                   | Authentication type. Options: `public`, `key` and `iam_role`. Defaults to `public` if `s3_key` and `s3_secret` are not provided, otherwise defaults to `key`.                                                                                              |
| `s3_key`                    | Access key (e.g. `AWS_ACCESS_KEY_ID` for AWS)                                                                                                                                                                                                              |
| `s3_secret`                 | Secret key (e.g. `AWS_SECRET_ACCESS_KEY` for AWS)                                                                                                                                                                                                          |
| `allow_http`                | Allow insecure HTTP connections to `s3_endpoint`. Defaults to `false`                                                                                                                                                                                      |

For CSV-specific parameters, see [CSV Parameters](https://github.com/spicehq/docs/blob/trunk/reference/file-format.md#csv).

## Authentication

No authentication is required for public endpoints. For private buckets, set s3\_auth to key or iam\_role. For environments with assigned IAM roles, set `s3_auth` to `iam_role`. If using iam\_role, the [AWS IAM role](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html) of the running instance is used.

Minimum IAM policy for S3 access:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::company-bucketname-datasets"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::company-bucketname-datasets/*"
    }
  ]
}
```

## Types

Refer to [Object Store Data Types](https://github.com/spicehq/docs/blob/trunk/reference/datatypes/object_store.md) for data type mapping from object store files to arrow data type.

## Examples

### Public bucket Example

Create a dataset named `taxi_trips` from a public S3 folder.

```yaml
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
  name: taxi_trips
  params:
    file_format: parquet
```

### MinIO Example

Create a dataset named `cool_dataset` from a Parquet file stored in MinIO.

```yaml
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
  name: cool_dataset
  params:
    s3_endpoint: http://my.minio.server
    s3_region: 'us-east-1' # Best practice for MinIO
    allow_http: true
```

### Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

```plaintext
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
```

Spice can automatically infer these partition columns from the directory structure when `hive_partitioning_enabled` is set to `true`.

```yaml
version: v1
kind: Spicepod
name: hive_data

datasets:
  - from: s3://spiceai-public-datasets/hive_partitioned_data/
    name: hive_data_infer
    params:
      file_format: parquet
      hive_partitioning_enabled: true
```

## Limitations

{% hint style="warning" %}
**Performance Considerations**

When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

Memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](https://github.com/spicehq/docs/blob/trunk/building-blocks/data-accelerators/duckdb.md) and [`sqlite`](https://github.com/spicehq/docs/blob/trunk/building-blocks/data-accelerators/sqlite.md) accelerators by specifying `mode: file`.

Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.
{% endhint %}
