Skip to main content

S3 Data Lake

caution

This connector is in early access and still evolving. Future updates may introduce breaking changes.

We're interested in hearing about your experience! See Github for more information on joining the beta.

This page guides you through the process of setting up the S3 Data Lake destination connector.

This connector writes the Iceberg table format to S3, or an S3-compatible storage backend. Currently it supports the REST, AWS Glue, and Nessie catalogs.

Setup Guide

S3 Data Lake requires configuring two components: S3 storage, and your Iceberg catalog.

S3 Setup

The connector needs certain permissions to be able to write Iceberg-format files to S3:

  • s3:ListAllMyBuckets
  • s3:GetObject*
  • s3:PutObject
  • s3:PutObjectAcl
  • s3:DeleteObject
  • s3:ListBucket*

Iceberg Catalog Setup

Different catalogs have different setup requirements.

AWS Glue

In addition to the S3 permissions, you should also grant these Glue permissions:

  • glue:TagResource
  • glue:UnTagResource
  • glue:BatchCreatePartition
  • glue:BatchDeletePartition
  • glue:BatchDeleteTable
  • glue:BatchGetPartition
  • glue:CreateDatabase
  • glue:CreateTable
  • glue:CreatePartition
  • glue:DeletePartition
  • glue:DeleteTable
  • glue:GetDatabase
  • glue:GetDatabases
  • glue:GetPartition
  • glue:GetPartitions
  • glue:GetTable
  • glue:GetTables
  • glue:UpdateDatabase
  • glue:UpdatePartition
  • glue:UpdateTable

Set the "warehouse location" option to s3://<bucket name>/path/within/bucket.

The "Role ARN" option is only usable in cloud.

REST catalog

You will need the URI of your REST catalog.

Nessie

You will need the URI of your Nessie catalog, and an access token to authenticate to that catalog.

Set the "warehouse location" option to s3://<bucket name>/path/within/bucket.

Iceberg schema generation

The top-level fields of the stream will be mapped to Iceberg fields. Nested fields (objects, arrays, and unions) will be mapped to STRING columns, and written as serialized JSON. This is the full mapping between Airbyte types and Iceberg types:

Airbyte typeIceberg type
BooleanBoolean
DateDate
IntegerLong
NumberDouble
StringString
Time with timezoneTime
Time without timezoneTime
Timestamp with timezoneTimestamp with timezone
Timestamp without timezoneTimestamp without timezone
ObjectString (JSON-serialized value)
ArrayString (JSON-serialized value)
UnionString (JSON-serialized value)

Note that for the time/timestamp with timezone types, the value is first adjusted to UTC, and then written into the Iceberg file.

Reference

Config fields reference

Field
Type
Property name
string
s3_bucket_name
string
s3_bucket_region
string
warehouse_location
string
main_branch_name
object
catalog_type
string
access_key_id
string
secret_access_key
string
s3_endpoint

Changelog

Expand to review
VersionDatePull RequestSubject
0.3.92025-02-10#53165Very basic usability improvements and documentation
0.3.82025-02-10#52666Change the chunk size to 1.5Gb
0.3.72025-02-07#53141Adding integration tests around the Rest catalog
0.3.62025-02-06#53172Internal refactor
0.3.52025-02-06#53164Improve error message on null primary key in dedup mode
0.3.42025-02-05#53173Tweak spec wording
0.3.32025-02-05#53176Fix time_with_timezone handling (values are now adjusted to UTC)
0.3.22025-02-04#52690Handle special characters in stream name/namespace when using AWS Glue
0.3.12025-02-03#52633Fix dedup
0.3.02025-01-31#52639Make the database/namespace a required field
0.2.232025-01-27#51600Internal refactor
0.2.222025-01-22#52081Implement support for REST catalog
0.2.212025-01-27#52564Fix crash on stream with 0 records
0.2.202025-01-23#52068Add support for default namespace (/database name)
0.2.192025-01-16#51595Clarifications in connector config options
0.2.182025-01-15#51042Write structs as JSON strings instead of Iceberg structs.
0.2.172025-01-14#51542New identifier fields should be marked as required.
0.2.162025-01-14#51538Update identifier fields if incoming fields are different than existing ones
0.2.152025-01-14#51530Set AWS region for S3 bucket for nessie catalog
0.2.142025-01-14#50413Update existing table schema based on the incoming schema
0.2.132025-01-14#50412Implement logic to determine super types between iceberg types
0.2.122025-01-10#50876Add support for AWS instance profile auth
0.2.112025-01-10#50971Internal refactor in AWS auth flow
0.2.102025-01-09#50400Add S3DataLakeTypesComparator
0.2.92025-01-09#51022Rename all classes and files from Iceberg V2
0.2.82025-01-09#51012Rename/Cleanup package from Iceberg V2
0.2.72025-01-09#50957Add support for GLUE RBAC (Assume role)
0.2.62025-01-08#50991Initial public release.