Iceberg Catalog
Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables.
Quick start with Apache Doris and Apache Iceberg.
Users can access Iceberg tables using Hive Metastore as metadata through the Hive Catalog. However, it is still recommended to use the Iceberg Catalog directly to avoid some compatibility issues.
Applicable Scenarios
| Scenario | Description |
|---|---|
| Query Acceleration | Use Doris's distributed computing engine to directly access Iceberg data for faster queries. |
| Data Integration | Read Iceberg data and write it to internal Doris tables, or perform ZeroETL operations using the Doris computing engine. |
| Data Write-back | Process data from any Doris-supported data source and write it back to Iceberg table storage. |
Configuring Catalog
Syntax
CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = '<iceberg_catalog_type>',
'warehouse' = '<warehouse>' --optional
{MetaStoreProperties},
{StorageProperties},
{IcebergProperties},
{CommonProperties}
);
-
<iceberg_catalog_type>The type of Iceberg Catalog, supporting the following options:
-
hms: Uses Hive Metastore as the metadata service. -
rest: Metadata service compatible with the Iceberg Rest Catalog interface. -
hadoop: Directly accesses metadata stored on the file system. -
glue: Uses AWS Glue as the metadata service. -
dlf: Uses Alibaba Cloud DLF as the metadata service. -
s3tables: Uses AWS S3 Tables Catalog to visit S3 Table Bucket.
-
-
<warehouse>The warehouse path for Iceberg. This parameter must be specified when
<iceberg_catalog_type>ishadoop.The
warehousepath must point to the level above theDatabasepath. For example, if your table path iss3://bucket/path/to/db1/table1, then thewarehouseshould bes3://bucket/path/to/. -
{MetaStoreProperties}The MetaStoreProperties section is for entering connection and authentication information for the Metastore metadata service. Refer to the section on [Supported Metadata Services].
-
{StorageProperties}The StorageProperties section is for entering connection and authentication information related to the storage system. Refer to the section on [Supported Storage Systems].
-
{IcebergProperties}The IcebergProperties section is used to fill in parameters specific to Iceberg Catalog.
-
list-all-tablesFor Iceberg Catalog that uses Hive Metastore as the metadata service. Default is
true. By default, theSHOW TABLESoperation will list all types of tables in the current Database (Hive Metastore may store non-Iceberg type tables). This approach has the best performance.If set to
false, Doris will check the type of each table one by one and only return Iceberg type tables. This mode will have poor performance when there are many tables.
-
-
{CommonProperties}The CommonProperties section is for entering general properties. See the Catalog Overview for details on common properties.
Supported Iceberg Versions
The current Iceberg dependency is version 1.6.1, which is compatible with higher versions of Iceberg.
Supported Iceberg Formats
- Supports Iceberg V1/V2 formats.
- Supports Position Delete and Equality Delete.
Supported Metadata Services
Note: The service types and parameters supported by different Doris versions are slightly different. Please refer to the [Examples] section.
Supported Storage Systems
Note: The service types and parameters supported by different Doris versions are slightly different. Please refer to the [Examples] section.
Supported Data Formats
Column Type Mapping
| Iceberg Type | Doris Type | Comment |
|---|---|---|
| boolean | boolean | |
| integer | int | |
| long | bigint | |
| float | float | |
| double | double | |
| decimal(P, S) | decimal(P, S) | |
| date | date | |
| timestamp (Timestamp without timezone) | datetime(6) | Mapped to datetime with precision 6 |
| timestamptz (Timestamp with timezone) | datetime(6) | Mapped to datetime with precision 6 |
| fixed(N) | char(N) | |
| string | string | |
| binary | string | |
| uuid | string | |
| struct | struct (supported from version 2.1.3) | |
| map | map (supported from version 2.1.3) | |
| list | array | |
| other | UNSUPPORTED |
Note:
Doris currently does not support
Timestamptypes with timezone. Alltimestampandtimestamptzwill be uniformly mapped todatetime(N)type. However, during reading and writing, Doris will correctly handle timezones based on the actual source type. For example, after specifying a timezone withSET time_zone=<tz>, it will affect the reading and writing results oftimestamptzcolumns.You can check whether the source type has timezone information in the Extra column of the
DESCRIBE table_namestatement. If it showsWITH_TIMEZONE, it indicates that the source type is a timezone-aware type. (Supported since 3.1.0).
Namespace Mapping
Iceberg's metadata hierarchy is Catalog -> Namespace -> Table. Namespace can have multiple levels (Nested Namespace).
┌─────────┐
│ Catalog │
└────┬────┘
│
┌─────┴─────┐
┌──▼──┐ ┌──▼──┐
│ NS1 │ │ NS2 │
└──┬──┘ └──┬──┘
│ │
┌────▼───┐ ┌──▼──┐
│ Table1 │ │ NS3 │
└────────┘ └──┬──┘
│
┌──────┴───────┐
┌────▼───┐ ┌────▼───┐
│ Table2 │ │ Table3 │
└────────┘ └────────┘
Starting from version 3.1.2, for Iceberg Rest Catalog, Doris supports mapping of Nested Namespace.
In the above example, tables will be mapped to Doris metadata according to the following logic:
| Catalog | Database | Table |
|---|---|---|
| Catalog | NS1 | Table1 |
| Catalog | NS2.NS3 | Table2 |
| Catalog | NS2.NS3 | Table3 |
Support for Nested Namespace needs to be explicitly enabled. For details, please refer to Iceberg Rest Catalog
Examples
Hive Metastore
3.1+ Version
- HDFS
- S3
- OSS
- COS
- OBS
- GCS
- Azure Blob
- MinIO
Access HMS and HDFS services without Kerberos authentication
CREATE CATALOG iceberg_hms_on_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'hdfs://127.0.0.1:8320/user/iceberg/warehouse/iceberg-hms-hdfs-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'fs.defaultFS' = 'hdfs://127.0.0.1:8320',
'hadoop.username' = 'doris'
);
Access HMS and HDFS services with Kerberos authentication enabled
CREATE CATALOG iceberg_hms_on_hdfs_kerberos_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'hdfs://127.0.0.1:8520/iceberg-hms-hdfs-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal'='hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab'
);
Access HMS without Kerberos authentication
CREATE CATALOG iceberg_hms_on_s3_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 's3://selectdb-qa-datalake-test-hk/refactor-test/iceberg-hms-s3-warehouse',
's3.region' = 'ap-east-1',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
Obtaining S3 access credentials using an IAM Assumed Role (3.1.2+)
CREATE CATALOG iceberg_hms_on_s3_iamrole PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 's3://bucket/warehouse',
's3.region' = 'us-east-1',
's3.role_arn' = 'arn:aws:iam::543815668950:role/role'
);
Access HMS without Kerberos authentication
CREATE CATALOG iceberg_hms_on_oss_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'oss://bucket/iceberg/iceberg-hms-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9083',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
Access HMS with Kerberos authentication enabled
CREATE CATALOG iceberg_hms_on_oss_kerberos_new_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'oss://bucket/iceberg/iceberg-hms-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.client.principal' = 'presto-server/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/presto-server.keytab',
'hive.metastore.authentication.type' = 'kerberos',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
Access HMS without Kerberos authentication
CREATE CATALOG `iceberg_hms_on_cos_catalog` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'cos://bucket/iceberg-hms-cos-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.secret_key' = '<sk>',
'cos.access_key' = '<ak>'
);
Access HMS with Kerberos authentication enabled
CREATE CATALOG `iceberg_hms_on_cos_kerberos_new_catalog` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'cos://bucket/iceberg-hms-cos-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled' = 'true',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/keytabs/hive-presto-master.keytab',
'hadoop.security.auth_to_local' = 'RULE:[2:$1@$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:$1@$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:$1@$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.secret_key' = '<sk>',
'cos.access_key' = '<ak>'
);
Access HMS without Kerberos authentication
CREATE CATALOG test_iceberg_on_hms_obs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'obs://bucket/regression/iceberg',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
Access HMS with Kerberos authentication enabled
CREATE CATALOG obs PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'obs://bucket/regression/iceberg1',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled' = 'true',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
CREATE CATALOG `iceberg_hms_on_gcs_kerberos_new_catalog` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 'gs://bucket/iceberg_warehouse',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>',
'fs.gcs.support'='true'
);
Supported since 3.1.3
CREATE CATALOG iceberg_fs_on_azure_blob_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'abfs://<container>@<account_name>.dfs.core.windows.net/iceberg-warehouse',
'azure.account_name' = '<account_name>',
'azure.account_key' = '<account_key>',
'azure.endpoint' = 'https://<account_name>.blob.core.windows.net',
'fs.azure.support'='true'
);
CREATE CATALOG test_iceberg_on_hms_minio_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 's3://warehouse/wh',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'fs.minio.support' = 'true',
'minio.use_path_style' = 'true',
'minio.endpoint' = 'http://127.0.0.1:19001',
'minio.access_key' = '<ak>',
'minio.secret_key' = '<sk>'
);
2.1 & 3.0 Version
- HDFS
- S3
- OSS
- COS
- OBS
- MinIO
Access HMS and HDFS services without Kerberos authentication
CREATE CATALOG iceberg_hms_on_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 'hdfs://127.0.0.1:8320/user/iceberg/warehouse/iceberg-hms-hdfs-warehouse',
'hadoop.username' = 'doris',
'fs.defaultFS' = 'hdfs://127.0.0.1:8320'
);
Access HMS and HDFS services with Kerberos authentication enabled
CREATE CATALOG iceberg_hms_on_hdfs_kerberos_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'hdfs://127.0.0.1:8520/iceberg-hms-hdfs-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.kerberos.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal'='hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab'
);
CREATE CATALOG iceberg_hms_on_s3_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 's3://selectdb-qa-datalake-test-hk/refactor-test/iceberg-hms-s3-warehouse',
's3.region' = 'ap-east-1',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_hms_on_oss_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9083',
'warehouse' = 'oss://bucket/iceberg/iceberg-hms-warehouse',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
CREATE CATALOG `iceberg_hms_on_cos_catalog` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 'cos://bucket/iceberg-hms-cos-warehouse',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.secret_key' = '<sk>',
'cos.access_key' = '<ak>'
);
CREATE CATALOG test_iceberg_on_hms_obs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 'obs://bucket/regression/iceberg',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
CREATE CATALOG test_iceberg_on_hms_minio_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hms',
'warehouse' = 's3://warehouse/iceberg',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
's3.endpoint' = 'http://127.0.0.1:19001',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
AWS Glue
3.1+ Version
- S3
AWS Glue and S3 storage services share the same authentication credentials.
CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'glue',
'warehouse' = 's3://bucket/iceberg-glue-s3-warehouse',
'glue.region' = 'ap-northeast-1',
'glue.endpoint' = 'https://glue.ap-northeast-1.amazonaws.com',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>'
);
When Glue service authentication credentials differ from S3 authentication credentials, you can specify S3 authentication credentials separately using the following method.
CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'glue',
'warehouse' = 's3://selectdb-qa-datalake-test/iceberg-glue-s3-warehouse',
'glue.region' = 'ap-northeast-1',
'glue.endpoint' = 'https://glue.ap-northeast-1.amazonaws.com',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>',
's3.endpoint' = 's3.ap-northeast-1.amazonaws.com',
's3.region' = 'ap-northeast-1',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+)
CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'glue',
'warehouse' = 's3://bucket/warehouse',
'glue.region' = 'us-east-1',
'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com',
'glue.role_arn' = '<role_arn>'
);
2.1 & 3.0 Version
- S3
AWS Glue and S3 storage services share the same authentication credentials.
In non-EC2 environments, you need to use aws configure to configure Credentials information and generate a credentials file in the ~/.aws directory.
CREATE CATALOG glue PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'glue',
'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>'
);
Aliyun DLF
3.1+ Version
- DLF 1.0
CREATE CATALOG iceberg_dlf_catalog_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type'='dlf',
'warehouse' = 'oss://bucket/iceberg-dlf-oss-warehouse',
'dlf.uid' = '203225413946383283',
'dlf.catalog_id' = 'p2_regression_case',
'dlf.region' = 'cn-beijing',
'dlf.endpoint' = 'datalake.cn-beijing.aliyuncs.com',
'dlf.access_key' = '<ak>',
'dlf.secret_key' = '<sk>'
);
2.1 & 3.0 Version
- DLF 1.0
CREATE CATALOG iceberg_dlf_catalog_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type'='dlf',
'warehouse' = 'oss://bucket/iceberg-dlf-oss-warehouse',
'dlf.uid' = '203225413946383283',
'dlf.catalog.id' = 'catalog_id',
'dlf.region' = 'cn-beijing',
'dlf.access_key' = '<ak>',
'dlf.secret_key' = '<sk>'
);
Iceberg Rest Catalog
3.1+ Version
- S3
- OSS
- COS
- OBS
- GCS
- HDFS
- No Authentication
- OAuth2 Authentication
- Vended Credentials
CREATE CATALOG iceberg_static_s3 PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
's3.endpoint' = 'https://s3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'ap-east-1'
);
CREATE CATALOG polaris_oauth2_credential_static_s3 PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_test',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'root:secret123',
'iceberg.rest.oauth2.server-uri' = 'http://127.0.0.1:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:ALL',
's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'us-west-2'
);
CREATE CATALOG gravitino_vended_s3 PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
'iceberg.rest.vended-credentials-enabled' = 'true'
);
- No Authentication
- OAuth2 Authentication
CREATE CATALOG iceberg_static_oss PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
'oss.endpoint' = 'https://oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>',
'oss.region' = 'cn-beijing'
);
CREATE CATALOG polaris_oauth2_credential_static_oss PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_test',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'root:secret123',
'iceberg.rest.oauth2.server-uri' = 'http://127.0.0.1:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:ALL',
'oss.endpoint' = 'https://oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>',
'oss.region' = 'cn-beijing'
);
- No Authentication
- OAuth2 Authentication
CREATE CATALOG iceberg_static_cos PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
'cos.endpoint' = 'https://cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>',
'cos.region' = 'ap-beijing'
);
CREATE CATALOG polaris_oauth2_credential_static_cos PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_test',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'root:secret123',
'iceberg.rest.oauth2.server-uri' = 'http://127.0.0.1:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:ALL',
'cos.endpoint' = 'https://cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>',
'cos.region' = 'ap-beijing'
);
- No Authentication
- OAuth2 Authentication
CREATE CATALOG iceberg_static_obs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>',
'obs.region' = 'cn-north-4'
);
CREATE CATALOG polaris_oauth2_credential_static_obs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_test',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'root:secret123',
'iceberg.rest.oauth2.server-uri' = 'http://127.0.0.1:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:ALL',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>',
'obs.region' = 'cn-north-4'
);
- No Authentication
- OAuth2 Authentication
CREATE CATALOG iceberg_static_gcs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'gs://warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
'gs.endpoint' = 'https://storage.googleapis.com',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>'
);
CREATE CATALOG polaris_oauth2_credential_static_gcs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_test',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'root:secret123',
'iceberg.rest.oauth2.server-uri' = 'http://127.0.0.1:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:ALL',
'gs.endpoint' = 'https://storage.googleapis.com',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_static_hdfs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'hdfs://127.0.0.1:8020/user/iceberg/warehouse',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://127.0.0.1:8181',
'fs.defaultFS' = 'hdfs://127.0.0.1:8020'
);
2.1 & 3.0 Version
- S3
- OSS
- COS
- OBS
- HDFS
CREATE CATALOG iceberg_static_s3 PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'uri' = 'http://127.0.0.1:8181',
's3.endpoint' = 'https://s3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'ap-east-1'
);
CREATE CATALOG iceberg_static_oss PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'uri' = 'http://127.0.0.1:8181',
's3.endpoint' = 'https://oss-cn-beijing.aliyuncs.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'cn-beijing'
);
CREATE CATALOG iceberg_static_cos PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'uri' = 'http://127.0.0.1:8181',
's3.endpoint' = 'https://cos.ap-beijing.myqcloud.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'ap-beijing'
);
CREATE CATALOG iceberg_static_obs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 's3://warehouse',
'iceberg.catalog.type' = 'rest',
'uri' = 'http://127.0.0.1:8181',
's3.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.region' = 'cn-north-4'
);
CREATE CATALOG iceberg_static_hdfs PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'hdfs://127.0.0.1:8020/user/iceberg/warehouse',
'iceberg.catalog.type' = 'rest',
'uri' = 'http://127.0.0.1:8181',
'fs.defaultFS' = 'hdfs://127.0.0.1:8020'
);
FileSystem
3.1+ Version
- HDFS
- S3
- COS
- OSS
- OBS
- GCS
- MinIO
Access HDFS service without Kerberos authentication
CREATE CATALOG iceberg_fs_on_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://127.0.0.1:8320/user/iceberg/warehouse/iceberg-fs-hdfs-warehouse',
'hadoop.username' = 'doris',
'fs.defaultFS' = 'hdfs://127.0.0.1:8320'
);
Access HDFS service with Kerberos authentication enabled
CREATE CATALOG iceberg_fs_on_hdfs_kerberos_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://127.0.0.1:8520/iceberg-fs-hdfs-warehouse',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal'='hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab'
);
CREATE CATALOG iceberg_fs_on_glue_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://selectdb-qa-datalake-test-hk/refactor-test/iceberg-fs-s3-warehouse',
's3.region' = 'ap-east-1',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
Obtaining S3 access credentials using an IAM Assumed Role (3.1.2+)
CREATE CATALOG iceberg_fs_on_s3_iamrole PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://bucket/warehouse',
's3.region' = 'us-east-1',
's3.role_arn' = 'arn:aws:iam::543815668950:role/role'
);
CREATE CATALOG iceberg_fs_on_cos_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'cos://doris-build-1308700295/iceberg-fs-cos-warehouse',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_fs_on_oss_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'oss://doris-regression-bj/refactor-test/iceberg-fs-oss-warehouse',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
CREATE CATALOG test_fs_on_obs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'obs://bucket/regression/iceberg',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_fs_on_gcs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'gs://bucket/iceberg_warehouse',
'fs.gcs.support'='true',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>'
);
CREATE CATALOG test_iceberg_fs_on_minio PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type'='hadoop',
'warehouse' = 's3://warehouse/wh',
'fs.minio.support'='true',
'minio.endpoint' = 'http://127.0.0.1:19001',
'minio.access_key' = 'admin',
'minio.secret_key' = 'password',
'minio.use_path_style' = 'true'
);
2.1 & 3.0 Version
- HDFS
- S3
- COS
- OSS
- GCS
- MinIO
Access HDFS service without Kerberos authentication
CREATE CATALOG iceberg_fs_on_hdfs_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://127.0.0.1:8320/user/iceberg/warehouse/iceberg-fs-hdfs-warehouse',
'hadoop.username' = 'doris',
'fs.defaultFS' = 'hdfs://127.0.0.1:8320'
);
Access HDFS service with Kerberos authentication enabled
CREATE CATALOG iceberg_fs_on_hdfs_kerberos_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://127.0.0.1:8520/iceberg-fs-hdfs-warehouse',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal'='hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab'
);
CREATE CATALOG iceberg_fs_on_s3_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://selectdb-qa-datalake-test-hk/refactor-test/iceberg-fs-s3-warehouse',
's3.region' = 'ap-east-1',
's3.endpoint' = 's3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_fs_on_cos_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'cos://doris-build-1308700295/iceberg-fs-cos-warehouse',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_fs_on_oss_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'oss://bucket/refactor-test/iceberg-fs-oss-warehouse',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
CREATE CATALOG iceberg_fs_on_gcs_catalog PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='hadoop',
'warehouse' = 's3://bucket/iceberg_warehouse',
'gs.endpoint'='storage.googleapis.com',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>'
);
CREATE CATALOG test_iceberg_fs_on_minio PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type'='hadoop',
'warehouse' = 's3://warehouse/wh',
's3.region' = 'ap-east-1',
's3.endpoint' = 'http://minio:9000',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
AWS S3 Tables
3.1+ Version
- FileSystem Catalog
- AWS Glue Rest Catalog
Please refer to the S3 Tables Integration documentation.
CREATE CATALOG test_s3tables_write_insert PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 's3tables',
'warehouse' = 'arn:aws:s3tables:ap-east-1:12345:bucket/doris-s3-table-bucket',
's3.region' = 'ap-east-1',
's3.endpoint' = 'https://s3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
Please refer to the S3 Tables Integration documentation.
CREATE CATALOG glue_s3 PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'https://glue.ap-east-1.amazonaws.com/iceberg',
'warehouse' = '<acount_id>:s3tablescatalog/<s3_table_bucket_name>',
'iceberg.rest.sigv4-enabled' = 'true',
'iceberg.rest.signing-name' = 'glue',
'iceberg.rest.access-key-id' = '<ak>',
'iceberg.rest.secret-access-key' = '<sk>',
'iceberg.rest.signing-region' = 'ap-east-1'
);
3.0.6+ Version
Please refer to the S3 Tables Integration documentation.
CREATE CATALOG test_s3tables_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 's3tables',
'warehouse' = 'arn:aws:s3tables:ap-east-1:9527:bucket/doris-s3-table-bucket',
's3.region' = 'ap-east-1',
's3.endpoint' = 'https://s3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
Query Operations
Basic Query
-- 1. switch to catalog, use database and query
SWITCH iceberg;
USE iceberg_db;
SELECT * FROM iceberg_tbl LIMIT 10;
-- 2. use iceberg database directly
USE iceberg.iceberg_db;
SELECT * FROM iceberg_tbl LIMIT 10;
-- 3. use full qualified name to query
SELECT * FROM iceberg.iceberg_db.iceberg_tbl LIMIT 10;
Time Travel
You can read a specific snapshot of an Iceberg table.
By default, read requests will access the latest snapshot version.
You can query a specific snapshot of an Iceberg table using the iceberg_meta() table function:
SELECT * FROM iceberg_meta(
'table' = 'iceberg_ctl.iceberg_db.iceberg_tbl',
'query_type' = 'snapshots'
)\G
*************************** 1. row ***************************
committed_at: 2024-11-28 11:07:29
snapshot_id: 8903826400153112036
parent_id: -1
operation: append
manifest_list: oss://path/to/metadata/snap-8903826400153112036-1-3835e66d-9a18-4cb0-b9b0-9ec80527ad8d.avro
summary: {"added-data-files":"2","added-records":"3","added-files-size":"2742","changed-partition-count":"2","total-records":"3","total-files-size":"2742","total-data-files":"2","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}
*************************** 2. row ***************************
committed_at: 2024-11-28 11:10:11
snapshot_id: 6099853805930794326
parent_id: 8903826400153112036
operation: append
manifest_list: oss://path/to/metadata/snap-6099853805930794326-1-dd46a1bd-219b-4fb0-bb46-ac441d8b3105.avro
summary: {"added-data-files":"1","added-records":"1","added-files-size":"1367","changed-partition-count":"1","total-records":"4","total-files-size":"4109","total-data-files":"3","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}
You can use the FOR TIME AS OF and FOR VERSION AS OF clauses to read historical data based on snapshot ID or the time the snapshot was created. Here are some examples:
-- Read data as of a specific timestamp
SELECT * FROM iceberg_table FOR TIME AS OF '2023-01-01 00:00:00';
-- Read data as of a specific snapshot ID
SELECT * FROM iceberg_table FOR VERSION AS OF 123456789;
Branch and Tag
Since 3.1.0
For creating, dropping and managing branch and tag, please refer to [Managing Branch & Tag]
Reading specific branches and tags of Iceberg tables is supported.
Multiple syntax forms are supported to be compatible with systems such as Spark/Trino.
-- BRANCH
SELECT * FROM iceberg_tbl@branch(branch1);
SELECT * FROM iceberg_tbl@branch("name" = "branch1");
SELECT * FROM iceberg_tbl FOR VERSION AS OF 'branch1';
-- TAG
SELECT * FROM iceberg_tbl@tag(tag1);
SELECT * FROM iceberg_tbl@tag("name" = "tag1");
SELECT * FROM iceberg_tbl FOR VERSION AS OF 'tag1';
For the FOR VERSION AS OF syntax, Doris will automatically determine whether the parameter is a timestamp or a Branch/Tag name.
View
Since version 3.1.0
Supports querying Iceberg views. View queries work the same way as regular table queries. Please note the following:
- Only
hmstype Iceberg Catalog is supported. - The view definition SQL must be compatible with Doris SQL dialect, otherwise parsing errors will occur. (Dialect conversion functionality will be provided in future versions).
System Tables
Since 3.1.0
Doris supports querying Iceberg system tables to retrieve metadata information about tables. You can use system tables to view snapshot history, manifest files, data files, partitions, and other metadata.
To access metadata of an Iceberg table, append a $ symbol followed by the system table name to the table name:
SELECT * FROM iceberg_table$system_table_name;
For example, to view the table's history, you can execute:
SELECT * FROM iceberg_table$history;
Currently, the
all_manifestsandposition_deletessystem tables are not yet supported and are planned to be supported in future versions.
entries
Shows all manifest entries for the current snapshot of the table:
all_entries and entries are similar, with the difference that all_entries contains entries from all snapshots, while entries only contains entries from the current snapshot.
SELECT * FROM iceberg_table$entries;
Result:
+--------+---------------------+-----------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| status | snapshot_id | sequence_number | file_sequence_number | data_file | readable_metrics |
+--------+---------------------+-----------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2 | 4890031351138056789 | 1 | 1 | {"content":0, "file_path":"s3://.../iceberg_table/data/id=1/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00001.parquet", "file_format":"PARQUET", "spec_id":0, "partition":{"id":1}, "record_count":1, "file_size_in_bytes":625, "column_sizes":{1:36, 2:41}, "value_counts":{1:1, 2:1}, "null_value_counts":{1:0, 2:0}, "nan_value_counts":{}, "lower_bounds":{1:" ", 2:"Alice"}, "upper_bounds":{1:" ", 2:"Alice"}, "key_metadata":null, "split_offsets":[4], "equality_ids":null, "sort_order_id":0, "first_row_id":null, "referenced_data_file":null, "content_offset":null, "content_size_in_bytes":null} | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":1, "upper_bound":1}, "name":{"column_size":41, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Alice", "upper_bound":"Alice"}} |
| 0 | 1851184769713369003 | 1 | 1 | {"content":0, "file_path":"s3://.../iceberg_table/data/id=2/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00002.parquet", "file_format":"PARQUET", "spec_id":0, "partition":{"id":2}, "record_count":1, "file_size_in_bytes":611, "column_sizes":{1:36, 2:39}, "value_counts":{1:1, 2:1}, "null_value_counts":{1:0, 2:0}, "nan_value_counts":{}, "lower_bounds":{1:" ", 2:"Bob"}, "upper_bounds":{1:" ", 2:"Bob"}, "key_metadata":null, "split_offsets":[4], "equality_ids":null, "sort_order_id":0, "first_row_id":null, "referenced_data_file":null, "content_offset":null, "content_size_in_bytes":null} | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":2, "upper_bound":2}, "name":{"column_size":39, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Bob", "upper_bound":"Bob"}} |
+--------+---------------------+-----------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
files
Shows file list for the current snapshot of the table:
SELECT * FROM iceberg_table$files;
Result:
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+-----------+--------------+--------------------+--------------+--------------+-------------------+------------------+-----------------------+-----------------------+--------------+---------------+--------------+---------------+--------------+----------------------+----------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | first_row_id | referenced_data_file | content_offset | content_size_in_bytes | readable_metrics |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+-----------+--------------+--------------------+--------------+--------------+-------------------+------------------+-----------------------+-----------------------+--------------+---------------+--------------+---------------+--------------+----------------------+----------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0 | s3://.../iceberg_table/data/id=2/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00002.parquet | PARQUET | 0 | {"id":2} | 1 | 611 | {1:36, 2:39} | {1:1, 2:1} | {1:0, 2:0} | {} | {1:" ", 2:"Bob"} | {1:" ", 2:"Bob"} | NULL | [4] | NULL | 0 | NULL | NULL | NULL | NULL | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":2, "upper_bound":2}, "name":{"column_size":39, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Bob", "upper_bound":"Bob"}} |
| 0 | s3://.../iceberg_table/data/id=4/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00004.parquet | PARQUET | 0 | {"id":4} | 1 | 618 | {1:36, 2:40} | {1:1, 2:1} | {1:0, 2:0} | {} | {1:" ", 2:"Dave"} | {1:" ", 2:"Dave"} | NULL | [4] | NULL | 0 | NULL | NULL | NULL | NULL | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":4, "upper_bound":4}, "name":{"column_size":40, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Dave", "upper_bound":"Dave"}} |
| 0 | s3://.../iceberg_table/data/id=6/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00006.parquet | PARQUET | 0 | {"id":6} | 1 | 625 | {1:36, 2:41} | {1:1, 2:1} | {1:0, 2:0} | {} | {1:" ", 2:"Frank"} | {1:" ", 2:"Frank"} | NULL | [4] | NULL | 0 | NULL | NULL | NULL | NULL | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":6, "upper_bound":6}, "name":{"column_size":41, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Frank", "upper_bound":"Frank"}} |
| 0 | s3://.../iceberg_table/data/id=8/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00008.parquet | PARQUET | 0 | {"id":8} | 1 | 625 | {1:36, 2:41} | {1:1, 2:1} | {1:0, 2:0} | {} | {1:" ", 2:"Heidi"} | {1:" ", 2:"Heidi"} | NULL | [4] | NULL | 0 | NULL | NULL | NULL | NULL | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":8, "upper_bound":8}, "name":{"column_size":41, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Heidi", "upper_bound":"Heidi"}} |
| 0 | s3://.../iceberg_table/data/id=10/00000-16-79ef2fd7-9997-47eb-a91a-9f7af8201315-0-00010.parquet | PARQUET | 0 | {"id":10} | 1 | 618 | {1:36, 2:40} | {1:1, 2:1} | {1:0, 2:0} | {} | {1:" ", 2:"Judy"} | {1:" ", 2:"Judy"} | NULL | [4] | NULL | 0 | NULL | NULL | NULL | NULL | {"id":{"column_size":36, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":10, "upper_bound":10}, "name":{"column_size":40, "value_count":1, "null_value_count":0, "nan_value_count":null, "lower_bound":"Judy", "upper_bound":"Judy"}} |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+-----------+--------------+--------------------+--------------+--------------+-------------------+------------------+-----------------------+-----------------------+--------------+---------------+--------------+---------------+--------------+----------------------+----------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Similar to the files system table, you can also query specific file type information through the following system tables:
-- Query data files for the current snapshot
SELECT * FROM iceberg_table$data_files;
-- Query delete files for the current snapshot
SELECT * FROM iceberg_table$delete_files;
-- Query all files (including data and delete files) from all snapshots
SELECT * FROM iceberg_table$all_files;
-- Query data files from all snapshots
SELECT * FROM iceberg_table$all_data_files;
-- Query delete files from all snapshots
SELECT * FROM iceberg_table$all_delete_files;
The result format of these tables is similar to the files system table, but each focuses specifically on data files or delete files. System tables with the all_ prefix contain files from all snapshots, not just files from the current snapshot.
Note: When specific types of files do not exist in the table (for example, querying delete_files when there are no delete files in the table), the query result may be empty.
history
Shows all history of the table:
SELECT * FROM iceberg_table$history;
Result:
+----------------------------+---------------------+---------------------+---------------------+
| made_current_at | snapshot_id | parent_id | is_current_ancestor |
+----------------------------+---------------------+---------------------+---------------------+
| 2025-06-12 22:29:16.357000 | 1851184769713369003 | NULL | 1 |
| 2025-06-12 22:29:39.922000 | 4890031351138056789 | 1851184769713369003 | 1 |
+----------------------------+---------------------+---------------------+---------------------+
manifests
Shows manifest file info of the table:
SELECT * FROM iceberg_table$manifests;
Result:
+---------+------------------------------------------------------------------------------------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------+-----------------------------+----------------------------+--------------------------------------------------------------------------------+
| content | path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | added_delete_files_count | existing_delete_files_count | deleted_delete_files_count | partition_summaries |
+---------+------------------------------------------------------------------------------------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------+-----------------------------+----------------------------+--------------------------------------------------------------------------------+
| 0 | s3://.../iceberg_table/metadata/3194eb8b-5ea4-4cbe-95ba-073229458e7b-m0.avro | 7138 | 0 | 4890031351138056789 | 0 | 5 | 5 | 0 | 0 | 0 | [{"contains_null":0, "contains_nan":0, "lower_bound":"1", "upper_bound":"10"}] |
+---------+------------------------------------------------------------------------------------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------+-----------------------------+----------------------------+--------------------------------------------------------------------------------+
metadata_log_entries
Shows meta logs of the table:
SELECT * FROM iceberg_table$metadata_log_entries;
Result:
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------------+------------------------+
| timestamp | file | latest_snapshot_id | latest_schema_id | latest_sequence_number |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------------+------------------------+
| 2025-06-12 22:29:06.948000 | s3://.../iceberg_table/metadata/00000-e373aa16-15f1-4e69-ae7d-5ed64199cf9a.metadata.json | NULL | NULL | NULL |
| 2025-06-12 22:29:16.357000 | s3://.../iceberg_table/metadata/00001-bbc8e244-e41c-4958-92f4-63b8c3ee1196.metadata.json | 1851184769713369003 | 0 | 1 |
| 2025-06-12 22:29:39.922000 | s3://.../iceberg_table/metadata/00002-7dc00d6a-6269-4200-9d28-5f8c1c6b9f99.metadata.json | 4890031351138056789 | 0 | 2 |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------------+------------------------+
partitions
Shows partitions of the table:
SELECT * FROM iceberg_table$partitions;
Result:
+-----------+---------+--------------+------------+-------------------------------+------------------------------+----------------------------+------------------------------+----------------------------+----------------------------+--------------------------+
| partition | spec_id | record_count | file_count | total_data_file_size_in_bytes | position_delete_record_count | position_delete_file_count | equality_delete_record_count | equality_delete_file_count | last_updated_at | last_updated_snapshot_id |
+-----------+---------+--------------+------------+-------------------------------+------------------------------+----------------------------+------------------------------+----------------------------+----------------------------+--------------------------+
| {"id":8} | 0 | 1 | 1 | 625 | 0 | 0 | 0 | 0 | 2025-06-12 22:29:16.357000 | 1851184769713369003 |
| {"id":6} | 0 | 1 | 1 | 625 | 0 | 0 | 0 | 0 | 2025-06-12 22:29:16.357000 | 1851184769713369003 |
| {"id":10} | 0 | 1 | 1 | 618 | 0 | 0 | 0 | 0 | 2025-06-12 22:29:16.357000 | 1851184769713369003 |
| {"id":4} | 0 | 1 | 1 | 618 | 0 | 0 | 0 | 0 | 2025-06-12 22:29:16.357000 | 1851184769713369003 |
| {"id":2} | 0 | 1 | 1 | 611 | 0 | 0 | 0 | 0 | 2025-06-12 22:29:16.357000 | 1851184769713369003 |
+-----------+---------+--------------+------------+-------------------------------+------------------------------+----------------------------+------------------------------+----------------------------+----------------------------+--------------------------+
Note:
- For non-partitioned tables, the
partitionstable will not contain thepartitionandspec_idfields. - The
partitionstable shows partitions that contain data files or delete files in the current snapshot. However, delete files are not applied, so in some cases, a partition may still be displayed even if all data rows in the partition have been marked as deleted by delete files.
refs
Shows all known snapshot references (branches and tags) for the table:
SELECT * FROM iceberg_table$refs;
Result:
+------+--------+---------------------+-------------------------+-----------------------+------------------------+
| name | type | snapshot_id | max_reference_age_in_ms | min_snapshots_to_keep | max_snapshot_age_in_ms |
+------+--------+---------------------+-------------------------+-----------------------+------------------------+
| main | BRANCH | 4890031351138056789 | NULL | NULL | NULL |
+------+--------+---------------------+-------------------------+-----------------------+------------------------+
snapshots
Shows all snapshots of the table:
SELECT * FROM iceberg_table$snapshots;
Result:
+----------------------------+---------------------+---------------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| committed_at | snapshot_id | parent_id | operation | manifest_list | summary |
+----------------------------+---------------------+---------------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2025-06-12 22:29:16.357000 | 1851184769713369003 | NULL | append | s3://.../iceberg_table/metadata/snap-1851184769713369003-1-82059f57-821a-4983-b083-002cc2cde313.avro | {"spark.app.id":"application_1738810850199_0472", "added-data-files":"10", "added-records":"10", "added-files-size":"6200", "changed-partition-count":"10", "total-records":"10", "total-files-size":"6200", "total-data-files":"10", "total-delete-files":"0", "total-position-deletes":"0", "total-equality-deletes":"0"} |
| 2025-06-12 22:29:39.922000 | 4890031351138056789 | 1851184769713369003 | overwrite | s3://.../iceberg_table/metadata/snap-4890031351138056789-1-3194eb8b-5ea4-4cbe-95ba-073229458e7b.avro | {"spark.app.id":"application_1738810850199_0472", "deleted-data-files":"5", "deleted-records":"5", "removed-files-size":"3103", "changed-partition-count":"5", "total-records":"5", "total-files-size":"3097", "total-data-files":"5", "total-delete-files":"0", "total-position-deletes":"0", "total-equality-deletes":"0"} |
+----------------------------+---------------------+---------------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Write Operations
INSERT INTO
The INSERT operation appends data to the target table.
For example:
INSERT INTO iceberg_tbl VALUES (val1, val2, val3, val4);
INSERT INTO iceberg.iceberg_db.iceberg_tbl SELECT col1, col2 FROM internal.db1.tbl1;
INSERT INTO iceberg_tbl(col1, col2) VALUES (val1, val2);
INSERT INTO iceberg_tbl(col1, col2, partition_col1, partition_col2) VALUES (1, 2, 'beijing', '2023-12-12');
Since version 3.1.0, support for writing data to specified branches:
INSERT INTO iceberg_tbl@branch(b1) values (val1, val2, val3, val4);
INSERT INTO iceberg_tbl@branch(b1) (col3, col4) values (val3, val4);
INSERT OVERWRITE
The INSERT OVERWRITE operation completely replaces the existing data in the table with new data.
INSERT OVERWRITE TABLE iceberg_tbl VALUES (val1, val2, val3, val4);
INSERT OVERWRITE TABLE iceberg.iceberg_db.iceberg_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1;
Since version 3.1.0, support for writing data to specified branches:
INSERT OVERWRITE TABLE iceberg_tbl@branch(b1) values (val1, val2, val3, val4);
INSERT OVERWRITE TABLE iceberg_tbl@branch(b1) (col3, col4) values (val3, val4);
CTAS
You can create an Iceberg table and write data using the CTAS (Create Table As Select) statement:
CREATE TABLE iceberg_ctas AS SELECT * FROM other_table;
CTAS supports specifying file formats, partitioning, and other properties:
CREATE TABLE iceberg_ctas
PARTITION BY LIST (pt1, pt2) ()
AS SELECT col1, pt1, pt2 FROM part_ctas_src WHERE col1 > 0;
CREATE TABLE iceberg.iceberg_db.iceberg_ctas (col1, col2, pt1)
PARTITION BY LIST (pt1) ()
PROPERTIES (
'write-format'='parquet',
'compression-codec'='zstd'
)
AS SELECT col1, pt1 AS col2, pt2 AS pt1 FROM test_ctas.part_ctas_src WHERE col1 > 0;
INSERT INTO BRANCH
Since 3.1.0
INSERT INTO iceberg_table@branch(b1) SELECT * FROM other_table;
INSERT OVERWRITE TABLE iceberg_table@branch(b1) SELECT * FROM other_table;
Related Parameters
-
BE (Backend)
Parameter Name Default Value Description iceberg_sink_max_file_size1GB Maximum data file size. When the written data exceeds this size, the current file is closed and a new file is created to continue writing. table_sink_partition_write_max_partition_nums_per_writer128 Maximum number of partitions each instance can write to on a BE node. table_sink_non_partition_write_scaling_data_processed_threshold25MB Data threshold for starting scaling-write in non-partitioned tables. A new writer (instance) is used for every additional table_sink_non_partition_write_scaling_data_processed_thresholdof data. This mechanism adjusts the number of writers based on data volume to enhance throughput while conserving resources and minimizing file numbers for smaller data volumes.table_sink_partition_write_min_data_processed_rebalance_threshold25MB Minimum data volume threshold to trigger rebalancing for partitioned tables. Rebalancing starts if current accumulated data volume-data volume since last rebalancing>=table_sink_partition_write_min_data_processed_rebalance_threshold. Lowering this threshold can improve balance if file size differences are significant, but may increase rebalancing costs and impact performance.table_sink_partition_write_min_partition_data_processed_rebalance_thresholdMinimum partition data volume threshold to trigger rebalancing. Rebalancing starts if current partition data volume>=threshold*number of tasks already allocated to the partition. Lowering this threshold can improve balance if file size differences are significant, but may increase rebalancing costs and impact performance.
Database and Table Management
Creating and Dropping Databases
You can switch to the desired catalog using the SWITCH statement and execute the CREATE DATABASE command:
SWITCH iceberg;
CREATE DATABASE [IF NOT EXISTS] iceberg_db;
You can also create a database using a fully qualified name or specify a location (currently, only HMS-type catalogs support specifying a location), such as:
CREATE DATABASE [IF NOT EXISTS] iceberg.iceberg_db;
CREATE DATABASE [IF NOT EXISTS] iceberg.iceberg_db
PROPERTIES ('location'='hdfs://172.21.16.47:4007/path/to/db/');
You can view the database's location information using the SHOW CREATE DATABASE command:
mysql> SHOW CREATE DATABASE iceberg_db;
+-------------+-------------------------------------------------------------------------------------------------+
| Database | Create Database |
+-------------+-------------------------------------------------------------------------------------------------+
| iceberg_db | CREATE DATABASE iceberg_db LOCATION 'hdfs://172.21.16.47:4007/usr/hive/warehouse/iceberg_db.db' |
+-------------+-------------------------------------------------------------------------------------------------+
To drop a database:
DROP DATABASE [IF EXISTS] iceberg.iceberg_db;
For an Iceberg Database, you must first drop all tables under the database before you can drop the database itself; otherwise, an error will occur.
Creating and Dropping Tables
-
Creating Tables
Doris supports creating both partitioned and non-partitioned tables in Iceberg.
For example:
-- Create unpartitioned iceberg table
CREATE TABLE unpartitioned_table (
`col1` BOOLEAN COMMENT 'col1',
`col2` INT COMMENT 'col2',
`col3` BIGINT COMMENT 'col3',
`col4` FLOAT COMMENT 'col4',
`col5` DOUBLE COMMENT 'col5',
`col6` DECIMAL(9,4) COMMENT 'col6',
`col7` STRING COMMENT 'col7',
`col8` DATE COMMENT 'col8',
`col9` DATETIME COMMENT 'col9'
)
PROPERTIES (
'write-format'='parquet'
);
-- Create partitioned iceberg table
-- The partition columns must be in table's column definition list
CREATE TABLE partition_table (
`ts` DATETIME COMMENT 'ts',
`col1` BOOLEAN COMMENT 'col1',
`col2` INT COMMENT 'col2',
`col3` BIGINT COMMENT 'col3',
`col4` FLOAT COMMENT 'col4',
`col5` DOUBLE COMMENT 'col5',
`col6` DECIMAL(9,4) COMMENT 'col6',
`col7` STRING COMMENT 'col7',
`col8` DATE COMMENT 'col8',
`col9` DATETIME COMMENT 'col9',
`pt1` STRING COMMENT 'pt1',
`pt2` STRING COMMENT 'pt2'
)
PARTITION BY LIST (day(ts), pt1, pt2) ()
PROPERTIES (
'write-format'='orc',
'compression-codec'='zlib'
);After creation, you can use the
SHOW CREATE TABLEcommand to view the Iceberg table creation statement. For details about partition functions, see the Partitioning section. -
Dropping Tables
You can drop an Iceberg table using the
DROP TABLEstatement. Dropping a table will also remove its data, including partition data.For example:
DROP TABLE [IF EXISTS] iceberg_tbl; -
Column Type Mapping
Refer to the Column Type Mapping section.
-
Partitioning
Partition types in Iceberg correspond to List partitions in Doris. Therefore, when creating an Iceberg partitioned table in Doris, you should use the List partitioning syntax, but you don't need to explicitly enumerate each partition. Doris will automatically create the corresponding Iceberg partitions based on the data values during data insertion.
-
Supports creating single-column or multi-column partitioned tables.
-
Supports partition transformation functions to enable Iceberg implicit partitioning and partition evolution. For specific Iceberg partition transformation functions, see Iceberg partition transforms:
-
year(ts)oryears(ts) -
month(ts)ormonths(ts) -
day(ts)ordays(ts)ordate(ts) -
hour(ts)orhours(ts)ordate_hour(ts) -
bucket(N, col) -
truncate(L, col)
-
-
-
File Formats
-
Parquet (default)
Note that for the Iceberg table created by Doris, the Datetime corresponds to the
timestamp_ntztype.In versions after 3.1.0, when the Datetime type is written to the Parquet file, the physical type used is INT64 instead of INT96.
And if the Iceberg table is created by other systems, although the
timestampandtimestamp_ntztypes are both mapped to the Doris Datetime type. However, when writing, it will determine whether the time zone needs to be processed based on the actual type. -
ORC
-
-
Compression Formats
-
Parquet: snappy, zstd (default), plain (no compression).
-
ORC: snappy, zlib (default), zstd, plain (no compression).
-
-
Storage Medium
-
HDFS
-
Object storage
-
Schema Change
Start from 3.1.0, Doris supports schema changes for Iceberg tables, which can be modified using the ALTER TABLE statement.
Supported schema change operations include:
-
Rename Column
Use the
RENAME COLUMNclause to rename columns. Renaming columns within nested types is not supported.ALTER TABLE iceberg_table RENAME COLUMN old_col_name TO new_col_name; -
Add a Column
Use
ADD COLUMNto add a new column. Adding new columns to nested types is not supported.When adding a new column, you can specify nullable attributes, default values, comments, and column position.
ALTER TABLE iceberg_table ADD COLUMN col_name col_type [NULL|NOT NULL, [DEFAULT default_value, [COMMENT 'comment', [FIRST|AFTER col_name]]]];Example:
ALTER TABLE iceberg_table ADD COLUMN new_col STRING NOT NULL DEFAULT 'default_value' COMMENT 'This is a new col' AFTER old_col; -
Add Columns
You can also use
ADD COLUMNto add multiple columns. The new columns will be added to the end of the table. Column positioning is not supported for multiple columns. Adding new columns to nested types is not supported.The syntax for each column is the same as adding a single column.
ALTER TABLE iceberg_table ADD COLUMN (col_name1 col_type1 [NULL|NOT NULL, [DEFAULT default_value, [COMMENT 'comment']]], col_name2 col_type2 [NULL|NOT NULL, [DEFAULT default_value, [COMMENT 'comment']]] ...); -
Drop Column
Use
DROP COLUMNto drop columns. Dropping columns within nested types is not supported.ALTER TABLE iceberg_table DROP COLUMN col_name; -
Modify Column
Use the
MODIFY COLUMNstatement to modify column attributes, including type, nullable, default value, comment, and column position.Note: When modifying column attributes, all attributes that are not being modified should also be explicitly specified with their original values.
ALTER TABLE iceberg_table MODIFY COLUMN col_name col_type [NULL|NOT NULL, [DEFAULT default_value, [COMMENT 'comment', [FIRST|AFTER col_name]]]];Example:
CREATE TABLE iceberg_table (
id INT,
name STRING
);
-- Modify the id column type to BIGINT, set as NOT NULL, default value to 0, and add comment
ALTER TABLE iceberg_table MODIFY COLUMN id BIGINT NOT NULL DEFAULT 0 COMMENT 'This is a modified id column'; -
Reorder Columns
Use
ORDER BYto reorder columns by specifying the new column order.ALTER TABLE iceberg_table ORDER BY (col_name1, col_name2, ...);
Managing Branch & Tag
Since 3.1.0
-
Create Branch
Syntax:
ALTER TABLE [catalog.][database.]table_name
CREATE [OR REPLACE] BRANCH [IF NOT EXISTS] <branch_name>
[AS OF VERSION <snapshot_id>]
[RETAIN <num> { DAYS | HOURS | MINUTES }]
[WITH SNAPSHOT RETENTION { snapshotKeep | timeKeep }]
snapshotKeep:
<num> SNAPSHOTS [<num> { DAYS | HOURS | MINUTES }]
timeKeep:
<num> { DAYS | HOURS | MINUTES }Examples:
-- Create branch "b1".
ALTER TABLE tbl CREATE BRANCH b1;
ALTER TABLE tb1 CREATE BRANCH IF NOT EXISTS b1;
-- Create or replace branch "b1".
ALTER TABLE tb1 CREATE OR REPLACE BRANCH b1;
-- Create or replace branch "b1" based on snapshot "123456".
ALTER TABLE tb1 CREATE OR REPLACE BRANCH b1 AS OF VERSION 123456;
-- Create or replace branch "b1" based on snapshot "123456", branch retained for 1 day.
ALTER TABLE tb1 CREATE OR REPLACE BRANCH b1 AS OF VERSION 123456 RETAIN 1 DAYS;
-- Create branch "b1" based on snapshot "123456", branch retained for 30 days. Keep the latest 3 snapshots in the branch.
ALTER TABLE tb1 CREATE BRANCH b1 AS OF VERSION 123456 RETAIN 30 DAYS WITH SNAPSHOT RETENTION 3 SNAPSHOTS;
-- Create branch "b1" based on snapshot "123456", branch retained for 30 days. Snapshots in the branch are retained for at most 2 days.
ALTER TABLE tb1 CREATE BRANCH b1 AS OF VERSION 123456 RETAIN 30 DAYS WITH SNAPSHOT RETENTION 2 DAYS;
-- Create branch "b1" based on snapshot "123456", branch retained for 30 days. Keep the latest 3 snapshots in the branch, and snapshots in the branch are retained for at most 2 days.
ALTER TABLE tb1 CREATE BRANCH b1 AS OF VERSION 123456 RETAIN 30 DAYS WITH SNAPSHOT RETENTION 3 SNAPSHOTS 2 DAYS; -
Drop Branch
Syntax:
ALTER TABLE [catalog.][database.]table_name
DROP BRANCH [IF EXISTS] <branch_name>;Example:
ALTER TABLE tbl DROP BRANCH b1; -
Create Tag
Syntax:
ALTER TABLE [catalog.][database.]table_name
CREATE [OR REPLACE] TAG [IF NOT EXISTS] <tag_name>
[AS OF VERSION <snapshot_id>]
[RETAIN <num> { DAYS | HOURS | MINUTES }]Examples:
-- Create tag "t1".
ALTER TABLE tbl CREATE TAG t1;
ALTER TABLE tb1 CREATE TAG IF NOT EXISTS t1;
-- Create or replace tag "t1".
ALTER TABLE tb1 CREATE OR REPLACE TAG t1;
-- Create or replace tag "t1" based on snapshot "123456".
ALTER TABLE tb1 CREATE OR REPLACE TAG b1 AS OF VERSION 123456;
-- Create or replace tag "b1" based on snapshot "123456", tag retained for 1 day.
ALTER TABLE tb1 CREATE OR REPLACE TAG b1 AS OF VERSION 123456 RETAIN 1 DAYS; -
Drop Tag
Syntax:
ALTER TABLE [catalog.][database.]table_name
DROP TAG [IF EXISTS] <tag_name>;Example:
ALTER TABLE tbl DROP TAG t1;
Iceberg Table Actions
This feature is supported since 4.0.2
This is an experimental feature.
Doris supports executing specific actions on Iceberg tables through the ALTER TABLE EXECUTE syntax. This syntax provides a unified interface for executing various actions such as table optimization and snapshot management.
ALTER TABLE [catalog.][database.]table_name
EXECUTE <action_name> ("key1" = "value1", "key2" = "value2", ...)
[WHERE <condition>]
action_name: The name of the action to execute("key1" = "value1", ...): Action-related parameters provided as key-value pairs. Parameter names must be wrapped in double quotes[WHERE <condition>]: Optional WHERE condition to specify the partition or data range for the operation (only supported by some operations)
rewrite_data_files
The rewrite_data_files operation rewrites data files in an Iceberg table to implement small file compaction, optimizing query performance and storage efficiency. This is a synchronized operation.
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE rewrite_data_files ("key1" = "value1", "key2" = "value2", ...) [WHERE <condition>]
Parameters:
| Parameter Name | Type | Default Value | Description |
|---|---|---|---|
target-file-size-bytes | Long | 536870912 (512MB) | Target file size in bytes for output files |
min-file-size-bytes | Long | target-file-size-bytes * 0.75 | Minimum file size in bytes. Files smaller than this will be rewritten. If not specified, defaults to 75% of target file size |
max-file-size-bytes | Long | target-file-size-bytes * 1.8 | Maximum file size in bytes. Files larger than this will be split. If not specified, defaults to 180% of target file size |
min-input-files | Integer | 5 | Minimum number of input files to trigger rewrite. Rewrite only executes when the number of files to rewrite reaches this threshold |
rewrite-all | Boolean | false | Whether to rewrite all files regardless of file size |
max-file-group-size-bytes | Long | 107374182400 (100GB) | Maximum size in bytes for a file group. Used to limit the amount of data processed by a single rewrite task |
delete-file-threshold | Integer | Integer.MAX_VALUE | Minimum number of delete files required to trigger rewrite |
delete-ratio-threshold | Double | 0.3 | Minimum ratio of delete records to total records required to trigger rewrite (delete records / total records). Range: 0.0 - 1.0 |
output-spec-id | Long | 2 | Partition specification ID for output files |
Return Value:
Executing rewrite_data_files returns a result set with the following 4 columns:
| Column Name | Type | Description |
|---|---|---|
rewritten_data_files_count | INT | Number of data files that were rewritten |
added_data_files_count | INT | Number of new data files that were written |
rewritten_bytes_count | INT | Number of bytes that were rewritten |
removed_delete_files_count | BIGINT | Number of delete files that were removed |
Examples:
-- Rewrite data files with default parameters
ALTER TABLE iceberg_db.iceberg_table EXECUTE rewrite_data_files();
-- Specify target file size and minimum input files
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rewrite_data_files (
"target-file-size-bytes" = "134217728",
"min-input-files" = "10"
);
-- Use WHERE condition to rewrite specific partition data
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rewrite_data_files (
"target-file-size-bytes" = "268435456"
)
WHERE date_col = '2024-01-01';
-- Rewrite all files
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rewrite_data_files (
"rewrite-all" = "true",
"target-file-size-bytes" = "536870912"
);
Notes:
- The
rewrite_data_filesoperation reads and rewrites data files, which will generate additional I/O and computational overhead. Please allocate cluster resources reasonably. - Before execution, you can use the SQL in the View Data File Distribution section to evaluate whether a rewrite operation is needed
- The WHERE condition can be used to limit the partition or data range for rewriting. This condition will filter out files that do not contain data matching the WHERE condition, thereby reducing the number of files and amount of data to rewrite
- Before execution, you can use the SQL in the Rewrite File Selection Logic section to calculate which files will be rewritten
cherrypick_snapshot
The cherrypick_snapshot operation merges changes from a specified snapshot into the current table state, creating a new snapshot without modifying or deleting the original snapshot.
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE cherrypick_snapshot ("snapshot_id" = "<snapshot_id>")
Parameters:
| Parameter Name | Type | Required | Description |
|---|---|---|---|
snapshot_id | Long | Yes | The snapshot ID to merge |
Return Value:
Executing cherrypick_snapshot returns a result set with the following 2 columns:
| Column Name | Type | Description |
|---|---|---|
source_snapshot_id | BIGINT | The snapshot ID whose changes were merged |
current_snapshot_id | BIGINT | The snapshot ID of the new snapshot created and set as current after the merge operation |
Example:
-- Merge changes from snapshot 123456789 into current table state
ALTER TABLE iceberg_db.iceberg_table
EXECUTE cherrypick_snapshot ("snapshot_id" = "123456789");
Notes:
- This operation does not support WHERE conditions
- If the specified snapshot does not exist, the operation will fail
- The merge operation creates a new snapshot and does not delete the original snapshot
fast_forward
The fast_forward operation fast-forwards the current snapshot of one branch to the latest snapshot of another branch.
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE fast_forward ("branch" = "<branch_name>", "to" = "<target_branch>")
Parameters:
| Parameter Name | Type | Required | Description |
|---|---|---|---|
branch | String | Yes | Name of the branch to fast-forward |
to | String | Yes | Target branch name. The branch will be fast-forwarded to the latest snapshot of this branch |
Return Value:
Executing fast_forward returns a result set with the following 3 columns:
| Column Name | Type | Description |
|---|---|---|
branch_updated | STRING | Name of the branch that was fast-forwarded |
previous_ref | BIGINT | Snapshot ID the branch was pointing to before the fast-forward operation (can be NULL) |
updated_ref | BIGINT | Snapshot ID the branch is pointing to after the fast-forward operation |
Example:
-- Fast-forward feature branch to the latest snapshot of main branch
ALTER TABLE iceberg_db.iceberg_table
EXECUTE fast_forward ("branch" = "feature", "to" = "main");
Notes:
- This operation does not support WHERE conditions
- If the specified branch does not exist, the operation will fail
- Only branches can be fast-forwarded, not tags
rollback_to_snapshot
The rollback_to_snapshot operation rolls back an Iceberg table to a specified snapshot.
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE rollback_to_snapshot ("snapshot_id" = "<snapshot_id>")
Parameters:
| Parameter Name | Type | Required | Description |
|---|---|---|---|
snapshot_id | Long | Yes | The snapshot ID to rollback to |
Return Value:
Executing rollback_to_snapshot returns a result set with the following 2 columns:
| Column Name | Type | Description |
|---|---|---|
previous_snapshot_id | BIGINT | The snapshot ID that was current before the rollback operation |
current_snapshot_id | BIGINT | The snapshot ID that is set as current after rolling back to the specified snapshot |
Example:
-- Rollback to snapshot 987654321
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rollback_to_snapshot ("snapshot_id" = "987654321");
Notes:
- This operation does not support WHERE conditions
- If the specified snapshot does not exist, the operation will fail
- If the current snapshot is already the target snapshot, the operation returns directly without creating a new snapshot
rollback_to_timestamp
The rollback_to_timestamp operation rolls back an Iceberg table to the snapshot at a specified timestamp.
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE rollback_to_timestamp ("timestamp" = "<timestamp>")
Parameters:
| Parameter Name | Type | Required | Description |
|---|---|---|---|
timestamp | String | Yes | The timestamp to rollback to. Supports two formats: 1. ISO datetime format: yyyy-MM-dd HH:mm:ss.SSS (e.g., 2024-01-01 10:30:00.000)2. Milliseconds timestamp (e.g., 1704067200000) |
Return Value:
Executing rollback_to_timestamp returns a result set with the following 2 columns:
| Column Name | Type | Description |
|---|---|---|
previous_snapshot_id | BIGINT | The snapshot ID that was current before the rollback operation |
current_snapshot_id | BIGINT | The snapshot ID that was current at the specified timestamp, now set as current |
Examples:
-- Rollback to specified timestamp (ISO format)
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rollback_to_timestamp ("timestamp" = "2024-01-01 10:30:00.000");
-- Rollback to specified timestamp (timestamp format)
ALTER TABLE iceberg_db.iceberg_table
EXECUTE rollback_to_timestamp ("timestamp" = "1704067200000");
Notes:
- This operation does not support WHERE conditions
- The
timestampparameter supports ISO datetime format (yyyy-MM-dd HH:mm:ss.SSS) or milliseconds timestamp format - If there is no snapshot corresponding to the specified timestamp, the operation will fail
set_current_snapshot
The set_current_snapshot operation sets the current snapshot of an Iceberg table to a specified snapshot ID or reference (branch or tag).
Syntax:
ALTER TABLE [catalog.][database.]table_name
EXECUTE set_current_snapshot ("snapshot_id" = "<snapshot_id>" | "ref" = "<ref_name>")
Parameters:
| Parameter Name | Type | Required | Description |
|---|---|---|---|
snapshot_id | Long | Either one | The snapshot ID to set as current |
ref | String | Either one | The reference name (branch or tag) to set as current |
Return Value:
Executing set_current_snapshot returns a result set with the following 2 columns:
| Column Name | Type | Description |
|---|---|---|
previous_snapshot_id | BIGINT | The snapshot ID that was current before setting the new current snapshot |
current_snapshot_id | BIGINT | The snapshot ID that is now set as current (from snapshot_id parameter or resolved from ref parameter) |
Examples:
-- Set current snapshot by snapshot ID
ALTER TABLE iceberg_db.iceberg_table
EXECUTE set_current_snapshot ("snapshot_id" = "123456789");
-- Set current snapshot by branch name
ALTER TABLE iceberg_db.iceberg_table
EXECUTE set_current_snapshot ("ref" = "feature_branch");
-- Set current snapshot by tag name
ALTER TABLE iceberg_db.iceberg_table
EXECUTE set_current_snapshot ("ref" = "v1.0");
Notes:
- This operation does not support WHERE conditions
- The
snapshot_idandrefparameters are mutually exclusive; only one can be specified - If the specified snapshot ID or reference does not exist, the operation will fail
- If the current snapshot is already the target snapshot, the operation returns directly without creating a new snapshot
Iceberg Table Optimization
View Data File Distribution
You can use the following SQL to analyze the data distribution and delete file count of Iceberg tables, helping determine whether optimization like rewrite_data_files is needed.
-
Display data file size distribution. This can help identify if there are too many
SELECT
CASE
WHEN file_size_in_bytes BETWEEN 0 AND 8 * 1024 * 1024 THEN '0-8M'
WHEN file_size_in_bytes BETWEEN 8 * 1024 * 1024 + 1 AND 32 * 1024 * 1024 THEN '8-32M'
WHEN file_size_in_bytes BETWEEN 2 * 1024 * 1024 + 1 AND 128 * 1024 * 1024 THEN '32-128M'
WHEN file_size_in_bytes BETWEEN 128 * 1024 * 1024 + 1 AND 512 * 1024 * 1024 THEN '128-512M'
WHEN file_size_in_bytes > 512 * 1024 * 1024 THEN '> 512M'
ELSE 'Unknown'
END AS SizeRange,
COUNT(*) AS FileNum
FROM store_sales$data_files
GROUP BY
SizeRange;
+-----------+---------+
| SizeRange | FileNum |
+-----------+---------+
| 0-8M | 8 |
| 8-32M | 6 |
+-----------+---------+ -
Display the number of data files and delete files
SELECT
CASE
WHEN content = 0 THEN 'DataFile'
WHEN content = 1 THEN 'PositionDeleteFile'
WHEN content = 2 THEN 'EqualityDeleteFile'
ELSE 'Unknown'
END AS ContentType,
COUNT(*) AS FileNum,
SUM(file_size_in_bytes) AS SizeInBytes,
SUM(record_count) AS Records
FROM
iceberg_table$files
GROUP BY
ContentType;
+--------------------+---------+-------------+---------+
| ContentType | FileNum | SizeInBytes | Records |
+--------------------+---------+-------------+---------+
| EqualityDeleteFile | 2787 | 1432518 | 27870 |
| DataFile | 2787 | 4062416 | 38760 |
| PositionDeleteFile | 11 | 36608 | 10890 |
+--------------------+---------+-------------+---------+
Appendix
rewrite_data_files File Selection Strategy
You can check this section to understand the logic for selecting files to be rewritten. You can also use the following SQL queries to analyze which files will be rewritten based on system table information. This helps you debug and understand rewrite behavior before executing actual rewrite operations.
The rewrite operation uses a two-stage filtering and grouping strategy:
-
Stage 1: File-level filtering
First, files are filtered based on file-level conditions. Only files that meet at least one of the following conditions will be considered for rewriting:
- File size out of range: File is smaller than
min-file-size-bytesor larger thanmax-file-size-bytes - Too many delete files: Number of associated delete files >=
delete-file-threshold - High delete ratio: Delete record ratio >=
delete-ratio-threshold
Note about delete file conditions: Conditions related to delete files (
delete-file-thresholdanddelete-ratio-threshold) affect file rewrite selection, but cannot be accurately calculated through system table SQL queries. This is because current system table information is insufficient to accurately determine which data files are affected by delete files. Delete files may only affect specific data files (especially Position Delete files), and the association between delete files and data files requires more detailed metadata that is not available in current system tables. Actual rewrite operations will evaluate these conditions based on more detailed metadata not exposed in system tables. - File size out of range: File is smaller than
-
Stage 2: BinPack grouping and group-level filtering
After file-level filtering, files are grouped by partition, then within each partition, files are grouped using the BinPack algorithm based on file size and
max-file-group-size-bytes. The BinPack algorithm intelligently packs files into groups to optimize rewrite operations.After grouping, file groups are filtered based on group-level conditions. File groups will be rewritten when they meet at least one of the following conditions:
- Sufficient input files: Number of files in group > 1 and file count >=
min-input-files - Sufficient content: Number of files in group > 1 and total size >
target-file-size-bytes - Too much content: Group total size >
max-file-group-size-bytes - Contains files with delete file issues: At least one file in the group has too many delete files or high delete ratio.
Note about BinPack grouping: The BinPack algorithm used to group files within each partition is a complex optimization algorithm that cannot be accurately simulated through system table SQL queries. This algorithm considers file size and
max-file-group-size-bytesto intelligently pack files into groups, which may result in different grouping results than simple partition-based grouping. Therefore, the SQL queries provided below can only identify files that meet file-level conditions, but cannot accurately calculate the final file groups that will be rewritten. - Sufficient input files: Number of files in group > 1 and file count >=
Here are some example query strategies:
-
Query 1: Identify files meeting file-level conditions
This query identifies files that meet file-level conditions (file size) and will be considered for rewriting:
-- Set parameters (adjust according to your rewrite parameters)
SET @min_file_size_bytes = 16 * 1024 * 1024; -- 16 MB (default: target-file-size-bytes * 0.75)
SET @max_file_size_bytes = 768 * 1024 * 1024; -- 768 MB (default: target-file-size-bytes * 1.8)
SELECT
file_path,
`partition`,
file_size_in_bytes / 1024.0 / 1024.0 AS file_size_mb,
record_count,
CASE
WHEN file_size_in_bytes < @min_file_size_bytes THEN 'Too small'
WHEN file_size_in_bytes > @max_file_size_bytes THEN 'Too large'
END AS size_issue
FROM iceberg_table$data_files
WHERE file_size_in_bytes < @min_file_size_bytes
OR file_size_in_bytes > @max_file_size_bytes
ORDER BY `partition`, file_size_in_bytes DESC;Sample output:
+----------------------------------------------------------------------------------------+---------------------------+-------------+--------------+------------+
| file_path | partition | file_size_mb | record_count | size_issue |
+----------------------------------------------------------------------------------------+---------------------------+-------------+--------------+------------+
| s3a://bucket/path/to/table/data/date_col=2024-01-01/00000-0-00000-00000-0.parquet | {"date_col":"2024-01-01"} | 0.00215 | 5 | Too small |
| s3a://bucket/path/to/table/data/date_col=2024-01-01/00001-0-00001-00001-0.parquet | {"date_col":"2024-01-01"} | 0.00198 | 5 | Too small |
| s3a://bucket/path/to/table/data/date_col=2024-01-02/00002-0-00002-00002-0.parquet | {"date_col":"2024-01-02"} | 0.00231 | 5 | Too small |
| s3a://bucket/path/to/table/data/date_col=2024-01-02/00003-0-00003-00003-0.parquet | {"date_col":"2024-01-02"} | 850.23456 | 1250000 | Too large |
| s3a://bucket/path/to/table/data/date_col=2024-01-03/00004-0-00004-00004-0.parquet | {"date_col":"2024-01-03"} | 0.00245 | 5 | Too small |
+----------------------------------------------------------------------------------------+---------------------------+-------------+--------------+------------+
5 rows in setNote: This query can only identify files that meet file-level conditions (file size). The actual file groups to be rewritten depend on the BinPack grouping algorithm, which cannot be accurately simulated through SQL queries. Files identified by this query may be grouped differently by the BinPack algorithm, and some groups may be filtered out based on group-level conditions.
-
Query 2: Overall statistics summary
This query provides overall statistics about files that meet file-level conditions:
-- Set parameters (adjust according to your rewrite parameters)
SET @min_file_size_bytes = 16 * 1024 * 1024;
SET @max_file_size_bytes = 768 * 1024 * 1024;
WITH file_analysis AS (
SELECT
`partition`,
file_path,
file_size_in_bytes,
record_count,
(file_size_in_bytes < @min_file_size_bytes OR file_size_in_bytes > @max_file_size_bytes) AS meets_file_level_conditions
FROM iceberg_table$data_files
)
SELECT
'Total files' AS metric,
COUNT(*) AS value
FROM file_analysis
UNION ALL
SELECT
'Files meeting file-level conditions',
SUM(CASE WHEN meets_file_level_conditions THEN 1 ELSE 0 END)
FROM file_analysis
UNION ALL
SELECT
'Total size (GB)',
ROUND(SUM(file_size_in_bytes) / 1024.0 / 1024.0 / 1024.0, 2)
FROM file_analysis
UNION ALL
SELECT
'Size meeting file-level conditions (GB)',
ROUND(SUM(CASE WHEN meets_file_level_conditions THEN file_size_in_bytes ELSE 0 END) / 1024.0 / 1024.0 / 1024.0, 2)
FROM file_analysis
UNION ALL
SELECT
'Percentage meeting file-level conditions (%)',
ROUND(SUM(CASE WHEN meets_file_level_conditions THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2)
FROM file_analysis;Sample output:
+----------------------------------------------+--------+
| metric | value |
+----------------------------------------------+--------+
| Total files | 15.00 |
| Total size (GB) | 2.45 |
| Files meeting file-level conditions | 12.00 |
| Size meeting file-level conditions (GB) | 1.85 |
| Percentage meeting file-level conditions (%) | 80.00 |
+----------------------------------------------+--------+
5 rows in setNotes:
- Replace
iceberg_tablein the example with the actual table name - Adjust parameter values (
@min_file_size_bytes,@max_file_size_bytes) according to your actual rewrite parameters - The default values shown above correspond to default rewrite parameters (min-file-size-bytes = target-file-size-bytes * 0.75, max-file-size-bytes = target-file-size-bytes * 1.8)
- Replace
-
View snapshot and branch relationships
SELECT
refs_data.snapshot_id,
snapshots.committed_at,
snapshots.operation,
ARRAY_SORT(refs_data.refs)
FROM (
SELECT
snapshot_id,
ARRAY_AGG(CONCAT(type, ':', name)) AS refs
FROM
iceberg_table$refs
GROUP BY
snapshot_id
) AS refs_data
JOIN (
SELECT
snapshot_id,
committed_at,
operation
FROM
iceberg_table$snapshots
) AS snapshots
ON refs_data.snapshot_id = snapshots.snapshot_id
ORDER BY
snapshots.committed_at;
+---------------------+----------------------------+-----------+-------------------------------------+
| snapshot_id | committed_at | operation | ARRAY_SORT(refs_data.refs) |
+---------------------+----------------------------+-----------+-------------------------------------+
| 8272911997874079853 | 2025-07-10 15:27:07.177000 | append | ["BRANCH:b1", "TAG:t1"] |
| 1325777059626757917 | 2025-07-10 15:27:07.530000 | append | ["BRANCH:b2", "TAG:t2"] |
| 76492482642020578 | 2025-07-10 15:27:07.865000 | append | ["BRANCH:b3", "TAG:t3"] |
| 1788715857849070138 | 2025-07-12 04:15:19.626000 | append | ["BRANCH:main", "TAG:t4", "TAG:t5"] |
+---------------------+----------------------------+-----------+-------------------------------------+