S3-Compatible Storage

Apache Doris supports loading data from various object storage systems that are compatible with the S3 protocol, such as MinIO, Ceph, Huawei Cloud OBS, and Tencent Cloud COS. This document describes the applicable scenarios and complete steps for two loading methods.

Choosing a Method

Choose the appropriate loading method based on data size and processing mode:

Loading Method	Mode	Applicable Scenario	Reference
S3 Load	Asynchronous	Large-batch data loading, requires job management and retry mechanisms	Broker Load Manual
TVF (Table Value Function)	Synchronous	Ad-hoc queries, small-batch data loading, consuming data directly in SQL	-

Note

The S3 SDK accesses object storage in virtual-hosted style by default. If the target storage system does not enable or support this style, add the parameter "use_path_style" = "true" to force the use of path style.

Loading with S3 Load (Asynchronous)

S3 Load is suitable for large-scale batch loading. Loading jobs are submitted asynchronously and scheduled by Doris in the background. For the detailed mechanism, refer to the Broker Load Manual.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv on S3-compatible storage with the following content:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create the Target Table in Doris

CREATE TABLE test_s3load(
    user_id            BIGINT       NOT NULL COMMENT "user id",
    name               VARCHAR(20)           COMMENT "name",
    age                INT                   COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Submit the S3 Load Job

LOAD LABEL s3_load_2022_04_01
(
    DATA INFILE("s3://your_bucket_name/s3load_example.csv")
    INTO TABLE test_s3load
    COLUMNS TERMINATED BY ","
    FORMAT AS "CSV"
    (user_id, name, age)
)
WITH S3
(
    "provider" = "S3",
    "s3.endpoint" = "play.min.io:9000",
    "s3.region" = "us-east-1",
    "s3.access_key" = "<your-ak>",
    "s3.secret_key" = "<your-sk>",
    "use_path_style" = "true"
)
PROPERTIES
(
    "timeout" = "3600"
);

Key parameters:

Parameter	Description
`provider`	Object storage provider type. For S3-compatible storage, always use `S3`.
`s3.endpoint`	The access endpoint of the S3-compatible storage (without protocol prefix).
`s3.region`	The region where the bucket is located.
`s3.access_key`	The access key (AK).
`s3.secret_key`	The secret key (SK).
`use_path_style`	Whether to use path style access. Storage systems such as MinIO usually require this to be set to `true`.
`timeout`	The timeout of the loading job, in seconds.

Step 4: Verify the Loading Result

SELECT * FROM test_s3load;

Expected output:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name      | age  |
+---------+-----------+------+
|       5 | Ava       |   17 |
|      10 | Liam      |   64 |
|       7 | Sophia    |   32 |
|       9 | Emma      |   37 |
|       1 | Emily     |   25 |
|       4 | Alexander |   60 |
|       2 | Benjamin  |   35 |
|       3 | Olivia    |   28 |
|       6 | William   |   69 |
|       8 | James     |   64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Loading with TVF (Synchronous)

TVF treats files in S3-compatible storage as a table for querying. Combined with INSERT INTO ... SELECT, it enables synchronous loading and is suitable for ad-hoc analysis and small-batch data loading.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv on S3-compatible storage with the following content:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create the Target Table in Doris

CREATE TABLE test_s3load(
    user_id            BIGINT       NOT NULL COMMENT "user id",
    name               VARCHAR(20)           COMMENT "name",
    age                INT                   COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Load Synchronously via TVF

INSERT INTO test_s3load
SELECT * FROM S3
(
    "uri" = "s3://your_bucket_name/s3load_example.csv",
    "format" = "csv",
    "provider" = "S3",
    "s3.endpoint" = "play.min.io:9000",
    "s3.region" = "us-east-1",
    "s3.access_key" = "<your-ak>",
    "s3.secret_key" = "<your-sk>",
    "column_separator" = ",",
    "csv_schema" = "user_id:int;name:string;age:int",
    "use_path_style" = "true"
);

Key parameters:

Parameter	Description
`uri`	The full path of the file in S3-compatible storage.
`format`	The file format, such as `csv`, `parquet`, or `orc`.
`provider`	Object storage provider type. For S3-compatible storage, always use `S3`.
`s3.endpoint`	The access endpoint of the S3-compatible storage.
`s3.region`	The region where the bucket is located.
`s3.access_key`	The access key (AK).
`s3.secret_key`	The secret key (SK).
`column_separator`	The column separator.
`csv_schema`	The column definition of the CSV file, in the format `field_name:type`.
`use_path_style`	Whether to use path style access.

Step 4: Verify the Loading Result

SELECT * FROM test_s3load;

Expected output:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name      | age  |
+---------+-----------+------+
|       5 | Ava       |   17 |
|      10 | Liam      |   64 |
|       7 | Sophia    |   32 |
|       9 | Emma      |   37 |
|       1 | Emily     |   25 |
|       4 | Alexander |   60 |
|       2 | Benjamin  |   35 |
|       3 | Olivia    |   28 |
|       6 | William   |   69 |
|       8 | James     |   64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

FAQ

1. Why does accessing storage such as MinIO report that the bucket does not exist or cannot be resolved?

The S3 SDK builds the domain name in virtual-hosted style by default (such as bucket.endpoint), while MinIO, Ceph, and similar systems usually use path style (such as endpoint/bucket). Add "use_path_style" = "true" to the loading parameters to resolve this.

2. Does `s3.endpoint` need to include the `http://` or `https://` prefix?

No. s3.endpoint only takes the host name and port (such as play.min.io:9000).

3. How do I choose between S3 Load and TVF?

For large data volumes where you want asynchronous execution and a preserved job history, use S3 Load.
When you need to read S3 files in a single SQL statement and write the result synchronously with INSERT INTO ... SELECT, use TVF.

4. What should I do if the loading job times out?

Increase the timeout property of S3 Load (in seconds), or split a large file into multiple smaller files to improve concurrency.

Choosing a Method​

Loading with S3 Load (Asynchronous)​

Step 1: Prepare the Data​

Step 2: Create the Target Table in Doris​

Step 3: Submit the S3 Load Job​

Step 4: Verify the Loading Result​

Loading with TVF (Synchronous)​

Step 1: Prepare the Data​

Step 2: Create the Target Table in Doris​

Step 3: Load Synchronously via TVF​

Step 4: Verify the Loading Result​

FAQ​

1. Why does accessing storage such as MinIO report that the bucket does not exist or cannot be resolved?​

2. Does s3.endpoint need to include the http:// or https:// prefix?​

3. How do I choose between S3 Load and TVF?​

4. What should I do if the loading job times out?​

Choosing a Method

Loading with S3 Load (Asynchronous)

Step 1: Prepare the Data

Step 2: Create the Target Table in Doris

Step 3: Submit the S3 Load Job

Step 4: Verify the Loading Result

Loading with TVF (Synchronous)

Step 1: Prepare the Data

Step 2: Create the Target Table in Doris

Step 3: Load Synchronously via TVF

Step 4: Verify the Loading Result

FAQ

1. Why does accessing storage such as MinIO report that the bucket does not exist or cannot be resolved?

2. Does `s3.endpoint` need to include the `http://` or `https://` prefix?

3. How do I choose between S3 Load and TVF?

4. What should I do if the loading job times out?