Importing Data from Google Cloud Storage (GCS)

Apache Doris supports importing files from Google Cloud Storage (GCS) by accessing GCS buckets through the S3-compatible protocol. This document describes two import methods and their complete operating procedures.

Choosing an Approach

Doris provides two methods for importing files from Google Cloud Storage. Choose one based on data volume and timeliness requirements:

Import Method	Execution Mode	Applicable Scenario	Reference
S3 Load	Asynchronous	Large-batch data imports, long-running jobs	Broker Load Manual
S3 TVF (table function)	Synchronous	Small-batch data imports, ad-hoc queries, quick validation	This document

Recommendations:

For larger data volumes or jobs that need to run in the background, use S3 Load.
For immediate results or use with INSERT INTO ... SELECT, use S3 TVF.

Prerequisites

Prepare the following information before importing:

Google Cloud Storage bucket name (your_bucket_name).
Access credentials: Access Key and Secret Key.
GCS Endpoint and Region (for example, storage.us-west2.rep.googleapis.com and US-WEST2).
A deployed and accessible Apache Doris cluster.

Method 1: Import with S3 Load (Asynchronous)

S3 Load is suitable for asynchronous imports of large data volumes. For detailed parameters and advanced usage, see the Broker Load Manual.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a Table in Doris

CREATE TABLE test_s3load(
    user_id            BIGINT       NOT NULL COMMENT "user id",
    name               VARCHAR(20)           COMMENT "name",
    age                INT                   COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import the Data with S3 Load

LOAD LABEL s3_load_2022_04_01
(
    DATA INFILE("s3://your_bucket_name/s3load_example.csv")
    INTO TABLE test_s3load
    COLUMNS TERMINATED BY ","
    FORMAT AS "CSV"
    (user_id, name, age)
)
WITH S3
(
    "provider" = "GCP",
    "s3.endpoint" = "storage.us-west2.rep.googleapis.com",
    "s3.region" = "US-WEST2",
    "s3.access_key" = "<your-ak>",
    "s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
    "timeout" = "3600"
);

Key parameters:

Parameter	Description
`provider`	Object storage provider. For GCS, this is fixed at `GCP`
`s3.endpoint`	GCS S3-compatible access endpoint
`s3.region`	Region where the GCS bucket is located
`s3.access_key`	GCS access key ID
`s3.secret_key`	GCS secret access key
`timeout`	Import timeout, in seconds

Step 4: Verify the Imported Data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name      | age  |
+---------+-----------+------+
|       5 | Ava       |   17 |
|      10 | Liam      |   64 |
|       7 | Sophia    |   32 |
|       9 | Emma      |   37 |
|       1 | Emily     |   25 |
|       4 | Alexander |   60 |
|       2 | Benjamin  |   35 |
|       3 | Olivia    |   28 |
|       6 | William   |   69 |
|       8 | James     |   64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Method 2: Import with TVF (Synchronous)

The S3 table function (TVF) is suitable for synchronous imports and ad-hoc queries, and works directly with INSERT INTO ... SELECT.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a Table in Doris

CREATE TABLE test_s3load(
    user_id            BIGINT       NOT NULL COMMENT "user id",
    name               VARCHAR(20)           COMMENT "name",
    age                INT                   COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import the Data with TVF

INSERT INTO test_s3load
SELECT * FROM S3
(
    "uri" = "s3://your_bucket_name/s3load_example.csv",
    "format" = "csv",
    "provider" = "GCP",
    "s3.endpoint" = "storage.us-west2.rep.googleapis.com",
    "s3.region" = "US-WEST2",
    "s3.access_key" = "<your-ak>",
    "s3.secret_key" = "<your-sk>",
    "column_separator" = ",",
    "csv_schema" = "user_id:int;name:string;age:int"
);

Key parameters:

Parameter	Description
`uri`	S3 URI of the file in object storage
`format`	File format, such as `csv`, `parquet`, or `orc`
`provider`	Object storage provider. For GCS, this is fixed at `GCP`
`s3.endpoint`	GCS S3-compatible access endpoint
`s3.region`	Region where the GCS bucket is located
`s3.access_key`	GCS access key ID
`s3.secret_key`	GCS secret access key
`column_separator`	Column separator
`csv_schema`	CSV column definitions, in the format `column_name:type;column_name:type`

Step 4: Verify the Imported Data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name      | age  |
+---------+-----------+------+
|       5 | Ava       |   17 |
|      10 | Liam      |   64 |
|       7 | Sophia    |   32 |
|       9 | Emma      |   37 |
|       1 | Emily     |   25 |
|       4 | Alexander |   60 |
|       2 | Benjamin  |   35 |
|       3 | Olivia    |   28 |
|       6 | William   |   69 |
|       8 | James     |   64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

FAQ

Q1: What value should the provider parameter use?

When importing from Google Cloud Storage, provider must be set to GCP so that Doris accesses GCS through its S3-compatible protocol.

Q2: How do I choose between S3 Load and S3 TVF?

For large data volumes that need to run asynchronously in the background, choose S3 Load.
For small data volumes that need immediate results or that work alongside SQL queries, choose S3 TVF.

Q3: How do I obtain the Access Key and Secret Key for GCS?

You can create and manage HMAC keys (Access Key / Secret Key) for S3-compatible access on the Cloud Storage > Settings > Interoperability page in the Google Cloud Console.

Q4: How do I determine the Endpoint and Region?

The Endpoint and Region depend on the region where the bucket is located. For example, the US-WEST2 region corresponds to the endpoint storage.us-west2.rep.googleapis.com. Replace these values according to the actual location of your bucket.

Q5: What should I do if the import fails with a timeout?

Increase the timeout parameter in PROPERTIES (in seconds). The default is 3600. For very large files or slow networks, extend it as needed.

Choosing an Approach​

Prerequisites​

Method 1: Import with S3 Load (Asynchronous)​

Step 1: Prepare the Data​

Step 2: Create a Table in Doris​

Step 3: Import the Data with S3 Load​

Step 4: Verify the Imported Data​

Method 2: Import with TVF (Synchronous)​

Step 1: Prepare the Data​

Step 2: Create a Table in Doris​

Step 3: Import the Data with TVF​

Step 4: Verify the Imported Data​

FAQ​

Choosing an Approach

Prerequisites

Method 1: Import with S3 Load (Asynchronous)

Step 1: Prepare the Data

Step 2: Create a Table in Doris

Step 3: Import the Data with S3 Load

Step 4: Verify the Imported Data

Method 2: Import with TVF (Synchronous)

Step 1: Prepare the Data

Step 2: Create a Table in Doris

Step 3: Import the Data with TVF

Step 4: Verify the Imported Data

FAQ