Skip to main content

Importing Data from Google Cloud Storage (GCS)

Apache Doris supports importing files from Google Cloud Storage (GCS) by accessing GCS buckets through the S3-compatible protocol. This document describes two import methods and their complete operating procedures.

Choosing an Approach

Doris provides two methods for importing files from Google Cloud Storage. Choose one based on data volume and timeliness requirements:

Import MethodExecution ModeApplicable ScenarioReference
S3 LoadAsynchronousLarge-batch data imports, long-running jobsBroker Load Manual
S3 TVF (table function)SynchronousSmall-batch data imports, ad-hoc queries, quick validationThis document

Recommendations:

  • For larger data volumes or jobs that need to run in the background, use S3 Load.
  • For immediate results or use with INSERT INTO ... SELECT, use S3 TVF.

Prerequisites

Prepare the following information before importing:

  • Google Cloud Storage bucket name (your_bucket_name).
  • Access credentials: Access Key and Secret Key.
  • GCS Endpoint and Region (for example, storage.us-west2.rep.googleapis.com and US-WEST2).
  • A deployed and accessible Apache Doris cluster.

Method 1: Import with S3 Load (Asynchronous)

S3 Load is suitable for asynchronous imports of large data volumes. For detailed parameters and advanced usage, see the Broker Load Manual.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a Table in Doris

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import the Data with S3 Load

LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
"timeout" = "3600"
);

Key parameters:

ParameterDescription
providerObject storage provider. For GCS, this is fixed at GCP
s3.endpointGCS S3-compatible access endpoint
s3.regionRegion where the GCS bucket is located
s3.access_keyGCS access key ID
s3.secret_keyGCS secret access key
timeoutImport timeout, in seconds

Step 4: Verify the Imported Data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Method 2: Import with TVF (Synchronous)

The S3 table function (TVF) is suitable for synchronous imports and ad-hoc queries, and works directly with INSERT INTO ... SELECT.

Step 1: Prepare the Data

Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a Table in Doris

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import the Data with TVF

INSERT INTO test_s3load
SELECT * FROM S3
(
"uri" = "s3://your_bucket_name/s3load_example.csv",
"format" = "csv",
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>",
"column_separator" = ",",
"csv_schema" = "user_id:int;name:string;age:int"
);

Key parameters:

ParameterDescription
uriS3 URI of the file in object storage
formatFile format, such as csv, parquet, or orc
providerObject storage provider. For GCS, this is fixed at GCP
s3.endpointGCS S3-compatible access endpoint
s3.regionRegion where the GCS bucket is located
s3.access_keyGCS access key ID
s3.secret_keyGCS secret access key
column_separatorColumn separator
csv_schemaCSV column definitions, in the format column_name:type;column_name:type

Step 4: Verify the Imported Data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

FAQ

Q1: What value should the provider parameter use?

When importing from Google Cloud Storage, provider must be set to GCP so that Doris accesses GCS through its S3-compatible protocol.

Q2: How do I choose between S3 Load and S3 TVF?

  • For large data volumes that need to run asynchronously in the background, choose S3 Load.
  • For small data volumes that need immediate results or that work alongside SQL queries, choose S3 TVF.

Q3: How do I obtain the Access Key and Secret Key for GCS?

You can create and manage HMAC keys (Access Key / Secret Key) for S3-compatible access on the Cloud Storage > Settings > Interoperability page in the Google Cloud Console.

Q4: How do I determine the Endpoint and Region?

The Endpoint and Region depend on the region where the bucket is located. For example, the US-WEST2 region corresponds to the endpoint storage.us-west2.rep.googleapis.com. Replace these values according to the actual location of your bucket.

Q5: What should I do if the import fails with a timeout?

Increase the timeout parameter in PROPERTIES (in seconds). The default is 3600. For very large files or slow networks, extend it as needed.