Importing Data from Google Cloud Storage (GCS)
Apache Doris supports importing files from Google Cloud Storage (GCS) by accessing GCS buckets through the S3-compatible protocol. This document describes two import methods and their complete operating procedures.
Choosing an Approach
Doris provides two methods for importing files from Google Cloud Storage. Choose one based on data volume and timeliness requirements:
| Import Method | Execution Mode | Applicable Scenario | Reference |
|---|---|---|---|
| S3 Load | Asynchronous | Large-batch data imports, long-running jobs | Broker Load Manual |
| S3 TVF (table function) | Synchronous | Small-batch data imports, ad-hoc queries, quick validation | This document |
Recommendations:
- For larger data volumes or jobs that need to run in the background, use S3 Load.
- For immediate results or use with
INSERT INTO ... SELECT, use S3 TVF.
Prerequisites
Prepare the following information before importing:
- Google Cloud Storage bucket name (
your_bucket_name). - Access credentials: Access Key and Secret Key.
- GCS Endpoint and Region (for example,
storage.us-west2.rep.googleapis.comandUS-WEST2). - A deployed and accessible Apache Doris cluster.
Method 1: Import with S3 Load (Asynchronous)
S3 Load is suitable for asynchronous imports of large data volumes. For detailed parameters and advanced usage, see the Broker Load Manual.
Step 1: Prepare the Data
Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:
1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64
Step 2: Create a Table in Doris
CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
Step 3: Import the Data with S3 Load
LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
"timeout" = "3600"
);
Key parameters:
| Parameter | Description |
|---|---|
provider | Object storage provider. For GCS, this is fixed at GCP |
s3.endpoint | GCS S3-compatible access endpoint |
s3.region | Region where the GCS bucket is located |
s3.access_key | GCS access key ID |
s3.secret_key | GCS secret access key |
timeout | Import timeout, in seconds |
Step 4: Verify the Imported Data
SELECT * FROM test_s3load;
Expected result:
mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)
Method 2: Import with TVF (Synchronous)
The S3 table function (TVF) is suitable for synchronous imports and ad-hoc queries, and works directly with INSERT INTO ... SELECT.
Step 1: Prepare the Data
Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:
1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64
Step 2: Create a Table in Doris
CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
Step 3: Import the Data with TVF
INSERT INTO test_s3load
SELECT * FROM S3
(
"uri" = "s3://your_bucket_name/s3load_example.csv",
"format" = "csv",
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>",
"column_separator" = ",",
"csv_schema" = "user_id:int;name:string;age:int"
);
Key parameters:
| Parameter | Description |
|---|---|
uri | S3 URI of the file in object storage |
format | File format, such as csv, parquet, or orc |
provider | Object storage provider. For GCS, this is fixed at GCP |
s3.endpoint | GCS S3-compatible access endpoint |
s3.region | Region where the GCS bucket is located |
s3.access_key | GCS access key ID |
s3.secret_key | GCS secret access key |
column_separator | Column separator |
csv_schema | CSV column definitions, in the format column_name:type;column_name:type |
Step 4: Verify the Imported Data
SELECT * FROM test_s3load;
Expected result:
mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)
FAQ
Q1: What value should the provider parameter use?
When importing from Google Cloud Storage, provider must be set to GCP so that Doris accesses GCS through its S3-compatible protocol.
Q2: How do I choose between S3 Load and S3 TVF?
- For large data volumes that need to run asynchronously in the background, choose S3 Load.
- For small data volumes that need immediate results or that work alongside SQL queries, choose S3 TVF.
Q3: How do I obtain the Access Key and Secret Key for GCS?
You can create and manage HMAC keys (Access Key / Secret Key) for S3-compatible access on the Cloud Storage > Settings > Interoperability page in the Google Cloud Console.
Q4: How do I determine the Endpoint and Region?
The Endpoint and Region depend on the region where the bucket is located. For example, the US-WEST2 region corresponds to the endpoint storage.us-west2.rep.googleapis.com. Replace these values according to the actual location of your bucket.
Q5: What should I do if the import fails with a timeout?
Increase the timeout parameter in PROPERTIES (in seconds). The default is 3600. For very large files or slow networks, extend it as needed.