Skip to main content

Azure Storage

Apache Doris supports importing files from Azure Storage (Azure Blob Storage). This document describes two typical import methods. You can choose the appropriate option based on data volume and real-time requirements.

Method selection

The following table compares the core differences between the two import methods, helping you choose quickly:

Import methodExecution modeApplicable scenarioReference document
S3 LoadAsynchronousLarge-scale data import, background scheduling requiredBroker Load Manual
TVF (Table-Valued Function)SynchronousAd hoc queries, small-batch data import, quick verification

Prerequisites

Before importing data from Azure Storage using either method, confirm the following configurations:

  • HTTPS transport: Azure Storage requires HTTPS transport by default (corresponding to the storage account configuration Secure transfer required: Enabled). You must set s3_client_http_scheme = https in Doris be.conf, otherwise it cannot be accessed properly.
  • Region can be omitted: In the properties for Azure-compatible S3 protocol, the s3.region parameter can be omitted.
  • Access credentials: Prepare the Access Key (AK) and Secret Key (SK) of the Azure Storage account.
  • Endpoint format: Use an Endpoint address in the form of <StorageAccount>.blob.core.windows.net.
Caution

If HTTPS is not enabled in the BE configuration, the import will fail due to a protocol mismatch. Complete the above configuration before performing import operations.

Method 1: Import using S3 Load (asynchronous)

S3 Load is an asynchronous batch import method, suitable for importing large-scale data from Azure Storage into Doris. The complete steps are as follows.

Step 1: Prepare data

Create a CSV file s3load_example.csv on Azure Storage with the following content:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import data using S3 Load

Run the following SQL to submit an S3 Load task:

LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "AZURE",
"s3.endpoint" = "StorageAccountA.blob.core.windows.net",
"s3.region" = "westus3",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
"timeout" = "3600"
);

Key parameter descriptions:

ParameterDescription
providerMust be set to AZURE to identify the object storage provider
s3.endpointAzure Blob service address, in the format <StorageAccount>.blob.core.windows.net
s3.regionCan be omitted in Azure scenarios
s3.access_keyAccess Key of the Azure Storage account
s3.secret_keySecret Key of the Azure Storage account
timeoutImport task timeout (in seconds)

Step 4: Verify the imported data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Method 2: Import using TVF (synchronous)

TVF (Table-Valued Function) is a synchronous import method that can read and write data within a single SQL statement, suitable for small-batch data or ad hoc scenarios.

Step 1: Prepare data

Create a CSV file s3load_example.csv on Azure Storage with the following content:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris

CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Import data using TVF

Import data directly via INSERT INTO ... SELECT FROM S3(...):

INSERT INTO test_s3load
SELECT * FROM S3
(
"uri" = "s3://your_bucket_name/s3load_example.csv",
"format" = "csv",
"provider" = "AZURE",
"s3.endpoint" = "StorageAccountA.blob.core.windows.net",
"s3.region" = "westus3",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>",
"column_separator" = ",",
"csv_schema" = "user_id:int;name:string;age:int"
);

Key parameter descriptions:

ParameterDescription
uriObject path on Azure Storage, in the format s3://<bucket>/<object>
formatFile format, such as csv, parquet, orc, etc.
providerMust be set to AZURE
s3.endpointAzure Blob service address
s3.regionCan be omitted in Azure scenarios
s3.access_key / s3.secret_keyAzure Storage access credentials
column_separatorColumn separator (applicable to CSV)
csv_schemaCSV column definition, in the format column_name:type;column_name:type

Step 4: Verify the imported data

SELECT * FROM test_s3load;

Expected result:

mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

FAQ

Q1: How should you choose between S3 Load and TVF?

  • S3 Load: Executes asynchronously, suitable for large-batch data, scenarios that require tracking import status by Label, or scheduling as periodic tasks.
  • TVF: Executes synchronously, suitable for small-batch data, ad hoc queries, quick verification of file content, or scenarios that require transforming data in SELECT before writing.

Q2: Why does an error or connection failure occur when accessing Azure Storage?

Confirm the following:

  1. s3_client_http_scheme = https is set in be.conf on the BE node, and the BE has been restarted to take effect.
  2. When the Azure Storage account has Secure transfer required enabled, HTTPS must be used.
  3. The format of s3.endpoint is correct: <StorageAccount>.blob.core.windows.net, without a protocol prefix.
  4. The AK/SK is valid and has read permission for the target Bucket.

Q3: Is s3.region required?

No. In Azure scenarios, s3.region can be omitted. If specified, it does not affect the import process.

Q4: What format should the URI use?

Whether for DATA INFILE in S3 Load or uri in TVF, use the form s3://<bucket_name>/<object_path> to access objects on Azure Storage.

Troubleshooting

SymptomPossible causeSolution
Import task reports a connection errorHTTPS is not enabled on BESet s3_client_http_scheme = https in be.conf and restart BE
Permission or authentication failure reportedIncorrect AK/SK or insufficient permissionCheck whether the Access Key and Secret Key are correct, and confirm read permission for the target container
Endpoint cannot be resolvedEndpoint contains a protocol prefix or is misspelledUse <StorageAccount>.blob.core.windows.net without a prefix such as https://
Import task times outLarge data volume or slow networkIncrease the timeout value in PROPERTIES