Skip to main content

HDFS

Doris provides two ways to load files from HDFS:

  • Use HDFS Load to load HDFS files into Doris, which is an asynchronous load method.
  • Use TVF to load HDFS files into Doris, which is a synchronous load method.

load with HDFS Load​

Use HDFS Load to import files on HDFS. For detailed steps, please refer to the Broker Load Manual

Step 1: Prepare the data​

Create a CSV file hdfsload_example.csv The file is stored on HDFS and its content is as follows:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris​

CREATE TABLE test_hdfsload(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Load data using HDFS Load​

LOAD LABEL hdfs_load_2022_04_01
(
DATA INFILE("hdfs://127.0.0.1:8020/tmp/hdfsload_example.csv")
INTO TABLE test_hdfsload
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
with HDFS
(
"fs.defaultFS" = "hdfs://127.0.0.1:8020",
"hadoop.username" = "user"
)
PROPERTIES
(
"timeout" = "3600"
);

Step 4: Check the imported data​

SELECT * FROM test_hdfsload;

Results:

mysql> select * from test_hdfsload;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)

Load with TVF​

Step 1: Prepare the data​

Create a CSV file hdfsload_example.csv The file is stored on HDFS and its content is as follows:

1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64

Step 2: Create a table in Doris​

CREATE TABLE test_hdfsload(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;

Step 3: Load data using TVF​

INSERT INTO test_hdfsload
SELECT * FROM hdfs (
"uri" = "hdfs://127.0.0.1:8020/tmp/hdfsload_example.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8020",
"hadoop.username" = "doris",
"format" = "csv",
"csv_schema" = "user_id:int;name:string;age:int"
);

Step 4: Check the imported data​

SELECT * FROM test_hdfsload;

Results:

mysql> select * from test_hdfsload;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)