S3
Description
S3 table-valued-function (tvf) allows users to read and access file contents on S3-compatible object storage just like accessing relational table-formatted data. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
file format.
Syntax
S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)
Required parameters
Parameter | Description |
---|---|
uri | The URI for accessing S3. The function will decide whether to use Path Style or Virtual-hosted Style based on the use_path_style parameter. |
s3.access_key | The access key for accessing S3. |
s3.secret_key | The secret key for accessing S3. |
s3.region | The region where the S3 storage is located. |
s3.endpoint | The endpoint address of the S3 storage. |
format | The file format, supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc . |
Optional parameters
Parameter | Description | Remarks |
---|---|---|
s3.session_token | S3 session token | |
use_path_style | Default is false . The S3 SDK uses Virtual-hosted Style by default. However, some object storage systems may not support Virtual-hosted Style. In such cases, the use_path_style parameter can be added to force the use of Path Style. For example, MinIO only allows Path Style by default, so you need to add use_path_style=true when accessing MinIO. | |
force_parsing_by_standard_uri | Default is false . Forces the parsing of non-standard URIs into standard URIs. | |
column_separator | Column separator, default is \t . | |
line_delimiter | Line separator, default is \n . | |
compress_type | Currently UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK is supported. The default value is UNKNOWN, the type will be automatically inferred based on the suffix of uri | |
read_json_by_line | Default is "true" , used for importing JSON format. | JSON Load |
strip_outer_array | Default is "false" , used for importing JSON format. | JSON Load |
json_root | Default is empty, used for importing JSON format. | JSON Load |
jsonpaths | Default is empty, used for importing JSON format. | JSON Load |
num_as_string | Default is false , used for importing JSON format. | JSON Load |
fuzzy_parse | Default is false , used for importing JSON format. | |
trim_double_quotes | Default is false , used for importing CSV format, trims the outer double quotes of each field. | |
skip_lines | Default is 0, indicating how many lines to skip at the beginning of the CSV file. This is invalid for csv_with_names or csv_with_names_and_types formats. | |
path_partition_keys | Specifies partition column names in the file path, e.g., /path/to/city=beijing/date="2023-07-09" , then fill path_partition_keys="city,date" , which will automatically read the corresponding column names and values from the path for import. | |
resource | Specifies the Resource name. The S3 TVF can directly access S3 using an existing S3 Resource. To create an S3 Resource, refer to CREATE-RESOURCE. This feature is supported starting from version 2.1.4. |
Notes
- For AWS S3, the standard URI styles are as follows:
- AWS Client Style (Hadoop S3 Style):
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
.- Virtual Host Style:
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
.- Path Style:
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
.In addition to supporting the above three common standard URI styles, other URI styles (perhaps uncommon, but possible) are also supported:
- Virtual Host AWS Client (Hadoop S3) Mixed Style: >>
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
- Path AWS Client (Hadoop S3) Mixed Style: > >
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Detailed usage examples can be found in the following examples.
- To directly query a TVF or create a View based on the TVF, you need to have the USAGE permission for that Resource. To query a View created based on the TVF, only the SELECT permission for that View is required.
Examples
-
Read and access CSV format files on S3-compatible object storage
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1; -
Can be used with
desc function
Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true"); -
Usage of different uri schemas
Example of http:// 、https://
-- Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
-- Because of "use_path_style"="true", s3 will be accessed in 'path style'.
select * from s3(
"URI" = "https://endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="true");
-- Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
-- Because of "use_path_style"="false", s3 will be accessed in 'virtual-hosted style'.
select * from s3(
"URI" = "https://bucket.endpoint/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="false");
-- The OSS on Alibaba Cloud and The COS on Tencent Cloud will use 'virtual-hosted style' to access s3.
-- OSS
select * from s3(
"URI" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
"s3.region" = "oss-cn-beijing",
"format" = "parquet",
"use_path_style" = "false");
-- COS
select * from s3(
"URI" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
"s3.region" = "ap-hongkong",
"format" = "parquet",
"use_path_style" = "false");
-- MinIO
select * from s3(
"uri" = "s3://bucket/file.csv",
"s3.endpoint" = "http://172.21.0.101:9000",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "us-east-1",
"format" = "csv"
);
-- The BOS on Baidu Cloud will use 'virtual-hosted style' compatible with the S3 protocol to access s3.
-- BOS
select * from s3(
"uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "bj",
"s3.endpoint" = "http://bj.bcebos.com",
"format" = "parquet",
"use_path_style" = "false"); -
Example of s3://:
-- Note how to write your bucket of URI, no need to set 'use_path_style'.
-- s3 will be accessed in 'virtual-hosted style'.
select * from s3(
"URI" = "s3://bucket/file/student.csv",
"s3.endpoint"= "endpont",
"s3.region" = "region",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv"); -
Example of other uri styles:
-- Virtual Host AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = false` and `force_parsing_by_standard_uri = true`.
select * from s3(
"URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="false",
"force_parsing_by_standard_uri"="true");
-- Path AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = true` and `force_parsing_by_standard_uri = true`.
select * from s3(
"URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="true",
"force_parsing_by_standard_uri"="true"); -
csv format
csv
format: Read the file on S3 and process it as a csv file, read the first line in the file to parse out the table schema. The number of columns in the first line of the filen
will be used as the number of columns in the table schema, and the column names of the table schema will be automatically namedc1, c2, ..., cn
, and the column type is set toString
, for example:The file content of student1.csv:
1,ftw,12
2,zs,18
3,ww,20 -
use S3 tvf
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1;+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+ -
use
desc function S3()
to view the table schemaDesc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true");+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| c1 | TEXT | Yes | false | NULL | NONE |
| c2 | TEXT | Yes | false | NULL | NONE |
| c3 | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
csv_with_names format
csv_with_names
format: The first line of the file is used as the number and name of the columns of the table schema, and the column type is set toString
, for example:The file content of student_with_names.csv:
id,name,age
1,ftw,12
2,zs,18
3,ww,20 -
use S3 tvf
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true") order by id;+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+ -
You can also use
desc function S3()
to view the Table Schema.Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true");+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
csv_with_names_and_types format
csv_with_names_and_types
format: Currently, it does not support parsing the column type from a csv file. When using this format, S3 tvf will parse the first line of the file as the number and name of the columns of the tableThe file content of
student_with_names_and_types.csv
:id,name,age
INT,STRING,INT
1,ftw,12
2,zs,18
3,ww,20 -
use S3 tvf
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true") order by id;+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+ -
You can also use
desc function S3()
to view the Table Schema.Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true");+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
json format
The json format involves many optional parameters, and the meaning of each parameter can be referred to: Json Load. When S3 tvf queries the json form
data.json 文件
[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}] -
use S3 tvf:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style"="true");
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 2 | xxx | 17 |
| 3 | yyy | 19 |
+------+------+------+
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\", \"$.age\"]",
"use_path_style"="true");
+------+------+
| id | age |
+------+------+
| 1 | 18 |
| 2 | 17 |
| 3 | 19 |
+------+------+ -
parquet format
S3 tvf supports parsing the column names and column types of the table schema from the parquet file. Example:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true") limit 5;+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+MySQL [(none)]> desc function s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true");+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+ -
orc format
Same as
parquet
format, setformat
parameter to orc.select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "orc",
"use_path_style"="true") limit 5;+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ -
avro format
S3 tvf supports parsing the column names and column types of the table schema from the avro file. Example:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/person.avro",
"ACCESS_KEY" = "ak",
"SECRET_KEY" = "sk",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"FORMAT" = "avro");+--------+--------------+-------------+-----------------+
| name | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------------+
| Alyssa | 1 | 10.0012 | 100000000221133 |
| Ben | 0 | 5555.999 | 4009990000 |
| lisi | 0 | 5992225.999 | 9099933330 |
+--------+--------------+-------------+-----------------+ -
uri contains wildcards
uri can use wildcards to read multiple files. Note: If wildcards are used, the format of each file must be consistent (especially csv/csv_with_names/csv_with_names_and_types count as different formats), S3 tvf uses the first file to parse out the table schema. For example:
The following two csv files:
// file1.csv
1,aaa,18
2,qqq,20
3,qwe,19
// file2.csv
5,cyx,19
6,ftw,21 -
You can use wildcards on the uri to query.
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/file*.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style"="true");+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | aaa | 18 |
| 2 | qqq | 20 |
| 3 | qwe | 19 |
| 5 | cyx | 19 |
| 6 | ftw | 21 |
+------+------+------+ -
Using
S3
tvf withinsert into
andcast
-- Create doris internal table
CREATE TABLE IF NOT EXISTS ${testTable}
(
id int,
name varchar(50),
age int
)
COMMENT "my first table"
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES("replication_num" = "1");
-- Insert data using S3
insert into ${testTable} (id,name,age)
select cast (id as INT) as id, name, cast (age as INT) as age
from s3(
"uri" = "${uri}",
"s3.access_key"= "${ak}",
"s3.secret_key" = "${sk}",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");