S3
S3
Name
S3
description
S3 table-valued-function(tvf), allows users to read and access file contents on S3-compatible object storage, just like accessing relational table. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
file format.
syntax
s3(
"uri" = "..",
"s3.access_key" = "...",
"s3.secret_key" = "...",
"s3.region" = "...",
"format" = "csv",
"keyn" = "valuen",
...
);
parameter description
Each parameter in S3 tvf is a pair of "key"="value"
.
Related parameters for accessing S3:
uri
: (required) The S3 tvf will decide whether to use the path style access method according to theuse_path_style
parameter, and the default access method is the virtual-hosted style method.s3.access_key
: (required)s3.secret_key
: (required)s3.region
: (optional). Mandatory if the Minio has set another region. Otherwise,us-east-1
is used by default.s3.session_token
: (optional)use_path_style
: (optional) defaultfalse
. The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add theuse_path_style
parameter to force the use of path style access method.force_parsing_by_standard_uri
: (optional) defaultfalse
. We can addforce_parsing_by_standard_uri
parameter to force parsing unstandard uri as standard uri.
Note: For AWS S3, standard uri styles should be:
- AWS Client Style(Hadoop S3 Style):
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
- Virtual Host Style:
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
- Path Style:
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
In addition to supporting the common uri styles of the above three standards, it also supports some other uri styles (maybe not common, but there may be):
- Virtual Host AWS Client (Hadoop S3) Mixed Style:
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
- Path AWS Client (Hadoop S3) Mixed Style:
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
For detailed use cases, you can refer to Best Practice at the bottom.
file format parameter:
format
: (required) Currently supportcsv/csv_with_names/csv_with_names_and_types/json/parquet/orc
column_separator
: (optional) default\t
.line_delimiter
: (optional) default\n
.compress_type
: (optional) Currently supportUNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK
. Default value isUNKNOWN
, it will automatically infer the type based on the suffix ofuri
.
The following 6 parameters are used for loading in json format. For specific usage methods, please refer to: Json Load
read_json_by_line
: (optional) default"true"
strip_outer_array
: (optional) default"false"
json_root
: (optional) default""
jsonpaths
: (optional) default""
num_as_string
: (optional) default"false"
fuzzy_parse
: (optional) default"false"
The following 2 parameters are used for loading in csv format
trim_double_quotes
: Boolean type (optional), the default value isfalse
. True means that the outermost double quotes of each field in the csv file are trimmed.skip_lines
: Integer type (optional), the default value is 0. It will skip some lines in the head of csv file. It will be disabled when the format iscsv_with_names
orcsv_with_names_and_types
.
other parameter:
path_partition_keys
: (optional) Specifies the column names carried in the file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", you should fill inpath_partition_keys="city,date"
. It will automatically read the corresponding column names and values from the path during load process.resource
:(optional)Specify the resource name. S3 tvf can use the existing S3 resource to directly access S3. You can refer to the method for creating an S3 resource: CREATE-RESOURCE. This property is supported starting from version 2.1.4 .
To directly query a TVF or create a VIEW based on that TVF, you need to have usage permission for that resource. To query a VIEW created based on TVF, you only need select permission for that VIEW.
Example
Read and access csv format files on S3-compatible object storage.
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style" = "true") order by c1;
Can be used with desc function
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style" = "true");
Keywords
s3, table-valued-function, tvf
Best Practice
Since the S3 table-valued-function does not know the table schema in advance, it will read the file first to parse out the table schema.
Usage of different uri schemas Example of http:// 、https://
// Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
// Because of "use_path_style"="true", s3 will be accessed in 'path style'.
select * from s3(
"URI" = "https://endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true");
// Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
// Because of "use_path_style"="false", s3 will be accessed in 'virtual-hosted style'.
select * from s3(
"URI" = "https://bucket.endpoint/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="false");
// The OSS on Alibaba Cloud and The COS on Tencent Cloud will use 'virtual-hosted style' to access s3.
// OSS
select * from s3(
"URI" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"region" = "oss-cn-beijing",
"format" = "parquet",
"use_path_style" = "false");
// COS
select * from s3(
"URI" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"region" = "ap-hongkong",
"format" = "parquet",
"use_path_style" = "false");
// The BOS on Baidu Cloud will use 'virtual-hosted style' compatible with the S3 protocol to access s3.
// BOS
select * from s3(
"uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "bj",
"format" = "parquet",
"use_path_style" = "false");
// MinIO select * from s3( "uri" = "s3://bucket/file.csv", "s3.endpoint" = "http://172.21.0.101:9000", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.region" = "us-east-1", "format" = "csv" );
Example of s3://:
// Note how to write your bucket of URI, no need to set 'use_path_style'.
// s3 will be accessed in 'virtual-hosted style'.
select * from s3(
"URI" = "s3://bucket/file/student.csv",
"s3.endpoint"= "endpont",
"s3.region" = "region",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv");
Example of other uri styles:
// Virtual Host AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = false` and `force_parsing_by_standard_uri = true`.
select * from s3(
"URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="false",
"force_parsing_by_standard_uri"="true");
// Path AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = true` and `force_parsing_by_standard_uri = true`.
select * from s3(
"URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true",
"force_parsing_by_standard_uri"="true");
csv format
csv
format: Read the file on S3 and process it as a csv file, read the first line in the file to parse out the table schema. The number of columns in the first line of the file n
will be used as the number of columns in the table schema, and the column names of the table schema will be automatically named c1, c2, ..., cn
, and the column type is set to String
, for example:
The file content of student1.csv:
1,ftw,12
2,zs,18
3,ww,20
use S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv",
-> "use_path_style" = "true") order by c1;
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
use desc function S3()
to view the table schema
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| c1 | TEXT | Yes | false | NULL | NONE |
| c2 | TEXT | Yes | false | NULL | NONE |
| c3 | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
csv_with_names format
csv_with_names
format: The first line of the file is used as the number and name of the columns of the table schema, and the column type is set to String
, for example:
The file content of student_with_names.csv:
id,name,age
1,ftw,12
2,zs,18
3,ww,20
use S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names",
-> "use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
csv_with_names_and_types format
csv_with_names_and_types
format: Currently, it does not support parsing the column type from a csv file. When using this format, S3 tvf will parse the first line of the file as the number and name of the columns of the table schema, and set the column type to String. Meanwhile, the second line of the file is ignored.
The file content of student_with_names_and_types.csv:
id,name,age
INT,STRING,INT
1,ftw,12
2,zs,18
3,ww,20
use S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names_and_types",
-> "use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names_and_types",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
json format
json
format: The json format involves many optional parameters, and the meaning of each parameter can be referred to: Json Load. When S3 tvf queries the json format file, it locates a json object according to the json_root
and jsonpaths
parameters, and uses the key
in the object as the column name of the table schema, and sets the column type to String. For example:
The file content of data.json:
[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}]
use S3 tvf:
MySQL [(none)]> select * from s3(
"URI" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "json",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style"="true");
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 2 | xxx | 17 |
| 3 | yyy | 19 |
+------+------+------+
MySQL [(none)]> select * from s3(
"URI" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\", \"$.age\"]",
"use_path_style"="true");
+------+------+
| id | age |
+------+------+
| 1 | 18 |
| 2 | 17 |
| 3 | 19 |
+------+------+
parquet format
parquet
format: S3 tvf supports parsing the column names and column types of the table schema from the parquet file. Example:
MySQL [(none)]> select * from s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "parquet",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
MySQL [(none)]> desc function s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
orc format
orc
format: Same as parquet
format, set format
parameter to orc.
MySQL [(none)]> select * from s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.orc",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "orc",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
avro format
avro
format: S3 tvf supports parsing the column names and column types of the table schema from the avro file. Example:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/person.avro",
"ACCESS_KEY" = "ak",
"SECRET_KEY" = "sk",
"FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------------+
| Alyssa | 1 | 10.0012 | 100000000221133 |
| Ben | 0 | 5555.999 | 4009990000 |
| lisi | 0 | 5992225.999 | 9099933330 |
+--------+--------------+-------------+-----------------+
uri contains wildcards
uri can use wildcards to read multiple files. Note: If wildcards are used, the format of each file must be consistent (especially csv/csv_with_names/csv_with_names_and_types count as different formats), S3 tvf uses the first file to parse out the table schema. For example:
The following two csv files:
// file1.csv
1,aaa,18
2,qqq,20
3,qwe,19
// file2.csv
5,cyx,19
6,ftw,21
You can use wildcards on the uri to query.
MySQL [(none)]> select * from s3(
"URI" = "http://127.0.0.1:9312/test2/file*.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style"="true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | aaa | 18 |
| 2 | qqq | 20 |
| 3 | qwe | 19 |
| 5 | cyx | 19 |
| 6 | ftw | 21 |
+------+------+------+
Using S3
tvf with insert into
and cast
// Create doris internal table
CREATE TABLE IF NOT EXISTS ${testTable}
(
id int,
name varchar(50),
age int
)
COMMENT "my first table"
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES("replication_num" = "1");
// Insert data using S3
insert into ${testTable} (id,name,age)
select cast (id as INT) as id, name, cast (age as INT) as age
from s3(
"uri" = "${uri}",
"s3.access_key"= "${ak}",
"s3.secret_key" = "${sk}",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");