S3
Description
The S3 table-valued function (TVF) allows users to read and access files stored on S3-compatible object storage as if accessing relational table-formatted data. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file formats.
Syntax
S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)
Required Parameters
| Parameter | Description |
|---|---|
uri | URI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the use_path_style parameter |
s3.access_key | Access key for S3 |
s3.secret_key | Secret key for S3 |
s3.region | S3 region |
s3.endpoint | S3 endpoint address |
format | File format, supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc |
Optional Parameters
| Parameter | Default | Description | Notes |
|---|---|---|---|
s3.session_token | - | S3 session token | |
use_path_style | false | S3 SDK uses Virtual-hosted Style by default. Some object storage systems may not support Virtual-hosted Style. Set this parameter to force Path Style access. For example, MinIO only allows Path Style by default, so use_path_style=true should be set when accessing MinIO | |
force_parsing_by_standard_uri | false | Forces parsing of non-standard URIs as standard URIs | |
column_separator | \t | Column separator | |
line_delimiter | \n | Line delimiter | |
compress_type | UNKNOWN | Compression type. Currently supports UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK, will auto-detect based on uri suffix | |
read_json_by_line | "true" | For JSON format import | JSON Load |
strip_outer_array | "false" | For JSON format import | JSON Load |
json_root | Empty | For JSON format import | JSON Load |
jsonpaths | Empty | For JSON format import | JSON Load |
num_as_string | false | For JSON format import | JSON Load |
fuzzy_parse | false | For JSON format import | |
trim_double_quotes | false | For CSV format import, removes the outermost double quotes from each field | |
skip_lines | 0 | Skips the first N lines of CSV file. This parameter is ineffective when using csv_with_names or csv_with_names_and_types format | |
path_partition_keys | - | Specifies partition column names in file path. For example, for path /path/to/city=beijing/date="2023-07-09", set path_partition_keys="city,date" to automatically read column names and values from path | |
resource | - | Specifies Resource name. S3 TVF can use existing S3 Resource to access S3 directly. For creating S3 Resource, refer to CREATE-RESOURCE. Supported from version 2.1.4. | |
enable_mapping_varbinary | false | Maps BYTE_ARRAY type to STRING when reading PARQUET/ORC. When enabled, maps to VARBINARY type instead. | Supported from version 4.0.3 |
Notes
-
For AWS S3, standard URI styles include:
-
AWS Client Style (Hadoop S3 Style)
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88 -
Virtual Host Style
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88 -
Path Style
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Besides these three standard URI styles, other URI styles are also supported (though less common):
-
Virtual Host AWS Client (Hadoop S3) Mixed Style
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88 -
Path AWS Client (Hadoop S3) Mixed Style
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
For detailed usage examples, see the examples below.
-
-
To directly query TVF or create a View based on this TVF, USAGE permission on the Resource is required. To query a View created based on TVF, only SELECT permission on the View is needed.
Examples
-
Read and access CSV format files on S3-compatible object storage
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1; -
Can be used with
desc functiondesc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true"); -
Different URL Schema Usage
http://,https://usage examples:-- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
-- Since "use_path_style"="true" is set, Path Style will be used to access S3.
select * from s3(
"uri" = "https://endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="true");
-- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
-- Since "use_path_style"="false" is set, Virtual-hosted Style will be used to access S3.
select * from s3(
"uri" = "https://bucket.endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="false");
-- Alibaba Cloud OSS and Tencent Cloud COS use Virtual-hosted Style to access S3.
-- OSS
select * from s3(
"uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
"s3.region" = "oss-cn-beijing",
"format" = "parquet",
"use_path_style" = "false");
-- COS
select * from s3(
"uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
"s3.region" = "ap-hongkong",
"format" = "parquet",
"use_path_style" = "false");
-- MinIO
select * from s3(
"uri" = "s3://bucket/file.csv",
"s3.endpoint" = "http://172.21.0.101:9000",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "us-east-1",
"format" = "csv"
);
-- Baidu Cloud BOS uses S3-compatible Virtual-hosted Style to access S3.
-- BOS
select * from s3(
"uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "bj",
"s3.endpoint" = "http://bj.bcebos.com",
"format" = "parquet",
"use_path_style" = "false"); -
s3://usage example:-- Note URI Bucket notation, no need to set use_path_style parameter.
-- Virtual-hosted Style will be used to access S3.
select * from s3(
"uri" = "s3://bucket/file/student.csv",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv"); -
Other supported URI style examples:
-- Virtual Host AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=false and force_parsing_by_standard_uri=true.
select * from s3(
"URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="false",
"force_parsing_by_standard_uri"="true");
-- Path AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=true and force_parsing_by_standard_uri=true.
select * from s3(
"URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="true",
"force_parsing_by_standard_uri"="true"); -
CSV Format
Since S3 table-valued-function doesn't know the Table Schema in advance, it will read the file once to parse the Table Schema.
csvformat: S3 table-valued-function reads files on S3 and processes them as CSV files, using the first line to parse Table Schema. The number of columnsnin the first line will be used as the number of columns in Table Schema, and column names will be automatically named asc1, c2, ..., cn, with all column types set toString. Example:student1.csvfile content:1,ftw,12
2,zs,18
3,ww,20Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1;
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+Can use
desc function S3()to view Table Schema:Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| c1 | TEXT | Yes | false | NULL | NONE |
| c2 | TEXT | Yes | false | NULL | NONE |
| c3 | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
csv_with_namesFormatcsv_with_namesformat: Parses the first line of the file as the number and names of columns in Table Schema, with all column types set toString. Example:student_with_names.csvfile content:id,name,age
1,ftw,12
2,zs,18
3,ww,20Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+ -
Similarly, use
desc function S3()to view Table Schemadesc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
csv_with_names_and_typesFormatcsv_with_names_and_typesformat: Currently does not support parsing column types from CSV files. When using this format, S3 TVF will parse the first line as the number and names of columns in Table Schema, with all column types set to String, and will ignore the second line of the file.student_with_names_and_types.csvfile content:id,name,age
INT,STRING,INT
1,ftw,12
2,zs,18
3,ww,20Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+Similarly, use
desc function S3()to view Table Schema:Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+ -
JSON Format
jsonformat: JSON format involves many optional parameters. For the meaning of each parameter, refer to: Json Load. When querying JSON format files with S3 TVF, it locates a JSON object based onjson_rootandjsonpathsparameters, uses thekeyin the object as column names in Table Schema, with all column types set to String. Example:data.jsonfile content:[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}]Using S3 TVF to query:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style"="true");
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 2 | xxx | 17 |
| 3 | yyy | 19 |
+------+------+------+
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\", \"$.age\"]",
"use_path_style"="true");
+------+------+
| id | age |
+------+------+
| 1 | 18 |
| 2 | 17 |
| 3 | 19 |
+------+------+ -
Parquet Format
parquetformat: S3 TVF supports parsing column names and types from Parquet files for Table Schema. Example:select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+MySQL [(none)]> desc function s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+ -
ORC Format
orcformat: Usage is the same asparquetformat, just set theformatparameter toorc.select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "orc",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ -
URI with Wildcards
URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially
csv,csv_with_names,csv_with_names_and_typesare different formats). S3 TVF will use the first file to parse Table Schema.With the following two CSV files:
// file1.csv
1,aaa,18
2,qqq,20
3,qwe,19
// file2.csv
5,cyx,19
6,ftw,21You can use wildcards in URI to query:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/file*.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style"="true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | aaa | 18 |
| 2 | qqq | 20 |
| 3 | qwe | 19 |
| 5 | cyx | 19 |
| 6 | ftw | 21 |
+------+------+------+ -
Using S3 TVF with
insert intoandcast-- Create Doris internal table
CREATE TABLE IF NOT EXISTS ${testTable}
(
id int,
name varchar(50),
age int
)
COMMENT "my first table"
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES("replication_num" = "1");
-- Insert data using S3
insert into ${testTable} (id,name,age)
select cast (id as INT) as id, name, cast (age as INT) as age
from s3(
"uri" = "${uri}",
"s3.access_key"= "${ak}",
"s3.secret_key" = "${sk}",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");