Skip to main content

S3

Description

S3 table-valued-function (tvf) allows users to read and access file contents on S3-compatible object storage just like accessing relational table-formatted data. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file format.

Syntax

S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)

Required parameters

ParameterDescription
uriThe URI for accessing S3. The function will decide whether to use Path Style or Virtual-hosted Style based on the use_path_style parameter.
s3.access_keyThe access key for accessing S3.
s3.secret_keyThe secret key for accessing S3.
s3.regionThe region where the S3 storage is located.
s3.endpointThe endpoint address of the S3 storage.
formatThe file format, supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc.

Optional parameters

ParameterDescriptionRemarks
s3.session_tokenS3 session token
use_path_styleDefault is false. The S3 SDK uses Virtual-hosted Style by default. However, some object storage systems may not support Virtual-hosted Style. In such cases, the use_path_style parameter can be added to force the use of Path Style. For example, MinIO only allows Path Style by default, so you need to add use_path_style=true when accessing MinIO.
force_parsing_by_standard_uriDefault is false. Forces the parsing of non-standard URIs into standard URIs.
column_separatorColumn separator, default is \t.
line_delimiterLine separator, default is \n.
compress_typeCurrently UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK is supported. The default value is UNKNOWN, the type will be automatically inferred based on the suffix of uri
read_json_by_lineDefault is "true", used for importing JSON format.JSON Load
strip_outer_arrayDefault is "false", used for importing JSON format.JSON Load
json_rootDefault is empty, used for importing JSON format.JSON Load
jsonpathsDefault is empty, used for importing JSON format.JSON Load
num_as_stringDefault is false, used for importing JSON format.JSON Load
fuzzy_parseDefault is false, used for importing JSON format.
trim_double_quotesDefault is false, used for importing CSV format, trims the outer double quotes of each field.
skip_linesDefault is 0, indicating how many lines to skip at the beginning of the CSV file. This is invalid for csv_with_names or csv_with_names_and_types formats.
path_partition_keysSpecifies partition column names in the file path, e.g., /path/to/city=beijing/date="2023-07-09", then fill path_partition_keys="city,date", which will automatically read the corresponding column names and values from the path for import.
resourceSpecifies the Resource name. The S3 TVF can directly access S3 using an existing S3 Resource. To create an S3 Resource, refer to CREATE-RESOURCE. This feature is supported starting from version 2.1.4.

Notes

  1. For AWS S3, the standard URI styles are as follows:
  1. AWS Client Style (Hadoop S3 Style): s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88.
  2. Virtual Host Style: https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88.
  3. Path Style: https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88.

In addition to supporting the above three common standard URI styles, other URI styles (perhaps uncommon, but possible) are also supported:

  1. Virtual Host AWS Client (Hadoop S3) Mixed Style: >> s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
  2. Path AWS Client (Hadoop S3) Mixed Style: > > s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88

Detailed usage examples can be found in the following examples.

  1. To directly query a TVF or create a View based on the TVF, you need to have the USAGE permission for that Resource. To query a View created based on the TVF, only the SELECT permission for that View is required.

Examples

  • Read and access CSV format files on S3-compatible object storage

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;
  • Can be used with desc function

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");
  • Usage of different uri schemas

    Example of http:// 、https://

    -- Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
    -- Because of "use_path_style"="true", s3 will be accessed in 'path style'.
    select * from s3(
    "URI" = "https://endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="true");

    -- Note how to write your bucket of URI and set the 'use_path_style' parameter, as well as http://.
    -- Because of "use_path_style"="false", s3 will be accessed in 'virtual-hosted style'.
    select * from s3(
    "URI" = "https://bucket.endpoint/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="false");

    -- The OSS on Alibaba Cloud and The COS on Tencent Cloud will use 'virtual-hosted style' to access s3.
    -- OSS
    select * from s3(
    "URI" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
    "s3.access_key" = "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
    "s3.region" = "oss-cn-beijing",
    "format" = "parquet",
    "use_path_style" = "false");
    -- COS
    select * from s3(
    "URI" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
    "s3.access_key" = "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
    "s3.region" = "ap-hongkong",
    "format" = "parquet",
    "use_path_style" = "false");

    -- MinIO
    select * from s3(
    "uri" = "s3://bucket/file.csv",
    "s3.endpoint" = "http://172.21.0.101:9000",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "us-east-1",
    "format" = "csv"
    );

    -- The BOS on Baidu Cloud will use 'virtual-hosted style' compatible with the S3 protocol to access s3.
    -- BOS
    select * from s3(
    "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "bj",
    "s3.endpoint" = "http://bj.bcebos.com",
    "format" = "parquet",
    "use_path_style" = "false");
  • Example of s3://:

    -- Note how to write your bucket of URI, no need to set 'use_path_style'.
    -- s3 will be accessed in 'virtual-hosted style'.
    select * from s3(
    "URI" = "s3://bucket/file/student.csv",
    "s3.endpoint"= "endpont",
    "s3.region" = "region",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv");
  • Example of other uri styles:

    -- Virtual Host AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = false` and `force_parsing_by_standard_uri = true`.
    select * from s3(
    "URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="false",
    "force_parsing_by_standard_uri"="true");

    -- Path AWS Client (Hadoop S3) Mixed Style. Used by setting `use_path_style = true` and `force_parsing_by_standard_uri = true`.
    select * from s3(
    "URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="true",
    "force_parsing_by_standard_uri"="true");
  • csv format

    csv format: Read the file on S3 and process it as a csv file, read the first line in the file to parse out the table schema. The number of columns in the first line of the file n will be used as the number of columns in the table schema, and the column names of the table schema will be automatically named c1, c2, ..., cn, and the column type is set to String , for example:

    The file content of student1.csv:

    1,ftw,12
    2,zs,18
    3,ww,20
  • use S3 tvf

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;
      +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+
  • use desc function S3() to view the table schema

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");
      +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | c1 | TEXT | Yes | false | NULL | NONE |
    | c2 | TEXT | Yes | false | NULL | NONE |
    | c3 | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names format

    csv_with_names format: The first line of the file is used as the number and name of the columns of the table schema, and the column type is set to String, for example:

    The file content of student_with_names.csv:

    id,name,age
    1,ftw,12
    2,zs,18
    3,ww,20
  • use S3 tvf

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true") order by id;
      +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+
  • You can also use desc function S3() to view the Table Schema.

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true");
      +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names_and_types format

    csv_with_names_and_types format: Currently, it does not support parsing the column type from a csv file. When using this format, S3 tvf will parse the first line of the file as the number and name of the columns of the table

    The file content of student_with_names_and_types.csv:

      id,name,age
    INT,STRING,INT
    1,ftw,12
    2,zs,18
    3,ww,20
  • use S3 tvf

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true") order by id;
    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+
  • You can also use desc function S3() to view the Table Schema.

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true");
      +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • json format

    The json format involves many optional parameters, and the meaning of each parameter can be referred to: Json Load. When S3 tvf queries the json form

    data.json 文件

    [{"id":1, "name":"ftw", "age":18}]
    [{"id":2, "name":"xxx", "age":17}]
    [{"id":3, "name":"yyy", "age":19}]
  • use S3 tvf:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style"="true");
    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 18 |
    | 2 | xxx | 17 |
    | 3 | yyy | 19 |
    +------+------+------+

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "jsonpaths" = "[\"$.id\", \"$.age\"]",
    "use_path_style"="true");
    +------+------+
    | id | age |
    +------+------+
    | 1 | 18 |
    | 2 | 17 |
    | 3 | 19 |
    +------+------+
  • parquet format

    S3 tvf supports parsing the column names and column types of the table schema from the parquet file. Example:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true") limit 5;
         +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    MySQL [(none)]> desc function s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true");
          +---------------+--------------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +---------------+--------------+------+-------+---------+-------+
    | p_partkey | INT | Yes | false | NULL | NONE |
    | p_name | TEXT | Yes | false | NULL | NONE |
    | p_mfgr | TEXT | Yes | false | NULL | NONE |
    | p_brand | TEXT | Yes | false | NULL | NONE |
    | p_type | TEXT | Yes | false | NULL | NONE |
    | p_size | INT | Yes | false | NULL | NONE |
    | p_container | TEXT | Yes | false | NULL | NONE |
    | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
    | p_comment | TEXT | Yes | false | NULL | NONE |
    +---------------+--------------+------+-------+---------+-------+
  • orc format

    Same as parquet format, set format parameter to orc.

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "orc",
    "use_path_style"="true") limit 5;
      +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  • avro format

    S3 tvf supports parsing the column names and column types of the table schema from the avro file. Example:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/person.avro",
    "ACCESS_KEY" = "ak",
    "SECRET_KEY" = "sk",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "FORMAT" = "avro");
    +--------+--------------+-------------+-----------------+
    | name | boolean_type | double_type | long_type |
    +--------+--------------+-------------+-----------------+
    | Alyssa | 1 | 10.0012 | 100000000221133 |
    | Ben | 0 | 5555.999 | 4009990000 |
    | lisi | 0 | 5992225.999 | 9099933330 |
    +--------+--------------+-------------+-----------------+
  • uri contains wildcards

    uri can use wildcards to read multiple files. Note: If wildcards are used, the format of each file must be consistent (especially csv/csv_with_names/csv_with_names_and_types count as different formats), S3 tvf uses the first file to parse out the table schema. For example:

    The following two csv files:

    // file1.csv
    1,aaa,18
    2,qqq,20
    3,qwe,19

    // file2.csv
    5,cyx,19
    6,ftw,21
  • You can use wildcards on the uri to query.

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/file*.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style"="true");
     +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | aaa | 18 |
    | 2 | qqq | 20 |
    | 3 | qwe | 19 |
    | 5 | cyx | 19 |
    | 6 | ftw | 21 |
    +------+------+------+
  • Using S3 tvf with insert into and cast

    -- Create doris internal table
    CREATE TABLE IF NOT EXISTS ${testTable}
    (
    id int,
    name varchar(50),
    age int
    )
    COMMENT "my first table"
    DISTRIBUTED BY HASH(id) BUCKETS 32
    PROPERTIES("replication_num" = "1");

    -- Insert data using S3
    insert into ${testTable} (id,name,age)
    select cast (id as INT) as id, name, cast (age as INT) as age
    from s3(
    "uri" = "${uri}",
    "s3.access_key"= "${ak}",
    "s3.secret_key" = "${sk}",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "${format}",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style" = "true");