Skip to main content

S3

Description

The S3 table-valued function (TVF) allows users to read and access files stored on S3-compatible object storage as if accessing relational table-formatted data. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file formats.

Syntax

S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)

Required Parameters

ParameterDescription
uriURI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the use_path_style parameter
s3.access_keyAccess key for S3
s3.secret_keySecret key for S3
s3.regionS3 region
s3.endpointS3 endpoint address
formatFile format, supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc

Optional Parameters

ParameterDefaultDescriptionNotes
s3.session_token-S3 session token
use_path_stylefalseS3 SDK uses Virtual-hosted Style by default. Some object storage systems may not support Virtual-hosted Style. Set this parameter to force Path Style access. For example, MinIO only allows Path Style by default, so use_path_style=true should be set when accessing MinIO
force_parsing_by_standard_urifalseForces parsing of non-standard URIs as standard URIs
column_separator\tColumn separator
line_delimiter\nLine delimiter
compress_typeUNKNOWNCompression type. Currently supports UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK, will auto-detect based on uri suffix
read_json_by_line"true"For JSON format importJSON Load
strip_outer_array"false"For JSON format importJSON Load
json_rootEmptyFor JSON format importJSON Load
jsonpathsEmptyFor JSON format importJSON Load
num_as_stringfalseFor JSON format importJSON Load
fuzzy_parsefalseFor JSON format import
trim_double_quotesfalseFor CSV format import, removes the outermost double quotes from each field
skip_lines0Skips the first N lines of CSV file. This parameter is ineffective when using csv_with_names or csv_with_names_and_types format
path_partition_keys-Specifies partition column names in file path. For example, for path /path/to/city=beijing/date="2023-07-09", set path_partition_keys="city,date" to automatically read column names and values from path
resource-Specifies Resource name. S3 TVF can use existing S3 Resource to access S3 directly. For creating S3 Resource, refer to CREATE-RESOURCE. Supported from version 2.1.4.
enable_mapping_varbinaryfalseMaps BYTE_ARRAY type to STRING when reading PARQUET/ORC. When enabled, maps to VARBINARY type instead.Supported from version 4.0.3

Notes

  1. For AWS S3, standard URI styles include:

    • AWS Client Style (Hadoop S3 Style)

      s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
    • Virtual Host Style

      https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
    • Path Style

      https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88

    Besides these three standard URI styles, other URI styles are also supported (though less common):

    • Virtual Host AWS Client (Hadoop S3) Mixed Style

      s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
    • Path AWS Client (Hadoop S3) Mixed Style

      s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88

    For detailed usage examples, see the examples below.

  2. To directly query TVF or create a View based on this TVF, USAGE permission on the Resource is required. To query a View created based on TVF, only SELECT permission on the View is needed.

Examples

  • Read and access CSV format files on S3-compatible object storage

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;
  • Can be used with desc function

    desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");
  • Different URL Schema Usage

    http://, https:// usage examples:

    -- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
    -- Since "use_path_style"="true" is set, Path Style will be used to access S3.
    select * from s3(
    "uri" = "https://endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="true");

    -- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
    -- Since "use_path_style"="false" is set, Virtual-hosted Style will be used to access S3.
    select * from s3(
    "uri" = "https://bucket.endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "endpoint",
    "s3.region" = "region",
    "format" = "csv",
    "use_path_style"="false");

    -- Alibaba Cloud OSS and Tencent Cloud COS use Virtual-hosted Style to access S3.
    -- OSS
    select * from s3(
    "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
    "s3.region" = "oss-cn-beijing",
    "format" = "parquet",
    "use_path_style" = "false");
    -- COS
    select * from s3(
    "uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
    "s3.region" = "ap-hongkong",
    "format" = "parquet",
    "use_path_style" = "false");

    -- MinIO
    select * from s3(
    "uri" = "s3://bucket/file.csv",
    "s3.endpoint" = "http://172.21.0.101:9000",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "us-east-1",
    "format" = "csv"
    );

    -- Baidu Cloud BOS uses S3-compatible Virtual-hosted Style to access S3.
    -- BOS
    select * from s3(
    "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.region" = "bj",
    "s3.endpoint" = "http://bj.bcebos.com",
    "format" = "parquet",
    "use_path_style" = "false");
  • s3:// usage example:

    -- Note URI Bucket notation, no need to set use_path_style parameter.
    -- Virtual-hosted Style will be used to access S3.
    select * from s3(
    "uri" = "s3://bucket/file/student.csv",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv");
  • Other supported URI style examples:

    -- Virtual Host AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=false and force_parsing_by_standard_uri=true.
    select * from s3(
    "URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="false",
    "force_parsing_by_standard_uri"="true");

    -- Path AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=true and force_parsing_by_standard_uri=true.
    select * from s3(
    "URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "s3.endpoint"= "endpont",
    "s3.region"= "region",
    "format" = "csv",
    "use_path_style"="true",
    "force_parsing_by_standard_uri"="true");
  • CSV Format

    Since S3 table-valued-function doesn't know the Table Schema in advance, it will read the file once to parse the Table Schema.

    csv format: S3 table-valued-function reads files on S3 and processes them as CSV files, using the first line to parse Table Schema. The number of columns n in the first line will be used as the number of columns in Table Schema, and column names will be automatically named as c1, c2, ..., cn, with all column types set to String. Example:

    student1.csv file content:

    1,ftw,12
    2,zs,18
    3,ww,20

    Using S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true") order by c1;

    +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+

    Can use desc function S3() to view Table Schema:

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | c1 | TEXT | Yes | false | NULL | NONE |
    | c2 | TEXT | Yes | false | NULL | NONE |
    | c3 | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names Format

    csv_with_names format: Parses the first line of the file as the number and names of columns in Table Schema, with all column types set to String. Example:

    student_with_names.csv file content:

    id,name,age
    1,ftw,12
    2,zs,18
    3,ww,20

    Using S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true") order by id;

    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+
  • Similarly, use desc function S3() to view Table Schema

    desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • csv_with_names_and_types Format

    csv_with_names_and_types format: Currently does not support parsing column types from CSV files. When using this format, S3 TVF will parse the first line as the number and names of columns in Table Schema, with all column types set to String, and will ignore the second line of the file.

    student_with_names_and_types.csv file content:

    id,name,age
    INT,STRING,INT
    1,ftw,12
    2,zs,18
    3,ww,20

    Using S3 TVF:

    select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true") order by id;

    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 12 |
    | 2 | zs | 18 |
    | 3 | ww | 20 |
    +------+------+------+

    Similarly, use desc function S3() to view Table Schema:

    Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv_with_names_and_types",
    "use_path_style" = "true");

    +-------+------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+------+------+-------+---------+-------+
    | id | TEXT | Yes | false | NULL | NONE |
    | name | TEXT | Yes | false | NULL | NONE |
    | age | TEXT | Yes | false | NULL | NONE |
    +-------+------+------+-------+---------+-------+
  • JSON Format

    json format: JSON format involves many optional parameters. For the meaning of each parameter, refer to: Json Load. When querying JSON format files with S3 TVF, it locates a JSON object based on json_root and jsonpaths parameters, uses the key in the object as column names in Table Schema, with all column types set to String. Example:

    data.json file content:

    [{"id":1, "name":"ftw", "age":18}]
    [{"id":2, "name":"xxx", "age":17}]
    [{"id":3, "name":"yyy", "age":19}]

    Using S3 TVF to query:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style"="true");
    +------+------+------+
    | id | name | age |
    +------+------+------+
    | 1 | ftw | 18 |
    | 2 | xxx | 17 |
    | 3 | yyy | 19 |
    +------+------+------+

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/data.json",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "json",
    "strip_outer_array" = "true",
    "jsonpaths" = "[\"$.id\", \"$.age\"]",
    "use_path_style"="true");

    +------+------+
    | id | age |
    +------+------+
    | 1 | 18 |
    | 2 | 17 |
    | 3 | 19 |
    +------+------+
  • Parquet Format

    parquet format: S3 TVF supports parsing column names and types from Parquet files for Table Schema. Example:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true") limit 5;

    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    MySQL [(none)]> desc function s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "parquet",
    "use_path_style"="true");

    +---------------+--------------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +---------------+--------------+------+-------+---------+-------+
    | p_partkey | INT | Yes | false | NULL | NONE |
    | p_name | TEXT | Yes | false | NULL | NONE |
    | p_mfgr | TEXT | Yes | false | NULL | NONE |
    | p_brand | TEXT | Yes | false | NULL | NONE |
    | p_type | TEXT | Yes | false | NULL | NONE |
    | p_size | INT | Yes | false | NULL | NONE |
    | p_container | TEXT | Yes | false | NULL | NONE |
    | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
    | p_comment | TEXT | Yes | false | NULL | NONE |
    +---------------+--------------+------+-------+---------+-------+
  • ORC Format

    orc format: Usage is the same as parquet format, just set the format parameter to orc.

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "orc",
    "use_path_style"="true") limit 5;

    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
    | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
    | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
    | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
    | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
    | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
    +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
  • URI with Wildcards

    URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially csv, csv_with_names, csv_with_names_and_types are different formats). S3 TVF will use the first file to parse Table Schema.

    With the following two CSV files:

    // file1.csv
    1,aaa,18
    2,qqq,20
    3,qwe,19

    // file2.csv
    5,cyx,19
    6,ftw,21

    You can use wildcards in URI to query:

    select * from s3(
    "uri" = "http://127.0.0.1:9312/test2/file*.csv",
    "s3.access_key"= "minioadmin",
    "s3.secret_key" = "minioadmin",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "csv",
    "use_path_style"="true");

    +------+------+------+
    | c1 | c2 | c3 |
    +------+------+------+
    | 1 | aaa | 18 |
    | 2 | qqq | 20 |
    | 3 | qwe | 19 |
    | 5 | cyx | 19 |
    | 6 | ftw | 21 |
    +------+------+------+
  • Using S3 TVF with insert into and cast

    -- Create Doris internal table
    CREATE TABLE IF NOT EXISTS ${testTable}
    (
    id int,
    name varchar(50),
    age int
    )
    COMMENT "my first table"
    DISTRIBUTED BY HASH(id) BUCKETS 32
    PROPERTIES("replication_num" = "1");

    -- Insert data using S3
    insert into ${testTable} (id,name,age)
    select cast (id as INT) as id, name, cast (age as INT) as age
    from s3(
    "uri" = "${uri}",
    "s3.access_key"= "${ak}",
    "s3.secret_key" = "${sk}",
    "s3.endpoint" = "http://127.0.0.1:9312",
    "s3.region" = "us-east-1",
    "format" = "${format}",
    "strip_outer_array" = "true",
    "read_json_by_line" = "true",
    "use_path_style" = "true");