TOKENIZE
Description
The TOKENIZE function tokenizes a string using a specified parser and returns the tokenization results as a string array. This function is particularly useful for testing and understanding how text will be analyzed when using inverted indexes with full-text search capabilities.
Syntax
ARRAY<VARCHAR> TOKENIZE(VARCHAR str, VARCHAR properties)
Parameters
str: The input string to be tokenized. Type:VARCHARproperties: A property string specifying the parser configuration. Type:VARCHAR
The properties parameter supports the following key-value pairs (format: 'key1'='value1', 'key2'='value2' or "key1"="value1", "key2"="value2"):
Supported Properties
| Property | Description | Example Values |
|---|---|---|
parser | Built-in parser type | "chinese", "english", "unicode" |
parser_mode | Parser mode for Chinese parser | "fine_grained", "coarse_grained" |
char_filter_type | Character filter type | "char_replace" |
char_filter_pattern | Characters to be replaced (used with char_filter_type) | "._=:," |
char_filter_replacement | Replacement character (used with char_filter_type) | " " (space) |
stopwords | Stop words configuration | "none" |
Return Value
Returns an ARRAY<VARCHAR> containing the tokenized strings as individual array elements.
Examples
Example 1: Using the Chinese parser
SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese'");
["我", "来到", "北京", "清华大学"]
Example 2: Chinese parser with fine-grained mode
SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese', 'parser_mode'='fine_grained'");
["我", "来到", "北京", "清华", "清华大学", "华大", "大学"]
Example 3: Using the Unicode parser
SELECT TOKENIZE('Apache Doris数据库', "'parser'='unicode'");
["apache", "doris", "数", "据", "库"]
Example 4: Using character filters
SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
'"parser"="unicode","char_filter_type" = "char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " "');
["get", "images", "hm", "bg", "jpg", "http", "1", "0", "test", "abc", "bcd"]
Example 5: Stopwords configuration
SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');
["华", "夏", "智", "胜", "新", "税", "股", "票"]
SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode","stopwords" = "none"');
["华", "夏", "智", "胜", "新", "税", "股", "票", "a"]
Notes
-
Parser Configuration: The
propertiesparameter must be a valid property string. Only built-in parsers are supported in this version. -
Supported Parsers: Version 2.1 supports the following built-in parsers:
chinese: Chinese text parser with optionalparser_mode(fine_grainedorcoarse_grained)english: English language parser with stemmingunicode: Unicode-based parser for multilingual text
-
Parser Mode: The
parser_modeproperty is primarily used with thechineseparser:fine_grained: Produces more detailed tokens with overlapping segmentscoarse_grained: Default mode with standard segmentation
-
Character Filters: Use
char_filter_type,char_filter_pattern, andchar_filter_replacementtogether to replace specific characters before tokenization. -
Performance: The
TOKENIZEfunction is primarily intended for testing and debugging parser configurations. For production full-text search, use inverted indexes with theMATCHpredicate. -
Compatibility with Inverted Indexes: The same parser configuration used in
TOKENIZEcan be applied to inverted indexes when creating tables:CREATE TABLE example (
content TEXT,
INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="chinese")
) -
Testing Parser Behavior: Use
TOKENIZEto preview how text will be tokenized before creating inverted indexes, helping to choose the most appropriate parser for your data.
Keywords
TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, PARSER