TOKENIZE
Description
The TOKENIZE function tokenizes a string using a specified analyzer and returns the tokenization results as a JSON-formatted string array. This function is particularly useful for understanding how text will be analyzed when using inverted indexes with full-text search capabilities.
Syntax
VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
Parameters
str: The input string to be tokenized. Type:VARCHARproperties: A property string specifying the analyzer configuration. Type:VARCHAR
The properties parameter supports the following key-value pairs (format: "key1"="value1", "key2"="value2"):
Common Properties
| Property | Description | Example Values |
|---|---|---|
built_in_analyzer | Built-in analyzer type | "english", "chinese", "unicode", "icu", "basic", "ik", "standard", "none" |
analyzer | Custom analyzer name (created via CREATE INVERTED INDEX ANALYZER) | "my_custom_analyzer" |
parser_mode | Parser mode (for chinese analyzers) | "fine_grained", "coarse_grained" |
support_phrase | Enable phrase support (stores position information) | "true", "false" |
lower_case | Convert tokens to lowercase | "true", "false" |
char_filter_type | Character filter type | Varies by filter |
stop_words | Stop words configuration | Varies by implementation |
Return Value
Returns a VARCHAR containing a JSON array of tokenization results. Each element in the array is an object with the following structure:
token: The tokenized termposition: (Optional) The position index of the token whensupport_phraseis enabled
Examples
Example 1: Using built-in analyzers
-- Using the standard analyzer
SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
[{ "token": "hello" }, { "token": "world" }]
-- Using the english analyzer
SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
[{ "token": "run" }, { "token": "quick" }]
-- Using the unicode analyzer with Chinese text
SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" }, { "token": "库" }]
-- Using the chinese analyzer
SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
-- Using the icu analyzer for multilingual text
SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
-- Using the basic analyzer
SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", '"built_in_analyzer"="basic"');
[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, {"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
-- Using the ik analyzer for Chinese text
SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
[{ "token": "中华人民共和国" }, { "token": "国歌" }]
Example 2: Using custom analyzers
First, create a custom analyzer:
CREATE INVERTED INDEX ANALYZER lowercase_delimited
PROPERTIES (
"tokenizer" = "standard",
"token_filter" = "asciifolding, lowercase"
);
Then use it with TOKENIZE:
SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
[{ "token": "foo" }, { "token": "bar" }]
Example 3: With phrase support (position information)
SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", "support_phrase"="true"');
[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
Notes
-
Analyzer Configuration: The
propertiesparameter must be a valid property string. If using a custom analyzer, it must be created beforehand usingCREATE INVERTED INDEX ANALYZER. -
Supported Analyzers: Currently supported built-in analyzers include:
standard: Standard analyzer for general textenglish: English language analyzer with stemmingchinese: Chinese text analyzerunicode: Unicode-based analyzer for multilingual texticu: ICU-based analyzer for advanced Unicode processingbasic: Basic tokenizationik: IK analyzer for Chinese textnone: No tokenization (returns original string as single token)
-
Performance: The
TOKENIZEfunction is primarily intended for testing and debugging analyzer configurations. For production full-text search, use inverted indexes with theMATCHorSEARCHoperators. -
JSON Output: The output is a formatted JSON string that can be further processed using JSON functions if needed.
-
Compatibility with Inverted Indexes: The same analyzer configuration used in
TOKENIZEcan be applied to inverted indexes when creating tables:CREATE TABLE example (
content TEXT,
INDEX idx_content(content) USING INVERTED PROPERTIES("analyzer"="my_analyzer")
) -
Testing Analyzer Behavior: Use
TOKENIZEto preview how text will be tokenized before creating inverted indexes, helping to choose the most appropriate analyzer for your data.
Related Functions
Keywords
TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, ANALYZER