Custom Analyzer
Overview
Custom analyzers allow you to overcome the limitations of built-in tokenizers by combining character filters, tokenizers, and token filters according to specific needs. This fine-tunes how text is segmented into searchable terms, directly determining search relevance and data analysis accuracy—a foundational key to enhancing search experience and data value.

Using Custom Analyzers
Creating Components
1. Creating a char_filter
CREATE INVERTED INDEX CHAR_FILTER IF NOT EXISTS x_char_filter
PROPERTIES (
"type" = "char_replace"
-- configure pattern/replacement parameters as needed
);
char_replace replaces specified characters before tokenization.
- Parameters
char_filter_pattern: characters to replacechar_filter_replacement: replacement characters (default: space)
2. Creating a tokenizer
CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS x_tokenizer
PROPERTIES (
"type" = "standard"
);
Available tokenizers:
- standard: Grammar-based tokenization following Unicode text segmentation
- ngram: Generates N-grams of specified length
- edge_ngram: Generates N-grams anchored at word start
- keyword: No-op tokenizer that outputs entire input as single term
- char_group: Tokenizes on specified characters
- basic: Simple English, numbers, Chinese, Unicode tokenizer
- icu: International text segmentation supporting all languages
3. Creating a token_filter
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS x_token_filter
PROPERTIES (
"type" = "word_delimiter"
);
Available token filters:
- word_delimiter: Splits tokens at non-alphanumeric characters
- ascii_folding: Converts non-ASCII characters to ASCII equivalents
- lowercase: Converts tokens to lowercase
4. Creating an analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS x_analyzer
PROPERTIES (
"tokenizer" = "x_tokenizer", -- single tokenizer
"token_filter" = "x_filter1, x_filter2" -- one or more token_filters, in order
);
Viewing Components
SHOW INVERTED INDEX TOKENIZER;
SHOW INVERTED INDEX TOKEN_FILTER;
SHOW INVERTED INDEX ANALYZER;
Deleting Components
DROP INVERTED INDEX TOKENIZER IF EXISTS x_tokenizer;
DROP INVERTED INDEX TOKEN_FILTER IF EXISTS x_token_filter;
DROP INVERTED INDEX ANALYZER IF EXISTS x_analyzer;
Using Custom Analyzers in Table Creation
Custom analyzers are specified using the analyzer parameter in index properties:
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("analyzer" = "x_custom_analyzer", "support_phrase" = "true")
)
table_properties;
Usage Limitations
- The
typeand parameters in tokenizer and token_filter must be from the supported list, otherwise table creation will fail - An analyzer can only be deleted when no tables are using it
- Tokenizers and token_filters can only be deleted when no analyzers are using them
- After creating custom analyzer syntax, it takes 10 seconds to sync to BE before data loading works normally
Notes
- Nesting multiple components in a custom analyzer may degrade tokenization performance
- The
tokenizefunction supports custom analyzers - Predefined tokenization uses
built_in_analyzer, custom tokenization usesanalyzer- only one can exist
Complete Examples
Example 1: Phone Number Tokenization
Using edge_ngram for phone number tokenization:
CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_phone_number_tokenizer
PROPERTIES
(
"type" = "edge_ngram",
"min_gram" = "3",
"max_gram" = "10",
"token_chars" = "digit"
);
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS edge_ngram_phone_number
PROPERTIES
(
"tokenizer" = "edge_ngram_phone_number_tokenizer"
);
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "edge_ngram_phone_number")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
Example 2: Fine-grained Tokenization
Using standard + word_delimiter for detailed tokenization:
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS word_splitter
PROPERTIES
(
"type" = "word_delimiter",
"split_on_numerics" = "false",
"split_on_case_change" = "false"
);
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS lowercase_delimited
PROPERTIES
(
"tokenizer" = "standard",
"token_filter" = "asciifolding, word_splitter, lowercase"
);
Example 3: Keyword with Multiple Token Filters
Using keyword to preserve original terms with multiple token filters:
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES
(
"tokenizer" = "keyword",
"token_filter" = "asciifolding, lowercase"
);