Skip to main content

Custom Analyzer

A Custom Analyzer combines a character filter (char_filter), a tokenizer, and a token filter (token_filter) so that you can precisely define how text is split into searchable terms based on business needs. It overcomes the limitations of built-in analyzers and directly affects search relevance and the accuracy of data analysis.

Custom analyzer diagram

Applicable Scenarios

ScenarioRecommended Combination
Multilingual mixed-text searchicu tokenizer + icu_normalizer filter
Phone number / code prefix matchingedge_ngram tokenizer
Splitting complex English text (camel case, hyphens, etc.)standard tokenizer + word_delimiter filter
Auto-completion / input suggestionsedge_ngram tokenizer + lowercase filter
Exact match (preserve original term)keyword tokenizer + lowercase / asciifolding filter

Core Components

A custom analyzer consists of four kinds of objects, applied to the original text in order:

Original text ──► [char_filter] ──► [tokenizer] ──► [token_filter] ──► Terms
ComponentPurposeCount
char_filterPre-processes characters before tokenization (such as replacement and normalization)0 ~ N
tokenizerSplits text into terms1
token_filterProcesses the resulting terms (such as lowercasing or ASCII folding)0 ~ N
analyzerAssembles the components above into a complete tokenization pipeline1

Creating a Custom Analyzer

1. Character Filter (char_filter)

CREATE INVERTED INDEX CHAR_FILTER IF NOT EXISTS x_char_filter
PROPERTIES (
"type" = "char_replace"
-- See below for other parameters
);

Supported character filter types:

  • char_replace: replaces specified characters with target characters before tokenization.

    • char_filter_pattern: list of characters to replace
    • char_filter_replacement: replacement character (default: space)
  • icu_normalizer: pre-processes text using ICU normalization.

    • name: normalization form (default nfkc_cf). Options: nfc, nfkc, nfkc_cf, nfd, nfkd
    • mode: normalization mode (default compose). Options: compose, decompose
    • unicode_set_filter: specifies the character set to normalize (such as [a-z])

2. Tokenizer

CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS x_tokenizer
PROPERTIES (
"type" = "standard"
);

Supported tokenizer types:

TypeDescriptionMain Parameters
standardStandard tokenization (follows Unicode text segmentation), suitable for most languagesNone
ngramSplits by N-gramsmin_ngram, max_ngram, token_chars
edge_ngramGenerates N-grams starting from the beginning of the wordmin_ngram, max_ngram, token_chars
keywordOutputs the entire text as a single term, often combined with token_filterNone
char_groupSplits by the given characterstokenize_on_chars
basicSimple English / digit / Chinese / Unicode tokenizationextra_chars
icuICU internationalized tokenization, supports complex scripts in multiple languagesNone

Parameter descriptions:

  • min_ngram: minimum length (default 1)
  • max_ngram: maximum length (default 2)
  • token_chars: character categories to keep (default: keep all). Options: letter, digit, whitespace, punctuation, symbol
  • tokenize_on_chars: a character list or category. Categories support whitespace, letter, digit, punctuation, symbol, cjk
  • extra_chars: additional ASCII characters to split on (such as []().)

3. Token Filter

CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS x_token_filter
PROPERTIES (
"type" = "word_delimiter"
);

Supported token filter types:

TypePurpose
word_delimiterSplits at non-alphanumeric characters and can perform normalization
ascii_foldingMaps non-ASCII characters to their ASCII equivalents
lowercaseConverts token text to lowercase
icu_normalizerProcesses tokens using ICU normalization

word_delimiter Details

Default behavior:

  1. Uses non-alphanumeric characters as delimiters (for example: Super-Duper to Super, Duper)
  2. Removes leading and trailing delimiters from a token (for example: XL---42+'Autocoder' to XL, 42, Autocoder)
  3. Splits at case transitions (for example: PowerShot to Power, Shot)
  4. Splits at the boundary between letters and digits (for example: XL500 to XL, 500)
  5. Removes the English possessive 's (for example: Neil's to Neil)

Optional parameters:

ParameterDefaultDescription
generate_number_partstrueWhether to output the numeric parts
generate_word_partstrueWhether to output the word parts
protected_wordsNoneProtected words that are not split
split_on_case_changetrueWhether to split at case changes
split_on_numericstrueWhether to split at the boundary between letters and digits
stem_english_possessivetrueWhether to remove the English possessive 's
type_tableNoneCustom character type mapping table

type_table supports the following mapping types:

  • ALPHA (letter)
  • ALPHANUM (alphanumeric)
  • DIGIT (digit)
  • LOWER (lowercase letter)
  • SUBWORD_DELIM (non-alphanumeric delimiter)
  • UPPER (uppercase letter)

Example: ["+ => ALPHA", "- => ALPHA"] treats + and - as letters so that they are not used as split points.

icu_normalizer Parameters

  • name: normalization form (default nfkc_cf). Options: nfc, nfkc, nfkc_cf, nfd, nfkd
  • unicode_set_filter: specifies the character set to normalize

4. Analyzer

Assembles the components above into a complete tokenization pipeline:

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS x_analyzer
PROPERTIES (
"tokenizer" = "x_tokenizer", -- A single tokenizer
"token_filter" = "x_filter1, x_filter2" -- One or more token_filters, executed in order
);

Using a Custom Analyzer in a Table

Reference a created custom analyzer through the analyzer field in the index PROPERTIES:

CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("analyzer" = "x_custom_analyzer", "support_phrase" = "true")
)
table_properties;

Notes:

  1. A custom analyzer is set in the index PROPERTIES through analyzer.
  2. The only property that can be used together with analyzer is support_phrase.

Multiple Analyzer Indexes on a Single Column

Doris supports creating multiple inverted indexes that use different tokenizers on the same column, so that the same data can be retrieved with different tokenization strategies.

Use Cases

  • Multilingual support: use tokenizers for different languages on the same text column.
  • Balancing search precision and recall: use a keyword tokenizer for exact matching and a standard tokenizer for fuzzy search.
  • Auto-completion: use an edge_ngram tokenizer for prefix matching and a standard tokenizer for regular search.

Creating Multiple Indexes

-- 1. Create tokenizers with different tokenization strategies
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");

CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
PROPERTIES (
"type" = "edge_ngram",
"min_gram" = "1",
"max_gram" = "20",
"token_chars" = "letter"
);

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = "lowercase");

-- 2. Create multiple indexes on the same column
CREATE TABLE articles (
id INT,
content TEXT,
-- Standard tokenizer for tokenized search
INDEX idx_content_std (content) USING INVERTED
PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
-- Keyword tokenizer for exact matching
INDEX idx_content_kw (content) USING INVERTED
PROPERTIES("analyzer" = "kw_analyzer"),
-- edge n-gram tokenizer for auto-completion
INDEX idx_content_ngram (content) USING INVERTED
PROPERTIES("analyzer" = "ngram_analyzer")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_allocation" = "tag.location.default: 1");

Specifying an Analyzer at Query Time

Use the USING ANALYZER clause to specify which index to use:

-- Insert test data
INSERT INTO articles VALUES
(1, 'hello world'),
(2, 'hello'),
(3, 'world'),
(4, 'hello world test');

-- Tokenized search: matches rows containing the term 'hello'
-- Returns: 1, 2, 4
SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER std_analyzer ORDER BY id;

-- Exact match: matches only the exact 'hello' string
-- Returns: 2
SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer ORDER BY id;

-- Use edge n-gram for prefix matching
-- Returns: 1, 2, 4 (all rows starting with 'hel')
SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER ngram_analyzer ORDER BY id;

You can also use built-in tokenizers directly:

SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
SELECT * FROM articles WHERE content MATCH 'Hello' USING ANALYZER chinese;

Adding an Index to an Existing Table

-- Add a new index that uses a different tokenizer
ALTER TABLE articles ADD INDEX idx_content_chinese (content)
USING INVERTED PROPERTIES("parser" = "chinese");

-- Wait for the schema change to complete
SHOW ALTER TABLE COLUMN WHERE TableName='articles';

Building an Index

After adding an index, you must build the index for existing data:

-- Build a specific index (non-cloud mode)
BUILD INDEX idx_content_chinese ON articles;

-- Build all indexes (cloud mode)
BUILD INDEX ON articles;

-- Check the build progress
SHOW BUILD INDEX WHERE TableName='articles';

Key Notes

  1. Tokenizer identity: two tokenizers with the same tokenizer and token_filter configuration are considered identical. You cannot create multiple indexes on the same column that share the same tokenizer identity.
  2. Index selection behavior:
    • When USING ANALYZER is specified, the index for the specified tokenizer is used if it exists and has been built.
    • If the index is not built, the query falls back to the non-index path (results are correct, but performance is slower).
    • When USING ANALYZER is not specified, any available index may be used.
  3. Performance considerations:
    • Each additional index increases storage space and write overhead.
    • Choose tokenizers based on your actual query patterns.
    • If your query patterns are predictable, consider using fewer indexes.

Management and Maintenance

View

SHOW INVERTED INDEX TOKENIZER;
SHOW INVERTED INDEX TOKEN_FILTER;
SHOW INVERTED INDEX ANALYZER;

Delete

DROP INVERTED INDEX TOKENIZER IF EXISTS x_tokenizer;
DROP INVERTED INDEX TOKEN_FILTER IF EXISTS x_token_filter;
DROP INVERTED INDEX ANALYZER IF EXISTS x_analyzer;

Complete Examples

Example 1: Phone Number Prefix Matching

Use edge_ngram to generate all prefix fragments of a phone number, enabling search-as-you-type.

CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_phone_number_tokenizer
PROPERTIES
(
"type" = "edge_ngram",
"min_gram" = "3",
"max_gram" = "10",
"token_chars" = "digit"
);

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS edge_ngram_phone_number
PROPERTIES
(
"tokenizer" = "edge_ngram_phone_number_tokenizer"
);

CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "edge_ngram_phone_number")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

SELECT tokenize('13891972631', '"analyzer"="edge_ngram_phone_number"');

Result:

[
{"token":"138"},
{"token":"1389"},
{"token":"13891"},
{"token":"138919"},
{"token":"1389197"},
{"token":"13891972"},
{"token":"138919726"},
{"token":"1389197263"}
]

Example 2: Fine-Grained Tokenization for Complex English Text

Use the standard tokenizer together with word_delimiter for finer-grained tokenization, plus case normalization and ASCII folding.

CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS word_splitter
PROPERTIES
(
"type" = "word_delimiter",
"split_on_numerics" = "false",
"split_on_case_change" = "false"
);

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS lowercase_delimited
PROPERTIES
(
"tokenizer" = "standard",
"token_filter" = "asciifolding, word_splitter, lowercase"
);

CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "lowercase_delimited")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

SELECT tokenize('The server at IP 192.168.1.15 sent a confirmation to user_123@example.com, requiring a quickResponse before the deadline.', '"analyzer"="lowercase_delimited"');

Result:

[
{"token":"the"},
{"token":"server"},
{"token":"at"},
{"token":"ip"},
{"token":"192"},
{"token":"168"},
{"token":"1"},
{"token":"15"},
{"token":"sent"},
{"token":"a"},
{"token":"confirmation"},
{"token":"to"},
{"token":"user"},
{"token":"123"},
{"token":"example"},
{"token":"com"},
{"token":"requiring"},
{"token":"a"},
{"token":"quickresponse"},
{"token":"before"},
{"token":"the"},
{"token":"deadline"}
]

Example 3: Exact Match with Original Term Preserved

Use the keyword tokenizer to keep the entire text intact, then apply lowercase and asciifolding for normalization. This is commonly used for case-insensitive exact matching of strings.

CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES
(
"tokenizer" = "keyword",
"token_filter" = "asciifolding, lowercase"
);

CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "keyword_lowercase")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

SELECT tokenize('hÉllo World', '"analyzer"="keyword_lowercase"');

Result:

[
{"token":"hello world"}
]

Limitations

  1. The type and parameters in tokenizer and token_filter can only use currently supported tokenizers and token filters; otherwise, table creation fails.
  2. An analyzer can be dropped only when no table is using it.
  3. A tokenizer or token_filter can be dropped only when no analyzer is using it.
  4. Custom analyzer DDL is synchronized to the BE 10 seconds after execution; subsequent imports do not produce errors.

Notes

  1. Nesting multiple components in a custom analyzer may degrade tokenization performance.
  2. The select tokenize tokenization function supports custom analyzers and can be used to debug tokenization results.
  3. Only one of the predefined built_in_analyzer and a custom analyzer can exist on the same index.