Inverted Index (Text Search)

Text search retrieves documents that contain specific terms or phrases from a dataset and ranks the results by relevance.

Retrieval method	Strength	Applicable scenarios
Text search	"Find precisely": controllable, explainable exact matching that ensures deterministic keyword hits and filter conditions	Keyword search, phrase matching, boolean filtering
Vector search	"Find broadly": uses semantic similarity to expand the recall range	Semantic search, approximate matching

In generative AI applications, especially in retrieval-augmented generation (RAG) scenarios, text search and vector search complement each other:

Balance semantic breadth with lexical precision
Improve recall while ensuring accuracy and explainability of results
Together build a reliable retrieval foundation that provides more accurate and relevant context for large language models

Evolution of text search in Doris

Starting from version 2.0.0, Doris introduced the inverted index to support high-performance full-text search. As retrieval scenarios diversified and query complexity increased, Doris has continued to expand its text search capabilities in subsequent releases.

Stage	Version	Key capabilities
Foundation stage	2.0+	Introduced column-level inverted indexes; provided basic full-text search operators (`MATCH_ANY`, `MATCH_ALL`) and multi-language tokenizers, supporting efficient keyword search on large-scale datasets
Feature expansion	2.x to 3.x	Enhanced the operator system, added advanced operators such as phrase matching (`MATCH_PHRASE`), prefix search (`MATCH_PHRASE_PREFIX`), and regex matching (`MATCH_REGEXP`); version 3.1 introduced custom tokenization
Capability enhancement	4.0+	Introduced BM25 relevance scoring and the unified query entry point `SEARCH` function, supporting text relevance ranking and hybrid ranking

The core enhancements in 4.0+ include:

BM25 relevance scoring: The score() function ranks results by text relevance and can be combined with vector similarity scores to enable hybrid ranking.
SEARCH function: Provides a unified query DSL that supports cross-column queries and boolean logic combinations, simplifying the construction of complex queries while further improving query performance.

Core text search features in Doris

1. Rich text operators

Doris provides a set of full-text search operators that cover multiple retrieval patterns, satisfying needs ranging from basic keyword matching to complex phrase queries.

Main operators supported in the current version:

Operator	Description	Typical scenarios
`MATCH_ANY` / `MATCH_ALL`	Any-term match (OR) and all-term match (AND)	General keyword search
`MATCH_PHRASE`	Exact phrase match, with support for custom slop and order control	Proximity word queries
`MATCH_PHRASE_PREFIX`	Phrase prefix match	Auto-completion, incremental search
`MATCH_REGEXP`	Matching based on regular expressions	Pattern-based text retrieval

These operators can be used independently or combined through the SEARCH() function to build complex logical queries. For example (the docs table referenced here is the one created in the Quick start section below):

-- Exact phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'inverted index';

-- Prefix search
SELECT * FROM docs WHERE content MATCH_PHRASE_PREFIX 'data ware';

View all operators

2. Custom tokenization (3.1+)

In text search, the tokenization method directly determines retrieval precision and recall. Starting from version 3.1, Doris supports custom analyzers, allowing you to flexibly define the tokenization pipeline based on business needs.

Custom tokenization achieves fine-grained text control by combining the following three types of components:

Character filter (char_filter): Replaces, removes, or normalizes symbols before tokenization
Tokenizer (tokenizer): Selects the tokenization algorithm. Supports types such as standard, ngram, edge_ngram, keyword, and icu for processing text in different languages and structures
Token filter (token_filter): For example, lowercase, word_delimiter, and ascii_folding, used to normalize and refine tokenization results

-- Example: Define a custom analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES (
    "tokenizer" = "keyword",
    "token_filter" = "asciifolding, lowercase"
);

-- Use the custom analyzer when creating a table
CREATE TABLE docs (
    id BIGINT,
    content TEXT,
    INDEX idx_content (content) USING INVERTED PROPERTIES(
        "analyzer" = "keyword_lowercase",
        "support_phrase" = "true"
    )
);

Learn about custom tokenization

3. BM25 relevance scoring (4.0+)

Doris implements the BM25 (Best Matching 25) algorithm for text relevance computation, providing ranking and scoring capabilities for full-text search.

Core characteristics of BM25:

A probabilistic model based on term frequency (TF), inverse document frequency (IDF), and document length
Robust for both short and long texts
Weighting strategy can be tuned through the k1 and b parameters

SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time OLAP analytics'
ORDER BY relevance DESC
LIMIT 10;

Learn more about the scoring mechanism

4. SEARCH function: unified query entry point (4.0+)

The SEARCH() function provides a unified syntax entry point for text retrieval, supporting multi-column search and boolean logic combinations, which makes complex queries more concise to express:

SELECT id, title, score() AS relevance
FROM docs
WHERE SEARCH('title:Machine AND tags:ANY(database sql)')
ORDER BY relevance DESC
LIMIT 20;

Complete SEARCH function guide

Quick start

Step 1: Create a table with inverted indexes

CREATE TABLE docs (
    id BIGINT,
    title STRING,
    content STRING,
    category STRING,
    tags ARRAY<STRING>,
    created_at DATETIME,
    -- Text search indexes
    INDEX idx_title(title) USING INVERTED PROPERTIES ("parser" = "chinese"),
    INDEX idx_content(content) USING INVERTED PROPERTIES ("parser" = "chinese", "support_phrase" = "true"),
    INDEX idx_category(category) USING INVERTED,
    INDEX idx_tags(tags) USING INVERTED
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10
PROPERTIES ("replication_num" = "1");

Step 2: Run text queries

-- Simple keyword search
SELECT * FROM docs WHERE content MATCH_ANY 'apache doris';

-- Phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'full-text search';

-- Boolean query with SEARCH
SELECT * FROM docs
WHERE SEARCH('title:apache AND (category:database OR tags:ANY(sql nosql))');

-- Relevance-based ranking
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time analytics OLAP'
ORDER BY relevance DESC
LIMIT 10;

Hybrid search: text + vector

In RAG applications, combining text search with vector similarity enables more comprehensive retrieval:

-- Hybrid retrieval: semantic similarity + keyword filtering
SELECT id, title, score() AS text_relevance
FROM docs
WHERE
    -- Vector filter for semantic similarity
    cosine_distance(embedding, [0.1, 0.2, ...]) < 0.3
    -- Text filter for keyword constraints
    AND SEARCH('title:search AND content:engine AND category:technology')
ORDER BY text_relevance DESC
LIMIT 10;

Managing inverted indexes

Create an index

-- Create at table creation time
CREATE TABLE t (
    content STRING,
    INDEX idx(content) USING INVERTED PROPERTIES ("parser" = "chinese")
);

-- Create on an existing table
CREATE INDEX idx_content ON docs(content) USING INVERTED PROPERTIES ("parser" = "chinese");

-- Build the index for existing data
BUILD INDEX idx_content ON docs;

Drop an index

DROP INDEX idx_content ON docs;

View indexes

SHOW CREATE TABLE docs;
SHOW INDEX FROM docs;

Inverted Index (Text Search)

Evolution of text search in Doris

Core text search features in Doris

1. Rich text operators

2. Custom tokenization (3.1+)

3. BM25 relevance scoring (4.0+)

4. SEARCH function: unified query entry point (4.0+)

Quick start

Step 1: Create a table with inverted indexes

Step 2: Run text queries

Hybrid search: text + vector

Managing inverted indexes

Create an index

Drop an index

View indexes

Further reading

Core documentation

Advanced topics

Evolution of text search in Doris​

Core text search features in Doris​

1. Rich text operators​

2. Custom tokenization (3.1+)​

3. BM25 relevance scoring (4.0+)​

4. SEARCH function: unified query entry point (4.0+)​

Quick start​

Step 1: Create a table with inverted indexes​

Step 2: Run text queries​

Hybrid search: text + vector​

Managing inverted indexes​

Create an index​

Drop an index​

View indexes​

Further reading​

Core documentation​

Advanced topics​

Evolution of text search in Doris

Core text search features in Doris

1. Rich text operators

2. Custom tokenization (3.1+)

3. BM25 relevance scoring (4.0+)

4. SEARCH function: unified query entry point (4.0+)

Quick start

Step 1: Create a table with inverted indexes

Step 2: Run text queries

Hybrid search: text + vector

Managing inverted indexes

Create an index

Drop an index

View indexes

Further reading

Core documentation

Advanced topics