Skip to main content

Inverted Index (Text Search)

Text search retrieves documents that contain specific terms or phrases from a dataset and ranks the results by relevance.

Retrieval methodStrengthApplicable scenarios
Text search"Find precisely": controllable, explainable exact matching that ensures deterministic keyword hits and filter conditionsKeyword search, phrase matching, boolean filtering
Vector search"Find broadly": uses semantic similarity to expand the recall rangeSemantic search, approximate matching

In generative AI applications, especially in retrieval-augmented generation (RAG) scenarios, text search and vector search complement each other:

  • Balance semantic breadth with lexical precision
  • Improve recall while ensuring accuracy and explainability of results
  • Together build a reliable retrieval foundation that provides more accurate and relevant context for large language models

Evolution of text search in Doris

Starting from version 2.0.0, Doris introduced the inverted index to support high-performance full-text search. As retrieval scenarios diversified and query complexity increased, Doris has continued to expand its text search capabilities in subsequent releases.

StageVersionKey capabilities
Foundation stage2.0+Introduced column-level inverted indexes; provided basic full-text search operators (MATCH_ANY, MATCH_ALL) and multi-language tokenizers, supporting efficient keyword search on large-scale datasets
Feature expansion2.x to 3.xEnhanced the operator system, added advanced operators such as phrase matching (MATCH_PHRASE), prefix search (MATCH_PHRASE_PREFIX), and regex matching (MATCH_REGEXP); version 3.1 introduced custom tokenization
Capability enhancement4.0+Introduced BM25 relevance scoring and the unified query entry point SEARCH function, supporting text relevance ranking and hybrid ranking

The core enhancements in 4.0+ include:

  • BM25 relevance scoring: The score() function ranks results by text relevance and can be combined with vector similarity scores to enable hybrid ranking.
  • SEARCH function: Provides a unified query DSL that supports cross-column queries and boolean logic combinations, simplifying the construction of complex queries while further improving query performance.

Core text search features in Doris

1. Rich text operators

Doris provides a set of full-text search operators that cover multiple retrieval patterns, satisfying needs ranging from basic keyword matching to complex phrase queries.

Main operators supported in the current version:

OperatorDescriptionTypical scenarios
MATCH_ANY / MATCH_ALLAny-term match (OR) and all-term match (AND)General keyword search
MATCH_PHRASEExact phrase match, with support for custom slop and order controlProximity word queries
MATCH_PHRASE_PREFIXPhrase prefix matchAuto-completion, incremental search
MATCH_REGEXPMatching based on regular expressionsPattern-based text retrieval

These operators can be used independently or combined through the SEARCH() function to build complex logical queries. For example:

-- Exact phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'inverted index';

-- Prefix search
SELECT * FROM docs WHERE content MATCH_PHRASE_PREFIX 'data ware';

View all operators


2. Custom tokenization (3.1+)

In text search, the tokenization method directly determines retrieval precision and recall. Starting from version 3.1, Doris supports custom analyzers, allowing you to flexibly define the tokenization pipeline based on business needs.

Custom tokenization achieves fine-grained text control by combining the following three types of components:

  • Character filter (char_filter): Replaces, removes, or normalizes symbols before tokenization
  • Tokenizer (tokenizer): Selects the tokenization algorithm. Supports types such as standard, ngram, edge_ngram, keyword, and icu for processing text in different languages and structures
  • Token filter (token_filter): For example, lowercase, word_delimiter, and ascii_folding, used to normalize and refine tokenization results
-- Example: Define a custom analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES (
"tokenizer" = "keyword",
"token_filter" = "asciifolding, lowercase"
);

-- Use the custom analyzer when creating a table
CREATE TABLE docs (
id BIGINT,
content TEXT,
INDEX idx_content (content) USING INVERTED PROPERTIES(
"analyzer" = "keyword_lowercase",
"support_phrase" = "true"
)
);

Learn about custom tokenization


3. BM25 relevance scoring (4.0+)

Doris implements the BM25 (Best Matching 25) algorithm for text relevance computation, providing ranking and scoring capabilities for full-text search.

Core characteristics of BM25:

  • A probabilistic model based on term frequency (TF), inverse document frequency (IDF), and document length
  • Robust for both short and long texts
  • Weighting strategy can be tuned through the k1 and b parameters
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time OLAP analytics'
ORDER BY relevance DESC
LIMIT 10;

Learn more about the scoring mechanism


4. SEARCH function: unified query entry point (4.0+)

The SEARCH() function provides a unified syntax entry point for text retrieval, supporting multi-column search and boolean logic combinations, which makes complex queries more concise to express:

SELECT id, title, score() AS relevance
FROM docs
WHERE SEARCH('title:Machine AND tags:ANY(database sql)')
ORDER BY relevance DESC
LIMIT 20;

Complete SEARCH function guide


Quick start

Step 1: Create a table with inverted indexes

CREATE TABLE docs (
id BIGINT,
title STRING,
content STRING,
category STRING,
tags ARRAY<STRING>,
created_at DATETIME,
-- Text search indexes
INDEX idx_title(title) USING INVERTED PROPERTIES ("parser" = "chinese"),
INDEX idx_content(content) USING INVERTED PROPERTIES ("parser" = "chinese", "support_phrase" = "true"),
INDEX idx_category(category) USING INVERTED,
INDEX idx_tags(tags) USING INVERTED
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;

Step 2: Run text queries

-- Simple keyword search
SELECT * FROM docs WHERE content MATCH_ANY 'apache doris';

-- Phrase search
SELECT * FROM docs WHERE content MATCH_PHRASE 'full-text search';

-- Boolean query with SEARCH
SELECT * FROM docs
WHERE SEARCH('title:apache AND (category:database OR tags:ANY(sql nosql))');

-- Relevance-based ranking
SELECT id, title, score() AS relevance
FROM docs
WHERE content MATCH_ANY 'real-time analytics OLAP'
ORDER BY relevance DESC
LIMIT 10;

Hybrid search: text + vector

In RAG applications, combining text search with vector similarity enables more comprehensive retrieval:

-- Hybrid retrieval: semantic similarity + keyword filtering
SELECT id, title, score() AS text_relevance
FROM docs
WHERE
-- Vector filter for semantic similarity
cosine_distance(embedding, [0.1, 0.2, ...]) < 0.3
-- Text filter for keyword constraints
AND SEARCH('title:search AND content:engine AND category:technology')
ORDER BY text_relevance DESC
LIMIT 10;

Managing inverted indexes

Create an index

-- Create at table creation time
CREATE TABLE t (
content STRING,
INDEX idx(content) USING INVERTED PROPERTIES ("parser" = "chinese")
);

-- Create on an existing table
CREATE INDEX idx_content ON docs(content) USING INVERTED PROPERTIES ("parser" = "chinese");

-- Build the index for existing data
BUILD INDEX idx_content ON docs;

Drop an index

DROP INDEX idx_content ON docs;

View indexes

SHOW CREATE TABLE docs;
SHOW INDEX FROM docs;

Further reading

Core documentation

Advanced topics