Vector Index Practical Guide
This document is intended for users who need to deploy vector retrieval (ANN) in Apache Doris. It provides a complete operational path from table design to query tuning and troubleshooting. If you are evaluating how to migrate semantic search, RAG, or recommendation recall to Doris, you can follow the steps in this document directly.
Quick Navigation
| What you want to do | Section |
|---|---|
| Confirm whether the Doris version and table model meet the requirements | Prerequisites and Limitations |
| Choose between HNSW and IVF index | Applicable Scenarios and Index Selection |
| Run the full table creation -> ingestion -> query workflow | End-to-End Operational Workflow |
| Sort by cosine similarity | Using Cosine Similarity |
| Increase recall / reduce latency | Query and Build Tuning |
| Troubleshoot index not taking effect / low recall / ingestion failures | Common Troubleshooting |
Applicable Scenarios and Index Selection
Starting from Apache Doris 4.x, ANN (Approximate Nearest Neighbor) vector indexes are supported. Common deployment scenarios include:
- Semantic search
- RAG retrieval augmentation
- Recommendation system recall
- Image or multimodal retrieval
- Anomaly detection
Index Type Comparison
| Index type | Recall | Online query performance | Build speed | Memory usage | Applicable scenario |
|---|---|---|---|---|---|
hnsw | High | Good | Slow | Higher | Online low-latency retrieval |
ivf | Medium | Better | Fast | More efficient | Large-scale datasets |
ivf_on_disk | Medium | Medium | Fast | Most efficient | Ultra-large scale, memory-constrained |
Supported Distance Functions
| Function | Sort direction | Description |
|---|---|---|
l2_distance_approximate | ORDER BY ... ASC | Euclidean distance, smaller distance means more similar |
inner_product_approximate | ORDER BY ... DESC | Inner product, larger value means more similar |
Cosine similarity cannot be configured directly via
metric_type="cosine". It must be implemented by normalizing the vectors and using inner product. For details, see Using Cosine Similarity.
Prerequisites and Limitations
Before using ANN indexes, confirm the following conditions:
| Check item | Requirement |
|---|---|
| Doris version | >= 4.0.0 |
| Table model | Only DUPLICATE KEY is supported |
| Vector column type | ARRAY<FLOAT> NOT NULL |
| Dimension consistency | The dimension of ingested vectors must match the index dim |
Minimal table creation example:
CREATE TABLE document_vectors (
id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
End-to-End Operational Workflow
The complete workflow consists of 4 steps: create table -> configure index -> ingest data -> build and monitor index.
Step 1: Create the Vector Table
There are two ways to create the table. Choose based on the data scale and ingestion mode:
| Method | Pros | Cons | Recommended scenario |
|---|---|---|---|
| Define the ANN index directly when creating the table | Queryable as soon as data is written | Slower ingestion | Small scale, streaming ingestion |
Create the table and ingest data first, then CREATE INDEX + BUILD INDEX | Faster ingestion, controllable build timing | Requires an extra build step | Large-scale batch ingestion |
Example of defining an ANN index directly when creating the table:
CREATE TABLE document_vectors (
id BIGINT NOT NULL,
title VARCHAR(500),
content TEXT,
category VARCHAR(100),
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_embedding (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768"
)
)
ENGINE = OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
Step 2: Configure Vector Index Parameters
Common parameters:
| Parameter | Values | Description |
|---|---|---|
index_type | hnsw / ivf / ivf_on_disk | Index type |
metric_type | l2_distance / inner_product | Distance metric |
dim | Integer | Vector dimension |
quantizer | flat / sq8 / sq4 / pq | Quantization method (optional) |
HNSW-specific parameters:
| Parameter | Default | Description |
|---|---|---|
max_degree | 32 | Maximum number of neighbors per node |
ef_construction | 40 | Search width during build |
IVF-specific parameters (shared by ivf and ivf_on_disk):
| Parameter | Default | Description |
|---|---|---|
nlist | 1024 | Number of cluster centroids |
Example of creating the index after the table:
CREATE INDEX idx_embedding ON document_vectors (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768",
"max_degree" = "64",
"ef_construction" = "128"
);
Step 3: Ingest Data
Recommended order for batch ingestion:
- Create the table, without building the index for now
- Batch-write the data (Stream Load / S3 TVF / SDK)
- Build the index uniformly after the data ingestion is complete
In production environments, this batch mode is preferred. It can significantly reduce ingestion time.
Step 4: Build the Index and Monitor
If the post-ingestion index creation method is used, you need to trigger it manually:
BUILD INDEX idx_embedding ON document_vectors;
SHOW BUILD INDEX WHERE TableName = "document_vectors";
Build states include: PENDING, RUNNING, FINISHED, CANCELLED.
Query Patterns
TopN Nearest Neighbor Search
SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
ORDER BY dist
LIMIT 10;
Range Search
SELECT id, title
FROM document_vectors
WHERE l2_distance_approximate(embedding, [0.1, 0.2, ...]) < 0.5;
Hybrid Search with Filter Conditions
SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
WHERE category = 'AI'
ORDER BY dist
LIMIT 10;
In hybrid filtering scenarios, Doris uses a pre-filtering strategy, which balances both performance and recall.
Using Cosine Similarity
ANN indexes do not support configuring metric_type="cosine" directly. If your business needs to sort by cosine similarity, use the following pattern:
- Apply L2 normalization to vectors before ingestion (convert them to unit vectors)
- Use
metric_type="inner_product"when creating the ANN index - Use
inner_product_approximate(...)in queries, and sort byORDER BY ... DESC
Principle:
cos(x, y) = (x · y) / (||x|| · ||y||)- After normalization,
||x|| = ||y|| = 1, socos(x, y) = x · y
In a unit-vector space, cosine sorting is equivalent to inner product sorting.
Query and Build Tuning
Query Parameters
| Index type | Tuning parameter | Effect |
|---|---|---|
| HNSW | hnsw_ef_search | Larger value yields higher recall and higher latency |
| IVF | nprobe or ivf_nprobe (depending on version) | Larger value yields higher recall |
SET hnsw_ef_search = 100;
SET nprobe = 128;
SET optimize_index_scan_parallelism = true;
Build Recommendations
- For large-scale data, run compaction first, then trigger the final index build
- Control the segment scale to avoid impacting recall when segments are too large
- Run A/B benchmarks on multiple parameter sets against the same dataset
Capacity Estimation
- Rough vector memory formula:
dim * 4 bytes * row_count - Add the overhead of the ANN index structure on top of this
- Reserve a memory budget for non-vector columns and execution operators
For 10M / 100M scale capacity reference on single-node and distributed deployments, see Large-Scale Performance Test.
Index Management
Common management SQL:
-- View the index list
SHOW INDEX FROM document_vectors;
-- View data scale
SHOW DATA ALL FROM document_vectors;
-- Drop the index
ALTER TABLE document_vectors DROP INDEX idx_embedding;
To adjust index parameters, the recommended approach is to drop the old index and rebuild it.
Common Troubleshooting
Index Not Taking Effect
Investigate in this order:
- Whether the index exists: run
SHOW INDEX - Whether the index has finished building: run
SHOW BUILD INDEX - Whether the query uses a distance function with the
_approximatesuffix
Low Recall
| Investigation direction | Recommendation |
|---|---|
| HNSW parameters | Increase max_degree, ef_construction, hnsw_ef_search |
| IVF probe parameters | Increase nprobe / ivf_nprobe |
| Segment scale | Rebuild the index after compaction |
High Query Latency
| Investigation direction | Recommendation |
|---|---|
| Cold query vs. hot query | Index loading time differs. You can warm up after service startup |
hnsw_ef_search too large | Reduce it appropriately to lower latency |
| Parallel scan not enabled | Set optimize_index_scan_parallelism = true |
| BE memory pressure | Check BE memory levels and GC behavior |
Ingestion Failure
| Common cause | Recommendation |
|---|---|
| Dimension mismatch | Check that the ingested vector dimension matches the index dim |
| NULL appears in the vector column | Fill or filter out NULL on the business side |
| Invalid vector array format | Validate the JSON / Stream Load payload format |
FAQ
Q1: Can ANN indexes be used on UNIQUE KEY or AGGREGATE KEY tables?
No. ANN indexes only support the DUPLICATE KEY model.
Q2: Can ANN indexes and inverted indexes be created at the same time?
Yes. You can create both an ANN index and an inverted index on the same table. Combining text filtering with vector sorting enables the hybrid retrieval pattern that is common in online RAG.
Q3: What if I need to use cosine similarity?
ANN does not support metric_type="cosine". Normalize the vectors and use inner_product, and the effect is equivalent. For details, see Using Cosine Similarity.
Q4: What if BUILD INDEX is stuck in RUNNING?
Check the progress with SHOW BUILD INDEX. Building a large table itself takes a long time, so first confirm whether it is still building normally. If there is no progress for a long time, check the BE memory and disk status.
Q5: How do I adjust ANN index parameters?
ANN index parameters do not support in-place modification. The recommendation is to DROP INDEX first, then CREATE INDEX with the new parameters, and finally BUILD INDEX.