Skip to main content

ANN Resource Estimation Guide

Vector search (ANN) workloads are usually constrained by memory and CPU first, not by disk capacity. This article provides a practical resource estimation method to help you plan the specifications of an Apache Doris vector search cluster before going live.

Quick Navigation

Estimation Overview

The general estimation order is:

  1. Estimate index memory: derive the resident index memory from data scale, index type, and quantization mode.
  2. Estimate CPU cores: match the CPU count to the memory ratio based on target QPS and latency.
  3. Reserve a safety margin: leave headroom for query execution, non-vector column access, and Compaction.

ANN Resource Characteristics

Compared with regular OLAP indexes, ANN has the following resource usage characteristics:

Resource DimensionResource Characteristics
Build-stage CPUHigh utilization. Heavy CPU pressure during ingestion.
Build-stage memoryWhen a Segment is too large, building a single index may fail due to insufficient memory.
Query-stage memoryHigh-performance queries usually require the index to stay resident in memory as much as possible.
Query-stage CPUHigh-QPS scenarios place a clear demand on the number of CPU cores.

Doris supports three quantization modes, sq8, sq4, and pq, to reduce memory usage. The trade-offs of quantization are usually:

  • Slower ingestion: extra encoding overhead.
  • Possibly slower queries: extra decoding or reconstruction overhead.
  • Possible recall drop: lossy encoding introduces error.

Estimation Input Checklist

Before starting the estimation, prepare the following inputs:

InputDescription
Vector dimension DThe float dimension of a single vector, for example 768.
Total rows NThe total number of vectors to be indexed.
Index typehnsw / ivf / ivf_on_disk
Quantization modeflat / sq8 / sq4 / pq
max_degreeHNSW only. Controls the number of graph neighbors. Default 32.
Target QPS and latencyUsed for CPU core estimation.

HNSW Memory Estimation

Empirical Formula Under Default Parameters

With the default max_degree=32:

HNSW_FLAT_Bytes ~= 1.3 * D * 4 * N

Where:

  • D * 4 * N is the raw float32 vector memory.
  • 1.3 represents the extra overhead from the HNSW graph structure (about 0.3 times).

Adjustment When Tuning max_degree

The larger max_degree is, the higher the graph structure overhead. Scale proportionally:

HNSW_factor   ~= 1 + 0.3 * (max_degree / 32)
HNSW_FLAT_Bytes ~= HNSW_factor * D * 4 * N

Approximate Memory Reduction From Quantization

Quantization ModeMemory Ratio (Relative to FLAT)
sq8About 1/4
sq4About 1/8
pqUsually close to sq4 (for example, pq_m=D/2, pq_nbits=8)

Notes on ivf_on_disk

ivf_on_disk reuses the training and query parameter model of IVF (nlist / ivf_nprobe), but stores the inverted list body on disk and serves queries through a cache. For capacity planning, you can first treat the IVF estimation below as the upper bound of "fully resident in memory", and then plan ann_index_ivf_list_cache_limit separately based on the size of hot data you expect to keep resident.

Quick Reference (D=768, max_degree=32)

RowsFLATSQ8SQ4PQ (m=384, nbits=8)
1M4 GB1 GB0.5 GB0.5 GB
10M40 GB10 GB5 GB5 GB
100M400 GB100 GB50 GB50 GB
1B4000 GB1000 GB500 GB500 GB
10B40000 GB10000 GB5000 GB5000 GB

IVF Memory Estimation

IVF has lower structural overhead than HNSW and can be approximated as:

IVF_FLAT_Bytes ~= D * 4 * N

The memory reduction ratio for IVF under quantization is the same as for HNSW:

Quantization ModeMemory Ratio (Relative to FLAT)
sq8About 1/4
sq4About 1/8
pqUsually close to sq4

Quick Reference (D=768)

RowsFLATSQ8SQ4PQ (m=384, nbits=8)
1M3 GB0.75 GB0.35 GB0.35 GB
10M30 GB7.5 GB3.5 GB3.5 GB
100M300 GB75 GB35 GB35 GB
1B3000 GB750 GB350 GB350 GB
10B30000 GB7500 GB3500 GB3500 GB

CPU Core Estimation

For high-QPS scenarios, you can start with the following empirical ratio:

16 cores : 64 GB   (about 1 core : 4 GB)

Note: even when quantization is enabled, CPU demand does not necessarily decrease at the same rate as index memory. In practice:

  1. First estimate CPU based on the FLAT-equivalent workload.
  2. Then gradually scale down to a reasonable level based on actual stress testing.

Production Safety Margin (Do Not Design Against 100% Memory)

The formulas above only cover the ANN index itself, not the full SQL execution overhead. For example:

SELECT id, text, l2_distance_approximate(embedding, [...]) AS dist
FROM tbl
ORDER BY dist
LIMIT N;

Even with TopN late materialization, the execution layer still needs additional memory to handle non-vector columns and operator state. In production, the recommendations are:

  • Keep ANN index memory within about 70% of total machine memory.
  • Use the remaining memory for query execution, Compaction, and other data access.

Scenario-based Recommendations

ScenarioRecommended PlanDescription
Performance-first with sufficient memory budgetHNSW + FLATBest recall and latency.
Memory-constrainedHNSW/IVF + PQUsually more balanced than SQ8/SQ4.
Initial PQ parameterpq_m = D / 2Fine-tune later based on recall and latency stress tests.
Low query performance requirementLower CPU configuration firstYou can also adopt a "high CPU during ingestion, downsized after stabilization" strategy.

FAQ

Q1: How much memory can be reduced after enabling quantization?

A: sq8 is about 1/4 of FLAT, and sq4 and pq (for example, pq_m=D/2, pq_nbits=8) are about 1/8. The exact value is still affected by the HNSW graph structure overhead.

Q2: Can CPU be scaled down at the same ratio as the quantized memory?

A: Not recommended. Quantization mainly reduces memory usage, and CPU demand does not decrease proportionally. It is recommended to first estimate CPU based on the FLAT-equivalent workload, and then scale down based on stress tests.

Q3: How does memory change when max_degree is increased?

A: The HNSW graph structure overhead scales by 1 + 0.3 * (max_degree / 32). For example, when max_degree=64, the factor is about 1.6.

Q4: How much memory should be planned for ivf_on_disk?

A: The upper bound is "IVF fully resident in memory". The actual resident size is determined by ann_index_ivf_list_cache_limit and can be evaluated separately based on the size of hot data.

Q5: Why should the design not target 100% memory?

A: In addition to the ANN index, the SQL execution layer (non-vector columns, operator state), Compaction, and other access also consume memory. It is recommended to reserve about 30% headroom and keep index memory within 70% of total memory.