Skip to main content

Large-Scale Performance Benchmarks

This document presents real-world load and query performance benchmarks for the Doris ANN Index on medium-large and hundred-million-scale datasets. It helps you evaluate query performance at different data scales and understand how to scale smoothly from a single-node deployment to a distributed deployment.

Quick Navigation

After reading this document, you can answer the following questions:

  • For tens of millions of vectors, can a single BE node handle online retrieval? What QPS and latency can you expect?
  • For hundreds of millions of vectors, when a single node runs out of memory, how can you continue to serve vector queries through a multi-BE deployment?
  • How does performance differ across vector dimensions (768/1536) and distance metrics (inner_product/l2_distance)?
  • How can you reproduce these test results?

Test Environment and Datasets

Deployment Topology

Deployment ModeBE NodesPer-Node SpecSuitable Data Scale
Single node116C64GBTens of millions
Distributed316C64GBHundreds of millions

FE and BE are deployed separately. All tests use the VectorDBBench tool.

Dataset Overview

DatasetData VolumeVector DimensionDistance MetricDeployment Topology
Performance768D10M10M768inner_productSingle node
Performance1536D5M5M1536inner_productSingle node
Performance768D100M100M768l2_distanceDistributed

Single-Node Benchmark (16C64GB)

The single-node results provide a baseline for ANN query performance on medium-large datasets.

Load Performance

The load metrics for the two datasets are as follows:

ItemPerformance768D10MPerformance1536D5M
Vector dimension7681536
metric_typeinner_productinner_product
Data volume10M rows5M rows
Load batch paramsNUM_PER_BATCH=500000
--stream-load-rows-per-batch 500000
NUM_PER_BATCH=250000
--stream-load-rows-per-batch 250000
Load duration76m41s41m
show data all56.498 GB (25.354 GB + 31.145 GB)55.223 GB (25.346 GB + 29.878 GB)

CPU usage:

  • During the Performance768D10M load, CPU utilization was relatively stable overall.

    Performance768D10M import CPU

  • Performance1536D5M has a smaller data volume and a smaller batch size, so CPU utilization fluctuates more frequently during the load phase.

    Performance1536D5M import CPU

Query Performance

While maintaining a high recall rate, the single-node deployment can reach hundreds of QPS and keep query latency low.

Summary Metrics

DatasetBestQPSRecall@100
Performance768D10M481.93560.9207
Performance1536D5M414.73420.9677

Performance768D10M Details (inner_product, 10M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAverage Latency
10116.20000.09320.09330.0861
40455.94850.11020.12250.0877
80481.93560.23310.26740.1658

Performance1536D5M Details (inner_product, 5M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAverage Latency
10144.32210.07640.08000.0693
40401.97320.12710.14040.0994
80414.73420.27720.32220.1925

CPU Monitoring

During the cold query phase, the index needs to be loaded into memory, so CPU utilization is relatively low and the system mainly waits on IO. After entering the hot query phase, CPU utilization rises significantly and approaches 100%.

Performance768D10M query CPU

Distributed Benchmark (3 x 16C64GB)

When the data scale exceeds the reasonable memory capacity of a single 16C64GB node, you can scale out horizontally with a multi-BE deployment. This section uses the Performance768D100M dataset (100M rows, 768 dimensions) to show that Doris can still provide online vector query capability at the 100M scale.

Tip

This test does not constitute a one-to-one absolute numerical comparison with the small-scale single-node tests. It is more suitable for observing how performance scales with data size.

Load and Index Build

Because the per-node memory cap is 64GB, this test uses vector quantization compression to reduce memory overhead.

ItemValue
DatasetPerformance768D100M
Data volume100M rows
Vector dimension768
Batch paramsNUM_PER_BATCH=500000
--stream-load-rows-per-batch 500000
Index params"dim"="768", "index_type"="hnsw", "metric_type"="l2_distance", "pq_m"="384", "pq_nbits"="8", "quantizer"="pq"
Build index time4h5min
show data all198.809 GB (137.259 GB + 61.550 GB)

Data distribution after index build:

  • 3 buckets in total
  • Each bucket contains 34 rowsets, each rowset is about 1.99 GB
  • Each rowset contains 6 segments

CPU usage: During the index build, CPU utilization stayed stable at around 50% overall. The CPU was not maxed out for an extended period, leaving some resource headroom.

Performance768D100M import CPU

Query Performance

Summary Metrics

MetricValue
BestQPS77.6247
Recall@1000.9294

Details (l2_distance, 100M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAverage Latency
1046.58360.26280.27910.2145
2075.35790.32510.35410.2651
3077.62470.52220.57660.3860
4076.63130.70890.78540.5212

CPU Monitoring

During the query phase, CPU utilization on each node stays at a high level, indicating that the query workload makes good use of the distributed compute resources.

Performance768D100M query CPU

Key Conclusions

  • Tens of millions on a single node: At the tens-of-millions vector data scale, a Doris single-node deployment can deliver hundreds of QPS for ANN queries while maintaining a high recall rate (>=0.92).
  • Hundreds of millions, distributed: On the 100M vector dataset, a multi-BE deployment combined with vector quantization compression can continue to provide online vector query capability (BestQPS approximately 77, Recall@100 approximately 0.93).
  • Horizontal scalability: When the data scale exceeds the memory capacity of a single node, distributed deployment is a viable path to sustain online retrieval capability.

Test Notes

When reading these results, keep the following in mind:

  • Different distance metrics: The single-node tests use inner_product, while the distributed test uses l2_distance. Direct side-by-side comparison of absolute numbers is not recommended.
  • Different data scales and index parameters: The data scales and index parameters (such as whether quantization is enabled) differ across test groups. The results are more suitable for observing how performance scales with data size.
  • Cold-query correction: For the single-node Performance768D10M test, the result at concurrency 10 has been corrected after removing the impact of cold queries.

How to Reproduce

The tests are run with the VectorDBBench tool.

Single-Node Reproduction

# Performance768D10M
export NUM_PER_BATCH=500000
vectordbbench doris ... --case-type Performance768D10M --stream-load-rows-per-batch 500000

# Performance1536D5M
export NUM_PER_BATCH=250000
vectordbbench doris ... --case-type Performance1536D5M --stream-load-rows-per-batch 250000

Distributed 3BE Reproduction

export NUM_PER_BATCH=500000
vectordbbench doris ... --case-type Performance768D100M --stream-load-rows-per-batch 500000

FAQ

Q1: How many vectors can a single 16C64GB node hold at most?

This depends on the vector dimension, whether quantization is enabled, and the index parameters. In this document, 768-dimensional 10M rows (about 56 GB) runs stably on a single 16C64GB node. If the data scale grows further or the dimension is higher, enable vector quantization or use a multi-BE distributed deployment.

Q2: Why is the QPS of the distributed test lower than that of the single-node tests?

The two groups of tests differ in distance metric, data scale, and index parameters (the distributed test enables PQ quantization), so absolute numbers cannot be compared directly. The goal of the distributed test is to validate scalability at large data sizes, not to maximize QPS.

Q3: Why is quantization required on the 100M dataset?

The per-node memory cap is 64GB, and the raw size of 100M 768-dimensional vectors already exceeds this capacity. PQ quantization (pq_m=384, pq_nbits=8) significantly reduces memory consumption, making large-scale online retrieval feasible.

Q4: Does the load duration include index build time?

In the single-node tests, "load duration" is the overall write time. The index is built either during writing or in the background compaction phase. In the distributed test, build index time is listed separately so that the index build cost can be evaluated for large-scale scenarios.