Skip to main content

Large-scale Performance Benchmark

This page summarizes large-scale benchmark results in both single-node and distributed deployments. The purpose of these tests is to show query behavior at different data scales, and to illustrate how Doris extends vector query capacity from single-node workloads to larger distributed deployments.

Test Matrix

  • Single node: FE/BE separated, BE on one 16C64GB machine.
  • Distributed: 3 BE nodes, each 16C64GB.
  • Workloads:
    • Performance768D10M
    • Performance1536D5M
    • Performance768D100M

Single-node Benchmark (16C64GB)

The single-node results provide a baseline for ANN query performance on medium-to-large datasets.

Import Performance

ItemPerformance768D10MPerformance1536D5M
Dimension7681536
metric_typeinner_productinner_product
Rows10M5M
Batch configNUM_PER_BATCH=500000
--stream-load-rows-per-batch 500000
NUM_PER_BATCH=250000
--stream-load-rows-per-batch 250000
Import time76m41s41m
show data all56.498 GB (25.354 GB + 31.145 GB)55.223 GB (25.346 GB + 29.878 GB)

CPU utilization during import for Performance768D10M is shown below. The chart indicates that CPU usage remains relatively stable throughout ingestion.

Performance768D10M import CPU

For Performance1536D5M, the dataset is smaller and the batch size is also smaller, so CPU utilization fluctuates more frequently during ingestion.

Performance1536D5M import CPU

Query Performance

For the two single-node workloads, Doris reaches hundreds of QPS while maintaining high recall and low latency.

Summary

DatasetBestQPSRecall@100
Performance768D10M481.93560.9207
Performance1536D5M414.73420.9677

Performance768D10M (inner_product, 10M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAvg Latency
10116.20000.09320.09330.0861
40455.94850.11020.12250.0877
80481.93560.23310.26740.1658

Performance1536D5M (inner_product, 5M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAvg Latency
10144.32210.07640.08000.0693
40401.97320.12710.14040.0994
80414.73420.27720.32220.1925

In the single-node query test, the cold-query phase needs to load the full index into memory, so CPU utilization is relatively low while the system waits for IO. During the warm-query phase, CPU utilization increases significantly and approaches 100%.

Performance768D10M query CPU

Distributed Benchmark (3 x 16C64GB)

The distributed test focuses on a larger dataset that exceeds the practical memory envelope of a single 16C64GB node.

For 3BE testing, Performance768D100M was selected. Since single-node memory is limited to 64GB, vector quantization is enabled to reduce memory usage. This test is intended to show how Doris sustains vector query capability at 100M scale through multi-BE deployment, rather than to provide a direct one-to-one comparison with the smaller single-node cases.

Import Performance

ItemValue
DatasetPerformance768D100M
Rows100M
Dimension768
Batch configNUM_PER_BATCH=500000
--stream-load-rows-per-batch 500000
Index properties"dim"="768", "index_type"="hnsw", "metric_type"="l2_distance", "pq_m"="384", "pq_nbits"="8", "quantizer"="pq"
Build index time4h5min
show data all198.809 GB (137.259 GB + 61.550 GB)

Post-build distribution:

  • 3 buckets
  • 34 rowsets per bucket, each rowset about 1.99 GB
  • 6 segments per rowset

Query Performance

Summary

MetricValue
BestQPS77.6247
Recall@1000.9294

Detailed results (l2_distance, 100M rows)

ConcurrencyQPSP95 LatencyP99 LatencyAvg Latency
1046.58360.26280.27910.2145
2075.35790.32510.35410.2651
3077.62470.52220.57660.3860
4076.63130.70890.78540.5212

During index build, CPU utilization stays around 50%, indicating that the build process does not saturate CPU resources for an extended period.

Performance768D100M import CPU

The chart below shows CPU utilization during the query phase. CPU usage stays at a relatively high level across the nodes, indicating that the distributed query workload makes good use of available compute resources.

Performance768D100M query CPU

Summary

  • On tens of millions of vectors, Doris provides strong ANN query performance on a single node, with hundreds of QPS and high recall.
  • On a 100M-vector dataset, Doris continues to provide online vector query capability through multi-BE deployment.
  • Because the test groups use different dataset sizes, distance metrics, and index settings, the results should be read as scale benchmarks rather than direct one-to-one performance comparisons.

Notes

  • Metric types differ between the two test groups (inner_product vs l2_distance), so absolute values should not be compared directly.
  • The single-node Performance768D10M result at concurrency = 10 has been adjusted to exclude cold-query impact.

Reproduction

Single-node:

export NUM_PER_BATCH=500000
vectordbbench doris ... --case-type Performance768D10M --stream-load-rows-per-batch 500000

export NUM_PER_BATCH=250000
vectordbbench doris ... --case-type Performance1536D5M --stream-load-rows-per-batch 250000

Distributed 3BE:

export NUM_PER_BATCH=500000
vectordbbench doris ... --case-type Performance768D100M --stream-load-rows-per-batch 500000