Parallelism Tuning: parallel_pipeline_task_num Configuration Guide

Doris is an MPP execution framework. Every query runs in parallel across multiple BEs, and within a single BE, multi-threaded parallelism further accelerates execution. All statements (Query, DML, DDL) support parallel execution.

One-sentence definition: parallel_pipeline_task_num controls the number of worker tasks used to execute a single Fragment within a single BE.

Quick Diagnosis

Before you start tuning, confirm the following:

Have you used the PROFILE tool to confirm that the query is CPU-bound?
How many CPU cores does the current BE have?
What type of query is it: point query, JOIN/aggregation, stress test, or complex query?
Are you using the Duplicate or Unique Key Merge-On-Write table model?
Do you plan to adjust at the SQL, session, or global level?

The default value is 0, which is equivalent to half the number of CPU cores on the BE. This default value already balances single-query and concurrent resource utilization, and typically does not require user intervention.

Parallelism Tuning Principles

The purpose of parallel_pipeline_task_num is to fully utilize multi-core resources and reduce query latency. However, multi-core parallelism introduces data Shuffle operators and multi-thread synchronization logic, so excessive parallelism can lead to wasted resources.

Scenario and Recommended Parallelism Reference Table

The table below uses BE CPU cores = 16 as an example.

Query scenario	Typical characteristics	Recommended parallelism	Rationale
Single-table simple operation	Single-table point query, `WHERE` scanning a small amount of data, `LIMIT` returning a small amount of data, materialized view hit	1	Only one Fragment exists; the bottleneck is the data scan thread (adaptive parallelism), not the query execution thread
Two-table JOIN / aggregation	Large data volume, CPU-intensive, CPU not saturated	16	Fully leverages the parallel capability of the Pipeline execution engine; do not increase without limit (for example, 48 only adds scheduling overhead)
High-concurrency stress test	Multiple queries themselves can already saturate the CPU	1	Excessive parallelism only adds thread scheduling and framework scheduling overhead
Complex query	Bottleneck is hard to identify in one shot	Default value	Adjust flexibly based on Profile and machine load; try the 4-2-1 step-down approach

Doris continues to refine its adaptive strategies. Adjustments are typically recommended only at the specific scenario or SQL level.

Query Parallelism Tuning Methods

Doris supports manually specifying query parallelism at three granularities: SQL, session, and global.

Method 1: SQL Level (Recommended)

Purpose: Affects only a single SQL statement, providing flexible and precise control.
Command: Use a SQL HINT.
Description: Suitable for fine-grained tuning of specific SQL statements without affecting other queries.

SELECT /*+SET_VAR(parallel_pipeline_task_num=8)*/ *
FROM nation, lineitem
WHERE lineitem.l_suppkey = nation.n_nationkey;

SELECT /*+SET_VAR(parallel_pipeline_task_num=8,runtime_filter_mode=global)*/ *
FROM nation, lineitem
WHERE lineitem.l_suppkey = nation.n_nationkey;

Method 2: Session Level

Purpose: Affects all queries in the current session.
Command: Set via session variables.
Description: All SQL statements within the session use this parallelism, including single-row queries, which may degrade performance for some small queries.

SET parallel_pipeline_task_num = 8;

Method 3: Global Level

Purpose: Affects the default behavior of the entire cluster.
Command: Use SET GLOBAL.
Description: Typically used for global CPU utilization tuning. After a global setting is applied, it takes effect on the current connection and newly created connections, but does not affect other existing connections. To make it take effect immediately for everything, restart the FE.

SET GLOBAL parallel_pipeline_task_num = 8;

Comparison of the Three Tuning Methods

Tuning method	Scope	When it takes effect	Recommended scenario
SQL HINT	Single SQL statement	Immediate	Fine-grained tuning of a single SQL (safest)
Session	Current session	Immediate	Tuning a group of related queries
Global	Entire cluster	Takes effect on new connections	Cluster-level CPU utilization optimization

Data Sharding and Parallelism

Starting from version 2.1, Doris supports decoupling parallelism from the number of data shards.

Version Comparison

Version	Behavior	Limitation
Before 2.1	Parallelism ≤ number of shards involved in the query	5 shards allow at most 5-way concurrency; large shards cannot be read concurrently
2.1 and later	Supports concurrent reads within a shard (enabled automatically)	Only the Duplicate and Unique Key Merge-On-Write table models are supported

Note: The Aggregate model and the Unique Key Merge-On-Read model are not applicable; query parallelism is still limited by the number of shards.

Best Practice Cases

Case 1: Excessive CPU Usage — Lower the Parallelism

Issue: CPU usage on the production cluster is too high, affecting the performance of low-latency queries.

Root cause analysis: By default, Doris prioritizes using more resources to obtain query results as quickly as possible. In production scenarios where resources are tight, this can affect overall stability.

Solution: Lower the parallelism from the default 0 (half of the CPU core count) to 4.

SET GLOBAL parallel_pipeline_task_num = 4;

Result: CPU usage dropped to 60% of the original peak, reducing the impact on low-latency queries.

A GLOBAL setting takes effect on the current connection and newly created connections; existing connections are not affected. To make it take effect immediately for everything, restart the FE.

Case 2: Insufficient CPU Utilization — Increase the Parallelism

Issue: A compute-intensive query takes 28 seconds to execute, with CPU utilization at only 60%.

Sample SQL (left table 2 billion rows, right table 5 million rows):

SELECT
    sum(if(t2.value IS NULL, 0, 1)) AS exist_value,
    sum(if(t2.value IS NULL, 1, 0)) AS no_exist_value
FROM t1
LEFT JOIN t2 ON t1.key = t2.key;

Key Profile metrics:

HASH_JOIN_OPERATOR (id=3 , nereids_id=448):
  - PlanInfo
      - join op: LEFT OUTER JOIN(BROADCAST)[]
      - equal join conjunct: (value = value)
      - cardinality=2,462,330,332
      - vec output tuple id: 5
      - output tuple id: 5
      - vIntermediate tuple ids: 4
      - hash output slot ids: 16
      - projections: value
      - project output tuple id: 5
  - BlocksProduced: sum 360.099K (360099), avg 45.012K (45012), max 45.014K (45014), min 45.011K (45011)
  - CloseTime: avg 8.44us, max 13.327us, min 5.574us
  - ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms
  - InitTime: avg 7.122us, max 13.395us, min 4.541us
  - MemoryUsage: sum , avg , max , min
    - PeakMemoryUsage: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB
    - ProbeKeyArena: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB
  - OpenTime: avg 2.967us, max 4.120us, min 1.562us
  - ProbeRows: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658)
  - ProjectionTime: avg 165.392ms, max 169.762ms, min 161.727ms
  - RowsProduced: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658)

Root cause analysis:

The dominant time cost (ExecTime: avg 26sec153ms) is concentrated in the Join operator.
The total volume of data processed (ProbeRows: 1.466 billion) is huge, which is a typical CPU-intensive workload.
Monitoring shows CPU utilization at only 60%, indicating room for acceleration.

Solution: Increase the parallelism.

SET parallel_pipeline_task_num = 16;

Result comparison:

Metric	Before	After
Query duration	28 seconds	19 seconds
CPU utilization	60%	90%

FAQ

Q1: What is the default parallelism? The default value is 0, which at runtime is equivalent to half the number of CPU cores on the BE.

Q2: Is a higher parallelism always better? No. Excessive parallelism brings thread scheduling and framework scheduling overhead, which can actually reduce performance. For example, setting it to 48 on a 16-core BE provides no benefit.

Q3: Why does SET GLOBAL not take effect on existing connections? GLOBAL only takes effect on the current connection and newly created connections; other existing connections are not affected. To make it take effect immediately for everything, restart the FE.

Q4: Do all table models support concurrent reads within a shard? Only the Duplicate and Unique Key Merge-On-Write models support it. For the Aggregate model and the Unique Key Merge-On-Read model, query parallelism is still limited by the number of shards.

Q5: How do I tell whether a query is CPU-bound? Use PROFILE to observe which operator concentrates the ExecTime, and observe the machine's CPU utilization at the same time. If the CPU is not saturated, consider increasing the parallelism.

Troubleshooting

Symptom	Possible cause	Investigation suggestion
Query becomes slower after raising parallelism	Thread scheduling overhead exceeds parallelism gains	Check operator time costs via Profile; step back using the 4-2-1 approach
CPU utilization is saturated but the query becomes slower	Parallelism is too high, causing context switches	Lower the parallelism; for high-concurrency scenarios, set it to 1
Global adjustment does not take effect	Existing connections have not applied the new configuration	Restart the FE or reconnect the client
Slow read speed for large shards	Version is below 2.1, limited by the number of shards	Upgrade to 2.1 or later, and confirm the table model is Duplicate or MoW

Summary

Typically, you do not need to intervene in query parallelism. If adjustment is needed, follow these principles:

Start from CPU utilization: Use the PROFILE tool to confirm whether the query is CPU-bound, and then decide whether to adjust.
Prefer SQL-level adjustment: A single-SQL HINT adjustment is the safest. Avoid aggressive global changes.
Scenario-based decisions: Refer to the Scenario and Recommended Parallelism Reference Table to choose an appropriate value.
Step-down trials: For complex queries, adjust gradually using the 4-2-1 step-down approach, observing query performance and machine load.

Quick Diagnosis​

Parallelism Tuning Principles​

Scenario and Recommended Parallelism Reference Table​

Query Parallelism Tuning Methods​

Method 1: SQL Level (Recommended)​

Method 2: Session Level​

Method 3: Global Level​

Comparison of the Three Tuning Methods​

Data Sharding and Parallelism​

Version Comparison​

Best Practice Cases​

Case 1: Excessive CPU Usage — Lower the Parallelism​

Case 2: Insufficient CPU Utilization — Increase the Parallelism​

FAQ​

Troubleshooting​

Summary​