Parallelism Tuning

Parallelism (parallel_pipeline_task_num) controls the number of worker tasks used by a single Fragment when it runs inside a BE. It is a key parameter for fully utilizing multi-core CPUs and reducing query latency.

Pre-Tuning Self-Check Checklist

You have confirmed the number of CPU cores on the BE (which determines the upper bound of parallelism).
You have used the Profile or monitoring metrics to determine the bottleneck type (CPU-intensive, scan-intensive, or scheduling overhead).
You understand the current query scenario (point query, JOIN/aggregation, stress test, or complex query).
Performance is genuinely below expectations under the default value (0, which means half the number of CPU cores).

Core Concepts

MPP parallelism: Each query runs in parallel across multiple BEs.
Intra-BE parallelism: Within a single BE, multi-threading accelerates Fragment execution.
parallel_pipeline_task_num: The number of worker tasks used by a single Fragment during execution. The default 0 means half the number of CPU cores on the BE.
Applicable statements: All Query, DML, and DDL statements support parallel execution.

Tuning Principles

A higher parallelism is not always better:

Increasing parallelism makes full use of multi-core resources and reduces single-query latency.
However, it introduces additional data Shuffle operators and multi-thread synchronization logic, which wastes resources.
The default value already balances single-query and concurrent scenarios, so manual intervention is usually unnecessary.
Doris continues to refine its adaptive strategies. Prefer adjustments at the SQL or scenario level rather than globally.

Scenario-Based Tuning Recommendations

The following examples assume that the BE has 16 CPU cores.

Scenario Reference Table

Query scenario	Recommended parallelism	Reason
Single-table point query / `WHERE` on a small amount of data / `LIMIT` / hitting a materialized view	1	Only one Fragment exists. The bottleneck is the scan thread (which is independent from the query thread and parallelizes adaptively), so the query thread does not need extra concurrency.
Two-table `JOIN` over a large dataset / aggregation query (CPU-intensive)	16	Compute-intensive. When the CPU is not saturated, increasing parallelism can fully utilize the cores. However, do not increase it without limit (for example, setting it to 48 only adds scheduling overhead).
Stress testing scenario	1	Concurrent queries already saturate the CPU on their own, and excessive parallelism only adds thread and framework scheduling overhead.
Complex query	Default value	Adjust flexibly based on the Profile and machine load. Try a stepwise decrease of 4 -> 2 -> 1 to observe the effect.

Scenario 1: Simple Single-Table Operations

Characteristics: A single Fragment, with the bottleneck on the scan thread (which already parallelizes adaptively).
Recommendation: parallel_pipeline_task_num = 1.
Reason: The scan thread and the query execution thread are independent. Adding more concurrent query threads cannot accelerate a scan-bound bottleneck.

Scenario 2: JOIN/Aggregation Over Large Datasets

Characteristics: Compute-intensive workload where the CPU is observed to be unsaturated.
Recommendation: Increase the parallelism above the default value, for example parallel_pipeline_task_num = 16.
Reason: This leverages the parallel capability of the Pipeline execution engine to fully utilize the CPU. However, do not increase it without limit, or the thread and framework scheduling overhead will outweigh the gains.

Scenario 3: Stress Testing

Characteristics: There are enough concurrent queries that the CPU is already saturated by multiple queries.
Recommendation: parallel_pipeline_task_num = 1.
Reason: This avoids the scheduling overhead introduced by stacking single-query parallelism on top of high concurrency.

Scenario 4: Complex Queries

Characteristics: The execution plan is complex, and the bottleneck is hard to identify at a glance.
Recommendation: Start with the default value, then adjust stepwise as 4 -> 2 -> 1 based on the Profile and machine load to observe the effect.

Tuning Methods

Doris supports parallelism settings at three granularities: SQL level, session level, and global level. The priority decreases in this order.

Method Comparison

Method	Scope	Applicable scenarios	Risk
SQL HINT	A single SQL statement	Tuning a specific slow SQL	No side effects, most recommended
Session variable	The current session	A group of queries within the same session	Single-row queries also follow this setting, which may degrade performance
Global variable	All new connections across the cluster	Cluster-wide CPU utilization adjustment	The largest scope of impact, use with caution

SQL-Level Adjustment

Purpose: Use a HINT to precisely control the parallelism of a single SQL statement, providing the best flexibility.

Command:

SELECT /*+SET_VAR(parallel_pipeline_task_num=8)*/ *
FROM nation, lineitem
WHERE lineitem.l_suppkey = nation.n_nationkey;

SELECT /*+SET_VAR(parallel_pipeline_task_num=8,runtime_filter_mode=global)*/ *
FROM nation, lineitem
WHERE lineitem.l_suppkey = nation.n_nationkey;

Notes: You can stack other session variables (such as runtime_filter_mode) within the same HINT.

Session-Level Adjustment

Purpose: Make all SQL statements in the current session follow the same parallelism setting.
Command:
```
SET parallel_pipeline_task_num = 8;
```
Notes: Within the session, even single-row point queries run with this parallelism, which may slow down lightweight queries.

Global Adjustment

Purpose: Take effect on all new connections, affecting the cluster-wide default behavior.

Command:

SET GLOBAL parallel_pipeline_task_num = 8;

Notes: Use this only when you genuinely need to adjust cluster-wide CPU utilization. Otherwise, keep the default value.

FAQ / Troubleshooting

Q1: Why is the query slower after I increased the parallelism?

An overly large parallelism introduces more Shuffle and thread synchronization overhead, and the scheduling framework itself has its own overhead. For example, setting it to 48 on a 16-core machine produces almost no benefit and only adds scheduling cost. Start from the default value and adjust stepwise as 4 -> 2 -> 1.

Q2: Why is the recommended parallelism for a single-table point query 1?

A single-table point query has only one Fragment, and the bottleneck is the data scan. The scan thread and the query execution thread are separate, and the scan thread parallelizes adaptively. Therefore, increasing parallel_pipeline_task_num does not accelerate a scan-bound bottleneck.

Q3: Why is parallelism 1 also recommended for stress testing?

Stress testing itself involves a large number of concurrent queries that already saturate the CPU. Stacking single-query parallelism on top of that only adds thread scheduling and framework scheduling overhead.

Q4: What does the default value `0` mean?

parallel_pipeline_task_num = 0 means the BE automatically uses half the number of CPU cores. This default value balances single-query performance and concurrent throughput, and most scenarios do not need to change it.

Q5: Should I set the global parallelism or use a SQL HINT?

Prefer using a SQL HINT to tune individual slow SQL statements. Use SET GLOBAL only when you have confirmed that the cluster-wide CPU utilization needs an overall adjustment.

Parameter	Purpose	Default value
`parallel_pipeline_task_num`	The number of worker tasks for a single Fragment inside a BE	`0` (BE CPU cores / 2)
`runtime_filter_mode`	The working mode of Runtime Filter, often tuned together with parallelism	See the Runtime Filter documentation

Pre-Tuning Self-Check Checklist​

Core Concepts​

Tuning Principles​

Scenario-Based Tuning Recommendations​

Scenario Reference Table​

Scenario 1: Simple Single-Table Operations​

Scenario 2: JOIN/Aggregation Over Large Datasets​

Scenario 3: Stress Testing​

Scenario 4: Complex Queries​

Tuning Methods​

Method Comparison​

SQL-Level Adjustment​

Session-Level Adjustment​

Global Adjustment​

FAQ / Troubleshooting​

Q1: Why is the query slower after I increased the parallelism?​

Q2: Why is the recommended parallelism for a single-table point query 1?​

Q3: Why is parallelism 1 also recommended for stress testing?​

Q4: What does the default value 0 mean?​

Q5: Should I set the global parallelism or use a SQL HINT?​

Related Parameters​