Adjusting Join Shuffle Mode with Hint
Overviewβ
Doris supports the use of hints to adjust the type of data shuffle in Join operations, thereby optimizing query performance. This section provides detailed instructions on how to specify the Join Shuffle type in Doris using hints.
Currently, Doris has good out-of-the-box capabilities. This means that in most scenarios, Doris will adaptively optimize performance in various scenarios, and users do not need to manually control hints for performance tuning. The content introduced in this chapter is mainly for professional tuners, and business personnel only need a simple understanding.
Currently, Doris supports two independent Distribute Hint, [shuffle]
and [broadcast]
, to specify the Distribute Type for the right table in a Join. The Distribute Type should be placed before the right table in the Join, enclosed in square brackets []
. Additionally, Doris can specify the shuffle mode by using the Leading Hint in conjunction with the Distribute Hint (for more details, refer to Reordering Join With Leading Hint).
Examples are as follows:
SELECT COUNT(*) FROM t2 JOIN [broadcast] t1 ON t1.c1 = t2.c2;
SELECT COUNT(*) FROM t2 JOIN [shuffle] t1 ON t1.c1 = t2.c2;
Caseβ
Next, we will demonstrate the usage of Distribute Hints through an example:
EXPLAIN SHAPE PLAN SELECT COUNT(*) FROM t1 JOIN t2 ON t1.c1 = t2.c2;
The plan for the original SQL is as follows, showing that the join between t1 and t2 uses the hash distribute method, indicated by DistributionSpecHash
.
+----------------------------------------------------------------------------------+
| Explain String (Nereids Planner) |
+----------------------------------------------------------------------------------+
| PhysicalResultSink |
| --hashAgg [GLOBAL] |
| ----PhysicalDistribute [DistributionSpecGather] |
| ------hashAgg [LOCAL] |
| --------PhysicalProject |
| ----------hashJoin [INNER_JOIN] hashCondition=((t1.c1 = t2.c2)) otherCondition=()|
| ------------PhysicalProject |
| --------------PhysicalOlapScan [t1] |
| ------------PhysicalDistribute [DistributionSpecHash] |
| --------------PhysicalProject |
| ----------------PhysicalOlapScan [t2] |
+----------------------------------------------------------------------------------+
After adding the [broadcast] hint:
EXPLAIN SHAPE PLAN SELECT COUNT(*) FROM t1 JOIN [broadcast] t2 ON t1.c1 = t2.c2;
It can be seen that the distribution method for the join between t1 and t2 has been changed to the broadcast method, indicated by DistributionSpecReplicated
.
+----------------------------------------------------------------------------------+
| Explain String (Nereids Planner) |
+----------------------------------------------------------------------------------+
| PhysicalResultSink |
| --hashAgg [GLOBAL] |
| ----PhysicalDistribute [DistributionSpecGather] |
| ------hashAgg [LOCAL] |
| --------PhysicalProject |
| ----------hashJoin [INNER_JOIN] hashCondition=((t1.c1 = t2.c2)) otherCondition=()|
| ------------PhysicalProject |
| --------------PhysicalOlapScan [t1] |
| ------------PhysicalDistribute [DistributionSpecReplicated] |
| --------------PhysicalProject |
| ----------------PhysicalOlapScan [t2] |
+----------------------------------------------------------------------------------+
Summaryβ
By appropriately using Distribute Hints, you can optimize the shuffle mode for Join operations and improve query performance. In practice, it is recommended to first analyze the query execution plan using EXPLAIN and then specify the appropriate shuffle type based on the actual situation.