Skip to main content

Spill Disk

概述

Doris's computing layer adopts an MPP (Massively Parallel Processing) architecture, where all computing tasks are completed in the memory of BEs (Backends), and data exchange between BEs is also conducted through memory. Therefore, memory management plays a crucial role in ensuring the stability of queries. According to online query statistics, a significant portion of query errors are related to memory issues. As more and more users migrate tasks such as ETL data processing, multi-table materialized view processing, and complex AdHoc queries to Doris, it is necessary to offload intermediate operation results to disk to enable the execution of queries that require more memory than each query or each node can handle. Specifically, when processing large datasets or executing complex queries, memory consumption can increase rapidly, exceeding the memory limits of a single node or the entire query processing process. Doris alleviates memory pressure by writing intermediate results (such as intermediate states of aggregation, temporary data for sorting, etc.) to disk rather than relying solely on memory to store these data. This approach offers several benefits:

  • Scalability: Allows Doris to handle datasets much larger than the memory limit of a single node.
  • Stability: Reduces the risk of query failures or system crashes due to insufficient memory.
  • Flexibility: Enables users to execute more complex queries without increasing hardware resources.

To avoid triggering OOM (Out of Memory) when requesting memory, Doris has introduced a reserve memory mechanism. The workflow of this mechanism is as follows:

  • During execution, Doris estimates the memory size required to process each block and then requests it from a unified memory manager.
  • The global memory allocator determines whether the current memory request exceeds the memory limit of the query or the entire process. If it does, the request fails.
  • When Doris receives a failure message, it suspends the current query, selects the largest operator for spilling to disk, and resumes query execution after spilling is complete.

Currently, the operators that support spilling include:

  • Hash Join operator
  • Aggregation operator
  • Sort operator
  • CTE

When a query triggers spilling, additional disk read/write operations may significantly increase query time. It is recommended to increase the FE Session variable query_timeout. Additionally, spilling can generate significant disk I/O, so it is advisable to configure a separate disk directory or use SSD disks to reduce the impact of query spilling on normal data ingestion or queries. The query spilling feature is currently disabled by default.

Memory Management Mechanism

BE Process Memory Configuration

The memory of the entire BE process is controlled by the mem_limit parameter in be.conf. Once Doris's memory usage exceeds this threshold, Doris cancels the current query that is requesting memory. Additionally, a background task asynchronously kills some queries to release memory or cache. Therefore, Doris's internal management operations (such as spilling to disk, flushing memtable, etc.) need to run when approaching this threshold to avoid reaching it. Once the threshold is reached, to prevent the entire process from experiencing OOM, Doris takes some drastic self-protection measures. When Doris's BE is collocated with other processes (such as Doris FE, Kafka, HDFS), the actual available memory for Doris BE may be significantly less than the user-set mem_limit, causing the internal memory release mechanism to fail and potentially leading to the Doris process being killed by the operating system's OOM Killer. When the Doris process is deployed in K8S or managed by Cgroup, Doris automatically senses the memory configuration of the container.

Workload Group Memory Configuration

  • MAX_MEMORY_PERCENT means that when requests are running in the group, their memory usage will never exceed this percentage of the total memory. Once exceeded, the query will either trigger disk spilling or be killed.
  • MIN_MEMORY_PERCENT sets the minimum memory value for a group. When resources are idle, memory exceeding MIN_MEMORY_PERCENT can be used. However, when memory is insufficient, the system will allocate memory according to MIN_MEMORY_PERCENT (minimum memory percentage). It may select some queries to kill, reducing the memory usage of the Workload Group to MIN_MEMORY_PERCENT to ensure that other Workload Groups have sufficient memory available.
  • The sum of MIN_MEMORY_PERCENT across all Workload Groups must not exceed 100%, and MIN_MEMORY_PERCENT cannot be greater than MAX_MEMORY_PERCENT.
  • low watermark: Default is 75%.
  • high watermark: Default is 90%.

Query Memory Management

Static Memory Allocation

The memory used by a query is controlled by the following two parameters:

  • exec_mem_limit, representing the maximum memory that a query can use, with a default value of 2GB.

Slot-Based Memory Allocation

In practice, we found that with static memory allocation, users often do not know how much memory to allocate to a query. Therefore, exec_mem_limit is frequently set to half of the entire BE process memory, meaning that the memory used by all queries within the BE cannot exceed half of the process memory. In this scenario, this feature effectively becomes a kind of fuse. When we need to implement more granular policy control based on memory size, such as spilling to disk, this value is too large to rely on for control. Therefore, we have implemented a new slot-based memory limitation method based on workload groups. The principle of this strategy is as follows:

  • Each workload group is configured with two parameters by the user: memory_limit and max_concurrency. It is assumed that the memory of the entire BE is divided into max_concurrency slots, with each slot occupying memory_limit / max_concurrency of memory.
  • By default, each query occupies one slot during execution. If the user wants a query to use more memory, they need to modify the query_slot_count value.
  • Since the total number of slots in a workload group is fixed, increasing query_slot_count means each query occupies more slots, dynamically reducing the number of queries that can run concurrently in the workload group, causing new queries to automatically queue.

The slot_memory_policy for workload groups can have three optional values:

  • disabled, the default value, indicating that it is not enabled and the static memory allocation method is used.
  • fixed, where the memory that each query can use is calculated as workload group's mem_limit * query_slot_count / max_concurrency.
  • dynamic, where the memory that each query can use is calculated as workload group's mem_limit * query_slot_count / sum(running query slots). This mainly overcomes the issue of unused slots in fixed mode. Both fixed and dynamic set hard limits for queries. Once exceeded, spilling to disk or query killing will occur, and these will override the static memory allocation parameters set by the user. Therefore, when setting slot_memory_policy, it is essential to properly configure max_concurrency for the workload group to avoid memory insufficiency issues.

Spilling

Enabling Query Intermediate Result Spilling

BE Configuration Items

spill_storage_root_path=/mnt/disk1/spilltest/doris/be/storage;/mnt/disk2/doris-spill;/mnt/disk3/doris-spill
spill_storage_limit=100%
  • spill_storage_root_path: The storage path for query intermediate result spilling files, which defaults to the same as storage_root_path.
  • spill_storage_limit: The disk space limit for spilling files. It can be configured with a specific space size (e.g., 100GB, 1TB) or a percentage, with a default of 20%. If spill_storage_root_path is configured with a separate disk, it can be set to 100%. This parameter primarily prevents spilling from occupying too much disk space, impeding normal data storage. After modifying the configuration items, you need to restart BE for them to take effect.

FE Session Variable

set enable_spill=true;
set exec_mem_limit = 10g
  • enable_spill, indicates whether spilling is enabled for a query.
  • exec_mem_limit, represents the maximum memory size used by a query.

Workload Group

The default max_memory_percent for workload groups is 100%, which can be adjusted based on the actual number of workload groups. If there is only one workload group, it can be adjusted to 90%.

alter workload group normal properties ( 'max_memory_percent'='90%' );

Monitoring Spilling

Audit Logs

The FE audit log has added the SpillWriteBytesToLocalStorage and SpillReadBytesFromLocalStorage fields, representing the total amount of data written to and read from local storage during spilling, respectively.

SpillWriteBytesToLocalStorage=503412182|SpillReadBytesFromLocalStorage=503412182

Profile

If spilling is triggered during a query, some Spill-prefixed counters are added to the Query Profile to mark and count spilling-related activities. Taking HashJoin's Build Hash Table as an example, you can see the following counters:

PARTITIONED_HASH_JOIN_SINK_OPERATOR  (id=4  ,  nereids_id=179):(ExecTime:  6sec351ms)
- Spilled: true
- CloseTime: 528ns
- ExecTime: 6sec351ms
- InitTime: 5.751us
- InputRows: 6.001215M (6001215)
- MemoryUsage: 0.00
- MemoryUsagePeak: 554.42 MB
- MemoryUsageReserved: 1024.00 KB
- OpenTime: 2.267ms
- PendingFinishDependency: 0ns
- SpillBuildTime: 2sec437ms
- SpillInMemRow: 0
- SpillMaxRowsOfPartition: 68.569K (68569)
- SpillMinRowsOfPartition: 67.455K (67455)
- SpillPartitionShuffleTime: 836.302ms
- SpillPartitionTime: 131.839ms
- SpillTotalTime: 5sec563ms
- SpillWriteBlockBytes: 714.13 MB
- SpillWriteBlockCount: 1.344K (1344)
- SpillWriteFileBytes: 244.40 MB
- SpillWriteFileTime: 350.754ms
- SpillWriteFileTotalCount: 32
- SpillWriteRows: 6.001215M (6001215)
- SpillWriteSerializeBlockTime: 4sec378ms
- SpillWriteTaskCount: 417
- SpillWriteTaskWaitInQueueCount: 0
- SpillWriteTaskWaitInQueueTime: 8.731ms
- SpillWriteTime: 5sec549ms

System Tables

backend_active_tasks

The SPILL_WRITE_BYTES_TO_LOCAL_STORAGE and SPILL_READ_BYTES_FROM_LOCAL_STORAGE fields have been added, representing the total amount of data currently being written to and read from local storage for intermediate results during a query.

mysql [information_schema]>select * from backend_active_tasks;
+-------+------------+-------------------+-----------------------------------+--------------+------------------+-----------+------------+----------------------+---------------------------+--------------------+-------------------+------------+------------------------------------+-------------------------------------+
| BE_ID | FE_HOST | WORKLOAD_GROUP_ID | QUERY_ID | TASK_TIME_MS | TASK_CPU_TIME_MS | SCAN_ROWS | SCAN_BYTES | BE_PEAK_MEMORY_BYTES | CURRENT_USED_MEMORY_BYTES | SHUFFLE_SEND_BYTES | SHUFFLE_SEND_ROWS | QUERY_TYPE | SPILL_WRITE_BYTES_TO_LOCAL_STORAGE | SPILL_READ_BYTES_FROM_LOCAL_STORAGE |
+-------+------------+-------------------+-----------------------------------+--------------+------------------+-----------+------------+----------------------+---------------------------+--------------------+-------------------+------------+------------------------------------+-------------------------------------+
| 10009 | 10.16.10.8 | 1 | 6f08c74afbd44fff-9af951270933842d | 13612 | 11025 | 12002430 | 1960955904 | 733243057 | 70113260 | 0 | 0 | SELECT | 508110119 | 26383070 |
| 10009 | 10.16.10.8 | 1 | 871d643b87bf447b-865eb799403bec96 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | SELECT | 0 | 0 |
+-------+------------+-------------------+-----------------------------------+--------------+------------------+-----------+------------+----------------------+---------------------------+--------------------+-------------------+------------+------------------------------------+-------------------------------------+
2 rows in set (0.00 sec)

Testing

Test Environment

Machine Configuration

The test used Alibaba Cloud servers with the following specific configurations:

1FE:

16 cores(vCPU) 32 GiB 200 Mbps ecs.c6.4xlarge

3BE:

16 cores(vCPU) 64 GiB 0 Mbps ecs.g6.4xlarge

Dataset

The test data used TPC-DS 10TB as input, sourced from Alibaba Cloud DLF, and mounted to Doris using the Catalog method. The SQL statement is as follows:

CREATE CATALOG dlf PROPERTIES (
"type"="hms",
"hive.metastore.type" = "dlf",
"dlf.proxy.mode" = "DLF_ONLY",
"dlf.endpoint" = "dlf-vpc.cn-beijing.aliyuncs.com",
"dlf.region" = "cn-beijing",
"dlf.uid" = "217316283625971977",
"dlf.catalog.id" = "emr_dev"
);

Reference website: https://doris.apache.org/zh-CN/docs/dev/benchmark/tpcds

Test Results

The dataset size was 10TB. The ratio of memory to dataset size was 1:52. The overall runtime was 32,000 seconds, and all 99 queries were successfully executed. In the future, we will provide spilling capabilities for more operators (such as window functions, Intersect, etc.) and continue to optimize performance under spilling conditions, reduce disk consumption, and improve query stability.

QueryDoris
query129092
query2130003
query396119
query41199097
query5212719
query662259
query7209154
query862433
query9579371
query1054260
query11560169
query1226084
query13228756
query141137097
query1527509
query1684806
query17288164
query1894770
query19124955
query2030970
query214333
query229890
query231757755
query24399553
query25291474
query2679832
query27175894
query28647497
query291299597
query3011434
query31106665
query3233481
query33146101
query3484055
query3569885
query36148662
query3721598
query38164746
query395874
query4051602
query41563
query4293005
query4367769
query4479527
query4526575
query46134991
query47161873
query48153657
query49259387
query50141421
query51158056
query5291392
query5389497
query54124118
query5582584
query56152110
query5783417
query58259580
query59177125
query60161729
query61258058
query6239619
query6391258
query64234882
query65278610
query6690246
query673939554
query68183648
query6911031
query70137901
query71166454
query722859001
query7392015
query74336694
query75838989
query76174235
query77174525
query781956786
query79162259
query80602088
query8116184
query8256292
query8326211
query8411906
query8557739
query8634350
query87173631
query88449003
query89113799
query9030825
query9112239
query9226695
query93275828
query9456464
query9564932
query9648102
query97597371
query98112399
query9964472
Sum28102386