Delete Overview
In Apache Doris, the delete operation is a key feature for managing and cleaning data to meet the flexibility needs of users in large-scale data analysis scenarios. Doris's deletion mechanism supports efficient logical deletion and multi-version data management, achieving a good balance between performance and flexibility.
Implementation Mechanism of Deletionβ
Doris's delete operation uses logical deletion rather than directly physically deleting data. The core implementation mechanisms are as follows:
-
Logical Deletion. The delete operation does not directly remove data from storage but adds a delete marker to the target data. There are two main ways to implement logical deletion: delete predicate and delete sign.
- Delete predicate is used for Duplicate and Aggregate models. Each deletion directly records a conditional predicate on the corresponding dataset to filter out the deleted data during queries.
- Delete sign is used for the Unique Key model. Each deletion writes a new batch of data to overwrite the data to be deleted, and the hidden column
__DORIS_VERSION_COL__
of the new data is set to 1, indicating that the data has been deleted. - Performance comparison: The operation speed of "delete predicate" is very fast, whether deleting 1 row or 100 million rows, the speed is almost the same, it just write a conditional predicate to the dataset; the write speed of delete sign is proportional to the amount of data.
-
Multi-Version Data Management. Doris supports multi-version data (MVCC, Multi-Version Concurrency Control), allowing concurrent operations on the same dataset without affecting query results. The delete operation creates a new version containing the delete marker, while the old version data is still retained.
-
Physical Deletion (Compaction). The periodically executed compaction process cleans up data marked for deletion, thereby freeing up storage space. This process is automatically completed by the system without user intervention. Note that only Base Compaction will physically delete data, while Cumulative Compaction only merges and reorders data, reducing the number of rowsets and segments.
Use Cases for Delete Operationsβ
Doris provides various deletion methods to meet different needs:
Conditional Deletionβ
Users can delete rows that meet specified conditions. For example:
DELETE FROM table_name WHERE condition;
Batch Deletion via data loadingβ
During data loading, logical deletion can be achieved by overwriting. This method is suitable for batch deletion of a large number of keys or synchronizing TP database deletions during CDC binlog synchronization.
Deleting All Dataβ
In some cases, data can be deleted by directly truncating the table or partition. For example:
TRUNCATE TABLE table_name;
Atomic Overwrite Using Temporary Partitionsβ
In some cases, users may want to rewrite the data of a partition. If the data is deleted and then imported, there will be a period when the data is unavailable. In this case, users can create a corresponding temporary partition, import the new data into the temporary partition, and then replace the original partition atomically to achieve the goal.
Notesβ
- The delete operation generates new data versions, so frequent deletions may increase the number of versions, affecting query performance.
- Compaction is a key step in freeing up storage space. Users are advised to adjust the compaction strategy based on system load.
- Deleted data will still occupy storage until compaction is completed, so the delete operation itself will not immediately reduce storage usage.