跳到主要内容

Load Transactions

TL;DR Apache Doris wraps every data load in a transaction. You can group several inserts under one BEGIN ... COMMIT, dedupe retries with a load Label, and drive a two-phase commit from headers in a Stream Load HTTP request. That last part is what lets Flink, Spark, and CDC pipelines write to Doris exactly once without keeping their own bookkeeping table.

Apache Doris Load Transactions: Run multi-statement loads as one atomic unit, with label-based dedup and 2PC for exactly-once streaming sinks.

Why use load transactions in Apache Doris?

Apache Doris load transactions push duplicate-row and partial-write problems off the application and into the warehouse, so a Flink replay, a half-completed multi-table write, or a re-uploaded file all behave correctly. Loading data into a warehouse goes wrong in predictable ways. A Flink job restarts mid-checkpoint and replays the last batch, doubling rows. A CDC pipeline updates a fact table but its companion dimension write fails, leaving the two out of sync until the next backfill. A retry script re-uploads the same file because it has no way to know the previous attempt actually landed. Each of these forces ugly bookkeeping in the application.

  • Replays from a stalled streaming job duplicate rows.
  • Multi-table writes that should move together can stop halfway.
  • Resubmissions of the same file land twice unless you track them yourself.

Apache Doris solves all three on the server, so the load client can stay thin.

What are Apache Doris load transactions?

Apache Doris load transactions are the atomic write unit underneath every ingestion path. Every write into Doris runs inside a transaction. That covers INSERT INTO, Stream Load, Routine Load, and Broker Load. Each one gets a unique Label and walks through a tracked set of lifecycle states managed by the FE. You can let Apache Doris open and commit that transaction implicitly per load, take control with explicit BEGIN/COMMIT/ROLLBACK, or split the commit into two phases over HTTP for streaming sinks.

Key terms

  • Label: a per-database string that identifies a transaction. Resubmitting the same Label deduplicates the load.
  • Implicit transaction: every single load is its own transaction; auto-commits on success.
  • Explicit transaction: BEGIN ... COMMIT; groups several writes into one atomic unit on the same database.
  • Stream Load 2PC: a Stream Load that stops at PRECOMMITTED until a follow-up HTTP call commits or aborts it.
  • Lifecycle states: PREPAREPRECOMMITTEDCOMMITTEDVISIBLE (or ABORTED). Rows are queryable only after VISIBLE.

How do Apache Doris load transactions work?

Apache Doris load transactions walk every write through five FE-managed states (PREPAREPRECOMMITTEDCOMMITTEDVISIBLE, or ABORTED on failure), with a Label that deduplicates retries and a final replica-level publish that makes rows queryable at one instant.

  1. Open. A load opens a transaction in PREPARE. The FE assigns a TxnId and ties it to the user-supplied or generated Label. A second submission with the same Label short-circuits to the original transaction's result.
  2. Write. BEs receive the data, build segments, and report success. The transaction moves to PRECOMMITTED once all expected loads acknowledge.
  3. Commit. A COMMIT (or the second-phase txn_operation: commit HTTP call) moves the transaction to COMMITTED. The data is on disk but not yet readable.
  4. Publish. A background daemon ships a PublishVersion task to every replica. When all replicas of every affected tablet acknowledge, the transaction reaches VISIBLE.
  5. Read. Queries running after VISIBLE see the new rows. Earlier queries see the prior version. Tablets in the same transaction never produce a torn read.

If anything fails along the way, or the client calls ROLLBACK (or txn_operation: abort), the transaction goes to ABORTED and the data is discarded.

Quick start

CREATE TABLE orders (
id INT, amount DECIMAL(10,2)
) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 4;

CREATE TABLE order_audit (
id INT, action STRING
) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 4;

BEGIN;
INSERT INTO orders VALUES (1, 99.50), (2, 149.00);
INSERT INTO order_audit VALUES (1, 'created'), (2, 'created');
COMMIT;

Expected result

Query OK, 2 rows affected
{'label':'txn_insert_b55db21aad7451b','status':'VISIBLE','txnId':'10013'}

Both tables become readable at the same instant. If the second INSERT had failed or the connection had dropped before COMMIT, neither table would carry the new rows.

When should you use Apache Doris load transactions?

Reach for Apache Doris load transactions when several inserts must move together, when a streaming sink needs exactly-once semantics, or when retried loads must dedupe on a Label rather than in the application.

Good fit

  • Flink and Spark sinks that rely on Stream Load 2PC for exactly-once semantics across checkpoints.
  • CDC pipelines that need fact and dimension tables to move together.
  • Backfills that swap a batch of related rows in or out as a unit.
  • Workloads where idempotent retries matter and you want the warehouse to enforce dedup via Label rather than the client.

Not a good fit

  • High-frequency, single-row inserts. Each transaction adds a commit round-trip and a tablet version, and small loads bottleneck on the commit stage. Use Group Commit instead.
  • Atomicity across two databases. All tables in one transaction must live in the same database.
  • Stream Load 2PC against Merge-on-Write Unique Key tables in cloud / storage-compute-separated deployments. The BE rejects this combination with Status::NotSupported; on-prem (shared-nothing) clusters allow it.
  • Long-horizon deduplication. Labels are evicted after roughly 3 days or 2,000 entries per database, whichever hits first; reusing one after that does not dedupe. Track long-lived idempotency keys in the application.
  • Reading your own writes inside the same transaction. Each statement runs against the snapshot taken at its start; a previous INSERT is not visible until after COMMIT.

Further reading