Skip to main content

Continuous Load Overview

Doris supports continuously loading data from multiple data sources into Doris tables through Streaming Job. After a Job is submitted, Doris keeps the load job running, reading data from the source in real time and writing it into the Doris table.

tip

This feature is supported starting from version 4.1.0.

This document helps you answer the following questions:

  • Which data sources and sync modes does continuous load support?
  • How do you choose between table-level sync and database-level sync?
  • How do job states transition, and how does automatic recovery work?
  • How do you view, pause, resume, and delete load jobs in daily operations?
  • What are the common FE and Job configuration parameters?

Supported Data Sources and Sync Modes

Continuous load supports the following data sources and sync modes:

Data SourceSupported VersionsTable-Level SyncDatabase-Level SyncConfiguration Guide
MySQL5.6, 5.7, 8.0.xMySQL Table-Level SyncMySQL Database-Level SyncAmazon RDS MySQL · Amazon Aurora MySQL
PostgreSQL14, 15, 16, 17PostgreSQL Table-Level SyncPostgreSQL Database-Level SyncAmazon RDS PostgreSQL · Amazon Aurora PostgreSQL
S3-S3 Continuous Load--

How to Choose a Sync Method

Table-level sync and database-level sync are two continuous load methods with completely different underlying mechanisms, not a difference in "number of tables." Database-level sync also supports syncing only a single table through include_tables, so the choice should be based on capability requirements.

Capability Comparison

CapabilityTable-Level SyncDatabase-Level Sync
Underlying mechanismJob + TVF (INSERT INTO tbl SELECT * FROM tvf())Job + native whole-database DDL (FROM src TO DATABASE db)
Target levelAn existing Doris tableA Doris database container
Sync scopeA single tableOne to multiple tables to the entire database (controlled by include_tables)
Automatic table creationTables must be pre-createdPrimary key tables are created automatically on first sync
SQL flexibilitySupports column mapping, filtering, and transformation (SELECT clause)Copies as-is, does not support ETL
Semantic guaranteeexactly-onceat-least-once
Required privilegesLoadLoad + Create (when creating tables automatically)
Typical scenariosReal-time sync that requires column pruning, field renaming, type conversion, or conditional filteringMirror replication of an entire database or a group of tables, where downstream table schemas should automatically follow the upstream

Selection Recommendations

  • You need to apply SQL processing to the data, or you have strict requirements for exactly-once semantics -> choose table-level sync
  • You want Doris to create tables automatically and sync a group of tables with one configuration -> choose database-level sync
  • The data source is S3 object storage -> only table-level sync is supported (using the S3 TVF)

Job State Transitions

A Streaming Job transitions between the following states during execution. Table-level sync and database-level sync follow the same state machine:

job-state-flow

State Descriptions

StateMeaning
PENDINGThe job has been created but has not yet scheduled any subtasks; it is waiting for the next scheduling round to create a StreamingTask.
RUNNINGA subtask has been spawned and is executing, reading incremental data from the source and writing it into Doris.
FINISHEDThe source has been fully consumed and the job is terminated. An S3 TVF job enters this state after all files have been loaded.
PAUSEDA subtask failed, the job is automatically paused, and failReason is recorded. You can check the cause through the ErrorMsg field of select * from jobs(...).

Automatic Recovery (autoResume)

After a job enters PAUSED, the scheduler periodically tries to recover it using an exponential backoff strategy. On recovery, the job returns to PENDING and continues to create subtasks. No manual intervention is required: transient failures (network jitter, brief upstream unavailability, and so on) are absorbed automatically.

Use the appropriate command for each scenario:

  • Resume immediately, or start manually after troubleshooting: use RESUME JOB
  • Stop completely and no longer schedule: use PAUSE JOB (a manually paused job is not woken up by autoResume) or DROP JOB

Common Operations

View Load Status

Query all Streaming-type Insert Jobs:

select * from jobs("type"="insert") where ExecuteType = "STREAMING";

Result columns:

ColumnDescription
IDJob ID
NAMEJob name
DefinerJob definer
ExecuteTypeJob scheduling type: ONE_TIME/RECURRING/STREAMING/MANUAL
RecurringStrategyRecurring strategy. Used by regular Insert. Empty when ExecuteType=Streaming
StatusJob status
ExecuteSqlThe Insert SQL statement of the Job
CreateTimeJob creation time
SucceedTaskCountNumber of successful tasks
FailedTaskCountNumber of failed tasks
CanceledTaskCountNumber of canceled tasks
CommentJob comment
PropertiesJob properties
CurrentOffsetThe offset that the Job has finished processing. Only set when ExecuteType=Streaming
EndOffsetThe maximum EndOffset retrieved from the data source. Only set when ExecuteType=Streaming
LoadStatisticJob statistics
ErrorMsgError message of the Job
JobRuntimeMsgRuntime hints of the Job

View Task Status

Query all subtasks under a Job by Job ID:

select * from tasks("type"="insert") where jobId='<job_id>';

Result columns:

ColumnDescription
TaskIdTask ID
JobIDJobID
JobNameJob name
LabelThe label used by the Task for loading
StatusTask status
ErrorMsgTask failure message
CreateTimeTask creation time
StartTimeTask start time
FinishTimeTask finish time
LoadStatisticTask statistics
UserTask executor
RunningOffsetOffset information currently being synced by the task. Only set when Job.ExecuteType=Streaming

Pause a Load Job

Manually pause the specified job (a paused job is not woken up by autoResume):

PAUSE JOB WHERE jobname = <job_name>;

Resume a Load Job

Resume a job that is in the PAUSED state:

RESUME JOB WHERE jobName = <job_name>;

Delete a Load Job

Permanently delete the specified job. After deletion, the job is no longer scheduled:

DROP JOB WHERE jobName = <job_name>;

Common Parameters

FE Configuration Parameters

ParameterDefaultDescription
max_streaming_job_num1024Maximum number of Streaming jobs
job_streaming_task_exec_thread_num10Number of threads used to execute StreamingTask
max_streaming_task_show_count100Maximum number of StreamingTask execution records kept in memory

Job Common Load Configuration Parameters

ParameterDefaultDescription
max_interval10sIdle scheduling interval when the upstream has no new data.

FAQ

MySQL Connection Error: Public Key Retrieval is not allowed

Cause: The configured MySQL user uses SHA256 password authentication, which requires the password to be transmitted over a protocol such as TLS.

Solution 1: Add the allowPublicKeyRetrieval=true parameter to the JDBC URL:

jdbc:mysql://127.0.0.1:3306?allowPublicKeyRetrieval=true

Solution 2: Change the MySQL user's authentication method to mysql_native_password:

ALTER USER 'username'@'%' IDENTIFIED WITH mysql_native_password BY 'password';
FLUSH PRIVILEGES;