Skip to main content

Continuous Load Overview

Overview

Doris supports continuously loading data from multiple data sources into Doris tables via Streaming Job. After submitting a Job, Doris continuously runs the import job, reading data from the source in real time and writing it into Doris tables.

Continuous Load supports the following data sources and import modes:

Data SourceSupported VersionsTable-level SyncDatabase-level SyncSetup Guide
MySQL5.6, 5.7, 8.0.xMySQL Table-level SyncMySQL Database-level SyncAmazon RDS MySQL · Amazon Aurora MySQL
PostgreSQL14, 15, 16, 17PostgreSQL Table-level SyncPostgreSQL Database-level SyncAmazon RDS PostgreSQL · Amazon Aurora PostgreSQL
S3-S3 Continuous Load--

How to Choose

Table-level Sync and Database-level Sync are two fundamentally different mechanisms, not a distinction by "number of tables". Database-level Sync can also sync just one table via include_tables, so the choice should be driven by capability requirements:

CapabilityTable-level SyncDatabase-level Sync
Underlying mechanismJob + TVF (INSERT INTO tbl SELECT * FROM tvf())Job + native database DDL (FROM src TO DATABASE db)
Target granularityOne existing Doris tableA Doris database container
Sync scopeA single tableOne to many to all tables (controlled by include_tables)
Auto-create tables❌ Requires pre-creation✅ Auto-creates primary-key tables on first sync
SQL expressiveness✅ Column mapping, filtering, transformation (via SELECT)❌ Direct replication, no ETL
Delivery semanticsexactly-onceat-least-once
Required privilegesLoadLoad + Create (when auto-creating)
Typical scenariosReal-time sync needing column pruning, renaming, type conversion, or conditional filteringMirror replication of a database or group of tables, where downstream schema should track upstream automatically
  • Need SQL transformations or strict exactly-once semantics → Choose Table-level Sync
  • Want Doris to auto-create tables and sync a group of tables with one config → Choose Database-level Sync
  • Source is S3 object storage → Only Table-level Sync is supported (via S3 TVF)

Job Lifecycle

A Streaming Job transitions between the following states during its lifecycle. Both Table-level Sync and Database-level Sync follow the same state machine:

stateDiagram-v2
[*] --> PENDING: create job
PENDING --> RUNNING: createStreamingTask()
RUNNING --> FINISHED: source consumed
RUNNING --> PAUSED: task failed (with failReason)
PAUSED --> PENDING: autoResume after exponential backoff
FINISHED --> [*]
StateDescription
PENDINGThe job has been created but no StreamingTask has been dispatched yet; awaiting the next scheduling round
RUNNINGA child task has been dispatched and is running, reading incremental data from the source and writing into Doris
FINISHEDThe source has been fully consumed and the job has terminated. S3 TVF jobs enter this state once all files have been imported
PAUSEDA child task failed; the job is automatically paused and a failReason is recorded. Check the ErrorMsg column in select * from jobs(...) for details

Auto-resume: After entering PAUSED, the scheduler periodically retries with an exponential backoff strategy and transitions the job back to PENDING to dispatch a new task. Transient failures (network jitter, brief upstream unavailability, etc.) are absorbed automatically without manual intervention. To resume immediately after diagnosing a failure, use RESUME JOB; to stop scheduling entirely, use PAUSE JOB (manually paused jobs are NOT woken up by auto-resume) or DROP JOB.

Common Operations

Check Import Status

select * from jobs("type"="insert") where ExecuteType = "STREAMING";
ColumnDescription
IDJob ID
NAMEJob name
DefinerJob definer
ExecuteTypeJob type: ONE_TIME/RECURRING/STREAMING/MANUAL
RecurringStrategyRecurring strategy, empty for Streaming
StatusJob status
ExecuteSqlJob's Insert SQL statement
CreateTimeJob creation time
SucceedTaskCountNumber of successful tasks
FailedTaskCountNumber of failed tasks
CanceledTaskCountNumber of canceled tasks
CommentJob comment
PropertiesJob properties
CurrentOffsetCurrent offset, only for Streaming jobs
EndOffsetMax end offset from source, only for Streaming jobs
LoadStatisticJob statistics
ErrorMsgJob error message
JobRuntimeMsgJob runtime info

Check Task Status

select * from tasks("type"="insert") where jobId='<job_id>';
ColumnDescription
TaskIdTask ID
JobIDJob ID
JobNameJob name
LabelTask label
StatusTask status
ErrorMsgTask error message
CreateTimeTask creation time
StartTimeTask start time
FinishTimeTask finish time
LoadStatisticTask statistics
UserTask executor
RunningOffsetCurrent offset, only for Streaming jobs

Pause Import Job

PAUSE JOB WHERE jobname = <job_name>;

Resume Import Job

RESUME JOB WHERE jobName = <job_name>;

Delete Import Job

DROP JOB WHERE jobName = <job_name>;

Common Parameters

FE Configuration Parameters

ParameterDefaultDescription
max_streaming_job_num1024Maximum number of Streaming jobs
job_streaming_task_exec_thread_num10Number of threads for StreamingTask
max_streaming_task_show_count100Max number of StreamingTask records in memory

General Job Import Parameters

ParameterDefaultDescription
max_interval10sIdle scheduling interval when no new data

FAQ

MySQL connection error: Public Key Retrieval is not allowed

Cause: The MySQL user uses SHA256 password authentication, which requires TLS or other protocols to transmit the password.

Solution 1: Add allowPublicKeyRetrieval=true to the JDBC URL:

jdbc:mysql://127.0.0.1:3306?allowPublicKeyRetrieval=true

Solution 2: Change the MySQL user's authentication method to mysql_native_password:

ALTER USER 'username'@'%' IDENTIFIED WITH mysql_native_password BY 'password';
FLUSH PRIVILEGES;