Skip to main content
Skip to main content

Loading Overview

Introduction to Import Solutions​

This section provides an overview of import solutions in order to help users choose the most suitable import solution based on data source, file format, and data volume.

Doris supports various import methods, including Stream Load, Broker Load, Insert Into, Routine Load, and MySQL Load. In addition to using Doris's native import methods, Doris also provides a range of ecosystem tools to assist users in data import, including Spark Doris Connector, Flink Doris Connector, Doris Kafka Connector, DataX Doriswriter, and Doris Streamloader.

For high-frequency small import scenarios, Doris also provides the Group Commit feature. Group Commit is not a new import method, but an extension to INSERT INTO VALUES, Stream Load, Http Stream, which batches small imports on the server side.

Each import method and ecosystem tool has different use cases and supports different data sources and file formats.

Import Methods​

Import MethodUse CaseSupported File FormatsSingle Import VolumeImport Mode
Stream LoadImport from local datacsv, json, parquet, orcLess than 10GBSynchronous
Broker LoadImport from object storage, HDFS, etc.csv, json, parquet, orcTens of GB to hundreds of GBAsynchronous
INSERT INTO VALUES

Import single or small batch data

Import via JDBC, etc.

SQLSimple testingSynchronous
INSERT INTO SELECT

Import data between Doris internal tables

Import external tables

SQLDepending on memory sizeSynchronous
Routine LoadReal-time import from Kafkacsv, jsonMicro-batch import MB to GBAsynchronous
MySQL LoadImport from local datacsvLess than 10GBSynchronous
Group CommitHigh-frequency small batch importDepending on the import method usedMicro-batch import KB-

Ecosystem Tools​

Ecosystem ToolUse Case
Spark Doris ConnectorBatch import data from Spark
Flink Doris ConnectorReal-time import data from Flink
Doris Kafka ConnectorReal-time import data from Kafka
DataX DoriswriterSynchronize data from MySQL, Oracle, SQL Server, PostgreSQL, Hive, ADS, etc.
Doris StreamloaderImplements concurrent import for Stream Load, allowing multiple files and directories to be imported at once
X2DorisMigrate data from other AP databases to Doris

File Formats​

File FormatSupported Import MethodsSupported Compression Formats
csvStream Load, Broker Load, MySQL Loadgz, lzo, bz2, lz4, LZ4FRAME,lzop, deflate
jsonStream Load, Broker LoadNot supported
parquetStream Load, Broker LoadNot supported
orcStream Load, Broker LoadNot supported

Data Sources​

Data SourceSupported Import Methods
Local data

Stream Load

StreamLoader

MySQL Load

Object storage

Broker Load

INSERT TO SELECT FROM S3 TVF

HDFS

Broker Load

INSERT TO SELECT FROM HDFS TVF

Kafka

Routine Load

Kakfa Doris Connector

FlinkFlink Doris Connector
SparkSpark Doris Connector
Mysql, PostgreSQL, Oracle, SQL Server, and other TP databases

Import via external tables

Flink Doris Connector

Other AP databases

X2Doris

Import via external tables

Spark/Flink Doris Connector

Concept Introduction​

This section mainly introduces some concepts related to import to help users better utilize the data import feature.

Atomicity​

All import tasks in Doris are atomic, meaning that a import job either succeeds completely or fails completely. Partially successful data import will not occur within the same import task, and atomicity and consistency between materialized views and base tables are also guaranteed. For simple import tasks, users do not need to perform additional configurations or operations. For materialized views associated with tables, atomicity and consistency with the base table are also guaranteed.

Label Mechanism​

Import jobs in Doris can be assigned a label. This label is usually a user-defined string with certain business logic properties. If not specified by the user, the system will generate one automatically. The main purpose of the label is to uniquely identify an import task and ensure that the same label is imported successfully only once.

The label is used to ensure that the corresponding import job can only be successfully imported once. If a label that has been successfully imported is used again, it will be rejected and an error message Label already used will be reported. With this mechanism, Doris can achieve At-Most-Once semantics on the Doris side. If combined with the At-Least-Once semantics of the upstream system, it is possible to achieve Exactly-Once semantics for importing data.

Import Mode​

Import mode can be either synchronous or asynchronous. For synchronous import methods, the result returned indicates whether the import is successful or not. For asynchronous import methods, a successful return only indicates that the job has been submitted successfully, not that the data import is successful. Users need to use the corresponding command to check the running status of the import job.

Data Transformation​

When importing data into a table, sometimes the content in the table may not be exactly the same as the content in the source data file, and data transformation is required. Doris supports performing certain transformations on the source data during the import process. Specifically, it includes mapping, conversion, pre-filtering, and post-filtering.

Error Data Handling​

During the import process, the data types of the original columns and the target columns may not be completely consistent. During the import, the values of original columns with inconsistent data types will be converted. During the conversion process, conversion failures may occur, such as field type mismatch or field length exceeded. Strict mode is used to control whether to filter out these conversion failure error data rows during the import process.

Minimum Write Replica Number​

By default, data import requires that at least a majority of replicas are successfully written for the import to be considered successful. However, this approach is not flexible and may cause inconvenience in certain scenarios. Doris allows users to set the minimum write replica number (Min Load Replica Num). For import data tasks, when the number of replicas successfully written is greater than or equal to the minimum write replica number, the import is considered successful.