Handling Messy Data
During data ingestion, discrepancies may arise between source and target column data types. Although the load process attempts to convert these inconsistent types, issues such as type mismatches, field length exceeding limits, or precision mismatches may occur, resulting in conversion failures.
To address these exceptional cases, Doris provides two essential control parameters:
- Strict Mode (strict_mode): Regulates whether to filter out rows with conversion failures
- Maximum Filter Ratio (max_filter_ratio): Defines the maximum allowable ratio of filtered data to total data during load
Strict Modeβ
Strict mode serves two primary purposes:
- Filtering out data rows where column type conversion fails during load
- Restricting updates to existing columns only in partial column update scenarios
Filtering Strategy for Column Type Conversion Failuresβ
The system employs different strategies based on the strict mode setting:
-
When strict mode is OFF: Fields that fail conversion are set to NULL, and rows containing these NULL values are loaded along with the correct data rows.
-
When strict mode is ON: The system filters out rows with conversion failures and only loads correct data rows. Here, "conversion failure" specifically refers to cases where the original data is non-NULL but becomes NULL after column type conversion. Note that NULL values resulting from function calculations are not included in this type of conversion.
-
Handling NULL values: Both correct and abnormal data rows may contain NULL values. If the target column is defined as NOT NULL, all rows containing NULL values will be filtered out.
1. Example with TinyInt column type:
Original Data Type | Original Data Example | Value After TinyInt Conversion | Strict Mode | Result |
---|---|---|---|---|
NULL | \N | NULL | ON/OFF | NULL |
Non-NULL | "abc" or 2000 | NULL | ON | Invalid (Filtered) |
Non-NULL | "abc" | NULL | OFF | NULL |
Non-NULL | 1 | 1 | ON/OFF | Loaded Successfully |
-
The column in the table allows NULL values
-
Both
abc
and2000
become NULL after conversion to TinyInt due to type or precision issues. When strict mode is ON, such data will be filtered out. When OFF, NULL will be loaded.
2. Example with Decimal(1,0) type:
Original Data Type | Original Data Example | Value After Decimal Conversion | Strict Mode | Result |
---|---|---|---|---|
NULL | \N | NULL | ON/OFF | NULL |
Non-NULL | aaa | NULL | ON | Invalid (Filtered) |
Non-NULL | aaa | NULL | OFF | NULL |
Non-NULL | 1 or 10 | 1 or 10 | ON/OFF | Loaded Successfully |
-
The column in the table allows NULL values
-
abc
becomes NULL after conversion to Decimal due to type issues. When strict mode is ON, such data will be filtered out. When OFF, NULL will be loaded. -
Although
10
exceeds the range, since its type meets decimal requirements, strict mode does not affect it.
Restricting Partial Column Updates to Existing Columns Onlyβ
In strict mode, each row in a partial column update must have its Key already existing in the table. In non-strict mode, partial column updates can both update existing rows (where Key exists) and insert new rows (where Key doesn't exist).
For example, given a table structure as follows:
CREATE TABLE user_profile
(
id INT,
name VARCHAR(10),
age INT,
city VARCHAR(10),
balance DECIMAL(9, 0),
last_access_time DATETIME
) ENGINE=OLAP
UNIQUE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
The table contains one record as follows:
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 400 | 2023-07-01 12:00:00|
+------+-------+------+----------+---------+---------------------+
When using Insert Into with strict mode to perform partial column updates, the insertion will fail because the second and third rows with keys (3)
and (18)
do not exist in the original table:
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = true;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
ERROR 1105 (HY000): errCode = 2, detailMessage = Insert has filtered data in strict mode
When using Insert Into with non-strict mode to perform partial column updates:
SET enable_unique_key_partial_update=true;
SET enable_insert_strict = false;
INSERT INTO user_profile (id, balance, last_access_time) VALUES
(1, 500, '2023-07-03 12:00:01'),
(3, 23, '2023-07-03 12:00:02'),
(18, 9999999, '2023-07-03 12:00:03');
The existing record will be updated, and two new records will be inserted. For columns not specified in the insert statement, if a default value is defined, it will be used; if the column allows NULL values, NULL will be used; otherwise, the insertion will fail.
The query result is as follows:
mysql> select * from user_profile;
+------+-------+------+----------+---------+---------------------+
| id | name | age | city | balance | last_access_time |
+------+-------+------+----------+---------+---------------------+
| 1 | kevin | 18 | shenzhen | 500 | 2023-07-03 12:00:01 |
| 3 | NULL | NULL | NULL | 23 | 2023-07-03 12:00:02 |
| 18 | NULL | NULL | NULL | 9999999 | 2023-07-03 12:00:03 |
+------+-------+------+----------+---------+---------------------+
Enable Strict Modeβ
Strict mode (strict_mode) defaults to False. Here's how to set it for different load methods:
Stream Load
curl --location-trusted -u user:passwd \
-H "strict_mode: true" \
-T data.txt \
http://host:port/api/example_db/test_table/_stream_load
Broker Load
LOAD LABEL example_db.label_1
(
DATA INFILE("s3://bucket/data.txt")
INTO TABLE test_table
)
WITH S3 (...)
PROPERTIES
(
"strict_mode" = "true"
);
Routine Load
CREATE ROUTINE LOAD example_db.job1 ON test_table
PROPERTIES
(
"strict_mode" = "true"
)
FROM KAFKA (...);
MySQL Load
LOAD DATA LOCAL INFILE 'data.txt'
INTO TABLE test_table
PROPERTIES
(
"strict_mode" = "true"
);
Insert Into
SET enable_insert_strict = true;
INSERT INTO test_table ...;
Maximum Filter Ratioβ
Maximum Filter Ratio (max_filter_ratio) is a crucial load control parameter that defines the maximum allowable ratio of filtered data to total data during load. If the actual filter ratio is below the set maximum, the load task will continue and filtered data will be ignored; if it exceeds this ratio, the load task will fail.
Filter Ratio Calculation Methodβ
-
Filtered Rows: Data filtered out due to quality issues, including type errors, precision errors, string length exceeding limits, file column count mismatches, and rows filtered due to missing corresponding partitions.
-
Unselected Rows: Data rows filtered out due to Pre-filtering or Post-filtering conditions.
-
Loaded Rows: Data rows successfully loaded.
The filter ratio is calculated as:
#Filtered Rows / (#Filtered Rows + #Loaded Rows)
Note that Unselected Rows
are not included in the filter ratio calculation.
Maximum Filter Ratio Configurationβ
The maximum filter ratio (max_filter_ratio) defaults to 0, meaning no filtered data is allowed. Here's how to set it for different load methods:
Stream Load
curl --location-trusted -u user:passwd \
-H "max_filter_ratio: 0.1" \
-T data.txt \
http://host:port/api/example_db/test_table/_stream_load
Broker Load
LOAD LABEL example_db.label_1
(
DATA INFILE("s3://bucket/data.txt")
INTO TABLE test_table
)
WITH S3 (...)
PROPERTIES
(
"max_filter_ratio" = "0.1"
);
Routine Load
CREATE ROUTINE LOAD example_db.job1 ON test_table
PROPERTIES
(
"max_filter_ratio" = "0.1"
)
FROM KAFKA (...);
MySQL Load
LOAD DATA LOCAL INFILE 'data.txt'
INTO TABLE test_table
PROPERTIES
(
"max_filter_ratio" = "0.1"
);
Insert Into
SET insert_max_filter_ratio = 0.1;
INSERT INTO test_table FROM S3/HDFS/LOCAL();```
:::tip
For Insert Into statements, `insert_max_filter_ratio` only takes effect when `enable_insert_strict = false`, and only applies to `INSERT INTO FROM S3/HDFS/LOCAL()` syntax. The default value is 1.0, which means that all abnormal data are allowed to be filtered.
:::