Skip to main content

VARIANT

VARIANT

Overview

The VARIANT type stores semi-structured JSON data. It can contain different primitive types (integers, strings, booleans, etc.), one-dimensional arrays, and nested objects. On write, Doris infers the structure and type of sub-paths based on JSON paths and materializes frequent paths as independent subcolumns, leveraging columnar storage and vectorized execution for both flexibility and performance.

Using VARIANT

Create table syntax

Declare a VARIANT column when creating a table:

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT
)
PROPERTIES("replication_num" = "1");

Constrain certain paths with a Schema Template (see “Extended types”):

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT <
'id' : INT, -- restrict path id to INT
'message*' : STRING, -- restrict message* prefix to STRING
'tags*' : ARRAY<TEXT> -- restrict tags* prefix to ARRAY<TEXT>
>
)
PROPERTIES("replication_num" = "1");

Query syntax

-- Access nested fields (returns VARIANT; explicit or implicit CAST is required for aggregation/comparison)
SELECT v['properties']['title'] FROM ${table_name};

-- CAST to a concrete type before aggregation
SELECT CAST(v['properties']['title'] AS STRING) AS title
FROM ${table_name}
GROUP BY title;

-- Query arrays
SELECT *
FROM ${table_name}
WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');

Primitive types

VARIANT infers subcolumn types automatically. Supported types include:

Supported types
TinyInt
NULL (equivalent to JSON null)
BigInt (64 bit)
Double
String (Text)
Jsonb
Variant (nested object)
Array<T> (one-dimensional only)

Simple INSERT example:

INSERT INTO vartab VALUES
(1, 'null'),
(2, NULL),
(3, 'true'),
(4, '-17'),
(5, '123.12'),
(6, '1.912'),
(7, '"A quote"'),
(8, '[-1, 12, false]'),
(9, '{ "x": "abc", "y": false, "z": 10 }'),
(10, '"2021-01-01"');

Tip: Non-standard JSON types such as date/time will be stored as strings unless a Schema Template is provided. For better computation efficiency, consider extracting them to static columns or declaring their types via a Schema Template.

Extended types (Schema Template)

Besides primitive types, VARIANT supports the following extended types via Schema Template:

  • Number (extended)
    • Decimal: Decimal32 / Decimal64 / Decimal128 / Decimal256
    • LargeInt
  • Datetime
  • Timestamptz
  • Date
  • IPV4 / IPV6
  • Boolean
  • ARRAY<T> (T can be any of the above, one-dimensional only)

Note: Predefined Schema can only be specified at table creation. ALTER is currently not supported (future versions may support adding new subcolumn definitions, but changing an existing subcolumn type is not supported).

Example:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'large_int_val': LARGEINT,
'string_val': STRING,
'decimal_val': DECIMAL(38, 9),
'datetime_val': DATETIME,
'tz_val': TIMESTAMPTZ,
'ip_val': IPV4
> NULL
)
PROPERTIES ("replication_num" = "1");

INSERT INTO test_var_schema VALUES (1, '{
"large_int_val" : "123222222222222222222222",
"string_val" : "Hello World",
"decimal_val" : 1.11111111,
"datetime_val" : "2025-05-16 11:11:11",
"tz_val" : "2025-05-16 11:11:11+08:00",
"ip_val" : "127.0.0.1"
}');

SELECT variant_type(v1) FROM test_var_schema;

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| variant_type(v1) |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| {"datetime_val":"datetimev2","decimal_val":"decimal128i","ip_val":"ipv4","large_int_val":"largeint","string_val":"string","tz_val":"timestamptz"} |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

{"date": 2020-01-01} and {"ip": 127.0.0.1} are invalid JSON texts; the correct format is {"date": "2020-01-01"} and {"ip": "127.0.0.1"}.

Once a Schema Template is specified, if a JSON value conflicts with the declared type and cannot be converted, it will be stored as NULL. For example:

INSERT INTO test_var_schema VALUES (1, '{
"decimal_val" : "1.11111111",
"ip_val" : "127.xxxxxx.xxxx",
"large_int_val" : "aaabbccc"
}');

-- Only decimal_val remains
SELECT * FROM test_var_schema;

+------+-----------------------------+
| id | v1 |
+------+-----------------------------+
| 1 | {"decimal_val":1.111111110} |
+------+-----------------------------+

Schema only guides the persisted storage type. During query execution, the effective type depends on actual data at runtime:

-- At runtime v['a'] may still be STRING
SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);

Wildcard matching and order:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'enumString*' : STRING,
'enum*' : ARRAY<TEXT>,
'ip*' : IPV6
> NULL
)
PROPERTIES ("replication_num" = "1");

-- If enumString1 matches both patterns, the first matching pattern in definition order (STRING) is used

If a column name contains * and you want to match it by its literal name (not as a prefix wildcard), use:

v1 VARIANT<
MATCH_NAME 'enumString*' : STRING
> NULL

Matched subpaths are materialized as columns by default. If too many paths match and generate excessive columns, consider enabling variant_enable_typed_paths_to_sparse (see “Configuration”).

Type conflicts and promotion rules

When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss:

{"a" : 12345678}
{"a" : "HelloWorld"}
-- a will be promoted to JSONB

Promotion rules:

Source typeCurrent typeFinal type
TinyIntBigIntBigInt
TinyIntDoubleDouble
TinyIntStringJSONB
TinyIntArrayJSONB
BigIntDoubleJSONB
BigIntStringJSONB
BigIntArrayJSONB
DoubleStringJSONB
DoubleArrayJSONB
Array<Double>Array<String>Array<Jsonb>

If you need strict types (for stable indexing and storage), declare them via Schema Template.

Variant indexes

Choosing indexes

VARIANT supports BloomFilter and Inverted Index on subpaths.

  • High-cardinality equality/IN filters: prefer BloomFilter (sparser index, better write performance).
  • Tokenization/phrase/range search: use Inverted Index and set proper parser/analyzer properties.
...  
PROPERTIES("replication_num" = "1", "bloom_filter_columns" = "v");

-- Use BloomFilter for equality/IN filters
SELECT * FROM tbl WHERE v['id'] = 12345678;
SELECT * FROM tbl WHERE v['id'] IN (1, 2, 3);

Once an inverted index is created on a VARIANT column, all subpaths inherit the same index properties (e.g., parser):

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT,
INDEX idx_v(v) USING INVERTED PROPERTIES("parser" = "english")
);

-- All subpaths inherit the english parser
SELECT * FROM tbl WHERE v['id_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['id_2'] MATCH 'Apache';

Index by subpath

In 3.1.x/4.0 and later, you can specify index properties for certain VARIANT subpaths, and even configure both tokenized and non-tokenized inverted indexes for the same path. Path-specific indexes require the path type to be declared via Schema Template.

-- Common properties: field_pattern (target path), analyzer, parser, support_phrase, etc.
CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'content' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "content"),
INDEX idx_v(v) USING INVERTED PROPERTIES("field_pattern" = "content")
);

-- v.content has both tokenized and non-tokenized inverted indexes
SELECT * FROM tbl WHERE v['content'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['content'] = 'Doris';

Wildcard path indexing:

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'pattern_*' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "pattern_*"),
INDEX idx_v(v) USING INVERTED -- global non-tokenized inverted index
);

SELECT * FROM tbl WHERE v['pattern_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['pattern_1'] = 'Doris';

Note: 2.1.7+ supports only InvertedIndex V2 properties (fewer files, lower write IOPS; suitable for disaggregated storage/compute). 2.1.8+ removes offline Build Index.

When indexes don’t work

  1. Type changes cause index loss: if a subpath changes to an incompatible type (e.g., INT → JSONB), the index is lost. Fix by pinning types and indexes via Schema Template.
  2. Query type mismatch:
    -- v['id'] is actually STRING; using INT equality causes index not to be used
    SELECT * FROM tbl WHERE v['id'] = 123456;
  3. Misconfigured index: indexes apply to subpaths, not the entire VARIANT column.
    -- VARIANT itself cannot be indexed as a whole
    SELECT * FROM tbl WHERE v MATCH 'Doris';

    -- If whole-JSON search is needed, store a duplicate STRING column and index it
    CREATE TABLE IF NOT EXISTS tbl (
    k BIGINT,
    v VARIANT,
    v_str STRING,
    INDEX idx_v_str(v_str) USING INVERTED PROPERTIES("parser" = "english")
    );
    SELECT * FROM tbl WHERE v_str MATCH 'Doris';

INSERT and load

INSERT INTO VALUES

CREATE TABLE IF NOT EXISTS variant_tbl (
k BIGINT,
v VARIANT
) PROPERTIES("replication_num" = "1");

INSERT INTO variant_tbl VALUES (1, '{"a" : 123}');

select * from variant_tbl;
+------+-----------+
| k | v |
+------+-----------+
| 1 | {"a":123} |
+------+-----------+

-- v['a'] is a VARIANT
select v['a'] from variant_tbl;
+--------+
| v['a'] |
+--------+
| 123 |
+--------+

-- Accessing a non-existent key returns NULL
select v['a']['no_such_key'] from variant_tbl;;
+-----------------------+
| v['a']['no_such_key'] |
+-----------------------+
| NULL |
+-----------------------+

Load (Stream Load)

# Line-delimited JSON (one JSON record per line)
curl --location-trusted -u root: -T gh_2022-11-07-3.json \
-H "read_json_by_line:true" -H "format:json" \
http://127.0.0.1:8030/api/test_variant/github_events/_stream_load

See also: https://doris.apache.org/docs/dev/data-operate/import/complex-types/variant

After loading, verify with SELECT count(*) or sample with SELECT * ... LIMIT 1. For high-throughput ingestion, prefer RANDOM bucketing and enable Group Commit.

Supported operations and CAST rules

  • VARIANT cannot be compared/operated directly with other types; comparisons between two VARIANTs are not supported either.
  • For comparison, filtering, aggregation, and ordering, CAST subpaths to concrete types (explicitly or implicitly).
-- Explicit CAST
SELECT CAST(v['arr'] AS ARRAY<TEXT>) FROM tbl;
SELECT * FROM tbl WHERE CAST(v['decimal'] AS DECIMAL(27, 9)) = 1.111111111;
SELECT * FROM tbl WHERE CAST(v['date'] AS DATE) = '2021-01-02';

-- Implicit CAST
SELECT * FROM tbl WHERE v['bool'];
SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
  • VARIANT itself cannot be used directly in ORDER BY, GROUP BY, as a JOIN KEY, or as an aggregate argument; CAST subpaths instead.
  • Strings can be implicitly converted to VARIANT.
VARIANTCastableCoercible
ARRAY
BOOLEAN
DATE/DATETIME
FLOAT
IPV4/IPV6
DECIMAL
MAP
TIMESTAMP
VARCHAR
JSON

Wide columns

When ingested data contains many distinct JSON keys, VARIANT materialized subcolumns can grow rapidly; at scale this may cause metadata bloat, higher write/merge cost, and query slowdowns. To address “wide columns” (too many subcolumns), VARIANT provides two mechanisms: Sparse columns and DOC encoding.

Note: these two mechanisms are mutually exclusive—enabling DOC encoding disables sparse columns, and vice versa.

Sparse columns

How it works

  • The system ranks paths by non-null ratio / sparsity: high-frequency (less-sparse) paths are materialized as independent subcolumns; remaining low-frequency (sparse) paths are merged and stored in sparse columns. The maximum number of materialized subcolumns is controlled by variant_max_subcolumns_count.
  • If a path is declared in a Schema Template, by default it will not be moved into sparse columns; set variant_enable_typed_paths_to_sparse to allow typed paths to be moved into sparse columns.
  • Sparse columns support sharding: distribute sparse subpaths across multiple sparse columns to reduce per-column read overhead and improve read efficiency. Use variant_sparse_hash_shard_count to specify how many sparse columns are physically stored.

When to use

  • Many JSON keys overall, but queries mostly target a small subset of high-frequency fields (hot fields).
  • Highly skewed key distribution (a few keys appear frequently; many keys appear occasionally): you want good hot-path performance while keeping long-tail keys queryable (typically slower).

Limitations and configuration notes

  • If most keys have similar non-null ratios (little sparsity contrast), it’s hard to identify truly sparse paths and the benefit of sparse columns is reduced.
  • variant_max_subcolumns_count (materialized subcolumns) is recommended to stay ≤ 10000.
  • If you have high query requirements on typed paths (declared via Schema Template), prefer variant_enable_typed_paths_to_sparse = false.
  • variant_sparse_hash_shard_count can be roughly estimated as “number of sparse paths / 128”. Example: total JSON keys ≈ 10,000, variant_max_subcolumns_count = 2000, then sparse paths ≈ 8000, so variant_sparse_hash_shard_count can start around 8000/128.

DOC encoding (DOC mode)

How it works

  • Paths can still be materialized as independent subcolumns for path-based queries, and the original JSON is additionally stored as a stored field to return the full JSON document efficiently.
  • DOC encoding supports sharding: the original JSON is split into multiple columns for storage and reassembled when querying the full JSON. Use variant_doc_hash_shard_count to specify the number of DOC shards.
  • For small-batch writes, subcolumns can be skipped and materialized later during merges. This is controlled by variant_doc_materialization_min_rows. For example, if variant_doc_materialization_min_rows = 10000, writes below 10,000 rows will only store the original JSON and won’t materialize subcolumns for that batch.

When to use

  • Many distinct keys overall, but each row contains only a small subset of keys (e.g. per-row keys < 5% of total keys): typical sparse wide-column workloads.
  • Workloads frequently need the full JSON document (e.g. SELECT * / full-row return) and you want to avoid assembling JSON from a large number of subcolumns.
  • You want to reduce subcolumn materialization overhead for small batches and defer it to later merges.
  • You can accept additional storage cost (because the original JSON is stored as a stored field).

Limitations and configuration notes

  • DOC mode requires variant_enable_doc_mode = true.
  • In DOC mode, typed paths declared via Schema Template are limited to numeric, string, and array types.
  • variant_doc_hash_shard_count can be roughly estimated as “total JSON keys / 128”.

See the “Configuration” section below for the full property list.

Limitations

  • JSON key length ≤ 255.
  • Cannot be a primary key or sort key.
  • Cannot be nested within other types (e.g., Array<Variant>, Struct<Variant>).
  • Reading the entire VARIANT column scans all subpaths. If a column has many subpaths, consider storing the original JSON string in an extra STRING/JSONB column for whole-object searches like LIKE:
CREATE TABLE example_table (
id INT,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_variant LIKE '%doris%';

-- Better: keep the original JSON string for whole-object matching
CREATE TABLE example_table (
id INT,
data_string STRING,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_string LIKE '%doris%';

Configuration

Starting from 3.1+, VARIANT supports type-level properties on columns:

CREATE TABLE example_table (
id INT,
data_variant VARIANT<
'path_1' : INT,
'path_2' : STRING,
properties(
'variant_max_subcolumns_count' = '2048',
'variant_enable_typed_paths_to_sparse' = 'true',
'variant_sparse_hash_shard_count' = '64'
)
>
);
Property
Description
`variant_max_subcolumns_count`
Max number of materialized paths. Above the threshold, new paths may be stored in a shared data structure. Default 0 (unlimited). Recommended 2048; do not exceed 10000.
`variant_enable_typed_paths_to_sparse`
By default, typed paths are always materialized (and do not count against `variant_max_subcolumns_count`). When set to `true`, typed paths also count toward the threshold and may be moved to the shared structure.
`variant_sparse_hash_shard_count`
Shard count for sparse columns. Distributes sparse subpaths across multiple sparse columns to improve read performance. Default 1; tune based on the number of sparse subpaths.
CREATE TABLE example_table (
id INT,
data_variant VARIANT<
'path_1' : INT,
'path_2' : STRING,
properties(
'variant_enable_doc_mode' = 'true',
'variant_doc_materialization_min_rows' = '10000',
'variant_doc_hash_shard_count' = '64'
)
>
);
Property
Description
`variant_enable_doc_mode`
Enable DOC encoding mode. When `true`, the original JSON is stored as a stored field to quickly return the whole JSON document. DOC mode is mutually exclusive with sparse columns. Default `false`.
`variant_doc_materialization_min_rows`
Minimum row threshold to materialize subcolumns in DOC mode. When rows are below this value, only the original JSON is stored; after compaction merges files to reach the threshold, subcolumns are materialized. Helps reduce overhead for small-batch writes.
`variant_doc_hash_shard_count`
Shard count for DOC encoding. The original JSON is split into the specified number of columns for storage and reassembled when querying the whole JSON. Default 64; tune based on JSON size and concurrency.

Behavior at limits and tuning suggestions:

  1. After exceeding the threshold, new paths are written into the shared structure; Rowset merges may also recycle some paths into the shared structure.
  2. The system prefers to keep paths with higher non-null ratios and higher access frequencies materialized.
  3. Close to 10,000 materialized paths requires strong hardware (≥128G RAM, ≥32C per node recommended).
  4. Ingestion tuning: increase client batch_size appropriately, or use Group Commit (increase group_commit_interval_ms/group_commit_data_bytes as needed).
  5. If partition pruning is not needed, consider RANDOM bucketing and enabling single-tablet loading to reduce compaction write amplification.
  6. BE tuning knobs: max_cumu_compaction_threads (≥8), vertical_compaction_num_columns_per_group=500 (improves vertical compaction but increases memory), segment_cache_memory_percentage=20 (improves metadata cache efficiency).
  7. Watch Compaction Score; if it keeps rising, compaction is lagging—reduce ingestion pressure.
  8. Avoid large SELECT * on VARIANT; prefer specific projections like SELECT v['path'].

Note: If you see Stream Load error [DATA_QUALITY_ERROR]Reached max column size limit 2048 (only on 2.1.x and 3.0.x), it means the merged tablet schema reached its column limit. You may increase variant_max_merged_tablet_schema_size (not recommended beyond 4096; requires strong hardware).

Inspect number of columns and types

Approach 1: use variant_type to inspect per-row schema (more precise, higher cost):

SELECT variant_type(v) FROM variant_tbl;

Approach 2: extended DESC to show materialized subpaths (only those extracted):

SET describe_extend_variant_column = true;
DESC variant_tbl;
DESCRIBE ${table_name} PARTITION ($partition_name);

Use both: Approach 1 is precise; Approach 2 is efficient.

Compared with JSON type

  • Storage: JSON is stored as JSONB (row-oriented). VARIANT is inferred and materialized into columns on write (higher compression, smaller size).
  • Query: JSON requires parsing. VARIANT scans columns directly and is usually much faster.

ClickBench (43 queries):

  • Storage: VARIANT saves ~65% vs JSON.
  • Query: VARIANT is 8x+ faster than JSON, close to predefined static columns.

Storage space

TypeSize
Predefined columns12.618 GB
VARIANT12.718 GB
JSON35.711 GB

~65% space savings

RunPredefinedVARIANTJSON
First (cold)233.79s248.66sMost timed out
Second (hot)86.02s94.82s789.24s
Third (hot)83.03s92.29s743.69s

FAQ

  1. Are null in VARIANT and SQL NULL different?
    • No. They are equivalent.
  2. Why doesn’t my query/index work?
    • Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths.