Before You Start the POC
This document highlights common issues that new users may encounter, with the goal of accelerating the POC process.
Table Design
Creating a table in Doris involves four decisions that affect load and query performance.
Data Model
| If your data is... | Use | Why |
|---|---|---|
| Append-only (logs, events, facts) | Duplicate Key (default) | Keeps all rows. Best query performance. |
| Updated by primary key (CDC, upsert) | Unique Key | New rows replace old rows with the same key. |
| Pre-aggregated metrics (PV, UV, sums) | Aggregate Key | Rows are merged with SUM/MAX/MIN at write time. |
Duplicate Key works for most scenarios. See Data Model Overview.
Sort Key
Put the column you filter on most frequently first, with fixed-size types (INT, BIGINT, DATE) before VARCHAR. Doris builds a prefix index on the first 36 bytes of key columns but stops at the first VARCHAR. Add inverted indexes for other columns that need fast filtering.
Partitioning
If you have a time column, use AUTO PARTITION BY RANGE(date_trunc(time_col, 'day')) to enable partition pruning. Doris skips irrelevant partitions automatically.
Bucketing
Default is Random bucketing (recommended for Duplicate Key tables). Use DISTRIBUTED BY HASH(col) if you frequently filter or join on a specific column. See Data Bucketing.
How to choose bucket count:
- Multiple of BE count to ensure even data distribution. When BEs are added later, queries typically scan multiple partitions, so performance holds up.
- As low as possible to avoid small files.
- Compressed data per bucket ≤ 20 GB (≤ 10 GB for Unique Key). Check with
SHOW TABLETS FROM your_table. - No more than 128 per partition. Consider partitioning first if you need more.
Example Templates
Log / Event Analytics
CREATE TABLE app_logs
(
log_time DATETIME NOT NULL,
log_level VARCHAR(10),
service_name VARCHAR(50),
trace_id VARCHAR(64),
message STRING,
INDEX idx_message (message) USING INVERTED PROPERTIES("parser" = "unicode")
)
AUTO PARTITION BY RANGE(date_trunc(`log_time`, 'day'))
()
DISTRIBUTED BY RANDOM BUCKETS 10;
Real-Time Dashboard with Upsert (CDC)
CREATE TABLE user_profiles
(
user_id BIGINT NOT NULL,
username VARCHAR(50),
email VARCHAR(100),
status TINYINT,
updated_at DATETIME
)
UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
Metrics Aggregation
CREATE TABLE site_metrics
(
dt DATE NOT NULL,
site_id INT NOT NULL,
pv BIGINT SUM DEFAULT '0',
uv BIGINT MAX DEFAULT '0'
)
AGGREGATE KEY(dt, site_id)
AUTO PARTITION BY RANGE(date_trunc(`dt`, 'day'))
()
DISTRIBUTED BY HASH(site_id) BUCKETS 10;
Performance Pitfalls
Load
- Don't use
INSERT INTO VALUESfor bulk data. Use Stream Load or Broker Load instead. See Loading Overview. - Batch writes on the client side. High-frequency small imports cause version accumulation. If not feasible, use Group Commit.
- Break large imports into smaller batches. A failed long-running import must restart from scratch. Use INSERT INTO SELECT with S3 TVF for incremental import.
- Enable
load_to_single_tabletfor Duplicate Key tables with Random bucketing to reduce write amplification.
See Load Best Practices.
Query
- Data skew. Check tablet sizes with
SHOW TABLETS. Switch to Random bucketing or a higher-cardinality bucket column if sizes vary significantly. - Wrong sort key order. See Sort Key.
See Query Profile to diagnose slow queries.