Apache Doris Table Design Guide
In Apache Doris, table design directly determines write performance, query efficiency, and storage cost. This chapter is organized around typical design scenarios to help you design a table from scratch or optimize the structure of an existing table.
Design Workflow
Designing a Doris table typically involves the following steps:
- Choose a table model: decide how data is stored, deduplicated, and aggregated.
- Define columns and data types: choose appropriate types for your business fields.
- Plan partitioning and bucketing: distribute data evenly and enable pruning.
- Add indexes: accelerate point queries, range scans, or full-text search.
- Optimize storage: reduce cost through compression, hybrid row-column storage, or hot/cold tiering.
- Evolve the schema: adjust the schema or use auto-increment columns as the business changes.
Creating a Table
Use the CREATE TABLE statement to create a table in Doris. To derive a new table from an existing one, you can use:
- CREATE TABLE ... LIKE: reuse the structure of an existing table.
- CREATE TABLE ... AS SELECT (CTAS): create a table based on the result of a query.
Table Name Rules
Table names in Doris are case-sensitive by default and have a maximum length of 64 bytes. You can adjust this with the following configurations:
| Configuration | Default | Description |
|---|---|---|
| lower_case_table_names | Case-sensitive | Configured only when the cluster is first initialized; cannot be changed afterward |
| table_name_length_limit | 64 bytes | Setting it too large is not recommended |
Key Table Properties
When creating a table, you can specify common properties in PROPERTIES. See CREATE TABLE for details:
| Property | Purpose |
|---|---|
buckets | Determines the granularity of data distribution within the table |
storage_medium | Controls the storage medium, such as HDD, SSD, or remote shared storage |
replication_num | Controls the number of data replicas for redundancy and reliability |
storage_policy | Configures the hot/cold data tiering migration policy |
Table properties apply at the partition level. Modifying table properties only takes effect for partitions created in the future; existing partitions are not affected. See ALTER TABLE PROPERTY and ALTER TABLE DISTRIBUTION for details. Dynamic partitioning can set these properties separately.
Choosing a Table Model
Table Model Overview
Compare the Duplicate, Aggregate, and Unique models, and choose the right data storage and deduplication approach for your scenario
Duplicate Model
Preserve raw detail data, suitable for logs, events, and other scenarios that do not require deduplication or aggregation
Unique Model
Deduplicate by primary key and support high-concurrency real-time updates, suitable for CDC, orders, and other scenarios where data is updated by primary key
Aggregate Model
Pre-aggregate data by dimensions to improve performance for reporting and multi-dimensional analysis queries
Model Selection Tips
Recommendations for table models and Key column design based on typical business scenarios
Defining Columns and Data Types
Planning Partitioning and Bucketing
Data Distribution Concepts
Understand how partitioning and bucketing map data to tablets to fully leverage the storage and compute capabilities of multiple nodes
Basic Concepts
Explanations of basic terms related to data distribution, including partition, bucket, tablet, and replica
Manual Partitioning
Explicitly define partition ranges at table creation, suitable for scenarios with fixed partitions and a need for fine-grained control
Dynamic Partitioning
Automatically create and reclaim partitions along the time dimension to simplify the operations of time-series tables
Auto Partitioning
Automatically create partitions based on incoming data, eliminating the need to define partition ranges in advance
Data Bucketing
Choose appropriate bucketing columns and the number of buckets so that data is evenly distributed within partitions
Common Issues
Common mistakes and troubleshooting methods in partitioning and bucketing design
Accelerating Queries: Indexes
Index Overview
Compare the various index types supported by Doris and their applicable query scenarios
Prefix Index
A built-in index based on the sort key that accelerates equality and range queries
BloomFilter Index
Accelerate equality queries on high-cardinality columns and reduce unnecessary data scans
NGram BloomFilter Index
Accelerate LIKE fuzzy match queries without introducing a full inverted index
Inverted Index (Full-Text Search)
Support tokenized full-text search and keyword and phrase matching, suitable for log and text analysis
Vector Index
Accelerate similarity search on high-dimensional vectors based on algorithms such as HNSW and IVF
Optimizing Storage
Storage Format
Learn about the design and advantages of the Doris columnar storage format V3
Data Compression
Choose an appropriate compression algorithm to balance storage cost and query performance
Hybrid Row-Column Storage
Add row storage on top of column storage to accelerate point queries on wide tables
Hot/Cold Data Tiering
Move cold data down to SSD/HDD or remote storage such as S3 and HDFS to reduce storage cost
Evolving the Schema
Schema Change
Add, drop, or modify columns and types with ALTER TABLE, and learn the difference between lightweight and heavyweight changes
Auto-Increment Column
Use an Auto Increment Column to generate unique numeric IDs automatically and simplify data writes
Special Tables and Best Practices
Temporary Tables
Break complex SQL into multiple steps and stage intermediate results for easier debugging and reuse
Table Design Best Practices
A summary of key design recommendations for table models, Key columns, partitioning and bucketing, indexes, and more
Design Considerations
| Design Item | Description | Impact |
|---|---|---|
| Data model | Cannot be modified after the table is created; choose the right data model in advance | Determines subsequent write and query patterns |
| Number of buckets | The number of buckets in already-created partitions cannot be modified, but can be adjusted via partition replacement; the number of buckets for future partitions in dynamic partitioning can be modified | Affects data balance and query parallelism |
| Column changes | Adding or dropping VALUE columns is a lightweight operation (completes in seconds); adding or dropping KEY columns or modifying data types is a heavyweight operation | Avoid KEY column changes on large datasets when possible |
| Storage policy | Use hot/cold tiering to migrate cold data to HDD, S3, or HDFS | Significantly reduces storage cost |