Async Materialized View Best Practices

Async materialized views accelerate queries by precomputing and storing query results, but each refresh incurs some compute and IO overhead. This article walks through scenario assessment, usage principles, refresh strategy selection, implementation practices, and operational considerations in order, helping DBAs and developers build efficient async materialized views.

For the refresh principles of materialized views, see Refresh Principles.

Quick Decision Checklist

Before creating an async materialized view, evaluate against the following checklist:

Does the query include multi-table JOINs, complex aggregations, or window functions?
Is the base table data update frequency relatively low (avoid multiple updates per minute)?
Can the business tolerate minute-level or longer data latency (real-time data within 1 to 5 minutes is not required)?
Is the base table data large enough (much larger than a few hundred rows)?
Can common query SQL patterns be grouped, with no overlap between groups?
Is the base table a partitioned table, and can a partitioned materialized view be built?
Are there enough resources for periodic refresh?
Can you periodically check the usage status of materialized views and clean up unused ones in time?

If most of the answers above are yes, then async materialized views are a good fit.

1. Scenario Assessment

The table below summarizes typical scenarios where async materialized views are recommended or not recommended, for quick reference.

Scenario Quick Reference

Category	Scenario	Key Characteristics	Recommended
Query complexity	Complex aggregation queries	Multi-table JOIN, SUM/AVG/COUNT, window functions	Yes
Reports	Consistent snapshot reports	Generated at fixed time points (such as daily midnight)	Yes
Compute-intensive	Compute-intensive analysis	Complex math, data transformation, prediction models	Yes
Data warehouse modeling	Star / snowflake schema	Fact table + multiple dimension tables JOIN	Yes
Lakehouse	Lakehouse acceleration	Data lake queries limited by network and object storage throughput	Yes
Data warehouse layering	ETL layered processing	Base table is raw data and needs multi-layer processing	Yes
Data updates	Frequently updated base table	Multiple updates per minute	No
Query complexity	Simple queries	Single-table scan or simple filter	No
Timeliness	Near real-time (within 1 to 5 minutes) data	Business requires data to always be the latest	No
Data scale	Very small source table	Only a few hundred rows	No

Recommended Scenarios

Complex Aggregation Queries

Description: Queries with multi-table joins, complex aggregation functions (such as SUM, AVG, COUNT), or window functions.
Benefit: Avoids recomputing complex logic on each execution.

Reports

Description: Reports that need a consistent snapshot generated at a fixed time point (such as daily midnight).
Benefit: Ensures all users see data at the same point in time.

Compute-Intensive Analysis

Description: Analytical queries with complex math or data transformations, such as customer lifetime value calculations or predictive analytics models.
Benefit: Precomputes results, reducing runtime resource consumption.

Star / Snowflake Schema in Data Warehouses

Description: Scenarios where a fact table joins multiple dimension tables, such as a sales fact table joining product, time, and region dimensions.
Benefit: Pre-materializes join results to accelerate analytical queries.

Lakehouse Acceleration

Description: Queries against a data lake can be slow due to network latency and object storage throughput limits.
Benefit: Leverages Doris local storage acceleration to speed up data lake analytics.

Data Warehouse Layering

Description: The base table contains a large amount of raw data, and queries require complex ETL operations.
Benefit: Build multi-layer async materialized views over the data to implement data warehouse layering.

Scenarios Not Recommended

Frequently Updated Base Tables

Description: Source table data changes very frequently (such as multiple updates per minute).
Issue: Async materialized views are hard to keep in sync, and refresh costs are too high. Consider periodic refresh instead.

Simple Queries

Description: Queries that involve only a single-table scan or simple filtering.
Issue: The benefit of an async materialized view does not offset the refresh cost.

Scenarios Requiring Real-Time Data (Within 1 to 5 Minutes)

Description: The business requires data to always be the latest version.
Issue: Async materialized views have data latency.

Very Small Source Tables

Description: The base table has only a small number of records (such as a few hundred rows).
Issue: The optimization benefit of an async materialized view is not significant.

2. Usage Principles

2.1 When to Use Async Materialized Views

Dimension	Description
Timeliness	Suitable for scenarios where data timeliness requirements are not high (such as T+1 data). For high timeliness requirements, use synchronous materialized views.
Acceleration and consistency	Group common query SQL patterns with as little overlap between groups as possible. The clearer the grouping, the higher the build quality.
Reusability	One query can use multiple materialized views, and one materialized view can be used by multiple queries.
Trade-offs	Consider together the response time when hitting a materialized view (acceleration), build cost, and data consistency requirements.

2.2 Trade-off Between Materialized View Definition and Build Cost

Definition close to the original query: Strong acceleration, but poor generality and reusability, with high build cost.
More general definition (such as without WHERE conditions or with more aggregation dimensions): Lower acceleration, but better generality and reusability, with lower build cost.

Note

Control the number of materialized views: More materialized views are not always better. Building and refreshing them consumes resources, and the CBO also takes time to choose the optimal materialized view during transparent rewrite. In theory, the more materialized views, the longer the transparent rewrite time.
Periodically review usage status: Unused materialized views should be deleted in a timely manner.
Base table update frequency: Frequent updates to the base table cause materialized views to be invalidated frequently and unable to be used for transparent rewrite (direct queries are still possible). To use transparent rewrite in this scenario, you must allow some latency in queried data, which can be configured via grace_period. See the grace_period description for details.

3. Refresh Strategy Selection

Async materialized views provide three main refresh strategies: manual refresh, scheduled refresh, and trigger-based refresh. Choosing an appropriate refresh strategy is critical for balancing data freshness and system performance.

3.1 Prefer Partitioned Materialized Views

When all of the following conditions are met, building a partitioned materialized view is recommended:

The base table of the materialized view has a large amount of data and is a partitioned table.
Non-partitioned tables referenced by the materialized view do not change frequently.
The materialized view definition SQL and partition fields meet partition derivation requirements (that is, they meet the partition incremental update requirements). For detailed requirements, see CREATE-ASYNC-MATERIALIZED-VIEW.
The materialized view does not have many partitions. Too many partitions cause excessively long build times.

When some partitions of a materialized view are invalidated, transparent rewrite can still use the valid partitions UNION ALL with the base table to return data.

If a partitioned materialized view cannot be built, consider using a materialized view with full refresh.

3.2 Comparison of the Three Refresh Strategies

Refresh Strategy	Trigger Method	Data Freshness	Automation Level	Main Risk
Manual refresh	Explicit user command or external scheduling	Low, depends on scheduling	Low	Scheduling must be self-managed
Scheduled refresh	At fixed time intervals (minimum minute level)	Medium, deterministic latency	Medium	High frequency continuously occupies resources
Trigger-based refresh	Automatically triggered when base table data changes	High	High	May cause refresh storms

3.3 Detailed Refresh Strategies

Manual Refresh

How it works: Triggered by users via explicit commands or external system scheduling.
Applicable scenarios:
- Reporting systems with low real-time requirements
- Historical data analysis in data warehouses
- Scenarios that need to refresh in sync with specific business processes
- Large-scale data refreshes that need to coordinate system resources
Pros: Full control over refresh timing, can avoid business peak hours.
Cons: Requires extra refresh scheduling management and good fault tolerance to avoid external loops continuously triggering refreshes.

Scheduled Refresh

How it works:
- Automatically refreshes at fixed time intervals
- Minimum time unit is at the minute level
- The start time of the first task run can be specified
Applicable scenarios:
- Periodic business metric monitoring
- Tiered data pipelines
- Reporting systems with tiered time sensitivity
- Source data with regular fluctuations
Pros: Scheduled data processing with deterministic data latency.
Cons: Limited data freshness, and the refresh sequence of related views must be manually coordinated.
Configuration constraints: Setting all materialized views to high-frequency scheduled refresh to approach real-time is not recommended, because it causes:
- Continuous occupation of system resources
- Refresh jobs competing with each other for resources
- Frequent addition and removal of partitions / tablets, which puts heavy pressure on BE

Trigger-Based Refresh

How it works: Automatically triggers a refresh when base table data changes.
Applicable scenarios:
- Upper-layer views in a multi-layer materialized view architecture
- Scenarios where base table change frequency is low
Pros: High data freshness, high automation.
Cons: May cause refresh storms, and system load is hard to predict.
Configuration constraints: Trigger-based refresh on base-layer materialized views is not recommended unless:
- You can confirm the base table refresh frequency is low (for example, changes every few tens of minutes)

3.4 Recommendations for Combining Refresh Strategies

By Data Warehouse Layer

View Layer	Recommended Refresh Strategy
Base layer	Scheduled refresh (such as hourly)
Middle layer	Scheduled refresh or trigger-based refresh
Presentation layer	Trigger-based refresh or manual refresh

By Business Criticality

Business Level	Recommended Strategy
Critical real-time business data	Async materialized views are not recommended
Regular analytical data	Scheduled refresh (daily / hourly)
Historical / archived data	Manual refresh

By Data Change Frequency

Change Frequency	Recommended Strategy
High-frequency changes	Scheduled refresh (longer interval) or manual refresh
Low-frequency changes	Trigger-based refresh or short-interval scheduled refresh
Batch changes	Manual refresh after changes

3.5 Refresh Frequency Recommendations

The following are general recommendations. The actual choice should also be evaluated based on system resources, the number of async materialized views, and other business resource usage.

Actual Refresh Duration	Recommended Refresh Frequency
Less than 15 seconds	Greater than or equal to 5 minutes
Less than 10 minutes	Greater than or equal to 1 hour
Less than 1 hour	Greater than or equal to 1 day

4. Partitioned Materialized View Practice

4.1 Partition Mapping Relationship

The partitions of a materialized view are created by mapping from base table partitions, generally with a 1:1 or 1:n relationship to the base table partitions. For detailed partition derivation requirements, see CREATE-ASYNC-MATERIALIZED-VIEW and Async Materialized View FAQ Q12.

4.2 Partition Invalidation and Refresh Behavior

Trigger	Effect	Response
Base table partition data changes (insert, delete, etc.)	The corresponding materialized view partition is invalidated. Invalidated partitions cannot be used for transparent rewrite but can still be queried directly. During transparent rewrite, the invalidated partition responds to the query together with the base table	Use `SHOW PARTITIONS FROM mv_name` to view partition status
Referenced non-partitioned table data changes	All partitions of the materialized view are invalidated, and it cannot be used for transparent rewrite	Run `REFRESH MATERIALIZED VIEW mv1 AUTO;` to refresh all partitions where data has changed
Referenced non-partitioned tables only insert and never modify data	By default, all partitions are invalidated	At creation, specify `excluded_trigger_tables = 'non_partitioned_table_name1,non_partitioned_table_name2'`. The next refresh will only refresh the invalidated partitions corresponding to the partitioned table

Design recommendation: Place tables with frequently changing data in the partitioned table referenced by the partitioned materialized view, and place dimension tables that change infrequently in non-referenced partitioned positions.

4.3 Partition-Granularity Transparent Rewrite

Transparent rewrite for a partitioned materialized view operates at partition granularity:

Even if some partitions of the materialized view are invalidated, it can still be used for transparent rewrite.
However, if a query targets only one partition and that partition is invalidated, the materialized view cannot be used for this transparent rewrite.

4.4 Complete Example

Goal: Build a daily-granularity partitioned materialized view to accelerate queries that aggregate by day.

Step 1: Create the daily-partitioned base table lineitem and prepare the dimension table partsupp.

CREATE TABLE IF NOT EXISTS lineitem (
    l_orderkey INTEGER NOT NULL, 
    l_partkey INTEGER NOT NULL, 
    l_suppkey INTEGER NOT NULL, 
    l_linenumber INTEGER NOT NULL, 
    l_ordertime DATETIME NOT NULL, 
    l_quantity DECIMALV3(15, 2) NOT NULL, 
    l_extendedprice DECIMALV3(15, 2) NOT NULL, 
    l_discount DECIMALV3(15, 2) NOT NULL, 
    l_tax DECIMALV3(15, 2) NOT NULL, 
    l_returnflag CHAR(1) NOT NULL, 
    l_linestatus CHAR(1) NOT NULL, 
    l_shipdate DATE NOT NULL, 
    l_commitdate DATE NOT NULL, 
    l_receiptdate DATE NOT NULL, 
    l_shipinstruct CHAR(25) NOT NULL, 
    l_shipmode CHAR(10) NOT NULL, 
    l_comment VARCHAR(44) NOT NULL
) DUPLICATE KEY(
    l_orderkey, l_partkey, l_suppkey, 
    l_linenumber
) PARTITION BY RANGE(l_ordertime) (
    FROM 
      ('2024-05-01') TO ('2024-06-30') INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3;

INSERT INTO lineitem VALUES      
(1, 2, 3, 4, '2024-05-01 01:45:05', 5.5, 6.5, 0.1, 8.5, 'o', 'k', '2024-05-01', '2024-05-01', '2024-05-01', 'a', 'b', 'yyyyyyyyy'),    
(1, 2, 3, 4, '2024-05-15 02:35:05', 5.5, 6.5, 0.15, 8.5, 'o', 'k', '2024-05-15', '2024-05-15', '2024-05-15', 'a', 'b', 'yyyyyyyyy'),     
(2, 2, 3, 5, '2024-05-25 08:30:06', 5.5, 6.5, 0.2, 8.5, 'o', 'k', '2024-05-25', '2024-05-25', '2024-05-25', 'a', 'b', 'yyyyyyyyy'),     
(3, 4, 3, 6, '2024-06-02 09:25:07', 5.5, 6.5, 0.3, 8.5, 'o', 'k', '2024-06-02', '2024-06-02', '2024-06-02', 'a', 'b', 'yyyyyyyyy');

CREATE TABLE IF NOT EXISTS partsupp (
    ps_partkey INTEGER NOT NULL, 
    ps_suppkey INTEGER NOT NULL, 
    ps_availqty INTEGER NOT NULL, 
    ps_supplycost DECIMALV3(15, 2) NOT NULL, 
    ps_comment VARCHAR(199) NOT NULL
)
DUPLICATE KEY(ps_partkey, ps_suppkey)
DISTRIBUTED BY HASH(ps_partkey) BUCKETS 3;

INSERT INTO partsupp VALUES     
(2, 3, 9, 10.01, 'supply1'),     
(4, 3, 9, 10.01, 'supply2'),     
(5, 6, 9, 10.01, 'supply3'),     
(6, 5, 10, 11.01, 'supply4');

Step 2: A typical query that aggregates by day.

SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  ps_partkey 
FROM 
  lineitem 
  LEFT JOIN partsupp ON l_partkey = ps_partkey 
  AND l_suppkey = ps_suppkey 
WHERE 
  date_trunc(l_ordertime, 'day') <= DATE '2024-05-25' 
  AND date_trunc(l_ordertime, 'day') >= DATE '2024-05-05' 
GROUP BY 
  l_linestatus, 
  ps_partkey;

Step 3: Build a daily-partitioned materialized view with the same partition granularity as the base table, aggregating data by day.

CREATE MATERIALIZED VIEW rollup_partition_mv 
BUILD IMMEDIATE REFRESH AUTO ON MANUAL 
partition by(order_date) 
DISTRIBUTED BY RANDOM BUCKETS 2 
AS 
SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  ps_partkey, 
  date_trunc(l_ordertime, 'day') AS order_date 
FROM 
  lineitem 
  LEFT JOIN partsupp ON l_partkey = ps_partkey 
  AND l_suppkey = ps_suppkey 
GROUP BY 
  l_linestatus, 
  ps_partkey, 
  date_trunc(l_ordertime, 'day');

4.5 Retain Only the Most Recent Partition Data

Tip

This feature is supported starting from Apache Doris 2.1.1.

A materialized view can retain data only for the most recent few partitions and automatically delete expired partition data on each refresh. Configure this with the following properties:

Property	Description
`partition_sync_limit`	When the base table partition field is a time type, configure the partition range to sync from the base table (used together with `partition_sync_time_unit`). For example, setting it to `3` with unit `DAY` means only the last 3 days of partitions and data are synced from the base table
`partition_sync_time_unit`	The time unit for partition refresh. Supports `DAY` / `MONTH` / `YEAR`, defaults to `DAY`
`partition_date_format`	When the base table partition field is a string, the date format required to use the `partition_sync_limit` capability

The materialized view below retains only the most recent 3 days of data. If there is no data in the last 3 days, querying the materialized view directly returns no data.

CREATE MATERIALIZED VIEW latest_partition_mv 
BUILD IMMEDIATE REFRESH AUTO ON MANUAL 
partition by(order_date) 
DISTRIBUTED BY RANDOM BUCKETS 2 
PROPERTIES (
  "partition_sync_limit" = "3", 
  "partition_sync_time_unit" = "DAY", 
  "partition_date_format" = "yyyy-MM-dd"
)       
AS 
SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  ps_partkey, 
  date_trunc(l_ordertime, 'day') AS order_date 
FROM 
  lineitem 
  LEFT JOIN partsupp ON l_partkey = ps_partkey 
  AND l_suppkey = ps_suppkey 
GROUP BY 
  l_linestatus, 
  ps_partkey, 
  date_trunc(l_ordertime, 'day');

5. How to Use Materialized Views to Accelerate Queries

5.1 General Approach

To use a materialized view to accelerate queries, follow these steps:

View the profile file and find the most time-consuming operation in the query. The bottleneck is typically in: Join, Aggregate, Filter, or Calculated Expressions.
Build a corresponding materialized view targeting the bottleneck operator. For example, if Join consumes a lot of compute resources while Aggregate consumes relatively little, build a materialized view targeting Join.

5.2 Build Recommendations for the Four Operation Types

5.2.1 For Join

Extract common table join patterns used in queries to build a materialized view. Hits save the Join computation.
Remove filters from the query to obtain a more general Join materialized view.

5.2.2 For Aggregate

Use low-cardinality fields as dimensions for the materialized view to minimize data size after aggregation.
The aggregation granularity of the materialized view should be finer than the query (that is, the materialized view aggregation dimensions include the query's aggregation dimensions), and the aggregation functions in the materialized view should also include those in the query.

Cardinality assessment example:

Table t1 has 1,000,000 rows, and the query includes GROUP BY a, b, c:
- If the cardinalities of a, b, and c are 100, 50, and 15 respectively, the aggregated result is about 75,000 rows. The materialized view is effective.
- If a, b, and c are correlated, the post-aggregation data size shrinks further.
- If c has a cardinality of 3,500, the aggregated result is about 17,000,000 rows, larger than the original table. A materialized view is not suitable.

5.2.3 For Filter

If queries frequently filter on the same fields, add corresponding filters to the materialized view to reduce its data size.
The materialized view's filter should be less restrictive than the query's, and the query's filter should include the materialized view's filter.

For example, if the query is a > 10 AND b > 5:

The materialized view can have no filter at all,
Or have a broader-range filter such as a > 5 AND b > 5 or a > 5.

5.2.4 For Calculated Expressions

Precomputing high-cost expressions such as CASE WHEN or string processing can significantly improve query performance.
The number of columns in a single materialized view should not be too large. Group by query SQL pattern and build separate materialized views for each group.

Complete example for accelerating aggregation queries:

Query 1:

SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
WHERE 
  o_orderdate <= DATE '2024-06-30' 
  AND o_orderdate >= DATE '2024-05-01' 
GROUP BY 
  l_linestatus, 
  o_shippriority,
  l_partkey;

Query 2:

SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
WHERE 
  o_orderdate <= DATE '2024-06-30' 
  AND o_orderdate >= DATE '2024-05-01' 
GROUP BY 
  l_linestatus, 
  o_shippriority,
  l_suppkey;

For the queries above, build a more general aggregation materialized view: include both l_partkey and l_suppkey as aggregation dimensions, and use o_orderdate as a filter condition. Note: o_orderdate is used not only in materialized view condition compensation but must also be included in the aggregation dimensions. This way both Query 1 and Query 2 can hit the materialized view:

CREATE MATERIALIZED VIEW common_agg_mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 2
AS 
SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  o_shippriority,
  l_suppkey,
  l_partkey,
  o_orderdate
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
GROUP BY 
  l_linestatus, 
  o_shippriority,
  l_suppkey,
  l_partkey,
  o_orderdate;

6. Typical Use Cases

6.1 Scenario 1: Query Acceleration

Applicable scenarios: BI reporting or other scenarios sensitive to query response time, requiring results in seconds. Multi-table Joins followed by aggregation consume significant compute resources, making timeliness hard to guarantee. Async materialized views support both direct queries and transparent rewrite. The optimizer automatically selects the optimal materialized view based on the rewrite algorithm and cost model.

Use Case 1: Multi-Table Join Aggregation Query Acceleration

Build a more general materialized view to accelerate multi-table join aggregation queries.

Goal: Build a single materialized view that satisfies all three of the queries below.

Query 1:

SELECT 
  l_linestatus, 
  l_extendedprice * (1 - l_discount),
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
WHERE 
  o_orderdate <= DATE '2024-06-30' 
  AND o_orderdate >= DATE '2024-05-01';

Query 2:

SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  o_orderdate, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
WHERE 
  o_orderdate <= DATE '2024-06-30' 
  AND o_orderdate >= DATE '2024-05-01' 
GROUP BY 
  l_linestatus, 
  o_orderdate, 
  o_shippriority;

Query 3:

SELECT 
  l_linestatus, 
  l_extendedprice * (1 - l_discount),
  o_orderdate, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey;

Build option 1: A general Join materialized view. Remove the filter conditions of Query 1 and Query 2, and precompute l_extendedprice * (1 - l_discount):

CREATE MATERIALIZED VIEW common_join_mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 2
AS 
SELECT 
  l_linestatus, 
  l_extendedprice * (1 - l_discount),
  o_orderdate, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey;

Build option 2: If the materialized view above does not meet the acceleration performance requirement of Query 2, build an additional aggregation materialized view. Remove the filter on o_orderdate to keep it general:

CREATE MATERIALIZED VIEW target_agg_mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 2
AS 
SELECT 
  l_linestatus, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) AS revenue, 
  o_orderdate, 
  o_shippriority 
FROM 
  orders 
  LEFT JOIN lineitem ON l_orderkey = o_orderkey 
GROUP BY 
  l_linestatus, 
  o_orderdate, 
  o_shippriority;

Use Case 2: Log Query Acceleration

Applicable scenarios: The base table is typically partitioned by hour, and queries are single-table aggregations with filters mostly on time and identifier flags. When response speed is not satisfactory, async and synchronous materialized views can be used together.

Step 1: Base table definition.

CREATE TABLE IF NOT EXISTS test (
`app_name` VARCHAR(64) NULL COMMENT 'identifier', 
`event_id` VARCHAR(128) NULL COMMENT 'identifier', 
`decision` VARCHAR(32) NULL COMMENT 'enum value', 
`time` DATETIME NULL COMMENT 'query time', 
`id` VARCHAR(35) NOT NULL COMMENT 'id', 
`code` VARCHAR(64) NULL COMMENT 'identifier', 
`event_type` VARCHAR(32) NULL COMMENT 'event type' 
)
DUPLICATE KEY(app_name, event_id)
PARTITION BY RANGE(time)                                    
(                                                                                                                                      
    FROM ("2024-07-01 00:00:00") TO ("2024-07-15 00:00:00") INTERVAL 1 HOUR                                                                     
)     
DISTRIBUTED BY HASH(event_id)
BUCKETS 3;

Step 2: Build a materialized view aggregated by minute to achieve a certain level of aggregation.

CREATE MATERIALIZED VIEW sync_mv
AS
SELECT 
  decision,
  code, 
  app_name, 
  event_id, 
  event_type, 
  date_trunc(time, 'minute'), 
  DATE_FORMAT(
    `time`, '%Y-%m-%d'
  ), 
  cast(FLOOR(MINUTE(time) / 15) AS decimal(9, 0)),
  count(id) AS cnt
FROM 
  test 
GROUP BY 
  code, 
  app_name, 
  event_id, 
  event_type, 
  date_trunc(time, 'minute'), 
  decision, 
  DATE_FORMAT(time, '%Y-%m-%d'), 
  cast(FLOOR(MINUTE(`time`) / 15) AS decimal(9, 0));

Step 3: A typical query.

SELECT 
    decision, 
    CONCAT(
        CONCAT(
          DATE_FORMAT(
            `time`, '%Y-%m-%d'
          ), 
          '', 
          LPAD(
            cast(FLOOR(MINUTE(`time`) / 15) AS decimal(9, 0)) * 15, 
            5, 
            '00'
          ), 
          ':00'
        )
      ) AS time, 
      count(id) AS cnt 
FROM 
  test 
WHERE 
  date_trunc(time, 'minute') BETWEEN '2024-07-02 18:00:00' 
  AND '2024-07-03 20:00:00' 
GROUP BY 
  decision, 
  DATE_FORMAT(
    `time`, "%Y-%m-%d"
  ), 
  cast(FLOOR(MINUTE(`time`) / 15) AS decimal(9, 0));

6.2 Scenario 2: Data Modeling (ETL)

Applicable scenarios: Data analysis often requires joining and aggregating multiple tables, with complex and repeated queries leading to high latency and heavy resource consumption. Use async materialized views to build a layered data model. You can build higher-level materialized views on top of existing materialized views (supported from 2.1.3).

Choosing trigger methods for different layers:

First-layer scheduled refresh + second-layer trigger refresh: When the first layer finishes refreshing, the second layer is automatically triggered.
All layers use scheduled refresh: When the second layer refreshes, it does not consider whether the first layer is in sync with the base table, and only processes and syncs first-layer data to the second layer.

The following example uses the TPC-H dataset to analyze the order count and profit per region and country per month.

Original query (without materialized views):

SELECT
n_name,
date_trunc(o.o_orderdate, 'month') AS month,
count(distinct o.o_orderkey) AS order_count,
sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
FROM orders o
JOIN lineitem l ON o.o_orderkey = l.l_orderkey
JOIN customer c ON o.o_custkey = c.c_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
JOIN region r ON n.n_regionkey = r.r_regionkey
GROUP BY n_name, month;

Step 1: Build the DWD layer (detail data) - the order detail wide table.

CREATE MATERIALIZED VIEW dwd_order_detail
BUILD IMMEDIATE REFRESH AUTO ON COMMIT
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT
o.o_orderkey,
o.o_custkey,
o.o_orderstatus,
o.o_totalprice,
o.o_orderdate,
c.c_name,
c.c_nationkey,
n.n_name AS nation_name,
r.r_name AS region_name,
l.l_partkey,
l.l_quantity,
l.l_extendedprice,
l.l_discount,
l.l_tax
FROM orders o
JOIN customer c ON o.o_custkey = c.c_custkey
JOIN nation n ON c.c_nationkey = n.n_nationkey
JOIN region r ON n.n_regionkey = r.r_regionkey
JOIN lineitem l ON o.o_orderkey = l.l_orderkey;

Step 2: Build the DWS layer (summary data) - daily order summary.

CREATE MATERIALIZED VIEW dws_daily_sales
BUILD IMMEDIATE REFRESH AUTO ON COMMIT
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT
date_trunc(o_orderdate, 'month') AS month,
nation_name,
region_name,
bitmap_union(to_bitmap(o_orderkey)) AS order_count,
sum(l_extendedprice * (1 - l_discount)) AS net_revenue
FROM dwd_order_detail
GROUP BY
date_trunc(o_orderdate, 'month'),
nation_name,
region_name;

Step 3: Use the materialized view to optimize the query.

SELECT
nation_name,
month,
bitmap_union_count(order_count),
sum(net_revenue) AS revenue
FROM dws_daily_sales
GROUP BY nation_name, month;

6.3 Scenario 3: Lakehouse Federated Data Query

Applicable scenarios: Modern data architectures often adopt a lakehouse design to balance storage cost and query performance. This architecture has two main challenges:

Limited query performance: Frequent queries against the data lake are affected by network latency and third-party services, leading to query latency.
Complex data layered modeling: Moving and transforming data from the data lake to a real-time data warehouse usually requires complex ETL with high maintenance cost.

How Doris async materialized views address these issues:

Transparent rewrite to accelerate queries: Materialize commonly used data lake query results into Doris internal storage, and use transparent rewrite to improve query performance.
Simplified layered modeling: Support creating materialized views on top of tables in the data lake, making it easy to convert from a data lake to a real-time data warehouse.

The example below uses Hive.

Step 1: Create a Catalog based on Hive (using the TPC-H dataset).

CREATE CATALOG hive_catalog PROPERTIES (
'type'='hms', -- hive meta store address
'hive.metastore.uris' = 'thrift://172.21.0.1:7004'
);

Step 2: Create a materialized view based on the Hive Catalog.

-- Materialized views can only be created on the internal catalog. Switch to the internal catalog
SWITCH internal;
CREATE DATABASE hive_mv_db;
USE hive_mv_db;

CREATE MATERIALIZED VIEW external_hive_mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 12
AS
SELECT
n_name,
o_orderdate,
sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM
customer,
orders,
lineitem,
supplier,
nation,
region
WHERE
c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
GROUP BY
n_name,
o_orderdate;

Step 3: Run the query and accelerate it automatically via transparent rewrite using the materialized view.

SELECT
n_name,
sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM
customer,
orders,
lineitem,
supplier,
nation,
region
WHERE
c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
AND o_orderdate >= DATE '1994-01-01'
AND o_orderdate < DATE '1994-01-01' + INTERVAL '1' YEAR
GROUP BY
n_name
ORDER BY
revenue DESC;

Tip

Doris cannot currently detect data changes in external tables other than Hive. When external table data is inconsistent, using a materialized view may produce inconsistent data.

External-table transparent rewrite switch (default false): Whether materialized views participating in transparent rewrite are allowed to contain external tables. If you can accept data inconsistency or can ensure consistency through scheduled refresh, enable it:

SET materialized_view_rewrite_enable_contain_external_table = true;

Troubleshooting when a rewrite is not chosen: If the materialized view is in MaterializedViewRewriteSuccessButNotChose status, the rewrite succeeded but the plan was not chosen by the CBO. This may be due to incomplete external table statistics.

Enable getting row counts from files:

SET enable_get_row_count_from_file_list = true;

View external table statistics to confirm whether they have been collected completely:

SHOW TABLE STATS external_table_name;

6.4 Scenario 4: Improve Write Efficiency and Reduce Resource Contention

Applicable scenarios: High-throughput data write scenarios that need stable system performance and efficient data processing. Through the flexible refresh strategies of async materialized views, you can reduce write pressure and avoid resource contention.

When base table data changes, the materialized view refresh is not triggered immediately. Delayed refresh helps reduce resource pressure and avoid contention with write operations.

Example: A scheduled refresh strategy that refreshes every 2 hours. When data is loaded into orders and lineitem, the materialized view refresh is not triggered immediately.

CREATE MATERIALIZED VIEW common_schedule_join_mv
BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 2 HOUR
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT
l_linestatus,
l_extendedprice * (1 - l_discount),
o_orderdate,
o_shippriority
FROM
orders
LEFT JOIN lineitem ON l_orderkey = o_orderkey;

Transparent Rewrite Improves Load Efficiency

Transparent rewrite not only accelerates queries but can also rewrite load SQL, thereby improving load efficiency. Starting from version 2.1.6, when a materialized view is strongly consistent with the base table, DML operations (such as INSERT INTO or INSERT OVERWRITE) can be transparently rewritten, providing significant performance gains in data load scenarios.

Step 1: Create the target table for the INSERT INTO data.

CREATE TABLE IF NOT EXISTS target_table  (
orderdate      DATE NOT NULL,
shippriority   INTEGER NOT NULL,
linestatus     CHAR(1) NOT NULL,
sale           DECIMALV3(15,2) NOT NULL
)
DUPLICATE KEY(orderdate, shippriority)
DISTRIBUTED BY HASH(shippriority) BUCKETS 3;

Step 2: Create the common_schedule_join_mv materialized view.

CREATE MATERIALIZED VIEW common_schedule_join_mv
BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 2 HOUR
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT
l_linestatus,
l_extendedprice * (1 - l_discount),
o_orderdate,
o_shippriority
FROM
orders
LEFT JOIN lineitem ON l_orderkey = o_orderkey;

Step 3: The load statement before rewrite.

INSERT INTO target_table
SELECT
o_orderdate,
o_shippriority,
l_linestatus,
l_extendedprice * (1 - l_discount)
FROM
orders
LEFT JOIN lineitem ON l_orderkey = o_orderkey;

Step 4: The equivalent statement after transparent rewrite.

INSERT INTO target_table
SELECT *
FROM common_schedule_join_mv;

Note

If the DML operates on an external table whose data changes cannot be detected, transparent rewrite may cause the latest data in the base table to not be loaded into the target table in real time. If you can accept data inconsistency or can ensure consistency yourself, you can enable the following switch.

For DML, when the materialized view contains an external table whose data changes cannot be detected in real time, whether to enable structure-based transparent rewrite of the materialized view (disabled by default):

SET enable_dml_materialized_view_rewrite_when_base_table_unawareness = true;

7. Operational Considerations

Async materialized views are essentially enhanced ETL computations and require ongoing maintenance. The following three points are key to daily operations.

Monitoring: After a materialized view starts running, monitor system status via metrics in a timely manner. Async materialized views will expose more monitoring metrics in the future. Currently, you can use tasks to view information such as the number of tasks, execution status, and task duration.
Planning: Plan the number of materialized views, refresh frequency, and the maximum cluster compute capacity. Do not "just build materialized views without maintaining them." A materialized view is essentially an enhanced ETL computation and requires maintenance just like traditional ETL.
Resource isolation: A materialized view is a data computation task, so apply resource isolation as needed.

FAQ

Q1: Can async materialized views completely replace real-time queries?

No. Async materialized views have data latency (depending on the refresh strategy) and are not suitable for scenarios that require data freshness within 1 to 5 minutes. For scenarios with high timeliness requirements, consider synchronous materialized views.

Q2: Can I set all materialized views to high-frequency scheduled refresh to approach real-time?

Not recommended. Doing so causes continuous occupation of system resources, refresh jobs competing with each other, and frequent addition and removal of partitions / tablets, which puts heavy pressure on BE.

Q3: How do I choose a refresh strategy?

Refer to Comparison of the Three Refresh Strategies and Recommendations for Combining Refresh Strategies, and match by data warehouse layer, business criticality, or data change frequency. First evaluate whether you can build a partitioned materialized view.

Q4: Do materialized views still need maintenance after they are built?

Yes. A materialized view is essentially an enhanced ETL computation and requires monitoring, planning, and resource isolation. See Operational Considerations for details.

Q5: Can I still use transparent rewrite when the base table is updated frequently?

Frequent updates to the base table cause the materialized view to be invalidated frequently and unable to be used for transparent rewrite (direct queries are still possible). To use transparent rewrite in this scenario, you must allow some latency in queried data, which can be configured via grace_period.

Quick Decision Checklist​

1. Scenario Assessment​

Scenario Quick Reference​

Recommended Scenarios​

Complex Aggregation Queries​

Reports​

Compute-Intensive Analysis​

Star / Snowflake Schema in Data Warehouses​

Lakehouse Acceleration​

Data Warehouse Layering​

Scenarios Not Recommended​

Frequently Updated Base Tables​

Simple Queries​

Scenarios Requiring Real-Time Data (Within 1 to 5 Minutes)​

Very Small Source Tables​

2. Usage Principles​

2.1 When to Use Async Materialized Views​

2.2 Trade-off Between Materialized View Definition and Build Cost​

3. Refresh Strategy Selection​

3.1 Prefer Partitioned Materialized Views​

3.2 Comparison of the Three Refresh Strategies​

3.3 Detailed Refresh Strategies​

Manual Refresh​

Scheduled Refresh​

Trigger-Based Refresh​

3.4 Recommendations for Combining Refresh Strategies​

By Data Warehouse Layer​

By Business Criticality​

By Data Change Frequency​

3.5 Refresh Frequency Recommendations​

4. Partitioned Materialized View Practice​

4.1 Partition Mapping Relationship​

4.2 Partition Invalidation and Refresh Behavior​

4.3 Partition-Granularity Transparent Rewrite​

4.4 Complete Example​

4.5 Retain Only the Most Recent Partition Data​

5. How to Use Materialized Views to Accelerate Queries​

5.1 General Approach​

5.2 Build Recommendations for the Four Operation Types​

5.2.1 For Join​

5.2.2 For Aggregate​

5.2.3 For Filter​

5.2.4 For Calculated Expressions​

6. Typical Use Cases​

6.1 Scenario 1: Query Acceleration​

Use Case 1: Multi-Table Join Aggregation Query Acceleration​

Use Case 2: Log Query Acceleration​

6.2 Scenario 2: Data Modeling (ETL)​

6.3 Scenario 3: Lakehouse Federated Data Query​

6.4 Scenario 4: Improve Write Efficiency and Reduce Resource Contention​

Transparent Rewrite Improves Load Efficiency​

7. Operational Considerations​

FAQ​