Multi-Dimensional Aggregation Analysis

When producing sales reports, operational analyses, or market surveys, you often need a single query that returns both the "subtotals at the detail dimension" and the "totals across different dimension combinations" along with the "grand total." Stitching together multiple GROUP BY queries with UNION ALL not only makes the SQL verbose but also scans the base table multiple times, which is inefficient.

Doris provides three syntactic extensions to the GROUP BY clause for multi-dimensional aggregation. They produce summary results at multiple levels in a single SQL statement, and are semantically equivalent to using UNION ALL to join multiple aggregate queries:

Syntax	Applicable scenario	Output summary combinations
ROLLUP	Aggregate level by level along dimensions with a natural hierarchy, such as time, geography, or category	Aggregate progressively along the specified column order, from the finest granularity up to the grand total
CUBE	Multiple independent dimensions that need full cross-dimensional analysis	All combinations of all dimension subsets
GROUPING SETS	Only a few specific dimension combinations are of interest, avoiding the cost of a full CUBE	The grouping sets that the user explicitly specifies

This article introduces these three syntaxes in the order of "scenario -> syntax -> example," and explains how the companion GROUPING and GROUPING_ID functions identify subtotal rows and distinguish between two kinds of NULL values.

ROLLUP: Hierarchical Level-By-Level Summary

Applicable Scenario

ROLLUP is suitable for scenarios that aggregate hierarchical dimensions level by level. It aggregates along the specified column order, summarizing data progressively from the finest granularity up to the highest level. For example:

Time dimension: ROLLUP(year, month, day)
Geographic dimension: ROLLUP(country, province, city)

For sales data, you can use ROLLUP to aggregate by region and time, producing the monthly sales for each region, the total sales for each region, and the overall total sales.

Syntax

SELECT … GROUP BY ROLLUP(grouping_column_reference_list)

Example

The following query analyzes sales by year and month:

SELECT
        YEAR(d_date),
        MONTH(d_date),
        SUM(ss_net_paid) AS total_sum
FROM
        store_sales,
        date_dim d1
WHERE
        d1.d_date_sk = ss_sold_date_sk
        AND YEAR(d_date) IN (2001, 2002)
        AND MONTH(d_date) IN (1, 2, 3)
GROUP BY
        ROLLUP(YEAR(d_date), MONTH(d_date))
ORDER BY
        YEAR(d_date), MONTH(d_date);

The query aggregates by time level by level, computing the monthly sales subtotal for each year, the annual sales subtotal for each year, and the overall sales grand total. The query result is as follows:

+--------------+---------------+-------------+
| YEAR(d_date) | MONTH(d_date) | total_sum   |
+--------------+---------------+-------------+
|         NULL |          NULL | 54262669.17 |
|         2001 |          NULL | 26640320.46 |
|         2001 |             1 |  9982165.83 |
|         2001 |             2 |  8454915.34 |
|         2001 |             3 |  8203239.29 |
|         2002 |          NULL | 27622348.71 |
|         2002 |             1 | 11260654.35 |
|         2002 |             2 |  7722750.61 |
|         2002 |             3 |  8638943.75 |
+--------------+---------------+-------------+
9 rows in set (0.08 sec)

CUBE: Full Cross-Dimensional Summary

Applicable Scenario

CUBE is best suited for queries that involve multiple independent dimension columns rather than columns that represent different levels of a single dimension. A common use case is summarizing all combinations of month, region, and product. These are three mutually independent dimensions, and analyzing every possible subtotal combination is very common.

By contrast, cross-tabulating year, month, and day produces many unnecessary values, because the time dimension itself has a natural hierarchy. In most analyses, subtotals such as "profit by month-day" are not needed, and relatively few users ask "what is the total sales on the 16th of every month across the year." Hierarchical dimensions like this are better handled with ROLLUP.

Syntax

SELECT … GROUP BY CUBE(grouping_column_reference_list)

Example

SELECT
        YEAR(d_date),
        i_category,
        ca_state,
        SUM(ss_net_paid) AS total_sum
FROM
        store_sales,
        date_dim d1,
        item,
        customer_address ca
WHERE
        d1.d_date_sk = ss_sold_date_sk
        AND i_item_sk = ss_item_sk
        AND ss_addr_sk = ca_address_sk
        AND i_category IN ("Books", "Electronics")
        AND YEAR(d_date) IN (1998, 1999)
        AND ca_state IN ("LA", "AK")
GROUP BY CUBE(YEAR(d_date), i_category, ca_state)
ORDER BY YEAR(d_date), i_category, ca_state;

The query result computes:

The grand total of sales.
The sales subtotal for each year, the sales subtotal of items per category, and the sales subtotal per state.
The sales subtotal per category for each year, the sales subtotal per product for each state, the sales subtotal per state for each year, and the sales subtotal of products per category in each state for each year.

+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category  | ca_state | total_sum  |
+--------------+-------------+----------+------------+
|         NULL | NULL        | NULL     | 8690374.60 |
|         NULL | NULL        | AK       | 2675198.33 |
|         NULL | NULL        | LA       | 6015176.27 |
|         NULL | Books       | NULL     | 4238177.69 |
|         NULL | Books       | AK       | 1310791.36 |
|         NULL | Books       | LA       | 2927386.33 |
|         NULL | Electronics | NULL     | 4452196.91 |
|         NULL | Electronics | AK       | 1364406.97 |
|         NULL | Electronics | LA       | 3087789.94 |
|         1998 | NULL        | NULL     | 4369656.14 |
|         1998 | NULL        | AK       | 1402539.19 |
|         1998 | NULL        | LA       | 2967116.95 |
|         1998 | Books       | NULL     | 2213703.82 |
|         1998 | Books       | AK       |  719911.29 |
|         1998 | Books       | LA       | 1493792.53 |
|         1998 | Electronics | NULL     | 2155952.32 |
|         1998 | Electronics | AK       |  682627.90 |
|         1998 | Electronics | LA       | 1473324.42 |
|         1999 | NULL        | NULL     | 4320718.46 |
|         1999 | NULL        | AK       | 1272659.14 |
|         1999 | NULL        | LA       | 3048059.32 |
|         1999 | Books       | NULL     | 2024473.87 |
|         1999 | Books       | AK       |  590880.07 |
|         1999 | Books       | LA       | 1433593.80 |
|         1999 | Electronics | NULL     | 2296244.59 |
|         1999 | Electronics | AK       |  681779.07 |
|         1999 | Electronics | LA       | 1614465.52 |
+--------------+-------------+----------+------------+
27 rows in set (0.21 sec)

GROUPING Function: Identifying Subtotal Rows

When using ROLLUP and CUBE, the result set raises two issues that need to be addressed:

How to identify subtotal rows: programmatically detect which result rows represent subtotals and pinpoint the aggregation level a given subtotal corresponds to. This is very common in scenarios such as computing "percentage of grand total."
How to distinguish two kinds of NULL: when the query result contains both NULL values actually stored in the table and NULL values produced by ROLLUP or CUBE operations, you need a way to tell them apart.

The GROUPING and GROUPING_ID functions (paired with GROUPING SETS) solve these problems effectively.

GROUPING

Principle

GROUPING takes a single column as its argument:

Returns 1 when the value is a NULL produced by a ROLLUP or CUBE operation (that is, the row is a subtotal row).
Returns 0 for any other value, including NULL values that already exist in the table.

Example:

SELECT
        year(d_date),
        month(d_date),
        sum(ss_net_paid) AS total_sum,
        grouping(year(d_date)),
        grouping(month(d_date))
FROM
        store_sales,
        date_dim d1
WHERE
        d1.d_date_sk = ss_sold_date_sk
        AND year(d_date) IN (2001, 2002)
        AND month(d_date) IN (1, 2, 3)
GROUP BY
        ROLLUP(year(d_date), month(d_date))
ORDER BY
        year(d_date), month(d_date);

The values of the GROUPING function at different aggregation levels:

(year(d_date), month(d_date)) group: result is (0, 0), the detail rows aggregated by year and month.
(year(d_date)) group: result is (0, 1), the subtotal rows aggregated by year.
() group: result is (1, 1), the grand total row.

The query result is as follows:

+--------------+---------------+-------------+------------------------+-------------------------+
| year(d_date) | month(d_date) | total_sum   | Grouping(year(d_date)) | Grouping(month(d_date)) |
+--------------+---------------+-------------+------------------------+-------------------------+
|         NULL |          NULL | 54262669.17 |                      1 |                       1 |
|         2001 |          NULL | 26640320.46 |                      0 |                       1 |
|         2001 |             1 |  9982165.83 |                      0 |                       0 |
|         2001 |             2 |  8454915.34 |                      0 |                       0 |
|         2001 |             3 |  8203239.29 |                      0 |                       0 |
|         2002 |          NULL | 27622348.71 |                      0 |                       1 |
|         2002 |             1 | 11260654.35 |                      0 |                       0 |
|         2002 |             2 |  7722750.61 |                      0 |                       0 |
|         2002 |             3 |  8638943.75 |                      0 |                       0 |
+--------------+---------------+-------------+------------------------+-------------------------+
9 rows in set (0.06 sec)

Usage 1: Filtering Aggregation Levels in HAVING

The GROUPING function can filter results to a specific aggregation level. The following example keeps only "the grand total of sales," "sales aggregated by year," and "sales aggregated by state":

SELECT
        year(d_date),
        i_category,
        ca_state,
        sum(ss_net_paid) AS total_sum
FROM
        store_sales,
        date_dim d1,
        item,
        customer_address ca
WHERE
        d1.d_date_sk = ss_sold_date_sk
        AND i_item_sk = ss_item_sk
        AND ss_addr_sk = ca_address_sk
        AND i_category IN ("Books", "Electronics")
        AND year(d_date) IN (1998, 1999)
        AND ca_state IN ("LA", "AK")
GROUP BY CUBE(year(d_date), i_category, ca_state)
HAVING grouping(year(d_date)) = 1 AND grouping(i_category) = 1 AND grouping(ca_state) = 1
    OR grouping(year(d_date)) = 0 AND grouping(i_category) = 1 AND grouping(ca_state) = 1
    OR grouping(year(d_date)) = 1 AND grouping(i_category) = 1 AND grouping(ca_state) = 0
ORDER BY year(d_date), i_category, ca_state;

The query result is as follows:

+---------------------+------------+----------+------------+
| year(`d1`.`d_date`) | i_category | ca_state | total_sum  |
+---------------------+------------+----------+------------+
|                NULL | NULL       | NULL     | 8690374.60 |
|                NULL | NULL       | AK       | 2675198.33 |
|                NULL | NULL       | LA       | 6015176.27 |
|                1998 | NULL       | NULL     | 4369656.14 |
|                1999 | NULL       | NULL     | 4320718.46 |
+---------------------+------------+----------+------------+
5 rows in set (0.13 sec)

Usage 2: Combining With IF to Improve Readability

Replacing the NULL values in subtotal rows with more intuitive strings makes the result easier to read:

SELECT
        IF(grouping(year(d_date)) = 1, "Multi-year sum", year(d_date)) AS year,
        IF(grouping(i_category) = 1, "Multi-category sum", i_category) AS category,
        sum(ss_net_paid) AS total_sum
FROM
        store_sales,
        date_dim d1,
        item,
        customer_address ca
WHERE
        d1.d_date_sk = ss_sold_date_sk
        AND i_item_sk = ss_item_sk
        AND ss_addr_sk = ca_address_sk
        AND i_category IN ("Books", "Electronics")
        AND year(d_date) IN (1998, 1999)
        AND ca_state IN ("LA", "AK")
GROUP BY CUBE(year(d_date), i_category)

The query result is as follows:

+----------------+--------------------+------------+
| year           | category           | total_sum  |
+----------------+--------------------+------------+
| 1998           | Books              | 2213703.82 |
| 1998           | Electronics        | 2155952.32 |
| 1999           | Electronics        | 2296244.59 |
| 1999           | Books              | 2024473.87 |
| 1998           | Multi-category sum | 4369656.14 |
| 1999           | Multi-category sum | 4320718.46 |
| Multi-year sum | Books              | 4238177.69 |
| Multi-year sum | Electronics        | 4452196.91 |
| Multi-year sum | Multi-category sum | 8690374.60 |
+----------------+--------------------+------------+
9 rows in set (0.09 sec)

GROUPING_ID

Applicable Scenario

GROUPING_ID and GROUPING are both used to assist with multi-dimensional aggregation queries (such as ROLLUP and CUBE), helping you distinguish between aggregation results at different levels.

To identify the aggregation level a row belongs to with GROUPING, you must compute the function on every GROUP BY column individually (a single-column result is not enough to distinguish levels), which makes the SQL verbose. GROUPING_ID is more powerful than GROUPING: it accepts multiple columns as arguments and returns an integer whose binary bits represent the aggregation states of all those columns at once.

When the computed results are stored in a table or materialized view, using the GROUPING function to represent different aggregation levels takes up more storage. GROUPING_ID is more appropriate in this scenario.

Taking CUBE(a, b) as an example, the correspondence between GROUPING_ID and GROUPING is:

Aggregation level	Bit Vector	GROUPING_ID	GROUPING(a)	GROUPING(b)
a, b	0 0	0	0	0
a	0 1	1	0	1
b	1 0	2	1	0
Grand Total	1 1	3	1	1

Syntax and Example

SELECT
    year(d_date),
    i_category,
    SUM(ss_net_paid) AS total_sum,
    GROUPING(year(d_date)),
    GROUPING(i_category),
    GROUPING_ID(year(d_date), i_category)
FROM
    store_sales,
    date_dim d1,
    item,
    customer_address ca
WHERE
    d1.d_date_sk = ss_sold_date_sk
    AND i_item_sk = ss_item_sk
    AND ss_addr_sk = ca_address_sk
    AND i_category IN ('Books', 'Electronics')
    AND year(d_date) IN (1998, 1999)
    AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(year(d_date), i_category);

The query result is as follows:

+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| year(d_date) | i_category  | total_sum  | GROUPING(year(d_date)) | GROUPING(i_category) | GROUPING_ID(year(d_date), i_category) |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| 1998         | Electronics | 2155952.32 | 0                      | 0                    | 0                                     |
| 1998         | Books       | 2213703.82 | 0                      | 0                    | 0                                     |
| 1999         | Electronics | 2296244.59 | 0                      | 0                    | 0                                     |
| 1999         | Books       | 2024473.87 | 0                      | 0                    | 0                                     |
| 1998         | NULL        | 4369656.14 | 0                      | 1                    | 1                                     |
| 1999         | NULL        | 4320718.46 | 0                      | 1                    | 1                                     |
| NULL         | Electronics | 4452196.91 | 1                      | 0                    | 2                                     |
| NULL         | Books       | 4238177.69 | 1                      | 0                    | 2                                     |
| NULL         | NULL        | 8690374.60 | 1                      | 1                    | 3                                     |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
9 rows in set (0.12 sec)

GROUPING SETS: Specifying Grouping Combinations Precisely

Applicable Scenario

When you need to aggregate only over a few specified grouping combinations rather than compute a full CUBE, use GROUPING SETS in the GROUP BY clause. It lets you specify exactly which combinations across multiple dimensions to compute, avoiding unnecessary overhead.

Because CUBE queries usually consume more resources, GROUPING SETS improves execution efficiency when only a few dimensions are of interest.

Syntax

SELECT … GROUP BY GROUPING SETS(grouping_column_reference_list)

Example

Suppose you need:

The sales subtotal of each product category for each year
The sales subtotal in each state for each year
The sales subtotal of each product in each state for each year

You can use GROUPING SETS to specify these dimension combinations explicitly:

SELECT
    YEAR(d_date),
    i_category,
    ca_state,
    SUM(ss_net_paid) AS total_sum
FROM
    store_sales,
    date_dim d1,
    item,
    customer_address ca
WHERE
    d1.d_date_sk = ss_sold_date_sk
    AND i_item_sk = ss_item_sk
    AND ss_addr_sk = ca_address_sk
    AND i_category IN ('Books', 'Electronics')
    AND YEAR(d_date) IN (1998, 1999)
    AND ca_state IN ('LA', 'AK')
GROUP BY GROUPING SETS(
    (YEAR(d_date), i_category),
    (YEAR(d_date), ca_state),
    (YEAR(d_date), ca_state, i_category)
)
ORDER BY YEAR(d_date), i_category, ca_state;

The query result is as follows:

+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category  | ca_state | total_sum  |
+--------------+-------------+----------+------------+
| 1998         | NULL        | AK       | 1402539.19 |
| 1998         | NULL        | LA       | 2967116.95 |
| 1998         | Books       | NULL     | 2213703.82 |
| 1998         | Books       | AK       |  719911.29 |
| 1998         | Books       | LA       | 1493792.53 |
| 1998         | Electronics | NULL     | 2155952.32 |
| 1998         | Electronics | AK       |  682627.90 |
| 1998         | Electronics | LA       | 1473324.42 |
| 1999         | NULL        | AK       | 1272659.14 |
| 1999         | NULL        | LA       | 3048059.32 |
| 1999         | Books       | NULL     | 2024473.87 |
| 1999         | Books       | AK       |  590880.07 |
| 1999         | Books       | LA       | 1433593.80 |
| 1999         | Electronics | NULL     | 2296244.59 |
| 1999         | Electronics | AK       |  681779.07 |
| 1999         | Electronics | LA       | 1614465.52 |
+--------------+-------------+----------+------------+
16 rows in set (0.11 sec)

The query above is equivalent to running CUBE and then using grouping_id to keep only the specified aggregation combinations, which avoids unnecessary computation:

SELECT
    SUM(ss_net_paid) AS total_sum,
    YEAR(d_date),
    i_category,
    ca_state
FROM
    store_sales,
    date_dim d1,
    item,
    customer_address ca
WHERE
    d1.d_date_sk = ss_sold_date_sk
    AND i_item_sk = ss_item_sk
    AND ss_addr_sk = ca_address_sk
    AND i_category IN ('Books', 'Electronics')
    AND YEAR(d_date) IN (1998, 1999)
    AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(YEAR(d_date), ca_state, i_category)
HAVING grouping_id(YEAR(d_date), ca_state, i_category) = 0
    OR grouping_id(YEAR(d_date), ca_state, i_category) = 2
    OR grouping_id(YEAR(d_date), ca_state, i_category) = 1;

Note

CUBE computes every possible aggregation level (eight in this example), but in practice you may only be interested in a few of them.

Semantic Equivalences

GROUPING SETS is the more fundamental form of multi-dimensional aggregation. Both ROLLUP and CUBE can be expanded into GROUPING SETS.

GROUPING SETS and GROUP BY UNION ALL

The following GROUPING SETS statement:

SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());

Is equivalent to multiple GROUP BY queries joined by UNION ALL:

SELECT k1, k2, SUM(k3) FROM t GROUP BY k1, k2
UNION ALL
SELECT k1, NULL, SUM(k3) FROM t GROUP BY k1
UNION ALL
SELECT NULL, k2, SUM(k3) FROM t GROUP BY k2
UNION ALL
SELECT NULL, NULL, SUM(k3) FROM t;

The query joined by UNION ALL is longer and scans the base table multiple times, so it is less efficient both to write and to execute.

GROUPING SETS and ROLLUP

ROLLUP is an extension of GROUPING SETS. For example:

SELECT a, b, c, SUM(d) FROM tab1 GROUP BY ROLLUP(a, b, c);

Is equivalent to the following GROUPING SETS:

GROUPING SETS (
    (a, b, c),
    (a, b),
    (a),
    ()
);

GROUPING SETS and CUBE

CUBE(a, b, c) is equivalent to the following GROUPING SETS:

GROUPING SETS (
    (a, b, c),
    (a, b),
    (a, c),
    (a),
    (b, c),
    (b),
    (c),
    ()
);

Appendix

For the table creation statements and data files, see the appendix of Analytic Functions (Window Functions).

ROLLUP: Hierarchical Level-By-Level Summary​

Applicable Scenario​

Syntax​

Example​

CUBE: Full Cross-Dimensional Summary​

Applicable Scenario​

Syntax​

Example​

GROUPING Function: Identifying Subtotal Rows​

GROUPING​

Principle​

Usage 1: Filtering Aggregation Levels in HAVING​

Usage 2: Combining With IF to Improve Readability​

GROUPING_ID​

Applicable Scenario​

Syntax and Example​

GROUPING SETS: Specifying Grouping Combinations Precisely​

Applicable Scenario​

Syntax​

Example​

Semantic Equivalences​

GROUPING SETS and GROUP BY UNION ALL​

GROUPING SETS and ROLLUP​

GROUPING SETS and CUBE​

Appendix​

ROLLUP: Hierarchical Level-By-Level Summary

Applicable Scenario

Syntax

Example

CUBE: Full Cross-Dimensional Summary

Applicable Scenario

Syntax

Example

GROUPING Function: Identifying Subtotal Rows

GROUPING

Principle

Usage 1: Filtering Aggregation Levels in HAVING

Usage 2: Combining With IF to Improve Readability

GROUPING_ID

Applicable Scenario

Syntax and Example

GROUPING SETS: Specifying Grouping Combinations Precisely

Applicable Scenario

Syntax

Example

Semantic Equivalences

GROUPING SETS and GROUP BY UNION ALL

GROUPING SETS and ROLLUP

GROUPING SETS and CUBE

Appendix