Multi-Dimensional Analytics
In a database, ROLLUP, CUBE, and GROUPING SETS are advanced SQL statements used for multidimensional data aggregation. These features significantly enhance the capabilities of the GROUP BY clause, enabling users to obtain multiple levels of summary results in a single query, which is semantically equivalent to using UNION ALL to connect multiple aggregation statements.
-
ROLLUP: ROLLUP is an operation used to generate hierarchical summaries. It aggregates data according to the specified column order, gradually summarizing from the finest granularity to the highest level. For example, in sales data, ROLLUP can be used to summarize by region and time, providing sales for each region per month, total sales for each region, and overall total sales. ROLLUP is suitable for scenarios requiring step-by-step summaries.
-
CUBE: CUBE is a more powerful aggregation operation that generates all possible summary combinations. Unlike ROLLUP, CUBE calculates subsets for all dimensions. For instance, for sales data aggregated by product and region, CUBE will compute sales for each product in each region, total sales for each product, total sales for each region, and overall total sales. CUBE is applicable to scenarios requiring comprehensive multidimensional analysis, such as business analysis and market research.
-
GROUPING SETS: GROUPING SETS offer flexibility in aggregating specific grouping sets. It allows users to specify a set of column combinations for independent aggregation, rather than generating all possible combinations as in ROLLUP and CUBE. For example, one can define summaries for specific combinations of region and time without needing all combinations of each dimension. GROUPING SETS are suitable for scenarios requiring customized summaries, providing flexible aggregation control.
ROLLUP, CUBE, and GROUPING SETS provide powerful multidimensional data summary functions, catering to various data analysis and reporting needs, and making complex aggregation calculations simpler and more efficient. The following sections will detail the usage scenarios, syntax, and examples of these features.
ROLLUPβ
Use Caseβ
ROLLUP is particularly useful for summarizing data along hierarchical dimensions such as time, geography, and category. For instance, queries can specify ROLLUP(year, month, day)
or (country, Province, city)
.
Syntax and Exampleβ
The syntax for ROLLUP is as follows:
SELECT β¦ GROUP BY ROLLUP(grouping_column_reference_list)
Here's an example query that analyzes sales sums by year and month:
SELECT
YEAR(d_date),
MONTH(d_date),
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1
WHERE
d1.d_date_sk = ss_sold_date_sk
AND YEAR(d_date) IN (2001, 2002)
AND MONTH(d_date) IN (1, 2, 3)
GROUP BY
ROLLUP(YEAR(d_date), MONTH(d_date))
ORDER BY
YEAR(d_date), MONTH(d_date);
This query summarizes data by time, calculating subtotals for sales by year, sales by month within each year, and the grand total of sales. The query result is as follows:
+--------------+---------------+-------------+
| YEAR(d_date) | MONTH(d_date) | total_sum |
+--------------+---------------+-------------+
| NULL | NULL | 54262669.17 |
| 2001 | NULL | 26640320.46 |
| 2001 | 1 | 9982165.83 |
| 2001 | 2 | 8454915.34 |
| 2001 | 3 | 8203239.29 |
| 2002 | NULL | 27622348.71 |
| 2002 | 1 | 11260654.35 |
| 2002 | 2 | 7722750.61 |
| 2002 | 3 | 8638943.75 |
+--------------+---------------+-------------+
9 rows in set (0.08 sec)
CUBEβ
Use Caseβ
CUBE is best suited for queries involving columns from multiple independent dimensions, rather than columns representing different levels of a single dimension. For example, a common usage scenario is summarizing all combinations of month, region, and product. These are three independent dimensions, and it is common to analyze all possible subtotal combinations. By contrast, cross-tabulating all possible combinations of year, month, and day would include several unnecessary values due to the natural hierarchy in the time dimension. In most analyses, subtotals like profits calculated by month and day are unnecessary. Relatively few users need to ask, "What is the total sales for the 16th of each month throughout the year?"
Syntax and Exampleβ
The syntax for CUBE is as follows:
SELECT β¦ GROUP BY CUBE(grouping_column_reference_list)
Example usage:
SELECT
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ("Books", "Electronics")
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ("LA", "AK")
GROUP BY CUBE(YEAR(d_date), i_category, ca_state)
ORDER BY YEAR(d_date), i_category, ca_state;
The query result is as follows, calculating:
-
The total sales sum;
-
Subtotals for sales by year, by product category, and by state;
-
Subtotals for sales by product category within each year, by state for each product, by state within each year, and by product category within each state and year.
+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| NULL | Books | NULL | 4238177.69 |
| NULL | Books | AK | 1310791.36 |
| NULL | Books | LA | 2927386.33 |
| NULL | Electronics | NULL | 4452196.91 |
| NULL | Electronics | AK | 1364406.97 |
| NULL | Electronics | LA | 3087789.94 |
| 1998 | NULL | NULL | 4369656.14 |
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | NULL | 4320718.46 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
27 rows in set (0.21 sec)
GROUPING FUNCTIONβ
This section introduces how to address two challenges when using ROLLUP and CUBE:
-
How to programmatically identify which rows in the result set represent subtotals and accurately determine the aggregation level corresponding to a given subtotal. Since subtotals are often needed for calculations such as percentage of totals, we require a convenient method to identify these subtotal rows.
-
When the query results contain both actual stored NULL values and "NULL" values generated by ROLLUP or CUBE operations, another problem arises: how to distinguish between these two types of NULL values?
GROUPING, GROUPING_ID, and GROUPING SETS can effectively solve the aforementioned challenges.
GROUPINGβ
1. Principles
GROUPING uses a single column as a parameter and returns 1 when encountering a NULL value created by ROLLUP or CUBE operations, indicating that the row is a subtotal. Any other type of value (including NULLs inherently present in the table data) returns 0.
Example:
select
year(d_date),
month(d_date),
sum(ss_net_paid) as total_sum,
grouping(year(d_date)),
grouping(month(d_date))
from
store_sales,
date_dim d1
where
d1.d_date_sk = ss_sold_date_sk
and year(d_date) in (2001, 2002)
and month(d_date) in (1, 2, 3)
group by
rollup(year(d_date), month(d_date))
order by
year(d_date), month(d_date);
-
The GROUPING function result for the (YEAR(d_date), MONTH(d_date)) group is (0,0) for aggregation by year and month.
-
The GROUPING function result for the (YEAR(d_date)) group is (0,1) for aggregation by year.
-
The GROUPING function result for the () group is (1,1) for the total aggregation.
Query result:
+--------------+---------------+-------------+------------------------+-------------------------+
| year(d_date) | month(d_date) | total_sum | Grouping(year(d_date)) | Grouping(month(d_date)) |
+--------------+---------------+-------------+------------------------+-------------------------+
| NULL | NULL | 54262669.17 | 1 | 1 |
| 2001 | NULL | 26640320.46 | 0 | 1 |
| 2001 | 1 | 9982165.83 | 0 | 0 |
| 2001 | 2 | 8454915.34 | 0 | 0 |
| 2001 | 3 | 8203239.29 | 0 | 0 |
| 2002 | NULL | 27622348.71 | 0 | 1 |
| 2002 | 1 | 11260654.35 | 0 | 0 |
| 2002 | 2 | 7722750.61 | 0 | 0 |
| 2002 | 3 | 8638943.75 | 0 | 0 |
+--------------+---------------+-------------+------------------------+-------------------------+
9 rows in set (0.06 sec)
2. Usage Scenarios, Syntax, and Examples
The GROUPING function can be used to filter results. Example:
select
year(d_date),
i_category,
ca_state,
sum(ss_net_paid) as total_sum
from
store_sales,
date_dim d1,
item,
customer_address ca
where
d1.d_date_sk = ss_sold_date_sk
and i_item_sk = ss_item_sk
and ss_addr_sk=ca_address_sk
and i_category in ("Books", "Electronics")
and year(d_date) in(1998, 1999)
and ca_state in ("LA", "AK")
group by cube(year(d_date), i_category, ca_state)
having grouping(year(d_date))=1 and grouping(i_category)=1 and grouping(ca_state)=1
or grouping(year(d_date))=0 and grouping(i_category)=1 and grouping(ca_state)=1
or grouping(year(d_date))=1 and grouping(i_category)=1 and grouping(ca_state)=0
order by year(d_date), i_category, ca_state;
Using the GROUPING function in the HAVING clause retains only the total sales, sales summarized by year, and sales summarized by region. Query result:
+---------------------+------------+----------+------------+
| year(`d1`.`d_date`) | i_category | ca_state | total_sum |
+---------------------+------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| 1998 | NULL | NULL | 4369656.14 |
| 1999 | NULL | NULL | 4320718.46 |
+---------------------+------------+----------+------------+
5 rows in set (0.13 sec)
You can also use the GROUPING function with the IF function to enhance query readability. Example:
select
if(grouping(year(d_date)) = 1, "Multi-year sum", year(d_date)) as year,
if(grouping(i_category) = 1, "Multi-category sum", i_category) as category,
sum(ss_net_paid) as total_sum
from
store_sales,
date_dim d1,
item,
customer_address ca
where
d1.d_date_sk = ss_sold_date_sk
and i_item_sk = ss_item_sk
and ss_addr_sk = ca_address_sk
and i_category in ("Books", "Electronics")
and year(d_date) in (1998, 1999)
and ca_state in ("LA", "AK")
group by cube(year(d_date), i_category)
Query result:
+----------------+--------------------+------------+
| year | category | total_sum |
+----------------+--------------------+------------+
| 1998 | Books | 2213703.82 |
| 1998 | Electronics | 2155952.32 |
| 1999 | Electronics | 2296244.59 |
| 1999 | Books | 2024473.87 |
| 1998 | Multi-category sum | 4369656.14 |
| 1999 | Multi-category sum | 4320718.46 |
| Multi-year sum | Books | 4238177.69 |
| Multi-year sum | Electronics | 4452196.91 |
| Multi-year sum | Multi-category sum | 8690374.60 |
+----------------+--------------------+------------+
9 rows in set (0.09 sec)
GROUPING_IDβ
1. Usage Scenarios
In a database, both GROUPING_ID and GROUPING functions serve as auxiliary functions for handling multidimensional data aggregation queries, such as ROLLUP and CUBE, aiding users in distinguishing between different levels of aggregation results. If you wish to determine the aggregation level of a particular row, you need to use the GROUPING function to compute all GROUP BY columns, as the computation result of a single column alone is insufficient.
The GROUPING_ID function is more powerful than GROUPING because it can detect multiple columns simultaneously. The GROUPING_ID function accepts multiple columns as parameters and returns an integer that represents the aggregation status of these columns through binary bits. When using tables or materialized views to store computation results, using GROUPING to represent different levels of aggregation can consume considerable storage space. In such scenarios, GROUPING_ID is more appropriate.
Taking CUBE(a, b) as an example, its GROUPING_ID can be represented as follows:
Aggregation Level | Bit Vector | GROUPING_ID | GROUPING(a) | GROUPING(b) |
---|---|---|---|---|
a,b | 0 0 | 0 | 0 | 0 |
a | 0 1 | 1 | 0 | 1 |
b | 1 0 | 2 | 1 | 0 |
Grand Total | 1 1 | 3 | 1 | 1 |
2. Syntax and Example
Here is an example SQL query:
SELECT
year(d_date),
i_category,
SUM(ss_net_paid) AS total_sum,
GROUPING(year(d_date)),
GROUPING(i_category),
GROUPING_ID(year(d_date), i_category)
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND year(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(year(d_date), i_category);
The query results are as follows:
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| year(d_date) | i_category | total_sum | GROUPING(year(d_date)) | GROUPING(i_category) | GROUPING_ID(year(d_date), i_category) |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| 1998 | Electronics | 2155952.32 | 0 | 0 | 0 |
| 1998 | Books | 2213703.82 | 0 | 0 | 0 |
| 1999 | Electronics | 2296244.59 | 0 | 0 | 0 |
| 1999 | Books | 2024473.87 | 0 | 0 | 0 |
| 1998 | NULL | 4369656.14 | 0 | 1 | 1 |
| 1999 | NULL | 4320718.46 | 0 | 1 | 1 |
| NULL | Electronics | 4452196.91 | 1 | 0 | 2 |
| NULL | Books | 4238177.69 | 1 | 0 | 2 |
| NULL | NULL | 8690374.60 | 1 | 1 | 3 |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
9 rows in set (0.12 sec)
GROUPING SETSβ
1. Usage Scenarios
When there is a need to selectively specify the group sets to create, the GROUPING SETS
expression can be used in the GROUP BY
clause. This method allows users to precisely specify across multiple dimensions without computing the entire CUBE.
Since CUBE queries typically consume significant resources, using GROUPING SETS
can enhance query execution efficiency when only a few dimensions are of interest.
2. Syntax and Examples
The syntax for GROUPING SETS
is as follows:
SELECT β¦ GROUP BY GROUPING SETS(grouping_column_reference_list)
If you need:
-
Subtotals of sales for each product category per year
-
Subtotals of sales for each state per year
-
Subtotals of sales for each product in each state per year
You can use GROUPING SETS
to specify these dimensions and perform the aggregation. Here is an example:
SELECT
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY GROUPING SETS(
(YEAR(d_date), i_category),
(YEAR(d_date), ca_state),
(YEAR(d_date), ca_state, i_category)
)
ORDER BY YEAR(d_date), i_category, ca_state;
Query Result:
+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
16 rows in set (0.11 sec)
The above approach is equivalent to using CUBE but specifies concrete grouping_id
, thereby reducing unnecessary calculations:
SELECT
SUM(ss_net_paid) AS total_sum,
YEAR(d_date),
i_category,
ca_state
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(YEAR(d_date), ca_state, i_category)
HAVING grouping_id(YEAR(d_date), ca_state, i_category) = 0
OR grouping_id(YEAR(d_date), ca_state, i_category) = 2
OR grouping_id(YEAR(d_date), ca_state, i_category) = 1;
Using CUBE
computes all possible aggregation levels (eight in this case), but in practice, you may only be interested in a few of them.
3. Semantic Equivalents
-
GROUPING SETS vs. GROUP BY UNION ALL
The
GROUPING SETS
statement:SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());
Is equivalent in query results to multiple
GROUP BY
queries connected withUNION ALL
:SELECT k1, k2, SUM(k3) FROM t GROUP BY k1, k2
UNION ALL
SELECT k1, NULL, SUM(k3) FROM t GROUP BY k1
UNION ALL
SELECT NULL, k2, SUM(k3) FROM t GROUP BY k2
UNION ALL
SELECT NULL, NULL, SUM(k3) FROM t;Using
UNION ALL
results in a longer query and requires multiple scans of the base table, making it less efficient in both writing and execution. -
GROUPING SETS vs. ROLLUP
ROLLUP
is an extension ofGROUPING SETS
. For example:SELECT a, b, c, SUM(d) FROM tab1 GROUP BY ROLLUP(a, b, c);
This
ROLLUP
is equivalent to the followingGROUPING SETS
:GROUPING SETS (
(a, b, c),
(a, b),
(a),
()
); -
GROUPING SETS vs. CUBE
CUBE(a, b, c)
is equivalent to the followingGROUPING SETS
:GROUPING SETS (
(a, b, c),
(a, b),
(a, c),
(a),
(b, c),
(b),
(c),
()
);
APPENDIXβ
For table creation statements and data files, see the Window Function appendix.