APPROX_COUNT_DISTINCT
Description
Returns the number of distinct non-NULL elements. This function is implemented based on the HyperLogLog algorithm, which uses a fixed size of memory to estimate the column base. The algorithm is based on the assumption of a null distribution in the tails, and the accuracy depends on the data distribution. Based on the fixed bucket size used by Doris, the relative standard error of the algorithm is 0.8125%. For a more detailed and specific analysis, see related paper
Syntax
APPROX_COUNT_DISTINCT(<expr>)
NDV(<expr>)
Parameters
Parameters | Description |
---|---|
<expr> | The expression to get the value. Supported types are String, Date, DateTime, IPv4, IPv6, TinyInt, Bool, SmallInt, Integer, BigInt, LargeInt, Float, Double, Decimal. |
Return Value
Returns a value of type BIGINT.
Example
-- setup
create table t1(
k1 int,
k_string varchar(100),
k_tinyint tinyint
) distributed by hash (k1) buckets 1
properties ("replication_num"="1");
insert into t1 values
(1, 'apple', 10),
(1, 'banana', 20),
(1, 'apple', 10),
(2, 'orange', 30),
(2, 'orange', 40),
(2, 'grape', 50),
(3, null, null);
select approx_count_distinct(k_string) from t1;
String type: Calculate the approximate distinct count of all k_string values, NULL values are not included in the calculation.
+---------------------------------+
| approx_count_distinct(k_string) |
+---------------------------------+
| 4 |
+---------------------------------+
select approx_count_distinct(k_tinyint) from t1;
TinyInt type: Calculate the approximate distinct count of all k_tinyint values.
+----------------------------------+
| approx_count_distinct(k_tinyint) |
+----------------------------------+
| 5 |
+----------------------------------+
select approx_count_distinct(k1) from t1;
Integer type: Calculate the approximate distinct count of all k1 values.
+---------------------------+
| approx_count_distinct(k1) |
+---------------------------+
| 3 |
+---------------------------+
select k1, approx_count_distinct(k_string) from t1 group by k1;
Group by k1 and calculate the approximate distinct count of k_string in each group. When all records in the group are NULL, returns 0.
+------+---------------------------------+
| k1 | approx_count_distinct(k_string) |
+------+---------------------------------+
| 1 | 2 |
| 2 | 2 |
| 3 | 0 |
+------+---------------------------------+
select ndv(k_string) from t1;
Using alias NDV has the same effect as APPROX_COUNT_DISTINCT.
+---------------+
| ndv(k_string) |
+---------------+
| 4 |
+---------------+
select approx_count_distinct(k_string) from t1 where k1 = 999;
When the query result is empty, returns 0.
+---------------------------------+
| approx_count_distinct(k_string) |
+---------------------------------+
| 0 |
+---------------------------------+