HISTOGRAM
描述
HISTOGRAM(直方图)函数用于描述数据分布情况,它使用“等高”的分桶策略,并按照数据的值大小进行分桶,并用一些简单的数据来描述每个桶,比如落在桶里的值的个数。仅统计非 NULL 的数据。
别名
HIST
语法
HISTOGRAM(<expr>[, <num_buckets>])
HIST(<expr>[, <num_buckets>])
参数
参数 | 说明 |
---|---|
expr | 需要获取第一个值的表达式,支持的类型为 TinyInt,SmallInt,Integer,BigInt,LargeInt,Float,Double,Decimal ,String。 |
num_buckets | 可选。用于限制直方图桶(bucket)的数量,默认值 128,支持的类型为 Integer。 |
返回值
返回直方图估算后的 JSON 格式的值,类型为 String 。 组内没有有效数据时,返回 num_buckets 为0的结果。
举例
-- setup
CREATE TABLE histogram_test (
c_int INT,
c_float FLOAT,
c_string VARCHAR(20)
) DISTRIBUTED BY HASH(c_int) BUCKETS 1
PROPERTIES ("replication_num"="1");
INSERT INTO histogram_test VALUES
(1, 0.1, 'str1'),
(2, 0.2, 'str2'),
(3, 0.8, 'str3'),
(4, 0.9, 'str4'),
(5, 1.0, 'str5'),
(6, 1.0, 'str6'),
(NULL, NULL, 'str7');
SELECT histogram(c_float) FROM histogram_test;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| histogram(c_float) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"num_buckets":5,"buckets":[{"lower":"0.1","upper":"0.1","ndv":1,"count":1,"pre_sum":0},{"lower":"0.2","upper":"0.2","ndv":1,"count":1,"pre_sum":1},{"lower":"0.8","upper":"0.8","ndv":1,"count":1,"pre_sum":2},{"lower":"0.9","upper":"0.9","ndv":1,"count":1,"pre_sum":3},{"lower":"1","upper":"1","ndv":1,"count":2,"pre_sum":4}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
SELECT histogram(c_string, 2) FROM histogram_test;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| histogram(c_string, 2) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"num_buckets":2,"buckets":[{"lower":"str1","upper":"str4","ndv":4,"count":4,"pre_sum":0},{"lower":"str5","upper":"str7","ndv":3,"count":3,"pre_sum":4}]} |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
-- NULL 处理相关 case
SELECT histogram(c_float) FROM histogram_test WHERE c_float IS NULL;
+--------------------------------+
| histogram(c_float) |
+--------------------------------+
| {"num_buckets":0,"buckets":[]} |
+--------------------------------+
查询结果说明:
{
"num_buckets": 3,
"buckets": [
{
"lower": "0.1",
"upper": "0.2",
"count": 2,
"pre_sum": 0,
"ndv": 2
},
{
"lower": "0.8",
"upper": "0.9",
"count": 2,
"pre_sum": 2,
"ndv": 2
},
{
"lower": "1.0",
"upper": "1.0",
"count": 2,
"pre_sum": 4,
"ndv": 1
}
]
}
字段说明:
- num_buckets:桶的数量
- buckets:直方图所包含的桶
- lower:桶的上界
- upper:桶的下界
- count:桶内包含的元素数量
- pre_sum:前面桶的元素总量
- ndv:桶内不同值的个数
> 直方图总的元素数量 = 最后一个桶的元素数量(count)+ 前面桶的元素总量(pre_sum)。