Skip to main content
Skip to main content

WINDOW_FUNCTION_WINDOW_FUNNEL

WINDOW FUNCTION WINDOW_FUNNEL

description

Searches for event chains in a sliding time window and calculates the maximum number of events that occurred from the chain.

  • window is the length of time window in seconds.
  • mode can be one of the followings:
    • "default": Defualt mode.
    • "deduplication": If the same event holds for the sequence of events, then such repeating event interrupts further processing. E.g. the array parameter is [event1='A', event2='B', event3='C', event4='D'], and the original event chain is "A-B-C-B-D". Since event B repeats, the filtered event chain can only be "A-B-C" and the max event level is 3.
    • "fixed": Don't allow interventions of other events. E.g. the array parameter is [event1='A', event2='B', event3='C', event4='D'], and the original event chain is A->B->D->C, it stops finding A->B->C at the D and the max event level is 2.
    • "increase": Apply conditions only to events with strictly increasing timestamps.
  • timestamp_column specifies column of DATETIME type, sliding time window works on it.
  • evnetN is boolean expression like eventID = 1004.

The function works according to the algorithm:

  • The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.
  • If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter is not incremented.
  • If the data has multiple event chains at varying points of completion, the function will only output the size of the longest chain.
window_funnel(window, mode, timestamp_column, event1, event2, ... , eventN)

example

example1: default mode

Using the default mode, find out the maximum number of consecutive events corresponding to different user_id, with a time window of 5 minutes:

CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");

INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:04:00', 'HONOR', 3),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);

SELECT
user_id,
window_funnel(
300,
"default",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;

+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+

For uesr_id=100123, because the time when the payment event occurred exceeds the time window, the matched event chain is login-visit-order.

example2: deduplication mode

Use the deduplication mode to find out the maximum number of consecutive events corresponding to different user_ids, with a time window of 1 hour:

CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");

INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'login', '2022-05-14 10:03:00', 'HONOR', 3),
(100123, 'order', '2022-05-14 10:04:00', "HONOR", 4),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);

SELECT
user_id,
window_funnel(
3600,
"deduplication",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;

+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 2 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+

For uesr_id=100123, after matching the visit event, the login event appears repeatedly, so the matched event chain is login-visit.

example3: fixed mode

Use the fixed mode to find out the maximum number of consecutive events corresponding to different user_id, with a time window of 1 hour:

CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");

INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:03:00', "HONOR", 4),
(100123, 'login2', '2022-05-14 10:04:00', 'HONOR', 3),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);

SELECT
user_id,
window_funnel(
3600,
"fixed",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;

+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+

For uesr_id=100123, after matching the order event, the event chain is interrupted by the login2 event, so the matched event chain is login-visit-order.

example4: increase mode

Use the increase mode to find out the maximum number of consecutive events corresponding to different user_id, with a time window of 1 hour:

CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");

INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:04:00', "HONOR", 4),
(100123, 'payment', '2022-05-14 10:04:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);

SELECT
user_id,
window_funnel(
3600,
"increase",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;

+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+

For uesr_id=100123, the timestamp of the payment event and the timestamp of the order event occur in the same second and are not incremented, so the matched event chain is login-visit-order.

keywords

WINDOW,FUNCTION,WINDOW_FUNNEL