Author: vivo Internet Big Data Team - Wu Yonggang
In " User Behavior Analysis Model Practice (1) - Path Analysis Model ", it describes the requirements of short query time and visualization in platform-based queries, and combined with existing storage computing resources and specific requirements, we are implementing After enumerating the path data, it is divided into two and merged.
This article is the second in a series of articles. This article introduces the concept and basic principles of the funnel model in detail, and explains its specific implementation within the platform. Aiming at the problems of the actual use process, explore the practical solution based on the ClickHouse funnel model.
1. Background requirements
Funnel analysis is an important tool for measuring conversion effects and conducting conversion analysis, and it is a common process data analysis method. It can help you clearly understand the conversion situation, analyze and compare from multiple angles, locate the cause of loss, and improve conversion performance. He is mainly based on three demand scenarios:
- Identify specific reasons for user churn.
- Detect the effect of a thematic activity.
- Comparison of conversion rates for different versions.
2. Overview
2.1 Concept introduction
The funnel model is primarily used to analyze conversion and churn at each step in a multi-step process. There are a few concepts to understand:
There are two types of funnel models: unordered funnels and ordered funnels.
Defined as follows:
Unordered funnel : Within the cycle of the funnel, the sequence of events between the steps of the funnel is not defined.
[Calculation rule]: Assuming that a funnel contains three steps A, B, and C, the time of step A can be before or after step B. A combination of user behaviors A, B, and C is considered a successful funnel conversion. Even if the funnel steps are interspersed with some other event steps, the user is still considered to have completed a successful funnel conversion.
Ordered funnel : Within the cycle of the funnel, the order in which each step of the funnel occurs is strictly defined.
[Calculation rules]: Assuming that a funnel contains 3 steps A, B, and C, the time of step A must occur before step B, and the user's behavior sequence must be A->B->C.
Like the unordered funnel, the funnel steps are interspersed with some other event steps, and the user is still considered to have completed a successful funnel conversion.
3. Data analysis with funnels
Knowing the basic concepts of the funnel model above, let's take a look at how to create a funnel.
3.1 Choose a funnel type
The types of funnel models are generally divided into ordered funnels and unordered funnels, and their concepts have been introduced in detail in 2.1. Here we take an unordered funnel as an example to create a funnel model.
3.2 Adding a funnel step
The funnel step is the core part of the funnel analysis, and the comparison of the statistical data between the steps is the key indicator for us to analyze the conversion and loss of the data between the steps.
For example, let's take an activity of "downloading an app to receive a red envelope" as an example. The default user behavior path is as follows: the user first enters the [ Red Packet Home ], finds the latest red packet activity "Download the application, receive the red packet", click to enter the [ Red Packet Activity Page ], and jump to the [ Application Download Page ] according to the prompts, select Download the apps you are interested in, and after completion, go to the [ Withdrawal Page ] to receive the event rewards. From the scenarios described above, we can extract the following key four steps.
Figure 3.1 "Download the application to receive red envelopes" activity steps
3.3 Determine the time interval and period of the funnel
There is an additional concept of time interval here, which is easily confused with the cycle introduced above. Generally speaking, the data warehouse table of such data is partitioned by time. Therefore, when choosing a time interval, the essence is to choose the data range to be calculated.
Cycle refers to the time limit for a funnel to flow from the first step to the last step, that is, it is used to define what a complete funnel is. In this example, we process according to the period of days, and select the time interval as "2021-05-27", "2021-05-28", "2021-05-29".
3.4 Display of funnel data
According to the funnel model we designed (the specific model design will be mentioned below), the data in the following table can be calculated:
Table 3.1 Step-by-step data of the activity “Download Apps and Get Red Packets”
Taking the data from 2021-05-27 in Table 3.1 as an example, the number of users who reached the first step "Red Packet Homepage" is 150,000, and the first step "Red Packet Homepage" and the second step "Red Packet Activity Page" are triggered simultaneously within the same day. ' was 11,700. The meaning of other data is similar.
Add up the data in Table 3.1 by date at each step to get a funnel chart for the data from 2021-05-27 to 2021-05-29 (Figure 3.2).
From this, it can intuitively reflect the number and conversion rate of users in each of the four steps of "red envelope homepage", "red envelope activity page", "application download page", and "withdrawal page".
For example, the number of people who reach the "Red Packet Homepage" page is 400,000, and the number of people who reach the "Red Packet Activity Page" page after passing through the "Red Packet Homepage" is 30,000. Then the conversion rate of these two stages: 30,000÷400,000=7.5%.
By comparing the number of people and conversion rates at each stage, we can more intuitively find the link where users are losing in our "download the app to receive red envelopes" activity, and then investigate the reasons and optimize each link.
Figure 3.2 “Download apps to receive red envelopes” activity funnel diagram
4. Overall function design and realization of funnel analysis model
4.1 Functional overall architecture design
Figure 4.1 Overall architecture design of funnel analysis
The overall project is mainly divided into three stages: configuration, calculation and storage .
(1) Configuration
This stage is mainly the implementation of background services on the engineering side. Users can set the funnel type, funnel steps, filter conditions, time interval and cycle configuration on the front-end according to their own needs. After receiving the configuration request, the background service selects different task assemblers to assemble tasks according to the funnel type.
Among them, the funnel type is the Hive SQL task assembler used by the unordered funnel, and the more complex ordered funnel can use the Spark task assembler. The tasks generated after assembly contain the calculation logic of the funnel model, such as Hive SQL or Spark tasks.
(2) Calculation
The platform selects the Hive or Spark engine for analysis and calculation according to the type of tasks received. The calculation results are synchronized to the MySQL or ClickHouse cluster.
(3) Storage
The result set is persisted to the database and can be displayed to the user through the background service.
4.2 Unordered funnel implementation logic
The disordered funnel does not limit the order of occurrence among its multiple steps, as long as it is completed within a defined period.
In the design of the model, the idea adopted is:
In a cycle, the number of people in each step of the funnel is calculated in the order of steps, and the calculated population range of the next layer should be equal to the population range completed in the previous calculation. The desired index can be calculated through the population range of each step. For example, the number of people per step (uv) or the number of visits (pv).
As shown in Figure 4.2. Among them, the circled group is the number of people reached at each step, and the calculated result set is based on this group to obtain the calculation result. The circled population in step 1 will be used as a screening condition for the funnel calculation in step 2 and participate in the subsequent calculation. And so on to complete the calculation of each step of the funnel. Finally, the calculation result set of each step is aggregated to form the result data similar to Table 3.1.
Figure 4.2 Unordered funnel calculation logic
4.3 Ordered funnel implementation logic
An ordered funnel, as the name suggests, will strictly follow the order between each step of the funnel. The entire implementation logic can be divided into the following steps:
(1) Obtain a data set within a specified time interval.
For the convenience of explanation, the sample data is shown in the following figure, where day is the time of data reporting, userId is the user's unique identifier, event is the event, and event_time is the time when the event occurred.
(2) Calculate the funnel step where each row of data is located according to the funnel step.
Suppose the funnel steps that require statistical analysis are: "Start" -> "Home" -> "Details". ' "Startup" is marked as 1, "Home" is marked as 2, and "Details" is marked as 3, recorded on the event_step field.
(3) Process the above data to obtain an orderly list of events reported by each user on the day.
Group the above data according to day and userId, and in the order of event\_time, obtain the ordered set of event\_step and event\_time respectively, and obtain the maximum depth of the funnel according to event\_step, which is recorded as level, as follows:
(4) Calculate the number of people in each step of the funnel.
The number of people in each step of the funnel is calculated by day and level grouping, which is also the uv of each level.
It should be noted that because the number of people in each step of the funnel is calculated, there is no intersection between the number of steps and the number of steps, but in fact, according to the calculation logic of the ordered funnel, the steps behind the funnel will be reached in a while. It's a funnel step ahead.
Therefore, the number of people in the previous step must be added to the number of all subsequent steps, which is the real number of people in this step. As the above example, for the data of 2021-05-01, the uv of level=1 is 1, the uv of level=2 is 0, and the uv of level=3 is 1, so the actual total number of people at level=1 is one of the three steps. and, that is, 2. And so on, from this, you can get the real total number of all steps.
4.4 Existing problems and the direction of next optimization
Problem: At this stage, users generate corresponding Spark or Hive tasks to calculate the results of the model and generate reports through custom configurations, and then display them to users. Such a process not only provides users with flexible configuration and personalized query, but also saves storage resources. The fly in the ointment is that the report generation process still takes a certain amount of time and cost. Especially, the orderly funnel adopts Spark calculation, which will also consume a large amount of queue resources. This is especially evident when users create a large number of analytical reports in a short period of time.
Optimization direction: Synchronize the relevant data warehouse data within a certain period to ClickHouse, relying on ClickHouse's powerful real-time calculation and analysis capabilities, to provide users with an experience that is what they check. Users can choose real-time query or offline report according to their own business needs. For example, if a large number of combinations of various conditions are required for comparative analysis, the real-time module can be selected. For reports that require long-term observation, offline routine reports can be selected. This achieves a balance between storage and query efficiency.
Next, let's explore the application of the funnel model on ClickHouse.
5. Funnel analysis model based on ClickHouse
5.1 Introduction to main functions
(1) windowFunnel(window, [mode, [mode, ... ]])(timestamp, cond1, cond2, ..., condN)
Definition: In the defined sliding window, the chain of events is retrieved in turn. The maximum number of events the function touches on this event chain.
Replenish:
① The function retrieves the first event in the window, and sets the event counter to 1, which is the start time of the sliding window.
② If the events from the chain occur sequentially within the window, the counter is incremented, and if the event sequence is terminated, the counter is not incremented.
③ If the data has multiple event chains at different completion points, the function will only output the size of the longest chain.
parameter:
①【timestamp】: The column representing the time in the table. The function will be sorted by this time
② [cond]: Constraints of the event chain
③【window】: The length of the sliding window, indicating the gap between the first and last event conditions. The unit depends on the timestamp parameter. That is: timestamp of cond1 <= timestamp of cond2 <= ... <= timestamp of condN <= timestamp of cond1 + window
④ [mode]: some optional configurations:
[strict]: In the event chain, if there are events that are not unique, duplicate events will be excluded, and the function will stop calculating.
[strict_orde]: The order of events in the event chain must be strictly guaranteed.
[strict_increase]: For the middle event of the event chain, the event stamp should be kept fully incremented.
(2) arrayWithConstant(length,param)
Definition: Generates an array of a specified length
parameter:
① length: the length of the array
② param: fill field
example:
SQL:
select arrayWithConstant(3,1);
Result:
arrayWithConstant(3, 1)
[1,1,1]
(3) arrayEnumerate(arr)
Definition: Returns the subscript of an array
Parameters: arr: array
example:
SQL:
select arrayEnumerate([11,22,33]);
Result:
arrayEnumerate([11, 22, 33])
[1,2,3]
(4) groupArray(x)
Definition: create an array
example:
SQL:
select groupArray(1);
Result:
groupArray(1)
[1]
(5) arrayCount([func,] arr1)
Definition: Returns the number of elements in the array that match the function func
parameter:
① func: lambda expression
② arr1: array
example:
SQL:
select arrayCount(x-> x!=1,[11,22,33]);
Result:
arrayCount(lambda(tuple(x), notEquals(x, 1)), [11, 22, 33])
3
(6) hasAll(set, subset)
Definition: Checks whether an array is a subset of another array, and returns 1 if so
parameter:
① set: any type of array with a set of elements.
② subset: Any type of array whose elements should be tested as a subset of the set.
example:
SQL:
select hasAll([11,22,33], [11]);
Result:
hasAll([11, 22, 33], [11])
1
5.2 Model building process
5.2.1 Data Preparation
In order to explain the whole process more clearly, we give an example to demonstrate the whole process.
First build a ClickHouse table funnel_test, which contains the user's unique identifier userId, the event name event, and the event occurrence date day.
The table creation statement is as follows:
create table funnel_test
(
userId String,
event String,
day DateTime
)
engine = MergeTree PARTITION BY userId
ORDER BY xxHash32(userId);
Insert test data:
insert into funnel_test values(1,'启动','2021-05-01 11:00:00');
insert into funnel_test values(1,'首页','2021-05-01 11:10:00');
insert into funnel_test values(1,'详情','2021-05-01 11:20:00');
insert into funnel_test values(1,'浏览','2021-05-01 11:30:00');
insert into funnel_test values(1,'下载','2021-05-01 11:40:00');
insert into funnel_test values(2,'启动','2021-05-02 11:00:00');
insert into funnel_test values(2,'首页','2021-05-02 11:10:00');
insert into funnel_test values(2,'浏览','2021-05-02 11:20:00');
insert into funnel_test values(2,'下载','2021-05-02 11:30:00');
insert into funnel_test values(3,'启动','2021-05-01 11:00:00');
insert into funnel_test values(3,'首页','2021-05-02 11:00:00');
insert into funnel_test values(3,'详情','2021-05-03 11:00:00');
insert into funnel_test values(3,'下载','2021-05-04 11:00:00');
insert into funnel_test values(4,'启动','2021-05-03 11:00:00');
insert into funnel_test values(4,'首页','2021-05-03 11:01:00');
insert into funnel_test values(4,'首页','2021-05-03 11:02:00');
insert into funnel_test values(4,'详情','2021-05-03 11:03:00');
insert into funnel_test values(4,'详情','2021-05-03 11:04:00');
insert into funnel_test values(4,'下载','2021-05-03 11:05:00');
If the datasheet is as follows:
Table 5.1 Funnel model test data
5.2.2 Ordered Funnel Calculation
Suppose, the steps of the funnel are: Start->Home->Details->Download
(1) Use ClickHouse's funnel building function windowFunnel() to query
SELECT userId,
windowFunnel(86400)(
day,
event = '启动',
event = '首页',
event = '详情',
event = '下载'
) AS level
FROM (
SELECT day, event, userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06'
)
GROUP BY userId
order by userId;
From the above SQL, the funnel period is set to 86400 seconds (1 day), and the unit of this period is determined by timestamp. The entire funnel is divided into 4 steps: launch, homepage, details, and download. The time interval is between "2021-05-01" and "2021-05-06". After execution, the following results are obtained:
From the results, it can be seen that each userId has reached the largest funnel level within the specified period, that is, the funnel steps have been executed a few steps. For example, userId=1, within one day, the four steps of Start->Home->Details->Download are accessed in sequence, and the maximum level is 4. Of course, we can also configure the funnel function as "strict\_order" mode, it will strictly guarantee the order, or if the userId is 1, on the day of "2021-05-01", there will be more between "details" and "downloads" A "browse" action is taken, so at this moment, the level that userId=1 can reach is 3, because, under "strict\_order", "details" blocks the entire event chain.
(2) Obtain detailed data of each user at each level
Through the previous step, we calculated the maximum level reached by each user within the set period. Next, we need to calculate the detailed data of each user at each level. The calculation logic is as follows:
SELECT userId,
arrayWithConstant(level, 1) levels,
arrayJoin(arrayEnumerate(levels)) level_index
FROM (
SELECT userId,
windowFunnel(86400)(
day,
event = '启动',
event = '首页',
event = '详情',
event = '下载'
) AS level
FROM (
SELECT day, event, userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06'
)
GROUP BY userId
);
Convert this largest level into an array of the corresponding size, get a set of array subscripts from it, and then expand the subscripted set into multiple lines according to its elements. This gives detailed data for each user at each level.
For example, the maximum level of userId=1 is 4. The array [1,1,1,1] is generated by the arryWithConstant function, and then the subscript of this array is taken to obtain a new array [1,2,3,4]. These subscripts actually correspond to It has four levels of "Start", "Home", "Details" and "Download" of the funnel.
Expand the subscript array through the arrayJoin function to get the detailed data of each layer with userId=1:
The execution results of all userIds are as follows:
(3) Calculate the number of users at each level of the funnel
The detailed data obtained in the above steps are grouped and aggregated according to the funnel level, and the number of users at each level is obtained. The overall logic is as follows:
SELECT transform(level_index,[1,2,3,4],['启动','首页','详情','下载'],'其他') as event,
count(1)
FROM (
SELECT userId,
arrayWithConstant(level, 1) levels,
arrayJoin(arrayEnumerate(levels)) level_index
FROM (
SELECT userId,
windowFunnel(86400)(
day,
event = '启动',
event = '首页',
event = '详情',
event = '下载'
) AS level
FROM (
SELECT day, event, userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06'
)
GROUP BY userId
)
)
group by level_index
ORDER BY level_index;
The result is:
5.2.3 Unordered funnel calculation
Suppose, the steps of the funnel are: Start -> Home
(1) Determine the data range for calculation
SELECT toDate(day),
event,
userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06';
The result is as follows:
(2) Calculate the number of visits (pv) and the number of visiting users (uv) for each userId.
First, group by time and userId, and use the groupArray function to obtain a collection of events.
pv calculation:
[Funnel Level 1]: Directly query the total number of events in the first step of the funnel in the event collection.
【Funnel Level 2】: In the case of the existence of the first level event, query the quantity of the second level. And so on for subsequent layers.
uv calculation:
[Funnel Level 1]: If the event set contains the first step event, it will be marked as 1, indicating that it exists.
[Funnel Level 2]: In the event set, which contains both the first and second level events, it is recorded as 1. Subsequent tiers and so on.
select day,
userId,
groupArray(event) as events,
arrayCount(x-> x = '启动', events) as level1_pv,
if(has(events, '启动'), arrayCount(x-> x = '首页', events), 0) as level2_pv,
hasAll(events, ['启动']) as level1_uv,
hasAll(events, ['启动','首页']) as level2_uv
from (
SELECT toDate(day) as day,
event,
userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06')
group by day, userId;
got the answer:
(3) Statistics by day
Statistics by day, calculate the number of users per day and the pv, uv of each level.
SELECT day AS day,
sum(level1_pv) AS sum_level1_pv,
sum(level2_pv) AS sum_level2_pv,
sum(level1_uv) as sum_level1_uv,
sum(level2_uv) as sum_level2_uv
from (
select day,
userId,
groupArray(event) as events,
arrayCount(x-> x = '启动', events) as level1_pv,
if(has(events, '启动'), arrayCount(x-> x = '首页', events), 0) as level2_pv,
hasAll(events, ['启动']) as level1_uv,
hasAll(events, ['启动','首页']) as level2_uv
from (
SELECT toDate(day) as day,
event,
userId
FROM funnel_test
WHERE toDate(day) >= '2021-05-01'
and toDate(day) <= '2021-05-06')
group by day, userId
)
group by day
order by day;
The calculation results are as follows:
6. Write at the end
Funnel analysis is an important analysis method in data analysis. It obtains data such as traffic volume, conversion rate, and churn rate of each link to evaluate the rationality of business processes, improve user experience, and strengthen user retention rates. played an important role.
This paper briefly describes the implementation logic of the existing Hive/Spark-based funnel model, which saves storage resources while allowing users to highly customize queries. But it will consume a certain time cost and queue resources.
In order to optimize such problems, this paper discusses the implementation of the funnel model based on ClickHouse, which achieves a relatively ideal effect in the calculation rate of the model. Although ClickHouse has a wide variety of functions to support calculation and analysis, it lacks convenient custom functions and is not very suitable for business in some sub-scenarios. This is also a direction that can be strengthened and broken in the future.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。