About A topic shared by Flink Meetup at Beijing Station on May 22.
This article is compiled from the topic "Flink's Practice in iQiyi Advertising Business" shared by Han Honggen, technical manager of iQiyi at the Beijing Flink Meetup on May 22. The content includes:
- Business scene
- Business practice
- Problems and solutions during the use of Flink
- future plan
GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~
1. Business scenario
The usage scenarios of real-time data in advertising business can be divided into four aspects:
- large data screen: includes the display of core indicators such as exposure, clicks, and revenue, as well as monitoring indicators such as failure rate;
- abnormal monitoring: because the link for advertising is relatively long, so if any fluctuations occur on the link, it will have an impact on the overall advertising effect. In addition, whether each team will have an impact on the overall delivery during the online process can be observed through the abnormal monitoring system. We can also observe whether the trend of business indicators is reasonable, such as whether there are different fluctuations in the exposure when the inventory is normal, which can be used to find problems in real time;
- data analysis: mainly used for data empowerment business development. We can analyze some abnormal problems in the advertising process in real time, or study how to optimize based on the current advertising effect, so as to achieve better results;
- feature engineering: The advertising algorithm team mainly does some model training to support online delivery. At first, most of the technical features were offline. With the development of real-time, some projects began to be transferred to real-time.
Two, business practice
Business practices are mainly divided into two categories, the first is real-time data warehouse, and the second is feature engineering.
1. Real-time data warehouse
1.1 Real-time data warehouse-goal
The goals of real-time data warehouse include data integrity, service stability and query capabilities.
- data integrity: In the advertising business, real-time data is mainly used to guide decision-making. For example, advertisers need to guide future bidding or adjust budgets based on the current real-time data. In addition, the monitoring of the failure rate requires the data itself to be stable. If the data is fluctuating, the guiding significance is very poor, or even there is no guiding significance. Therefore, integrity itself is a trade-off between timeliness and integrity;
- service stability: The production chain includes data access, calculation (multi-layer), data writing, progress service and query service. In addition to the data quality, including the accuracy of the data and whether the data trend is in line with expectations;
- query capabilities: has a variety of usage scenarios in the advertising business, and different OLAP engines may be used in different scenarios, so the query methods and performance requirements are inconsistent. In addition, when doing data analysis, in addition to the latest and most stable real-time data, it will also perform real-time + offline analysis and query, in addition to data cross-source and query performance requirements.
1.2 Real-time data warehouse-challenge
- Data Progress Service: requires a trade-off between timeliness and completeness.
- data stability: relatively long production link, and various functional components may be used in the middle, so the end-to-end service stability has a more critical impact on the overall data accuracy.
- query performance: mainly includes OLAP analysis capabilities. In actual scenarios, the data table includes offline and real-time, and the scale of a single table is hundreds of columns, and the number of rows is also very large.
1.3 Advertising data platform architecture
The picture above is a diagram of the infrastructure of the advertising data platform, viewed from the bottom up:
- At the bottom is the data collection layer, which is basically the same as most companies. The business database mainly contains the advertiser's order data and delivery strategy; the burying log and the billing log are the logs generated during the advertisement delivery link;
The middle is the part of data production. The bottom layer of data production is the big data infrastructure. This part is provided by a cloud platform team of the company, which includes the Spark/Flink computing engine and the Babel unified management platform. Talos is a real-time data warehouse service. RAP and OLAP correspond to different real-time analysis and OLAP storage and query services.
The middle layer of data production is some services included in the advertising team, such as offline calculations and real-time calculations that are typical in production.
- Offline is a relatively common hierarchical model, and the scheduling system effectively manages and schedules the offline tasks produced.
- There are also many engines used for real-time computing. Our real-time operation started in 2016. At that time, we chose Spark Streaming. Later, with the development of big data technology and the company’s business needs, different scenarios were generated, and the computing engine Flink was introduced. .
- The bottom-level scheduling of real-time computing relies on the Babel system of cloud computing. In addition to computing, it will also be accompanied by data governance, including progress management, which refers to the point in time to which a data report in real-time computing has stabilized its current progress. In fact, offline corresponds to a table, which partitions are there.
- Blood relationship management includes two aspects. Offline includes table-level blood relationship and field blood relationship. Real-time is mainly blood ties at the task level.
- As for life cycle management, in an offline data warehouse, its calculations are continuously iterative. However, if the data retention time is very long, the amount of data will put more pressure on the underlying storage.
- Data lifecycle management is mainly based on a trade-off between business needs and storage costs.
- Quality management mainly includes two aspects. One part is at the data access layer to determine whether the data itself is reasonable; the other part is at the data export layer, which is the result indicator layer. Because our data will be used by many other teams, we must ensure that there is no problem with data calculation at the data export level.
The upper layer is the unified query service, and we will encapsulate many interfaces for querying.
- Because dataization includes offline and real-time, as well as cross-cluster, some core functions such as cluster selection, table selection, complex query and splitting will be performed in intelligent routing.
- The query service will conduct a unified management of historical queries. This way, on the one hand, you can further service life cycle management, and on the other hand, you can see which data is of great significance to the business.
- In addition to life cycle management, it can also guide our scheduling system, such as which reports are more critical, and prioritize these tasks when resources are tight.
- Further up are data applications, including reporting systems, Add-hoc queries, data visualization, abnormal monitoring, and downstream teams.
1.4 Real-time data warehouse-production link
The data production link is in terms of time granularity. We started with the offline data warehouse link. In the bottom row, as the real-time demand advances, a real-time link is generated. In terms of sorting, it is a Typical Lambda architecture.
In addition, some of our core indicators, such as billing indicators, because its stability is critical to the downstream, so we use different routes here. Multi-activity in different ways is that after the source log is generated, it is completely redundant in the computing layer and downstream storage layer, and is processed in a unified manner in the subsequent queries.
1.5 Real-time Data Warehouse-Progress Service
The above introduced that the real-time data indicators we require to provide are stable and unchanging. The core point of the progress service implementation includes the change trend of the indicators in the time window, combined with the status of the real-time computing task itself, because in the real-time data warehouse , Many indicators are aggregated calculations based on time windows.
For example, for a real-time indicator, the indicator we output is 3 minutes, which means that the indicator at 4:00 includes data from 4:00 to 4:03, and 4:03 includes 4:03 to 4:06. The data actually refers to the data in a time window, when is it visible to the outside world. Because in real-time calculations, data keep coming in, the data of the 4:00 time window starts at 4:00, and the indicators have already begun to be generated. Over time, the index continues to rise, and finally stabilizes. We judge whether it stabilizes based on the rate of change of the time window indicator.
But if it is only based on this point of view, then it still has certain drawbacks.
Because the calculation chain of this result table will depend on many calculation tasks, if there is a problem with which task on this link, it may cause the current indicator to trend normally, but ultimately incomplete. Therefore, on this basis, we have introduced real-time computing task status. When the indicators stabilize, we also check whether these computing tasks on the production link are normal. If it is normal, it means that the indicators at the time point of the task itself have been Stable and can provide services to the outside world.
If the calculation is stuck, piles up, or there is an abnormality in the restart process, you need to continue to wait for iterative processing.
1.6 Real-time Data Warehouse-Query Service
The figure above is the query service architecture diagram.
At the bottom is the data, which has real-time storage engines, including Druid and so on. In offline, the data is in Hive, but when querying, they will be synchronized with OLAP. Two engines are used here. In order to do union query with Kudu, it will be synchronized to the OLAP engine, and then Impala will be used to do the query uniformly. In addition, for the more fixed methods in the usage scenario, you can export it to Kylin, and then do data analysis on it.
Based on these data, there will be multiple query nodes, and above it is an intelligent routing layer. Query the gateway from the top. When a query request comes in, first determine whether it is a complex scenario. For example, in a query, if its duration spans both offline and real-time, both offline and real-time tables will be used here.
In addition, there are more complicated table selection logic in the offline table, such as hour level and day level. After analyzing the complex scene, the final selected table will be roughly determined. In fact, when doing smart routing, you will refer to some of the basic services on the left, such as metadata management, and where is the current progress of these tables.
For the optimization of query performance, in the data, the amount of data scanned at the bottom has a very large impact on the final performance. Therefore, there will be a report for dimensionality reduction and analysis based on historical queries. For example, what dimensions are included in a dimensionality reduction table, and what percentage of the query can be covered.
1.7 Data Production-Planning
As mentioned in the real-time data report production, it is mainly implemented based on API. One problem with the Lambda architecture itself is that there are two computing teams in real-time and offline. For the same requirement, two teams are required to develop at the same time, which will cause several problems.
- On the one hand, their logic may be different, which will eventually lead to inconsistent results;
- On the other hand, there is labor cost, which requires two teams to develop at the same time.
Therefore, our appeal is the integration of flow and batch, thinking about whether a logic can be used in the computing layer to express the same business requirement, for example, the flow or batch calculation engine can be used at the same time to achieve the calculation effect.
In this link, the original data is accessed through Kafka, after a unified ETL logic, and then the data is placed in the data lake. Because the data lake itself can support both streaming and batch reads and writes, and the data lake itself can be consumed in real time, it can perform real-time calculations or offline calculations, and then uniformly write the data back to the data lake.
As mentioned earlier, when doing queries, offline and real-time unified integration will be used, so writing the same table in the data lake can save a lot of work at the storage level and also save storage space.
1.8 Data Production-SQLization
SQLization is a capability provided by Talos' real-time data warehouse platform.
From the page, it includes several functions, the left is project management, and the right includes Source, Transform and Sink.
- Some business teams themselves are very familiar with calculation engine operators, so they can do some code development;
- But many business teams may not know the engine that much, or do not have a strong desire to understand, they can use this visual way to splice a job.
For example, you can drag in a Kafka data source, do data filtering on it, and then drag a Filter operator to achieve the filtering logic, and then you can do some Project, Union calculations, and finally output to a certain place. .
For students with slightly higher abilities, you can do some higher-level calculations. Here you can also achieve the purpose of real-time data warehouse, create some data sources in it, and then express the logic through SQL, and finally output this data to some kind of storage.
The above is from the development level. At the system level, it actually provides some other functions, such as rule verification, as well as development/testing/launching, which can be managed in a unified manner. In addition, there is monitoring. There are many real-time indicators for real-time tasks running online. You can judge whether the current task is in a normal state by viewing these indicators.
2. Feature Engineering
Feature engineering has two requirements:
The first requirement is real-time, because the value of data will decrease with time. For example, when a user shows a movie-watching behavior that he likes to watch children's content, the platform will recommend children-related advertisements. In addition, users will have some positive/negative feedback behaviors when watching advertisements. If these data are iterated into features in real time, the subsequent conversion effects can be effectively improved.
Another key point of real-timeization is accuracy. Many feature projects were offline before. In the production process, there is a deviation between the data during calculation and the features during the delivery process. The basic feature data is not very accurate, so we require the data to be more real-time ,more acurrate.
The second requirement of feature engineering is service stability.
- The first is job fault tolerance, such as whether the job can recover normally when it is abnormal;
- The other is data quality, pursuing end-to-end accuracy once in real-time data.
2.1 Click-through rate estimation
The following is the practice in the real-time feature, the first is the demand for click-through rate estimation.
The background of the click-through rate estimation case is as shown above. From the point of view of the advertising link, when the user has a movie watching behavior on the front end of the advertisement, the front end will request the advertisement from the advertisement engine, and then the advertisement engine will recall the rough/fine ranking of the advertisement. Get user characteristics and advertising characteristics. After the advertisement is returned to the front end, subsequent user behaviors may produce behavior events such as exposures and clicks. When making click-through rate estimation, it is necessary to associate the characteristics of the previous request stage with the exposure and clicks in the subsequent user behavior flow to form a Session data, this is our data requirements.
The implementation of specific practice includes two aspects:
- On the one hand is the correlation of exposure and click events in the Tracking stream;
- Another aspect is the correlation between feature flow and user behavior.
What are the challenges in practice?
- The first challenge is the amount of data;
- The second challenge is the disorder and delay of real-time data;
- The third challenge is high accuracy requirements.
In terms of timing, the feature must be earlier than Tracking, but when the successful correlation rate of two streams is more than 99%, how long does this feature need to be retained? Because in the advertising business, a user can download a content offline, and the advertisement request and return have already been completed at the time of downloading. However, if the user watches without internet, this event will not be returned immediately. Only when the status is restored will subsequent exposure and click events be returned.
So at this time, the time summary of feature flow and Tracking is actually very long. After offline data analysis, if the correlation rate of the two streams is more than 99%, then the feature data needs to be retained for a relatively long period of time. Currently, it is retained for 7 days, which is still relatively large.
The above picture shows the overall structure of click-through rate prediction. Just now we mentioned that the association includes two parts:
- The first part is the association between exposure and click events in the user behavior stream, which is implemented here through CEP.
- The second part is the association of the two streams. The features described above need to be retained for 7 days, and its status is relatively large, which is already hundreds of terabytes. This magnitude is managed in memory, which has a relatively large impact on data stability, so we put the feature data in an external storage (Hbase), and then do a real-time data query with HBase features to achieve such an effect .
But because the timing of the two streams itself may be staggered, that is, when the exposure and click appear, the feature may not be reached yet, then this feature will not be available. So we made a multi-level retry queue to ensure the integrity of the final two flows.
2.2 Click Rate Estimation-In-Stream Event Correlation
The right side of the figure above is a more detailed explanation, explaining why the in-stream event correlation chooses the CEP scheme. The business requirement is to associate the same advertisement request in the user behavior stream, and associate the exposure of the same advertisement with the click. After exposure, for example, clicks generated within 5 minutes are regarded as a positive sample, and clicks that appear after 5 minutes are discarded.
Imagine, when encountering such a scene, what kind of scheme can be used to achieve such an effect. In fact, the processing of multiple events in a stream can be realized with windows. But the problem with windows is:
- If the event sequence itself is in the same window, there is no problem with the data;
- But when the sequence of events spans windows, the normal correlation effect cannot be achieved.
So after a lot of technical research at that time, I found that the CEP in Flink can achieve this effect. Using a similar policy matching method, describe which matching methods these sequences need to meet. In addition, it can specify a time window, such as 15 minutes between exposure and click.
The left side of the above figure is a description of the matching rules. An exposure is defined in begin to achieve a click within 5 minutes after the exposure. The following is a description of a click that can occur multiple times. Within indicates how long the associated window is.
In the production practice process, this scheme can be associated in most cases, but when doing data comparison, it is found that there are some exposure clicks that are not normally associated.
After data analysis, it is found that the characteristic of the data itself is that the timestamps of exposure and click are both at the millisecond level. When they have the same millisecond timestamp, the event cannot be matched normally. So we adopted a plan to artificially add one millisecond to the click event and perform manual misalignment, so as to ensure that the exposure and the click can be successfully associated.
2.3 Click Rate Estimation-Shuangliu Association
As mentioned above, the characteristic data needs to be retained for 7 days, so the status is hundreds of TB. The data needs to be placed in an external storage, so there are certain requirements for external storage when making technical selection:
- First, support relatively high read and write concurrency capabilities;
- In addition, its timeliness needs to be very low;
- At the same time, because the data is to be retained for 7 days, it is best to have life cycle management capabilities.
Based on the above points, HBase was finally selected to form the solution shown in the figure above.
The upper line indicates that the exposure click sequence is associated after CEP is passed. The bottom is to write the feature stream to HBase through Flink for external state storage, and the middle core module is used to achieve the association of the two streams. After getting the exposure and clicking the association, check the HBase data. If it can be checked normally, it will be output to a normal result stream. For those data that cannot form an association, a multi-level retry queue is made. When multiple retries, the queue will be degraded. In order to reduce the pressure of scanning HBase when retrying, the retry gap will be level by level. Increase.
There is also an exit mechanism, because retries are not unlimited. The reasons for the exit mechanism mainly include two points:
- The first point is that the feature data is retained for 7 days. If the corresponding feature is before 7 days, then it cannot be correlated by itself.
- In addition, in the advertising business, there are some external sales behaviors, such as exposure or click, but it does not have a real advertisement request, so the corresponding features are not available in this scenario.
Therefore, the exit mechanism means that it will expire after multiple retries, and then it will retry the expired data.
2.4 Effective click
In the effective click scene, the two streams are actually related, but the technical selection in the two scenes is completely different.
First look at the background of the project. In the Internet scene, the film itself is an advertisement. After the user clicks, he will enter a play page. In the playback page, users can watch for 6 minutes for free. If you want to watch after 6 minutes, you need to be a member or purchase. The data that needs to be counted here is a valid click. The definition is that the viewing time exceeds 6 minutes after the click. Can.
This scenario is technically an association of two streams, including the click stream and the play heartbeat stream.
- The click stream is easier to understand, including the user's behaviors such as exposure and clicks, and click events can be filtered from it.
- The playback behavior stream is the process of user watching, and heartbeat information will be sent back regularly, such as a heartbeat in three seconds, indicating that the user is continuously watching. When the defined duration exceeds 6 minutes, the state itself needs to be processed to meet the 6-minute condition.
In this scene, the two mobile gaps are relatively small, and the time in the movie is generally more than two hours, so the behavior after clicking, the gap can basically be completed within three hours, so the status here is relatively comparative Small, using Flink's state management can achieve this effect.
Next we look at a specific plan.
From the stream point of view, the green part is the click stream, and the blue part is the heartbeat stream.
- In the status on the left, after a click event comes in, a status record will be made for the click, and a timer will be registered for periodic cleaning. The timer is three hours. Because the duration of most movies is less than three hours, if the corresponding playback event does not have a target state at this time, the click event can basically expire.
In the playing heartbeat stream on the right, this state is to accumulate the duration, and it is a heartbeat stream itself, for example, a heartbeat is sent every three seconds. We need to do a calculation here to see if its cumulative playing time has reached 6 minutes, and also see if the current record has reached 6 minutes. Corresponding to an implementation in Flink, the two streams are related together through the Connect operator, and then a CoProcessFunction can be formulated, in which there are two core operators.
- The first operator is what kind of processing needs to be done after getting the stream event of state 1;
- The second operator is what functions can be customized after getting the second stream event.
Operators provide users with a lot of flexibility, and users can do a lot of logic control in them. Compared with many Input Joins, the user can play a larger space.
2.5 Feature Engineering-Summary
Make a summary for the above case. Now that dual-stream management is very common, there are many solutions to choose from, such as Window join, Interval join, and Connect + CoProcessFunction we use. In addition, there are some user-defined programs.
When choosing a model, it is recommended to start from the business and make the corresponding technical model selection. First of all, we must think about the event relationship between multiple streams, and then determine the scale of the state. To a certain extent, we can exclude infeasible solutions from many of the above solutions.
3. Problems and solutions during the use of Flink
1. Fault tolerance
In Flink, checkpoint is mainly used for fault tolerance. Checkpoint itself is the task-level fault tolerance of the job, but when the job is actively or abnormally restarted, the state cannot be restored from the historical state.
Therefore, we have made a small improvement here, that is, when a job is started, it will also go to the Checkpoint to get the last successful historical state, and then perform initialization management, so as to achieve the effect of state restoration.
2. Data Quality
Flink itself implements end-to-end precision once, first you need to turn on the Checkpoint function, and specify the semantics of precision once in Checkpoint. In addition, if it supports transactions in the downstream, such as the Sink, it can combine two-phase commit with Checkpoint and downstream transactions to achieve end-to-end precision once.
This process is described on the right side of the figure above. This is a pre-submission process, that is, when the Checkpoint coordinator is doing Checkpoint, it will inject some Barrier data into the Source side. After each Source gets the Barrier, it will store the state, and then feedback the completion status to the coordinator. In this way, when each operator gets the Barrier, it actually does the same function.
After reaching the sink, it will submit a pre-commit mark in Kafka, which is mainly guaranteed by Kafka's own transaction mechanism. After all operators have completed Checkpoint, the coordinator will send an ACK to all operators and send an acknowledgement status. At this time, the sink side can do a commit action.
3. Sink Kafka
In previous practice, we found that when downstream Kafka increases the number of partitions, no data is written to the new partition.
The principle is that FlinkKafkaProducer uses FlinkFixedPartitioner by default, and each Task will only be sent to a downstream corresponding Partition. If the Partition of the downstream Kafka Topic is greater than the parallelism of the current task, this problem will occur.
There are two solutions:
- The first method is to customize a FlinkKafkaPartitioner by the user;
- Another method is not to configure by default, and polling and writing to each Partition by default.
4. Enhanced monitoring
For the running Flink job, we need to check some of its own status. For example, in Flink UI, many of its indicators are in Task granularity, and there is no overall effect.
The platform further aggregates these indicators and displays them on one page.
As you can see from the above figure, the displayed information includes the back pressure status, the time delay, and the CPU/memory utilization of JobManager and TaskManage during operation. There is also Checkpoint monitoring, such as whether it has timed out, and whether Checkpoint has failed recently. We will make some alarm notifications for these monitoring indicators later.
5. Monitoring and Alarm
When the real-time task operation is abnormal, the user needs to know the status in time. As shown in the figure above, there are some alarm items, including alarm subscribers, alarm level, and some indicators below. According to the indicator value set earlier, if it is satisfied These alarm policy rules will push alarms to the alarm subscribers. The alarm methods include email, telephone and internal communication tools, so as to realize the notification of the abnormal state of the task.
In this way, when the task is abnormal, the user can know the status in time, and then carry out human intervention.
6. Real-time data production
Finally, summarize the key nodes of iQiyi's advertising business in real-time link production.
- Our real-time started in 2016. At that time, the main function point was to make some indicators real-time, using SparkStreaming;
- In 2018, the click rate real-time feature was launched;
- In 2019, Flink's end-to-end accuracy and monitoring enhancement were launched.
- Effective click real-time features will be launched in 2020;
- In October of the same year, the improvement of real-time data warehouse was gradually promoted, and the API production method was gradually SQLized;
- In April 2021, the exploration of the integration of flow and batch will be carried out. Currently, the integration of flow and batch is implemented in ETL.
Previously, our ETL real-time and offline were done separately, through batch processing, and then changed to the Hive table, followed by the offline data warehouse. In real-time, after real-time ETL, put it in Kafka, and then do the subsequent real-time data warehouse.
The first benefit of first doing the integration of streaming and batching in ETL is the improvement of the timeliness of offline data warehouses, because the data needs to be anti-cheating, so when we provide basic features for the advertising algorithm, the timeliness after anti-cheating will improve the subsequent overall effect It is relatively large, so if ETL is made into a unified real-time, it will be of great significance for subsequent guidance.
After ETL achieves the integration of streaming and batching, we will put the data in the data lake, and subsequent offline data warehouses and real-time data warehouses can be implemented based on the data lake. The flow and batch integration can be divided into two stages. The first stage is to integrate ETL first. In addition, the report side can also be placed in the data lake, so that our query service can achieve an updated order of magnitude. Because the offline table and the real-time table were previously required to do a Union calculation, in the data lake, we can achieve this by writing a table offline and in real time.
Four, future planning
About future planning:
The first is the integration of flow and batch, which includes two aspects:
- The first one is the integration of ETL, which is now basically online.
- The second is the combination of real-time report SQLization and data lake.
- In addition, the current anti-cheating is mainly implemented offline, and some online anti-cheating models may be converted into real-time later to minimize the risk.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。