There are several BI large-scale reporting stability assurance practices

This paper mainly summarizes the large-scale report stability guarantee method based on practice.

Background of the project

With the gradual deepening of the data management thinking, whether it is the internal users of NetEase Group or external commercial customers, more and more people are using BI on a large scale. Taking Yanxuan as an example, there are 5w+ reports with daily visits. These reports cover almost all business sectors such as users, commodities, channels, traffic, marketing, warehousing, suppliers, and finance. Some reports are embedded in the app for management. Among them, some reports are used in weekly business meetings or review meetings, and some reports are embedded in the business system to assist business decision-making... and play an important role in daily work. The daily query volume of charts during peak periods is 10w+, which gives the stability of the report. Sexual security presents great challenges.

The stability of the report is not only to ensure the stability of the platform, but also to ensure the availability and performance of the report chart query. However, due to the large number of reports and different from ordinary business services, the query time and resource consumption of different charts are very different, and the underlying resources are always limited. It is very difficult to uniformly guarantee the high availability of business core reports.

Explore planning

There are mature methods to ensure the stability of the platform, which is not the focus of this article. The stability guarantee of report query occupies more energy in our actual work. In practice, we draw on the idea of service classification guarantee. Different reports must have different importance to the business. We mark the key reports. Prioritize key reports. Of course, this focus and non-focus are from the top-down perspective of the business.

After the guarantee object is clear, we also need to identify the relevant components and tasks on the data output link and data query link that key reports depend on. These components and tasks also need key guarantees, such as ETL tasks and data queries that key reports depend on. Depending on the impala engine, etc., we need to allocate independent resources for the components and tasks of the key report table output link and data query link.

In the data query link, since the isolation of the OLAP engine is not very good, it is best to use independent cluster resources. In practice, it can be further subdivided according to the application scenarios of key reports, such as Kanban and analysis scenarios. It is best to isolate them to reduce mutual influence.

With independent resource guarantees, we also need to formulate core indicators for key reports to quantify stability. Here we focus on three indicators, namely the first-visit cache hit rate of charts, the error rate of chart query, and the ratio of slow chart queries.

Focus on reporting the cache hit rate of the first access to the chart . This is an indicator of the cache preloading effect. It ensures that the user can hit the precache in seconds when they open the report for the first time.

Focus on reporting the chart query error rate . This is an indicator of chart usability. It is equivalent to focusing on reporting the chart query interface error rate. At present, it mainly depends on the overall error rate. In practice, different error rate requirements can be formulated for different reports. The errors here The rate mainly refers to the query error reported in the user's browsing state.

Focus on reporting the slow query ratio of the chart. This is an indicator of chart performance. For this indicator, a performance baseline must be established for the chart. For example, <5s is considered a slow query. In practice, different performance baselines can be developed for different charts.

practice program

After the core indicators are clarified, we need to make corresponding system reports for these indicators to monitor, diagnose and optimize, and continuously improve these three indicators. We mainly ensure and optimize through three parts: pre-event (report release review, report stress test), in-event (monitoring, diagnosis, intervention), and post-event (first access cache hit rate management, query error rate management, slow query management), here It is mainly combined with the practice of NetEase's high-performance query engine Impala.

3.1 Review of Report Release

The development of the report is also a kind of software development in essence. To complete the high-quality delivery, the release of the report also needs to have a review process, especially the key report. The audit is mainly in two aspects. On the one hand, it is necessary to check whether the tables and models that the report depends on, and whether the report production conforms to the specifications, such as whether the storage format of the table is reasonable, whether the small table file is reasonable, whether the model uses partition field filtering, and reports a single page. Whether there are too many graphs, etc. This BI report provides the "Data Doctor-Performance Diagnosis" function, which can automatically diagnose and check; on the other hand, it is necessary to test the report according to the estimated number of concurrency to see if the report performance reaches requirements, whether there is a risk in resource occupation.

3.2 Report stress test

Both report release and performance optimization need to be verified through stress testing. There are two types of stress testing. One is single report stress testing, such as report online stress testing and report optimization stress testing verification; the other is scenario-based stress testing. For example, traffic stress testing during peak hours of work, scenario stress testing can simulate user access traffic for stress testing based on user access logs.

3.3 Monitoring and Diagnostics

In addition to the regular basic monitoring and application monitoring of the platform, we also need to add corresponding business monitoring to the core indicators of key reports, such as cache preload quantity monitoring, key report query error monitoring, key report extraction task error monitoring, key report slow query monitoring, etc. Wait, with core indicator monitoring, we can find problems and deal with them in time.

For some specific errors, it can provide the ability to diagnose, such as the continuous occurrence of "chart query peak" errors, which can diagnose the impact of which reports, and in emergencies, reports can be temporarily disabled as needed to ensure overall stability.

3.4 Reporting Governance

To continuously improve the stability and performance of key reports, regular governance and optimization are essential, because there are uncertain changes in report traffic, table data volume, table structure, and table output time. Report management is mainly divided into three parts: first-visit cache hit rate management, report query error rate management, and slow query management.

To improve the first-access cache hit rate of key reports, the core is to increase the completion ratio of key report cache preloading, which can be optimized from the following three aspects:

(1) Optimize the table output time of key reports, and the table production time of key reports depends on earlier, so that there is more time for buffers to do cache preloading. This requires data development and analysts to work together from the data output link to optimize.

(2) Improve the priority of cache preloading of key reports. This can improve the priority of cache preloading of key reports compared with ordinary reports, thereby improving the completion rate of preloading caches of key reports. At the same time, key reports will also be based on recent visits, etc. Metrics are then subdivided into priorities.

(3) For some reports with cache preload timeout or many errors, the priority can be reduced.

To reduce the query error rate of key reports, it is necessary to classify and manage chart query errors:

(1) The graph with the query timeout should be managed by slow query optimization (see the section on slow query optimization in the graph).

(2) Graph query peak errors need to be optimized by diagnosing suspicious reports/graphs.

(3) System errors should be solved through system optimization. For example, metadata errors can increase metadata refresh retries, service restart errors can increase query retries, and so on.

(4) To promote the management of report authors for business errors, such as the deletion of the original table, the absence of some fields due to changes in the original table, the inability to connect to the data source, and so on.

In terms of graph slow query governance, unified governance has the following categories:

(1) Time-consuming and resource-consuming chart governance: Top resource-consuming and top-consuming charts often seriously affect the overall performance and stability of the cluster. When multiple slow charts are queried concurrently, query peaks are more likely to occur, so this part of governance is the top priority. . Of course, this governance should also be viewed in conjunction with the number of visits to the chart, and the chart with a large number of visits will have a greater impact.

(2) Small file management: Too many small files will lead to larger metadata, increase metadata synchronization pressure, and also affect the performance of HDFS.

(3) Timed refresh management: The time-consuming and resource-consuming charts are refreshed too frequently, which will significantly increase the cluster load. You can reduce the frequency or turn off the timed refresh.

Specific to a single slow graph, common performance optimization ideas are:

(1) Model forced partition filtering: Large table full table scans have a great impact on performance. It is recommended to use partitioned tables for large tables of more than one million. At the same time, mandatory partition filtering is set on the model to reduce the scope of data scans and control full table scans from the source. possible.

(2) Extraction to MPP: If custom SQL is filtered or aggregated to reduce the result set, it can be extracted to MPP, and query through MPP to reduce complex SQL real-time calculation; subsequent products also support the extraction of wide table models to MPP, which is in CK There will be a big performance improvement on the engine.

(3) Materialized model: Too many associated tables in the model lead to poor performance. You can use data task pre-calculation or use NetEase impala materialized view materialized model.

(4) The list filter uses an independent dimension table: the data of the list filter needs to be recalculated from the corresponding columns of the model width and detail, and the performance is slow when the amount of data is large. If the members of the list filter are relatively fixed, you can use the list filter to go to the independent dimension table and filter the chart by cross-model association.

(5) Refreshing table statistics: Impala optimizes the execution plan based on the cost model. The lack of table statistics will have an important impact on the quality of the execution plan, and table statistics can be refreshed in advance.

(6) Time/date conversion: When converting a field of type "string" to type "date, datetime", using the original type (ie, string type) for comparison does not require field type conversion in SQL. Improve query performance.

(7) Table storage format governance: The text storage format has poor data filtering capabilities. It is recommended to use the high-performance columnar storage format Parquet as much as possible.

summary

Report stability and performance assurance are one of the most important user experiences in BI. The method needs to be continuously practiced and summarized. At present, the product has support for key reporting functions, and there will be more system reports and governance products related to stability assurance in the future. Feature support.

At present, the governance of cloud music within the group has been quite effective. In terms of core indicators, the cache hit rate of the first visit is greater than 90%, the query error rate on the key report day is less than 0.5%, and the proportion of key report chart queries > 5s is less than 5%. Together with Cloud Music, we have formulated SLA targets for key reporting query error rates, and strict environmental governance is also in progress.

About the Author

Xueliang, NetEase technical expert, has several BI technology leaders, has been responsible for the strict selection of data center, data product and service research and development, has served as a JS course lecturer for "Becoming a Front-end Development Engineer" and "Front-end Micro Professional", and has more than ten years of Internet products R&D and management experience.

There are several BI large-scale reporting stability assurance practices

Background of the project

Explore planning

practice program

summary

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

There are several BI large-scale reporting stability assurance practices

Background of the project

Explore planning

practice program

summary

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈