6
头图

Preface


For the front-end, the most important thing is the experience, and in the front-end experience, the most important thing is performance. A series of indicators such as second opening rate and fluency directly affect user experience.

Therefore, the establishment of an accurate, timely and effective front-end performance monitoring system can not only quantify the performance level of the current page, but also provide data support for the effect of the optimization plan. In addition, it can also provide alarm services to remind the development when the page performance declines. Staff improve page performance.

Selection of monitoring indicators

After referring to the practical results of the predecessors, we evaluated the calculation cost, applicability and practical value of a series of performance monitoring indicators, and believed that the following indicators and information are the most practical and cost-effective:

The first is fcp (first contentful paint, as shown in the figure below). This indicator is currently the mainstream indicator considered for statistics page seconds, although it is not as close to the real user use as indicators such as fmp (First Meaningful Paint, lcp (Largest Contentful Paint), speedIndex, etc. Experience, but the advantage is that it can be obtained by calling the Performance API on the Android side, and can be estimated by raf (requestAnimationFrame) on iOS, and the implementation process is simple.

The second is tts (time to server), this indicator did not appear in the article I saw before.

It describes the time the user connects to the server, which is obtained by subtracting fetchStart from the requestStart provided in the Performance API. This indicator cannot be optimized by the front-end technology, but it can be reflected in the network environment of the current user group. The page opens in seconds. What is the upper limit of the rate.

For example, if it is found through performance monitoring data that 15% of users visit and it takes at least 1 second to connect to our server (which can be an SSR server or a CDN), then these users can't do so anyway. Open, then at this time, the upper limit of the second opening rate of a page is 85%.

If under the current situation, the second opening rate of this page has reached 75% or even higher, then the marginal benefit of continuing to optimize will be very low, and it should be enough.

The third is tsp (time for server processing), this indicator did not appear in the previous reference article.

It is aimed at the time consumption of processing page requests inside the server in the scenario of using SSR, which can be obtained by subtracting requestStart from the responseStart provided in the Performance API.

The poor performance of this link will also become a bottleneck that drags down the second opening rate, so it must be monitored; too long tsp will squeeze the performance budget of other links, and too small will increase the cost of server operation and maintenance.

The fourth is the size of resources such as css files and pictures, and the duration of xhr requests. If the first two resources are not controlled, the page will not be available quickly even if it is opened in seconds, such as common feed stream pages.

For xhr, it needs to be discussed in categories. If it is an SSR page, it will not have much impact. As long as tsp is kept at a low level, it will basically not drag down the second opening rate, but if it is SPA, the main functional area of the page depends on back-end data support Yes, such as judging permissions and displaying the content of the feed stream, the response speed of xhr is very critical, and it also needs to be monitored. When the indicator declines, the backend is notified to optimize and deal with it.

Finally, there are some environmental information, such as what brand and model of mobile the user uses, whether it is in WeChat, browser, or which version of our Dewu App to open the H5 page.

When there is a problem with the performance of the page, these auxiliary information can help us reproduce the user's practical scene as accurately as possible, and solve the problem efficiently and accurately.

system structure

The whole system consists of the following modules:

SDK: Responsible for collecting user page performance data and basic information, and sending performance data to SLS according to a certain sending strategy. After the page is implanted, it can collect performance data by itself without interacting with the page code.

SLS: Alibaba Cloud Log Service, which accepts data sent by the SDK and adds additional information such as receiving time and ip to the performance data.

Backend: Performance data backend service. This module has two functions. One is to periodically pull the original performance data from SLS, de-duplicate and process it to obtain performance index data and user information data, and then classify these data into categories Save it in the corresponding data table for inquiries. The other is to provide interface data for data visualization.

DB: A database of performance log data and performance index data after persistence processing.

Report: Performance data report, through and operation report, observe the various performance indicators of the specified page under the specific project and version.

The relationship of each module is shown in the figure below:

Key technology decision

Before developing, you need to think and make decisions about several key points of the system. There are roughly the following points:

  1. On the mobile side, the unload event is not always triggered, so the SDK is required to be able to send data to SLS intermittently. In order to control the frequency of sending and reduce the amount of repetition of data, we have adopted a strategy of gradually extending the sending interval, that is, the current page is opened for longer and longer, and the frequency of data sending will be lower and lower. After the time reaches a certain length, the SDK will stop working completely.
  2. Due to the duplication of data, it is necessary to perform fingerprint calculation on the user's side (browser, WeChat, webview in the app). Here we choose fingerPrintJS2. In the calculation, we remove the browser features that cause the fingerprint to be unstable, so that the user When opening the page, there is always a fixed fingerprint. But only relying on fingerprints is not enough, because the fingerprints calculated for the same model of mobile phone are likely to be the same. Therefore, when deduplicating performance data, it is necessary to combine the user's fingerprint, log client timestamp, and user device ip to perform deduplication. Use the same wifi, the same device, and open the page in the same millisecond. Users are judged by the current front-end page visits and time distribution. The effect of this deduplication scheme is still very good.
  3. The SDK has a part of synchronization code and needs to run as early as possible after the page is loaded. This means that if the SDK reports an error, the page will not function properly, which is very dangerous. So use try catch to wrap the synchronization code of the SDK to ensure that SDK exceptions will not drag the page.
  4. Some pages load pictures and send a lot of requests, and it is very unrealistic to report all records. Therefore, when we monitor this part of the content, we only list the top 10 pictures in file bytes, and the top 10 loading time. Record the details of the request, and then count how many images have been loaded on this page, and how many requests have been sent. In this way, the scale of page resource loading is known, and the most time-consuming resource for page loading can be found.
  5. The original log is sent to SLS mainly because the concurrency of this data is very large, and the cost of doing the log server by yourself is too high. Using SLS is a more cost-effective choice.
  6. The back-end service chooses Python+Django. Python is a scripting language. Although the performance is a bit worse, for front-end students, it will be less difficult to get started. For small-scale back-end services, the development efficiency can also be guaranteed. At the same time, using the pypy compiler can improve the running speed of the python code. In addition, by using multi-process + multi-thread, the data processing speed can be further improved.
  7. Because of our purpose for statistical analysis, it is not necessary to count all the performance data. For this reason, we have adopted a method of equal-step sampling, that is, for one day’s data, we only take the first 1000 records for every 5% of the log data. For statistics, since the behavior of reporting data from the user side is random, this scheme can basically guarantee the randomness of sampling. In this way, the amount of calculation for statistical work has dropped significantly, ensuring that we can complete data processing with a weak machine, and we can also ensure that in the event of a failure, we have time to "recover" data that has not been processed before. .
  8. For the database, we chose MySQL. We do not have very demanding requirements for IO, so a well-regulated relational database can meet the demand.

Future development plan

The current front-end performance monitoring system can meet the daily monitoring needs, but it can go further:

  1. Frame rate statistics: The SDK currently has the function of statistics frame rate, but because the data volume of the original frame rate is too large, the statistical method of the frame rate needs to be converted later, for example, only the time interval of the freeze and the frame of this interval are reported. Rate distribution.
  2. Use more scientific indicators such as lcp and fmp to replace fcp.
  3. In addition, pageName is combined to perform deduplication to further improve the effect of deduplication.

Text | Old Wolf

Pay attention to Dewu Technology, and work hand in hand to the cloud of technology


得物技术
862 声望1.5k 粉丝