Front-end white screen monitoring exploration

Quoting the slogan of the group monitoring: Those who are concerned about business stability will not have bad luck~

background

I don't know when the front-end white screen problem has become a very common topic. "White screen" has even become synonymous with front-end bugs: _Hello, your page is white. _Moreover, the phenomenon of'white' seems to be stronger for the user's physical sense, recalling the'blue screen' of the crash of the windows system:

It can be said that they are very similar, and they can even understand how the term white screen is unified. Then, the phenomenon of such a strong body sensation is bound to bring some bad effects to users. How to monitor as early as possible and quickly eliminate the effects becomes very important.

Why monitor the white screen alone

It is not only a white screen, but a white screen is just a phenomenon. What we need to do is fine-grained abnormal monitoring. Each company must have its own system for abnormal monitoring, and the group is no exception, and it is mature enough. However, the general scheme always has its shortcomings. If all abnormalities are alarmed and monitored, it is impossible to distinguish the severity of the abnormalities and respond accordingly. Therefore, customized and refined abnormal monitoring under the general monitoring system is Very necessary. This is the reason why this article discusses the scene of white screen. I delineated the boundary of this scene on the phenomenon of "white screen".

Program research

There are probably two possible reasons for the white screen:

Errors in the execution of js
Resource error

The two directions are different, and resource errors affect more areas and depend on the situation, so they are not within the scope of the following plan. For this reason, I refer to some practices on the Internet plus some research of my own, and I probably summarized some plans:

One, onerror + DOM detection

The principle is very simple. Under the current mainstream SPA framework, the DOM is generally mounted under a root node (such as <div id="root"></div> ). After a white screen occurs, the usual phenomenon is that all DOMs under the root node are uninstalled. This solution is to monitor the global onerror event. , When an exception occurs, check whether the DOM is mounted under the root node, if not, it proves that the screen is blank.
I think it is a very simple and violent and effective plan. But there are also shortcomings: is based on **White screen === DOM under the root node is uninstalled** under the premise of the establishment of , the actual is not the case, such as some micro front-end framework, of course, there are some I will mention later This plan naturally conflicts with my final plan.

`2. Mutation Observer Api`

If you don’t understand, you can look at the document . Its essence is to monitor DOM changes and tell you whether the DOM that changes each time is added or deleted. A variety of options were considered for it:

onerror with 0611636deb94a8, similar to the first plan, but it was quickly rejected by me. Although it can well know the trend of DOM changes, it cannot be linked to a specific error. Both are event monitoring. Both are There is no necessary connection.
Use it alone to determine whether a large amount of DOM has been uninstalled. Disadvantages: a white screen may not necessarily mean that the DOM has been uninstalled, or it may not be rendered at all, and a large amount of DOM may be uninstalled under normal circumstances. No way at all.
Using its monitoring timing alone to cooperate with DOM detection has the same disadvantages as Option 1, and I think it is not as good as Option 1. Because it cannot be linked to specific errors, that is, it cannot be located. Of course, when I communicated with other teammates, they gave other directions: tracking user behavior data to locate problems, I think it is also a way.

At the beginning I thought this was the final answer, but after a long period of psychological struggle, it was finally rejected. But it gives a better choice of monitoring timing.

`3. Are you hungry-Emonitor white screen monitoring solution`

Ele.me’s white-screen monitoring solution is based on the principle of recording the html length changes before and after the page is opened for 4 seconds, and uploading the data to Ele.me’s self-developed time series database. If a page is stable, then the distribution of page length changes should be in the form of a "power distribution" curve, and the data lines p10, p20 (ranked in the top 10% and 20% of the document) should be stable, in a certain interval Internal fluctuations, if the page is abnormal, the curve will definitely fall to the bottom.

`other`

Everything else is the same, in fact, after a round of research, I found that there are nothing more than two points.

monitoring timing: surveyed, there are three common types:
- onerror
- mutation observer api
- Rotation
DOM detection: is a lot of solutions, in addition to the above, you can also:
- elementsFromPoint api sampling
- Image Identification
- Various algorithm recognition of various data based on DOM
- ...

`change direction`

After several attempts, I almost didn't find what I wanted. The main reason was the accuracy - none of these solutions guarantee that what I was listening to was a white screen. The theoretical derivation alone would not make sense. They all have one thing in common: what they monitor is the phenomenon of'white screen'. Although it can be successful to derive the essence from the phenomenon, it is not accurate enough. So what I really want to monitor is the nature of the white screen.

So back to the beginning, what is a white screen? How did he cause it? Is the browser unable to render because of an error? No, the actual white screen that is prevalent in this spa framework is caused by the framework. The essence is that the framework does not know how to render due to an error, so it simply does not render. Since our team has a majority of React technology stacks, let's take a look at the paragraph React official website: React believes that keeping a wrong UI is worse than removing it completely. We do not discuss whether this view is correct or not, at least we know the reason for the white screen: the rendering process is abnormal and we did not catch the exception and deal with it.

In contrast to the current mainstream framework: we host the DOM operation to the framework, so the exception handling methods of rendering are definitely different in different frameworks. This is probably the reason why the white screen monitoring is difficult to be unified and productized. But the general direction is definitely the same.

Then I think the white screen can be defined as follows: The rendering failure caused by the .

Then the white screen monitoring program is: monitor rendering abnormal . So for React, the answer is: Error Boundaries

`Error Boundaries`

We can call it the error boundary. What is the error boundary? It is actually a life cycle, used to monitor errors in the rendering process of the children of the current component, and can return a degraded UI to render:

class ErrorBoundary extends React.Component {
  constructor(props) {
    super(props);
    this.state = { hasError: false };
  }

  static getDerivedStateFromError(error) {
    // 更新 state 使下一次渲染能够显示降级后的 UI
    return { hasError: true };
  }

  componentDidCatch(error, errorInfo) {
    // 我们可以将错误日志上报给服务器
    logErrorToMyService(error, errorInfo);
  }

  render() {
    if (this.state.hasError) {
      // 我们可以自定义降级后的 UI 并渲染
      return <h1>Something went wrong.</h1>;
    }

    return this.props.children; 
  }
}

A responsible development will not let mistakes happen. The error boundary can be wrapped in any location and provide a degraded UI, that is, once the developer is'responsible', the page will not be completely white. This is also the situation that I said before that the solution 1 naturally conflicts with other solutions and other solutions are unstable. . So, in the meantime we reported anomalies, exceptions reported here will lead us defined in black and white , this derivation is 100% correct.

The word 100% may not be responsible enough, let’s take a look at why I say this derivation is 100% accurate:

`React rendering process`

Let's briefly review what React does from the code to the presentation page. I roughly divide it into several stages: render => task scheduling => task cycle => submission => display Let's give a simple example to show the whole process (task scheduling is no longer in the scope of this discussion, so it will not be shown):

const App = ({ children }) => (
  <>
    <p>hello</p>
    { children }
  </>
);
const Child = () => <p>I'm child</p>

const a = ReactDOM.render(
  <App><Child/></App>,
  document.getElementById('root')
);

`Prepare`

First of all, the browser does not recognize our jsx syntax, so we can probably get the following code through babel compilation:

var App = function App(_ref2) {
  var children = _ref2.children;
  return React.createElement("p", null, "hello"), children);
};

var Child = function Child() {
  return React.createElement("p", null, "I'm child");
};

ReactDOM.render(React.createElement(App, null, React.createElement(Child, null)), document.getElementById('root'));

The babel plug-in converts all createElement into the 0611636deb9990 method. Executing it will get a description object ReactElement like this:

{
    $$typeof: Symbol(react.element),
  key: null,
  props: {}, // createElement 第二个参数 注意 children 也在这里，children 也会是一个 ReactElement 或 数组
  type: 'h1' // createElement 的第一个参数，可能是原生的节点字符串，也可能是一个组件对象（Function、Class...）
}

All nodes including the native <a></a> and <p></p> will create a FiberNode , and its structure will look like this:

FiberNode = {
    elementType: null, // 传入 createElement 的第一个参数
  key: null,
  type: HostRoot, // 节点类型（根节点、函数组件、类组件等等）
  return: null, // 父 FiberNode
  child: null, // 第一个子 FiberNode
  sibling: null, // 下一个兄弟 FiberNode
  flag: null, // 状态标记
}

You can think of it as Virtual Dom but with a lot of scheduling stuff. Initially, we will create a FiberNodeRoot for the root node. If there is one and only one ReactDOM.render then it is the only root, and there is one and only one FiberNode tree.

I only keep some important fields in the rendering process, and there are many other fields used for scheduling and judgment. I will not release them here. I am interested in understanding by myself.

`render`

Now we are going to start rendering the page, which is our example just now, execute ReactDOM.render . Here we have a global workInProgress object marking the currently processed FiberNode

First, we initialize a FiberNodeRoot for the root node, its structure is as shown above, and workInProgress= FiberNodeRoot .
Next we execute the first parameter of the ReactDOM.render ReactElement :

ReactElement = {
  $$typeof: Symbol(react.element),
  key: null,
  props: {
    children: {
      $$typeof: Symbol(react.element),
      key: null,
      props: {},
      ref: null,
      type: ƒ Child(),
    }
  }
  ref: null,
  type: f App()
}

The structure describes <App><Child /></App>

We ReactElement generate a FiberNode and the return to parent FiberNode , the beginning is our root, and workInProgress = FiberNode

{
  elementType: f App(), // type 就是 App 函数
  key: null,
  type: FunctionComponent, // 函数组件类型
  return: FiberNodeRoot, // 我们的根节点
  child: null,
  sibling: null,
  flags: null
}

As long as workInProgress exists, we have to deal with the FiberNode . There are many types of nodes, and the processing methods are different, but the overall process is the same. We take the current functional component as an example and directly execute the App(props) method. There are two cases here.
- The component returns a single node, that is, returns a ReactElement object, repeat the steps 3-4. And point the child of the current node to the child node CurrentFiberNode.child = ChildFiberNode and the return of the child node to the current node ChildFiberNode.return = CurrentFiberNode
- Fragment multiple nodes (array or 0611636deb9c65), and we will get an array of ChildiFberNode We loop him, and each node performs 3-4 steps. The child of the current node points to the first child node CurrentFiberNode.child = ChildFiberNodeList[0] , and the sibling of each child node points to its next child node (if any) ChildFiberNode[i].sibling = ChildFiberNode[i + 1] , and the return of each child node points to the current node ChildFiberNode[i].return = CurrentFiberNode

If there are no exceptions, each node will be marked as pending layout FiberNode.flags = Placement

Repeat the steps until all nodes workInProgress are empty.

In the end we can roughly get such a FiberNode tree:

FiberNodeRoot = {
  elementType: null,
  type: HostRoot,
  return: null,
  child: FiberNode<App>,
  sibling: null,
  flags: Placement, // 待布局状态
}

FiberNode<App> {
  elementType: f App(),
  type: FunctionComponent,
  return: FiberNodeRoot,
  child: FiberNode<p>,
  sibling: null,
  flags: Placement // 待布局状态
}

FiberNode<p> {
  elementType: 'p',
  type: HostComponent,
  return: FiberNode<App>,
  sibling: FiberNode<Child>,
  child: null,
  flags: Placement // 待布局状态
}

FiberNode<Child> {
  elementType: f Child(),
  type: FunctionComponent,
  return: FiberNode<App>,
  child: null,
  flags: Placement // 待布局状态
}

`Commit phase`

To put it simply, the submission phase is to take this tree for depth-first traversal of child => sibling, place DOM nodes and call the life cycle.

Then the entire normal rendering process is simply like this. Next look at exception handling

`Error boundary process`

We just learned the normal process and now we make some mistakes and catch him:

const App = ({ children }) => (
  <>
  <p>hello</p>
  { children }
  </>
);
const Child = () => <p>I'm child {a.a}</p>

const a = ReactDOM.render(
  <App>
    <ErrorBoundary><Child/></ErrorBoundary>
  </App>,
  document.getElementById('root')
);

The body of the function that executes step 4 is wrapped in try...catch . If an exception is caught, it will follow the exception process:

do {
  try {
    workLoopSync(); // 上述 步骤 4
    break;
  } catch (thrownValue) {
    handleError(root, thrownValue);
  }
} while (true);

When performing step 4, we call the Child method. Because we added a non-existent expression {a.a} an exception will be thrown into our handleError process. At this time, our processing target is FiberNode<Child> , let’s take a look at handleError :

function handleError(root, thrownValue): void {
  let erroredWork = workInProgress; // 当前处理的 FiberNode 也就是异常的 节点
  throwException(
    root, // 我们的根 FiberNode
    erroredWork.return, // 父节点
    erroredWork,
    thrownValue, // 异常内容
  );
    completeUnitOfWork(erroredWork);
}

function throwException(
  root: FiberRoot,
  returnFiber: Fiber,
  sourceFiber: Fiber,
  value: mixed,
) {
  // The source fiber did not complete.
  sourceFiber.flags |= Incomplete;

  let workInProgress = returnFiber;
  do {
    switch (workInProgress.tag) {
      case HostRoot: {
        workInProgress.flags |= ShouldCapture;
        return;
      }
      case ClassComponent:
        // Capture and retry
        const ctor = workInProgress.type;
        const instance = workInProgress.stateNode;
        if (
          (workInProgress.flags & DidCapture) === NoFlags &&
          (typeof ctor.getDerivedStateFromError === 'function' ||
            (instance !== null &&
              typeof instance.componentDidCatch === 'function' &&
              !isAlreadyFailedLegacyErrorBoundary(instance)))
        ) {
          workInProgress.flags |= ShouldCapture;
          return;
        }
        break;
      default:
        break;
    }
    workInProgress = workInProgress.return;
  } while (workInProgress !== null);
}

The code is too long to intercept part of it throwException method first, there are two core things:

Mark the current node status that is in the problem as incomplete FiberNode.flags = Incomplete
Start bubbling from the parent node, and look up for the node that is capable of handling exceptions ( ClassComponent ) and indeed handles exceptions (declaring the getDerivedStateFromError or componentDidCatch ), if there is, mark that node as workInProgress.flags |= ShouldCapture to be captured, if not, yes The root node.

completeUnitOfWork method is similar. Start bubbling from the parent node and find ShouldCapture . If there is one, mark it as captured DidCapture . If it is not found, mark all nodes as Incomplete until the root node, and workInProgress to the current The captured node.

After that, start the process again from the currently captured node (or the root node may not be captured). Because of its state, react will only render its degraded UI. If there is a sibling node, it will continue to follow the process below. Let's take a look at the FiberNode tree finally obtained in the above example:

FiberNodeRoot = {
  elementType: null,
  type: HostRoot,
  return: null,
  child: FiberNode<App>,
  sibling: null,
  flags: Placement, // 待布局状态
}

FiberNode<App> {
  elementType: f App(),
  type: FunctionComponent,
  return: FiberNodeRoot,
  child: FiberNode<p>,
  sibling: null,
  flags: Placement // 待布局状态
}

FiberNode<p> {
  elementType: 'p',
  type: HostComponent,
  return: FiberNode<App>,
  sibling: FiberNode<ErrorBoundary>,
  child: null,
  flags: Placement // 待布局状态
}

FiberNode<ErrorBoundary> {
  elementType: f ErrorBoundary(),
  type: ClassComponent,
  return: FiberNode<App>,
  child: null,
  flags: DidCapture // 已捕获状态
}

FiberNode<h1> {
  elementType: f ErrorBoundary(),
  type: ClassComponent,
  return: FiberNode<ErrorBoundary>,
  child: null,
  flags: Placement // 待布局状态
}

If there is no configuration error boundary, then there is no node under the root node, and naturally no content can be rendered.

Ok, I believe that by now everyone should be clear about the error boundary processing flow, and should be able to understand why I said before that the ErrorBoundry is 100% correct. Of course, this 100% means that the ErrorBoundry will basically cause a white screen, but it does not mean that it can capture all white screen exceptions. The following scenes are also not captured by him:

Event handling
Asynchronous code
SSR
Self-thrown error

React SSR is designed to use streaming, which means that while the server sends the processed elements, the rest is still generating HTML, which is the component whose parent element cannot catch the error of the child component and hide the error. In this case, it seems that all render functions can only be wrapped in try...catch . Of course, we can use babel or TypeScript to help us simply implement this process. The final result is similar to ErrorBoundry

The events and asynchrony are very coincidental. Although ErrorBoundry cannot capture the exceptions among them, the exceptions it generates do not cause a white screen (if it is a wrong setting state, it indirectly leads to a white screen, and it happens to be caught. ). This is outside the boundaries of the responsibility of white screen monitoring, and other refined monitoring capabilities are needed to handle it.

`Summarize`

Then finally summarize the conclusions of this article: My definition of white screen: rendering failure caused by exception . The corresponding solution is: resource monitoring + rendering process monitoring .

Under the current SPA framework, the white screen monitoring needs to be refined for the scene. Here, using React as an example, the white screen information can be obtained by monitoring the abnormality of the rendering process, and at the same time, it can enhance the developer's attention to exception handling. Other frameworks will also have corresponding methods to deal with this phenomenon.

Of course, this solution also has weaknesses. It is actually impossible to cover all white screen scenarios because it is derived from the essence. For example, I need to use resource monitoring to handle white screens caused by resource abnormalities. Of course, no solution is perfect. I am here to provide an idea, and everyone is welcome to discuss it together.

Author: ES2049 / Takeshi Kaneshiro

The article can be reprinted at will, but please keep this link to the original text. You are very welcome to join ES2049 Studio if you are passionate. Please send your resume to caijun.hcj@alibaba-inc.com .

Front-end white screen monitoring exploration

background

Why monitor the white screen alone

Program research

One, onerror + DOM detection

`2. Mutation Observer Api`

`3. Are you hungry-Emonitor white screen monitoring solution`

`other`

`change direction`

`Error Boundaries`

`React rendering process`

`Prepare`

`render`

`Commit phase`

`Error boundary process`

`Summarize`

ES2049

`引用和评论`

Web推理 - ONNX Runtime 入门

手写一个动态海洋和天空效果的vue hooks

你可能不知道的图片加载相关知识

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你