Students who are engaged in security know that there is a very famous network security model called the PDR model , which is proposed by the American International Internet Security System (ISS). Its core assertion is that network security is a matter of time, and the corresponding formula is
Et = Dt + Rt - Pt
, where:
- Et (Exposure) exposure time, the time the system is exposed to attacks;
- Pt (Prevent) defense time, the time for the system to withstand external attacks, or the entire time for the attacker to successfully penetrate;
- Dt (Detect) detection time, the time it takes for the security detection system to discover the attack;
- Rt (Response) Response time, the entire time from when the attack is found to when the attack path is cut off and the attack is aborted.
The PDR model is intuitive and easy to understand, and provides a practical guiding framework for security protection work. Safety is only one of the various types of failures that may occur in the system. Since the PDR model can guide the solution of safety problems, can the PDR model also guide the solution of other failures? I think yes.
1 What is the PDR failure model?
Compared with the PDR model, let's first look at the life cycle of the fault.
It can be seen very intuitively from the above figure that in order to shorten the failure time (Failure Time), we need to find a way to shorten the detection time (Detect Time) and the response time (Response Time) as much as possible, while extending the defense time (Prevent Time). Shortening the detection time corresponds to improving the monitoring and alarming capabilities, shortening the response time corresponds to improving fault repair and CI/CD capabilities, and extending the defense time corresponds to improving the fault tolerance or robustness of the system. The more interesting point here is about the defense time. As long as we can extend the defense time long enough (more than the sum of the detection time and the response time), then the failure has no chance to cause actual impact, which is equivalent to "killing" the failure.
2 Fire prevention is better than fire extinguishing
Marquis Wen of Wei said: 'Which of the three Zikun brothers is the best doctor? ' Bian Que said: 'The eldest brother is the best, the middle brother is second, and Bian Que is the worst. '
—— "Heguanzi, Volume II, Shixian Sixteenth"
Among the three Bian Que brothers, Bian Que has strong Rt ability, second brother Dt ability is strong, and eldest brother Pt ability is strong, but this Pt is not the other Pt. In the above PDR model, Pt refers to the defense ability after a fault occurs, and the big brother's Pt ability refers to the defense ability before the fault (disease) occurs, that is, the ability to take precautions.
Due to various factors, in most cases, the fault defense time is less than the sum of the detection time and the response time. Therefore, once the system fails, it will inevitably cause some practical effects. So is there a way to avoid such effects? Yes, learning from Big Brother Bian Que, fire prevention is better than fire fighting. How to prevent disasters before they happen? Taking history as a mirror, we can know the ups and downs, that is, the failure to restore.
Troubleshooting is an extremely important thing, so important that most people underestimate its importance. From small individuals to large companies, summarizing experience and lessons from various failures, large and small, and learning a lot of knowledge that cannot be learned in books, so as to achieve maximum improvement, which is also an important feature of a growing team. There is a lot of information about fault recovery on the Internet. Here I just want to emphasize three points.
First, the earlier the failure recovery is done, the better the effect (after the failure is properly handled, of course). Note that in the process of handling the fault, the fault site should be retained as much as possible, and the process data should be backed up for later review.
Second, in the whole process of fault recovery, we should uphold the principle of doing things right instead of people, starting from facts and speaking with facts, so as to find the real root cause and propose effective improvement measures accordingly.
Third, each improvement measure should be assigned a unique responsible person and prioritized. For high-priority improvement measures, the closed-loop time (such as 1 month) should be specified.
Finally, I recommend a column by Mr. Chen Hao (References at the end of the article, Part 2). At the same time, I have posted a template of a failure review report in the appendix, which I hope you can use in your daily work.
3 Summary
In today's article, I first proposed a PDR model specially used for fault handling, and then gave some principles of fault recovery, hoping to help you. You are welcome to leave a message on my message board and exchange ideas with everyone.
appendix
Failure recovery report template
refer to
- Listening to the Wind in the Left Ear - Troubleshooting Best Practices: Responding to Failures
- Listening to the Wind in the Left Ear - Troubleshooting Best Practices: Troubleshooting Improvements
- [Yu Sheng thinks - Talking about "Black Box Thinking"](
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。