这是 Network Collective 2018年2月23日的一期，CC BY-NC-ND 协议，经版权所有人同意转载翻译。
SHORT TAKE – BREAKING THINGS
Should you be breaking things intentionally on your network? The idea might sound preposterous to some, but in this short take Russ White explains how intentional failures can make your network more resilient.
你应该在你的网络上故意把东西搞坏么？这个主意可能对某些人来说很荒谬，但是在这个短视频中，Russ White 解释了有意制造的故障如何使你的网络更容易从故障中恢复。
This is Russ White from Network Collective, and this is Network Collective short take, and in this short take, I'll be talking about "The importance of breaking things"
我是 Network Collective 的 Russ White，这是 Network Collective 的短视频，在这个短视频中，我演讲的话题是：“把东西搞坏”的重要性。
Often hear "Enterprise" engineers say things like "We can't risk break things on purpose, we can't risk downtime at all. It's really scary." Do you really want to go to a hospital, where the network might break in the middle of your operation? Do you really want to keep your money in a bank, where network might break and lose your account? There are actually two different ways to look at this problem.
But before we get into this topic more deeply, I must repeat an old mantra of mine, "There are no enterprises, there are no service providers, there are only problems and solutions." Part of the reason we get into this thinking ruts about breaking things and how things work and don't work, is we want to divide the world into us vs. them. We want "them" to be different or us to be different in someway.
The first way of thinking about this problem is try to build systems do not fail, you think through every-possible-failure-mode, you think about what results in what failure might look like, you plan around this kind of failure and that kind of failure, so it won't cause a network outage. This is actually the default mode of operation at most network operators.
There is a second way thinking about this problem, however. We should break things on purpose. Hyper-scales break things on purpose all the time, in fact. It's fine if you're just running a social network though, right? Who cares if you have to upload the pictures of your dinner twice?
"We can't tolerate downtimes, our data is too important, it's dangerous to have downtime intentional!" In fact, however, I would rather keep my money at a bank that broke their network on a regular basis, rather one that try to plan for every contingency, and I'd rather go to a hospital that broke their network on a regular basis. Why? Here's a quote from Philippe Kruchten that might be helpful:...airline transportation did not become safer by accident but by accidents...
- Philippe Kruchten
“我们不能承受任何宕机时间，我们的数据太重要了，故意造成宕机是很危险的！”事实上，我宁愿把我的钱放在这样一个银行，这个银行有规律地让网络坏掉，而不是一个为所有偶发事故做好准备的银行。我也宁愿去一个网络有规律地挂掉的医院（而不是规划好一切故障的）。来自 Philippe Kruchten 的一句话也许有帮助（请注意体会英文原文）：“航空运输的安全不是偶然来的，而是一次又一次的事故造就的（安全）”
There are so many articles on the importance of software engineering: "You can't think of everything in system design, you can only think of what you can think of, the rest must be discovered." This is why we do so many layers of software testing, so you'll find software testing goes through smoke test, then system test, then component test ... and all these things. Every one of those is an attempt to break the software before it's deployed in the real world. You can't know every possible reaction to a failure. This, of course, is a subset of the law of unintended consequences. You simply can not think through every possible side effects, the rest must be discovered.
Finally you can not learn to troubleshoot your system without seeing it in a failure state, you can guess it how or why you might troubleshoot something or how you should troubleshoot it, you can guess what information will be useful, even though, the rest, again must be discovered.
This is not a call for ChaosMonkey kinds of testing in every network or every software system, it is however, a call to stop thinking a failure as being a bad thing. Breakages are important opportunities to learn, sometimes it's a good way, a good thing to induce learning opportunities in a scaled controlled way.
这不是把呼吁 “ChaosMonkey” 式的测试应用到所有网络或者所有软件系统中，而是呼吁不要再把故障看成是坏事。故障是学习的重要机会，有时引入大规模可控的故障是很好的学习机会。
Find ways to seal off system, parts of a network, find ways to emulate large versions of the network in a realistic ways possible, then break things and learn. Failure then is not an option in any network, it's required to learn and it's required to build better systems,
that's it for this time, we'll see you next time at the Network Collective
这次就到这里，我们下期 Network Collective 见。