论(有目的的)让网络坏掉的重要性

更新于 2018-02-25  约 12 分钟

这是 Network Collective 2018年2月23日的一期,CC BY-NC-ND 协议,经版权所有人同意转载翻译。

原文只有视频,我在这里把台词听写出来然后翻译了,有些词没听清,还请读者指正。

SHORT TAKE – BREAKING THINGS
Should you be breaking things intentionally on your network? The idea might sound preposterous to some, but in this short take Russ White explains how intentional failures can make your network more resilient.

短视频——把东西搞坏

你应该在你的网络上故意把东西搞坏么?这个主意可能对某些人来说很荒谬,但是在这个短视频中,Russ White 解释了有意制造的故障如何使你的网络更容易从故障中恢复。

This is Russ White from Network Collective, and this is Network Collective short take, and in this short take, I'll be talking about "The importance of breaking things"

我是 Network Collective 的 Russ White,这是 Network Collective 的短视频,在这个短视频中,我演讲的话题是:“把东西搞坏”的重要性。

Often hear "Enterprise" engineers say things like "We can't risk break things on purpose, we can't risk downtime at all. It's really scary." Do you really want to go to a hospital, where the network might break in the middle of your operation? Do you really want to keep your money in a bank, where network might break and lose your account? There are actually two different ways to look at this problem.

经常会听到“企业级”的工程师说些这样的话:“我们不能承担故意把东西搞坏的风险,我们一点宕机时间都不可以有,那真的太可怕了。”这样想,你真的敢去一家网络可能在手术进行时中断的医院么?你真的敢把钱放在一个网络可能中断造成账户丢失的银行么?其实看这个问题有两种角度。

But before we get into this topic more deeply, I must repeat an old mantra of mine, "There are no enterprises, there are no service providers, there are only problems and solutions." Part of the reason we get into this thinking ruts about breaking things and how things work and don't work, is we want to divide the world into us vs. them. We want "them" to be different or us to be different in someway.

在我们深入这个话题前,我必须重复我的一句格言:“没有什么‘企业’,也没有‘运营商(服务提供商)’,只有问题和解决方案。”我们陷入一种思维定势,这种关于“弄坏东西、东西怎么就工作了、怎么又不工作了”的思维定势,一部分原因是我们想要把这个世界分割成“我们”和“他们”,我们希望“他们”是有所不同或者“我们”在某些方面跟他们不同。

The first way of thinking about this problem is try to build systems do not fail, you think through every-possible-failure-mode, you think about what results in what failure might look like, you plan around this kind of failure and that kind of failure, so it won't cause a network outage. This is actually the default mode of operation at most network operators.

第一种遇到这个问题的思路是构建一个不会坏的系统,深入思考每一种可能的故障模式,思考哪种故障会带来怎样的后果,规划这种或那种故障模式,使得这些故障不会造成网络瘫痪。这是大部分网络操作员的操作模式。

There is a second way thinking about this problem, however. We should break things on purpose. Hyper-scales break things on purpose all the time, in fact. It's fine if you're just running a social network though, right? Who cares if you have to upload the pictures of your dinner twice?

然而还有另一种思路,“我们应该有意识地搞坏东西”。事实上,超大规模的系统每时每刻都有东西在故意被搞坏。如果你运营一个社交网络,这是没问题的,谁会在意需要上传两遍的晚餐照片呢?

"We can't tolerate downtimes, our data is too important, it's dangerous to have downtime intentional!" In fact, however, I would rather keep my money at a bank that broke their network on a regular basis, rather one that try to plan for every contingency, and I'd rather go to a hospital that broke their network on a regular basis. Why? Here's a quote from Philippe Kruchten that might be helpful:

...airline transportation did not become safer by accident but by accidents...
- Philippe Kruchten

“我们不能承受任何宕机时间,我们的数据太重要了,故意造成宕机是很危险的!”事实上,我宁愿把我的钱放在这样一个银行,这个银行有规律地让网络坏掉,而不是一个为所有偶发事故做好准备的银行。我也宁愿去一个网络有规律地挂掉的医院(而不是规划好一切故障的)。来自 Philippe Kruchten 的一句话也许有帮助(请注意体会英文原文):“航空运输的安全不是偶然来的,而是一次又一次的事故造就的(安全)”

There are so many articles on the importance of software engineering: "You can't think of everything in system design, you can only think of what you can think of, the rest must be discovered." This is why we do so many layers of software testing, so you'll find software testing goes through smoke test, then system test, then component test ... and all these things. Every one of those is an attempt to break the software before it's deployed in the real world. You can't know every possible reaction to a failure. This, of course, is a subset of the law of unintended consequences. You simply can not think through every possible side effects, the rest must be discovered.

有很多文章强调软件工程的重要性:“你无法在系统设计中考虑到每件事,你能想的只能是你想到的那些,剩下的只能‘被发现’。”这就是为什么我们要做那么多层的软件测试,你会看到软件测试里有“冒烟测试”、系统测试、组件测试等等所有这些测试。每一个测试都是试图在软件部署到真实世界之前就让它坏掉。你无法知晓一项失败造成的所有影响,这当然也是“未曾想到过的后果定律”。你不可能想到每种副作用,剩下的只能“被发现”。

Finally you can not learn to troubleshoot your system without seeing it in a failure state, you can guess it how or why you might troubleshoot something or how you should troubleshoot it, you can guess what information will be useful, even though, the rest, again must be discovered.

总之,你不能在没看到系统的故障状态的情况下发现/排除系统故障,你可以猜测可能会如何或是为什么你要排除某物的故障或者应该如何排除故障,你可以猜测哪些信息是有用的,然而剩下的只能“被发现”。

This is not a call for ChaosMonkey kinds of testing in every network or every software system, it is however, a call to stop thinking a failure as being a bad thing. Breakages are important opportunities to learn, sometimes it's a good way, a good thing to induce learning opportunities in a scaled controlled way.

这不是把呼吁 “ChaosMonkey” 式的测试应用到所有网络或者所有软件系统中,而是呼吁不要再把故障看成是坏事。故障是学习的重要机会,有时引入大规模可控的故障是很好的学习机会。

Find ways to seal off system, parts of a network, find ways to emulate large versions of the network in a realistic ways possible, then break things and learn. Failure then is not an option in any network, it's required to learn and it's required to build better systems,

隔离系统、隔离一部分的网络,或是想办法把大型的网络以切实可行的方法仿真出来,然后把东西搞坏,进而学习。故障对于任何网络来说不是一个“可能的”选项,而是对于学习,对构建更好的系统不可或缺的一部分。

that's it for this time, we'll see you next time at the Network Collective

这次就到这里,我们下期 Network Collective 见。

阅读 634更新于 2018-02-25

推荐阅读

CCIE CCDE 必读书目 读书笔记

3 人关注
2 篇文章
专栏主页
目录