Getting Real About Distributed System Reliability

http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability

非常好的一篇文章,说明了实现一个可靠的分布式系统是多么地困难,以及需要投入非常大的人力物力。不仅仅是实现,在使用这些分布式系统的时候也会遇到非常多的问题,耗费很多开发和运维成本。所以对于中小企业来说,使用最土的技术或者是使用比如AWS这类可靠的现成服务,或许是更正确的选择。

It’s true that, if properly designed, these systems can be run with no planned downtime or maintenance intervals in a way that traditional storage systems make harder. It’s also true that software that is explicitly designed to deal with machine failures is a very different thing from traditional infrastructure. …… But one thing you often hear is that this kind of software is more reliable than the traditional alternatives it replaces, and this just isn’t true. It is time people talked honestly about this. (分布式系统并不比一定那些传统系统更加可靠)

通常我们认为分布式系统里面每个组件的fail的概率是相互独立的,但是事实上却并不如此。Google的论文里面给出的BigTable/GFS在线上运行的事故概率,远比使用P^N这种概率要高。这种实践中得到的事故概率/可靠概率才是真实的。

P^N is an upper bound on reliability but one that you could never, never approach in practice. For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.

系统的真正看可靠性很大程度上取决于实现上有多少bug, 监控做的有多少,以及如果出现各种问题可以很快把这些问题隔离出来。真正的可靠性需要长期的实践和投入。

The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort. This software hasn’t been around for as long as MySQL or Oracle, and worse, the expertise to run it reliably is much less common. MySQL and Oracle administrators are plentiful, but folks experience with, say, serious production Zookeeper operations knowledge are much more rare.

Kernel filesystem engineers say it takes about a decade for a new filesystem to go from concept to maturity. I am not sure these systems will be mature much faster—they are not easier systems to build and the fundamental design space is much less well explored. This doesn’t mean they won’t be useful sooner, especially in domains where they solve a pressing need and are approached with an appropriate amount of caution, but they are not yet by any means a mature technology.

相比分布式系统,single-server code更加容易被测试和发现问题。如果有机会的话可以看看Oracle对于软件测试的态度,相比之下,分布式系统理应需要更加严格的测试。同时分布式系统牵扯到大量的机器,所以运维这些机器也需要下很大的功夫。

Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. Code that deals with failure cases and is “cluster aware” is extremely tricky to get right. The root of the problem is that dealing with failures effectively explodes the possible state space that needs testing and validation.

For example it doesn’t even make sense to expect a single-node database to be fast if its disk system suddenly gets really slow (how could it), but a distributed system does need to carry on in the presence of single degraded machine because it has some many machines, one is sure to be degraded. These kind of “semi-failures” are common and very hard to deal with. Correctly testing these kinds of issues in a realistic setting is brutally hard and the newer generation of software doesn’t have anything like the QA processes its more mature predecessors had. (If you get a chance get someone who has worked at Oracle to describe to you what kind of testing they do to a line of code that goes into their database before it gets shipped to customers). As a result there are a lot of bugs. And of course these bugs are on all the machines, so they absolutely happen together.

Likewise distributed systems typically require more configuration and more complex configuration because they need to be cluster aware, deal with timeouts, etc. This configuration is, of course, shared; and this creates yet another opportunity to bring everything to its knees.

And finally these systems usually want lots of machines. And no matter how good you are, some part of operational difficulty always scales with the number of machines.

所以我现在清楚地认识到,这些分布式系统的核心困难,是在运维和操作上,而不是架构和设计上。好的运维可以把糟糕的软件运行得良好,而反过来不行。相反我从很多NoSQL hypesters听到的都是他们的软件是多么地可靠并且可以self-healing.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.

These difficulties are one of the core barriers to adoption for distributed data infrastructure. LinkedIn and other companies that have a deep interest in doing creative things with data have taken on the burden of building this kind of expertise in-house—we employ committers on Hadoop and other open source projects on both our engineering and operations team, and have done a lot of from-scratch development in this space where there was gaps. This makes it feasible to take full advantage of an admittedly valuable but immature set of technologies, and let’s us build products we couldn’t otherwise—but this kind of investment only makes sense at a certain size and scale. It may be too high a cost for small startups or companies outside the web space trying to bootstrap this kind of knowledge inside a more traditional IT organization.

This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.

这些分布式系统的真实的(从实践中得到的数据)可靠性需要被披露。

I would love to see claims in academic publication around practicality or reliability justified in the same way we justify performance claims–by doing it. I would be a lot more likely to believe an academic distributed system was practically feasible if it was run continuously under load for a year successfully and if information was reported on failures and outages. Maybe that isn’t feasible for an academic project, but few other allegedly scientific academic disciplines can get away with making claims about reality without evidence.

Likewise if you have a “NoSQL vendor” I think it is reasonable to ask them to provide hard information on customer outages. They don’t need to tell you who the customer is, but they should let you know the observed real-life distribution of MTTF and MTTR they are able to achieve, not just highlight one or two happy cases. Make sure you understand how they measure this, do they have automated test load that runs or just wait for people to complain? This is a reasonable thing for people paying for a service to ask for. To a certain extent if they provide this kind of empirical data it isn’t clear why you should even care what their architecture is beyond intellectual curiosity.