Possible Hadoop Trajectories

http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories/fulltext

Computation

In summary, we see the following steps in Hadoop adoption in the computation arena. # 在数据计算方面, 使用Hadoop/MapReduce会经过下面4个步骤

Step 1: Adopt Hadoop for pilot projects.
Step 2: Scale Hadoop to production use.
Step 3: Hit the wall, as the above problems become big issues.
Step 4: Morph to something that deals with our issues. # 计算上会迁移到其他模型

At Lincoln Labs we have projects at all four steps. Survival of Hadoop in our environment will require major surgery to the parallel computation model, complementing the current Hadoop work on the task scheduler. Our expectation is that solving these issues will make current Hadoop unrecognizable in future systems. It is possible that other shops have a job mix that is more aligned with the current MapReduce framework. However, our expectation is that we are more the norm than the exception. The evolution of Google away from MapReduce to other models lends credence to this supposition. Hence, we fully expect a dramatic future evolution of the Hadoop computation framework. # 文中Lincolin Labs主要使用Hadoop来做迭代计算. Hadoop应该在并行计算方面做改进. 而且事实上Google已经开始从MapReduce迁移到其他计算模型了.

Data Management

Some of us wrote a paper in 2009 comparing parallel DBMS technology with Hadoop. In round numbers DBMSs are faster by 1-2 orders of magnitude. This performance advantage comes from indexing the data, making sure that queries are always sent to the nodes where data resides and not the other way around, superior compression, and superior protocols between worker nodes. As near as we can tell, the situation in 2012 is about the same as 2009; Hadoop is still 1-2 orders of magnitude off the mark. Anecdotal evidence abounds. For example, one large Web property has a 5 Pbyte Hadoop cluster deployed on 2700 nodes; a second has a 5 Pbyte instance supported by a commercial DBMS. It uses 200 nodes, a factor of 13 less. In summary, MapReduce is an internal interface in a parallel DBMS, and one that is not well suited to the needs of a DBMS. # 作者认为MapReduce只不过是DBMS的内部引擎一种实现, 但是却不是较优的实现, 比如缺少索引, 缺少data locality, 缺少好的压缩以及高效的通信协议.

Therefore, we see the following trajectory in Hadoop data management:

Step 1: Adopt Hadoop for pilot projects.
Step 2: Scale Hadoop to production use.
Step 3: Observe an unacceptable performance penalty.
Step 4: Morph to a real parallel DBMS. # 数据管理上会迁移到PDBMS