MapReduce and Parallel DBMSs: Friends or Foes?(MapReduce和并行数据库,朋友还是敌人?)

Table of Contents

http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext

1. Parallel Database Systems

2. Mapping Parallel DBMSs onto MapReduce

3. Possible Applications

作者认为MapReduce更加适合的应用场景

  • ETL and "read once" data sets.
  • Complex analytics.
  • Semi-structured data.
  • Quick-and-dirty analyses.
  • Limited-budget operations. (并行计算在MR出来之前迟迟地停留在PDBMS领域的原因)
    • Another strength of MR systems is that most are open source projects available for free. DBMSs, and in particular parallel DBMSs, are expensive
    • though there are good single-node open source solutions, to the best of our knowledge, there are no robust, community-supported parallel DBMSs.
  • Powerful tools. 当做工具来使用

4. DBMs "Sweet Spot"

  • DBMSs ought to be good at analytical queries involving complex join operations (see the table). The DBMSs are a factor of 36 and 21 respectively faster than Hadoop. In general, query times for a typical user task fall somewhere in between these extremes. In the next section, we explore the reasons for these results. 结论是DBMS更加适合做复杂查询

5. Architectural Differences

  • Repetitive record parsing.
  • Compression
  • Pipelining push/pull
  • Scheduling
    • In a parallel DBMS, each node knows exactly what it must do and when it must do it according to the distributed query plan. Because the operations are known in advance, the system is able to optimize the execution plan to minimize data transmission between nodes. 对于DBMS来说因为查询方案是以及数据分布都是已知的,所以能够静态地分析以及优化查询计划,调度也是静态的。
    • In contrast, each task in an MR system is scheduled on processing nodes one storage block at a time. 而MapReduce则只是简单地分配,需要动态地进行调度。
    • Such runtime work scheduling at a granularity of storage blocks is much more expensive than the DBMS compile-time scheduling. 动态调度的开销相对大一些
    • The former has the advantage, as some have argued, of allowing the MR scheduler to adapt to workload skew and performance differences between nodes. 但是动态调度也可以解决数据倾斜不平衡的问题。
  • Column-oriented storage.

We generally expect ETL and complex analytics to be amenable to MR systems and query-intensive workloads to be run by DBMSs. MR更加适合做ETL和复杂分析工作,而DBMS更加适合复杂查询

6. Learning from Each Other

7. Conclusion