MapReduce: A Flexible Data Processing Tool(MapRedcue: 一个灵活的数据库处理工具)

Table of Contents

针对下面这些文章中对MapReduce的分析:

作者澄清了下面这些问题:

并且提出了下面这些观点:

1. Heterogenous Systems

  • Many production environments contain a mix of storage systems. 现实的生产环境中都是各种系统混合在一起的。
  • A single MapReduce operation easily processes and combines data from a variety of storage systems. 而MapReduce很容易接入多个系统
  • Now consider a system in which a parallel DBMS is used to perform all data analysis. 而对于DBMS来说就没有那么方便了。
    • The input to such analysis must first be copied into the parallel DBMS. This loading phase is inconvenient. It may also be unacceptably slow, especially if the data will be analyzed only once or twice after being loaded. 首先需要将data load到DBMS内部,这个过程非常不方便并且可能非常慢
    • Even if the cost of loading the input into a parallel DBMS is acceptable, we still need an appropriate loading tool. Here is another place MapReduce can be used; instead of writing a custom loader with its own ad hoc parallelization and fault-tolerance support, a simple MapReduce program can be written to load the data into the parallel DBMS. 并且即使如果很快的话依然需要合适的工具,而MapReduce可以很方便地完成这个工作。

2. Indices

事实上MapReduce是可以使用索引的

3. Complex Functions

在SQL难以编写复杂的过程

4. Structured Data and Schemas

Protocol Buffer for structed data and schema provided

5. Fault Tolerance

这里正面回答了为什么MapReduce没有使用push模型而是使用pull模型

  • The MapReduce implementation uses a pull model for moving data between mappers and reducers, as opposed to a push model where mappers write directly to reducers.
    • Pavlo et al. correctly pointed out that the pull model can result in the creation of many small files and many disk seeks to move data between mappers and reducers. pull模型会产生很多小文件并且产生大量的随机访问
    • Implementation tricks like batching, sorting, and grouping of intermediate data and smart scheduling of reads are used by Google's MapReduce implementation to mitigate these costs. 在MapReduce实现上面有大量的优化在解决这个问题
  • MapReduce implementations tend not to use a push model due to the fault-tolerance properties required by Google's developers. 选用pull模型主要的原因在于考虑fault-tolerace因素
    • Most MapReduce executions over large data sets encounter at least a few failures; apart from hardware and software problems 首先软件和硬件可能存在问题
    • Google's cluster scheduling system can preempt MapReduce tasks by killing them to make room for higher-priority tasks. 其次google调度系统可能也会kill一些task
    • In a push model, failure of a reducer would force re-execution of all Map tasks.
  • We suspect that as data sets grow larger, analyses will require more computation, and fault tolerance will become more important. Fault-Tolerance对于large-scale系统来说应该是最重要的因素。

6. Performance

  • Engineering considerations
    • Startup overhead and sequential scanning speed are indicators of maturity of implementation and engineering tradeoffs, not fundamental differences in programming models.
    • startup overhead 可以通过daemon解决
    • sequential scanning 可以通过protocol buffer解决
  • Reading unnecessary data. 通过索引解决
  • Merging results. 完全没有必要merge result
  • Data loading.

7. Conclusion