MapReduce: A Flexible Data Processing Tool(MapRedcue: 一个灵活的数据库处理工具)
Table of Contents
- http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- http://duanple.blog.163.com/blog/static/7097176720119711038980/
针对下面这些文章中对MapReduce的分析:
- MapReduce: A major step backwards
- MapReduce: A major step backwards-ii
- A Comparison of Approaches to Large-Scale Data Analysis
作者澄清了下面这些问题:
- MapReduce cannot use indices and implies a full scan of all input data; MapReduce不能够使用索引
- MapReduce input and outputs are always simple files in a file system; and MapReduce输入输出只能够是简单的文件
- MapReduce requires the use of inefficient textual data formats. MapReduce只能够处理简单的文本格式
并且提出了下面这些观点:
- MapReduce is storage-system independent and can process data without first requiring it to be loaded into a database. In many cases, it is possible to run 50 or more separate MapReduce analyses in complete passes over the data before it is possible to load the data into a d>atabase and complete a single analysis; MapReduce不需要将数据全部load到db就可以开始运行,而load db时间之长完全可以运行50 passes MapReduce
- Complicated transformations are often easier to express in MapReduce than in SQL; and 使用SQL非常难以做一些复杂的变换
- Many conclusions in the comparison paper were based on implementation and evaluation shortcomings not fundamental to the MapReduce model; we discuss these shortcomings later in this article. 很多对于MapReduce提到的shortcomming并不是根本的问题,并且这些shortcomming也都有对应的解决办法。
1. Heterogenous Systems
- Many production environments contain a mix of storage systems. 现实的生产环境中都是各种系统混合在一起的。
- A single MapReduce operation easily processes and combines data from a variety of storage systems. 而MapReduce很容易接入多个系统
- Now consider a system in which a parallel DBMS is used to perform all data analysis. 而对于DBMS来说就没有那么方便了。
- The input to such analysis must first be copied into the parallel DBMS. This loading phase is inconvenient. It may also be unacceptably slow, especially if the data will be analyzed only once or twice after being loaded. 首先需要将data load到DBMS内部,这个过程非常不方便并且可能非常慢
- Even if the cost of loading the input into a parallel DBMS is acceptable, we still need an appropriate loading tool. Here is another place MapReduce can be used; instead of writing a custom loader with its own ad hoc parallelization and fault-tolerance support, a simple MapReduce program can be written to load the data into the parallel DBMS. 并且即使如果很快的话依然需要合适的工具,而MapReduce可以很方便地完成这个工作。
2. Indices
事实上MapReduce是可以使用索引的
3. Complex Functions
在SQL难以编写复杂的过程
4. Structured Data and Schemas
Protocol Buffer for structed data and schema provided
5. Fault Tolerance
这里正面回答了为什么MapReduce没有使用push模型而是使用pull模型
- The MapReduce implementation uses a pull model for moving data between mappers and reducers, as opposed to a push model where mappers write directly to reducers.
- Pavlo et al. correctly pointed out that the pull model can result in the creation of many small files and many disk seeks to move data between mappers and reducers. pull模型会产生很多小文件并且产生大量的随机访问
- Implementation tricks like batching, sorting, and grouping of intermediate data and smart scheduling of reads are used by Google's MapReduce implementation to mitigate these costs. 在MapReduce实现上面有大量的优化在解决这个问题
- MapReduce implementations tend not to use a push model due to the fault-tolerance properties required by Google's developers. 选用pull模型主要的原因在于考虑fault-tolerace因素
- Most MapReduce executions over large data sets encounter at least a few failures; apart from hardware and software problems 首先软件和硬件可能存在问题
- Google's cluster scheduling system can preempt MapReduce tasks by killing them to make room for higher-priority tasks. 其次google调度系统可能也会kill一些task
- In a push model, failure of a reducer would force re-execution of all Map tasks.
- We suspect that as data sets grow larger, analyses will require more computation, and fault tolerance will become more important. Fault-Tolerance对于large-scale系统来说应该是最重要的因素。
6. Performance
- Engineering considerations
- Startup overhead and sequential scanning speed are indicators of maturity of implementation and engineering tradeoffs, not fundamental differences in programming models.
- startup overhead 可以通过daemon解决
- sequential scanning 可以通过protocol buffer解决
- Reading unnecessary data. 通过索引解决
- Merging results. 完全没有必要merge result
- Data loading.