MapReduce: A Minor Step Forward
http://perspectives.mvdirona.com/2008/01/18/MapReduceAMinorStepForward.aspx
An execution engine that runs on multi-thousand node clusters really is an important step forward. The separation of execution engine and storage engine into extensible parts isn’t innovative but it is a very flexible approach that current generation commercial RDBMS could profit from. MapReduce最突出的地方就是这是一个通用的计算引擎,并且非常具有扩展性。将存储引擎和执行引擎分离,值得RDBMS学习和借鉴
1. MapReduce is a step backwards in database access
- In this section, the authors argue that schema is good, separation of schema and application are good, and high level language access is good. On the first two points, I agree schema is good and there is no question that application/schema separation has long ago proven to be a good thing. (schema,application/schema分离,以及high-level language access是好东西)
- The thing to keep in mind is that MapReduce is only an execution framework. 但是MapReduce本事只是一个执行引擎
- I argue that a separation of execution framework from store and indexing technology is a good thing in that MapReduce can be run over many stores. 将执行引擎以及store/indexing分离使得MapReduce可以在很多store上面运行
- The point here is that Dewitt and Stonebraker would like to see schema enforcement as part of the store and, generally, I agree that this would be useful. However, MapReduce is not a store. 原作者的观点应该是希望schema在store存储方面得到加强,这点是毋庸置疑的,但是MapReduce不是一个存储系统
- They also argue that high level languages are good. I agree and any language can be used with MapReduce systems so this isn’t a problem and is supported today. 支持更多的high-level language基于MapReduce系统
2. MapReduce is a poor implementation
- The argument here is that any reasonable structured store will support indexes. I agree for many workloads you absolutely must have indexes. However, for many data mining and analysis algorithms, all the data in a data set is accessed. Indexes, in these cases, don’t help. 原作者观点认为所有合理的结构化存储必须支持索引,虽然对于大部分工作方式需要索引, 但是许多数据挖掘和分析算法来说,是需要访问所有数据,而这点索引没有任何帮助
- This is one of the reason why many data mining algorithms run poorly over RDBMS – if all they are going to do is repeatedly scan the same data, a flat file is faster. It depends upon application access pattern and the amount of data that is accessed. 这就是为什么许多数据挖掘算法在RDBMS上面运行很差的原因,因为需要扫描所有的数据,因此对于平坦的数据结构来说访问更快。主要还是依赖访问模式。
3. MapReduce is not novel
- This is clearly true. These ideas have been fully and deeply investigated by the database community in the distant past. MapReduce确实不是一个新颖东西
- What is innovative is scale. MapReduce最新颖的东西是可扩展性
- I’ve seen MapReduce clusters of 3,000 nodes and I strongly suspect that clusters of 5,000+ servers can be found if you look in the right places. I’ve been around parallel database management systems for many years but have never seen multi-thousand node clusters of Oracle RAC or IBM DB2 Parallel Edition. 曾经见过3000节点运行MapReduce,但是很少见到PDBMS能够支持上千节点。
4. MapReduce is missing features
- All of the missing features (bulk loader, indexing, updates, transactions, RI, views) are features that could be implemented in a store used by MapReduce. 这些特性都可以在MapReduce所用的存储系统完成 #todo: RI???
5. MapReduce is incompatible with the DBMS tools
- I 100% agree. Tools are useful and today many of these tools target RDBMS. 大部分的工具确实是针对RDBMS
- It’s not mentioned by the authors but another useful characteristic of RDBMS is developers understand them and many people know how to write SQL. It’s an data access and manipulation language that is broadly understood. 但是这些工具广泛使用的特点是,开发人员了解RDBMS,许多人知道如何使用SQL
- The thing to keep in mind is that MapReduce is part of a componentized system. It’s just the execution framework. I could easily write a SQL compiler that emitted MapReduce jobs 而MapReduce仅仅是组件化系统的一个部分,只是一个执行引擎,而在上层可以很容易地通过SQL compiler转换成为MR jobs.
- MapReduce can be run over simple stores as it mostly is today or over stores with near database level functionality if needed. MapReduce可以很容易地跑在各个store上面。