Hadoop at a Crossroads?


The first announcement was Cloudera releasing a new DBMS, Impala, that runs on HDFS. Put simply, Impala is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market. Specifically, notice clearly that the MapReduce layer has been removed, and for good reason. As some of us have been pointing out for years, MapReduce is not a useful internal interface inside a SQL (or Hive) DBMS. Impala was architected by savvy DBMS developers, who know the above pragma. In fact, development activity similar to Impala is being done by both HortonWorks and FaceBook. This, of course, presents the Hadoop vendors with a dilemma. Historically, "Hadoop" referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack?

The answer is simple: redefine "Hadoop", and that is exactly what the Hadoop vendors have done. The word "Hadoop" is now used to mean the entire stack. In other words, HDFS is at the bottom, on top of which run Impala, MapReduce and other systems. On top of these systems run higher-level software such as Mahout. The word "Hadoop" is used to refer to the entire collection.

#note: MapReduce架构并不适合用来实现SQL DBMS. 相反MPP架构更合适。

The second recent announcement comes from Google, who announced that MapReduce is yesterday’s news and they have moved on, building their software offerings on top of better systems such as Dremmel, Big Table, and F1/Spanner. In fact, Google must be "laughing in their beer" about now. They invented MapReduce to support the web crawl for their search engine in 2004. A few years ago they replaced MapReduce in this application with BigTable, because they wanted an interactive storage system and MapReduce was batch-only. Hence, the driving application behind MapReduce moved to a better platform a while ago. Now Google is reporting that they see little-to-no future need for MapReduce.

It is indeed ironic that Hadoop is picking up support in the general community about five years after Google moved on to better things. Hence, the rest of the world followed Google into Hadoop with a delay of most of a decade. Google has long since abandoned it. I wonder how long it will take the rest of the world to follow Google’s direction and do likewise…

#note: 当社区开始造开源版本的MapReduce之时,G已经舍弃了MapReduce架构而改用MPP架构。挖坑不浅。

Notice that the Hadoop vendors are now on a collision course with the data warehouse vendors. They are now implementing (or have implemented) the same architecture supported by the data warehouse folks. Once they have a few years to solidify their implementations, they will probably offer competitive performance. Meanwhile most of the data warehouse vendors support HDFS, and many offer features to support semi-structured data. Hence, the data warehouse market and the Hadoop market will quickly converge. May the best systems win in the resulting head-to-head donneybrook!

#note: Hadoop会更加着力在MPP架构修改上,而data warehouse vendors则会支持HDFS以及半结构化数据,最终合并到相同的方向上。

Now I turn to HDFS, which is the only common building block left in the Hadoop stack. Notice clearly that HDFS is a file system, capable of storing bytes of data, a feature we have all come to expect on any computing platform. There are two possible world views of where HDFS might go in the future. If you take a file system view of the world, then users want a common distributed file system and HDFS is a perfectly reasonable alternative.

On the other hand, from the point of view of a parallel SQL/Hive DBMS, HDFS is a "fate worse than death". A DBMS ALWAYS wants to send the query (small kilobytes) to the data (lots of gigabytes) and never the other way around. Hence, hiding the location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature. All parallel DBMSs, either from the warehouse vendors or from the Hadoop vendors, will turn off location transparency, making HDFS look like a collection of Linux file systems, one per node. Likewise no DBMS wants file system replicas. See for an extensive discussion of this point. In short, load-balancing, query optimization and transaction considerations favor a DBMS-supplied replication system.

If it turns out that the DBMS point of view prevails in the marketplace over time, then HDFS will atrophy as the DBMS vendors abandon its features. In such a world, there is a local file system at each node, a parallel DBMS supporting a high-level query language, and various tools on top or extensions defined via user-defined functions. In this scenario, Hadoop will morph into a standard shared-nothing DBMS with a collection of DBMS vendors competing for your software dollar.

#note: 如果将HDFS看作是分布式文件系统,那么它是一个比较好的选择。但是如果作为MPP架构底层的文件系统那么命运堪忧。DBMS总是想将操作发送到数据节点上执行而不是相反方向,所以对DBMS隐藏数据位置信息会非常致命。DBMS更希望自己来管理单纯的文件系统。所以从DBMS角度上来看HDFS最终会被弃用(或者说现有实现会被弃用)

On the other hand, if the file system point of view prevails, then HDFS will probably survive largely intact with a potpourri of tools running on top of it. Features that users take for granted in a DBMS environment, such as load-balancing, auditing, resource governors, data independence, data integrity, high availability, concurrency control, and quality-of-service will be slow to come to file system users. There will be no higher-level standard interfaces in this scenario. In other words, a DBMS view of the world offers a bunch of useful services, and users would be well advised to consider carefully if they want to run lower-level interfaces.

In either scenario, the only common piece of software is a file system, and the Hadoop vendors will be selling file-system based tools, either DBMS ones or other stuff (or maybe both). In effect, they will join the ranks of the system software vendors selling software or services. May the best products win!

#note: 对现在HDFS的实现感到深深的堪忧!

comments powered by Disqus