Dremel A Decade of Interactive SQL Analysis at Web Scale

回顾过去10年Dremel对实时分析数据库的影响,主要有下面这些

[!NOTE]

使用SQL分析层次结构数据比较麻烦,需要扩展ANSI SQL, 但是大家各自扩展的方式又各不相同,导致最后面google自己搞了一个zeta SQL标准

[!NOTE]

Hierarchical schemas were a big departure from typical SQL schema design. Textbook normal forms would use many tables, and query-time joins. Avoiding joins was a key enabler for Dremel’s scalable and fast execution. (Dremel initially had no join sup- port, and was successful for years with only limited join support.) Denormalizing related data into one nested record was common in Google’s datasets; hierarchical schemas made it unnecessary to flatten or duplicate any data, which would have increased storage and processing cost.

All SQL systems at Google, including BigQuery, Cloud Spanner, and Cloud DataFlow, have now adopted this common dialect and framework. Users benefit from having a single, standard-compliant and complete dialect they can use across many systems. These common SQL libraries are now available in open source as ZetaSQL

存储和计算分离,以及shuffle和通用计算分离,这种模式有许多价值

[!NOTE]

Disaggregation proved to be a major trend in data management, as it decouples provisioning of different types of resources and en- ables better cost-performance and elasticity. Several aspects of dis- aggregation stand out:

数据湖(原地分析)相比数据仓库的优点 a. 更多数据源 b. 避免导入 c. 更多的query engines. 但是原地分析的缺点是,用户没有办法安全地管理好数据,以及没有办法做潜在的优化(原地分析)

[!NOTE]

Indeed, the data management community finds itself today in the middle of a transition from classical data warehouses to a data- lake-oriented architecture for analytics [1]. Three ingredients were called out as central to this transition: (a) consuming data from a variety of data sources, (b) eliminating traditional ETL-based data ingestion from an OLTP system to a data warehouse, and (c) en- abling a variety of compute engines to operate on the data. We have observed each part of this transition in Dremel’s decade-long history.

There were, however, important drawbacks to Dremel’s in situ ap- proach. First, users do not always want to or have the capability to manage their own data safely and securely. While this extra com- plexity in data governance was acceptable to some degree inside Google, it was not tolerable for many external customers. Second, in situ analysis means there is no opportunity to either optimize storage layout or compute statistics in the general case. In fact, a large percentage of Dremel queries are run over data seen for the first time. This makes many standard optimizations impossible. It is also impractical to run DML updates and deletes or DDL schema changes on standalone files.