Dremel A Decade of Interactive SQL Analysis at Web Scale
回顾过去10年Dremel对实时分析数据库的影响,主要有下面这些
[!NOTE]
- SQL: [1] reports that all data platforms have embraced SQL- style APIs as the predominant way to query and retrieve data. Dremel’s initial SQL-style dialect got generalized as ANSI- compliant SQL backed by an open-source library and shared with other Google products, notably Cloud Spanner.2
- Disaggregated compute and storage: The industry has con- verged on an architecture that uses elastic compute services to analyze data in cloud storage. This architecture decouples compute from storage, so each can scale independently.
- In situ analysis: Data lake repositories have become popu- lar, in which a variety of compute engines can operate on the data, to curate it or execute complex SQL queries, and store the results back in the data lake or send results to other oper- ational systems. Dremel’s use of a distributed file system and shared data access utilities allowed MapReduce and other data processing systems at Google to interoperate seamlessly with SQL-based analysis.
- Serverless computing: As an alternative to provisioned re- sources, the industry now offers on-demand resources that provide extreme elasticity. Dremel was built as a fully- managed internal service with no upfront provisioning and pay-per-use economics. This concept was successfully ported to BigQuery.
- Columnar storage: While use of columnar storage in com- mercial data analytic platforms predates the Dremel paper, Dremel introduced a novel encoding for nested data that gen- eralized the applicability of column stores to nested rela- tional and semistructured data.
使用SQL分析层次结构数据比较麻烦,需要扩展ANSI SQL, 但是大家各自扩展的方式又各不相同,导致最后面google自己搞了一个zeta SQL标准
[!NOTE]
Hierarchical schemas were a big departure from typical SQL schema design. Textbook normal forms would use many tables, and query-time joins. Avoiding joins was a key enabler for Dremel’s scalable and fast execution. (Dremel initially had no join sup- port, and was successful for years with only limited join support.) Denormalizing related data into one nested record was common in Google’s datasets; hierarchical schemas made it unnecessary to flatten or duplicate any data, which would have increased storage and processing cost.
All SQL systems at Google, including BigQuery, Cloud Spanner, and Cloud DataFlow, have now adopted this common dialect and framework. Users benefit from having a single, standard-compliant and complete dialect they can use across many systems. These common SQL libraries are now available in open source as ZetaSQL
存储和计算分离,以及shuffle和通用计算分离,这种模式有许多价值
[!NOTE]
Disaggregation proved to be a major trend in data management, as it decouples provisioning of different types of resources and en- ables better cost-performance and elasticity. Several aspects of dis- aggregation stand out:
- Economies of scale: The path of storage disaggregation went from RAID, SAN, distributed file systems to warehouse- scale computing [9].
- Universality: Storage disaggregation has been embraced by analytical and transactional systems alike, including Span- ner [17], AWS Aurora [44], Snowflake [18], and Azure SQL Hyperscale [7]. Disaggregated flash is discussed in [10, 28].
- Higher-level APIs: Disaggregated resources are accessed via APIs at ever higher levels of abstraction. Storage access is far removed from the early block I/O APIs and includes access control, encryption at rest,10 customer-managed en- cryption keys, load balancing, and metadata services (e.g., see [39]). Some data access APIs have built-in support for filtering and aggregation (see Section 6.1), which may be pushed all the way down into hardware (e.g., as done in Or- acle SPARC M7 [5]).
- Value-added repackaging: Raw resources are packaged as services providing enhanced capabilities. Even if it is not practical to disaggregate a raw resource such as RAM for general-purpose use in a data management system, it may be cost-effective to factor it out as a value-added service, as exemplified by Dremel’s shuffle.
数据湖(原地分析)相比数据仓库的优点 a. 更多数据源 b. 避免导入 c. 更多的query engines. 但是原地分析的缺点是,用户没有办法安全地管理好数据,以及没有办法做潜在的优化(原地分析)
[!NOTE]
Indeed, the data management community finds itself today in the middle of a transition from classical data warehouses to a data- lake-oriented architecture for analytics [1]. Three ingredients were called out as central to this transition: (a) consuming data from a variety of data sources, (b) eliminating traditional ETL-based data ingestion from an OLTP system to a data warehouse, and (c) en- abling a variety of compute engines to operate on the data. We have observed each part of this transition in Dremel’s decade-long history.
There were, however, important drawbacks to Dremel’s in situ ap- proach. First, users do not always want to or have the capability to manage their own data safely and securely. While this extra com- plexity in data governance was acceptable to some degree inside Google, it was not tolerable for many external customers. Second, in situ analysis means there is no opportunity to either optimize storage layout or compute statistics in the general case. In fact, a large percentage of Dremel queries are run over data seen for the first time. This makes many standard optimizations impossible. It is also impractical to run DML updates and deletes or DDL schema changes on standalone files.