Integrating Deep Learning Frameworks into Main-Memory Databases

这篇文章主要说怎么把pytorch嵌入到umbra这个数据库里面去。里面有一段总结了目前把ML和DB结合起来的两种方法: a. 直接把从数据库中导出数据,讲究的就数据移动的便携性 b. 直接在DBMS里面调用ML Operator,讲究的就是和ML训练和预测的集成性。

Connecting Databases to Regular Data Processing.

The deep learn- ing framework TensorFlow provides a way of retrieving data from SQLite database engines [26]. Furthermore, it includes an experi- mental feature IODataset, which can connect to a Postgres endpoint [25].

Katsipoulakis et al. [7] propose a general way of integrating machine learning into SQL and streaming data between storage and processing systems.

Khan et al. present Fireworks [8], which includes data retrieval from databases using SQLAlchemy [24] as a part of their machine learning pipelines.

All of these methods require a transfer of all input data from the database to a processing environment. Especially nested queries, which require multiple environment switches will be negatively affected by this.

Integrated Machine Learning in Databases Systems.

Microsoft SQL Server’s stored procedures use Python and R scripts to execute virtually arbitrary processing, including deep learning [4]. Further- more, Microsoft SQL Server provides access to many data mining methods including clustering, decision trees, and basic neural net- works [2]. Microsoft’s ML system Raven adds ONNX models to SQL Server and Spark and shows interesting optimizations for queries using tree-based models [6] [16]. Raven improves ONNX Runtime performance of deep learning models leveraging SQL Server’s exe- cution parallelization and model caching.

Google’s cloud-based data warehouse BigQuery supports the use of various classical machine learning methods including deep learning models with extended SQL [3]. Furthermore, it allows to deploy trained TensorFlow models. Amazon’s sagemaker can be accessed from the database systems Redshift and Athena [11]. In contrast to our work, these systems are divided into distinct services instead and inherently have larger communication costs.

Oracle provides a set of machine learning algorithms including simple neural networks to use directly within SQL queries [15]. Oracle’s implementation performs all computations in the database kernel, which avoids the need to move the data for processing.

Sandha et al. present the integration of linear regression and basic neural networks directly integrated to the Teradata SQL En- gine using a Python runtime [19]. Their distributed approach is especially interesting for expensive training algorithms.

With TensorDB Kim et al. present a database system which adds tensor data types to the relational model, supporting many tensor operations. Missing automatic differentiation and GPU support render it, however, impractical for neural networks.

Luo et al. [12] extend the parallel database system SimSQL to support linear algebra, which can be used for a number of machine learning tasks.

With MADlib, Hellerstein et al. create a comprehensive open- source machine learning framework based on SQL and user defined functions for Postgres and Greenplum [5]. While the concepts of MADlib are transferable between database systems, it currently only supports Postgres and Greenplum Database.

Schüle et al. introduce lambda expression based tensor process- ing with automatic differentiation to the main-memory database HyPer [20]. Due to its lack of syntactical abstractions and GPU support it is only of limited use for sophisticated deep learning models. In their follow-up work, Schüle et al. add the above lambda expression based tensor processing to the Umbra database system, adding just in time GPU code generation [21]. The use of just in time compilation of GPU code allows query specific optimizations.