What Modern NVMe Storage Can Do, And How To Exploit It

vldb2023. 文章讨论如何设计OLTP可以充分利用nvme ssd array. 按照文章里面给出的配置,一个机器上可以挂8个nvme ssd, 然后每个ssd可以达到1MIOPS, 总量大约是12.5M IOPS. 在这样的SSD阵列下面充分利用IO,需要做哪些事情以及应该如何设计系统,并且回答下面这几个问题:

[!NOTE] Our high-level goal of closing this performance gap can be broken down into the following research questions:

文章最开始做了几组测试大概有这么几个数据:

Pasted-Image-20241008204544.png

所以最后设计出来的结构依然是开辟固定数量的worker thread, 然后worker thread内部来进行cooperative调度,IO请求需要依赖于libaio/io_uring/SPDK.

[!NOTE]

Lightweight tasks. To avoid oversubscription, we use lightweight cooperative threads that are managed by the database system in user-space. This reduces the context switching overhead and allows the system to be fully in control of scheduling without kernel inter- ference. In this design, which is illustrated in Figure 8, the system starts as many worker threads as there are hardware cores avail- able in the system. Each of these workers runs a DBMS-internal scheduler that executes these lightweight threads, which we call tasks. To implement user-space task switching, we use the Boost context library [22], specifically fcontext. Thereby, a task switch costs only around ~20 CPU cycles, instead of several thousand for a kernel context switch. This enables cheap and frequent context switches deep in the call stack, and makes it fairly easy to port existing code bases to this new design.

Cooperative multitasking. Conceptually, Boost contexts are non-preemptive user-space threads. Tasks therefore need to yield control periodically back to the scheduler. In our cooperative multitasking design, this happens whenever a user query encounters a page fault, runs out of free pages, or when the user task is completed. Further, to prevent a worker from being stalled due to latching, we modified all latches to eventually yield to the scheduler as well.

Pasted-Image-20241008205051.png

Pasted-Image-20241008205113.png

在IO访问上还有三种方式:a) dedicated threads(专门的IO线程来做non-blocking IO) b) SSD assignment c) All-to-All. 几个之间的区别如下图. dedicated threads问题是在需要依赖于kernel线程切换cpu cycles多,ssd assignment和all-to-all两者之间差不多的,但是看起来all-to-all简单许多没有sync/message passing.

Pasted-Image-20241008205602.png

最开始的几个问题解答如下.

[!NOTE]