Doris Hash Index 分析

https://github.com/apache/incubator-doris/commit/73999974332362a3874bedb7d64cbd3f177718ec

Doris Hash Index 使用方式和实现都非常意思:


先从bucket数据结构开始分析,结构名称叫做 `HashChunk`. 它有下面这些特征:

struct alignas(64) HashChunk {
    static const uint32_t CAPACITY = 12;
    uint8_t tags[12];
    std::atomic<uint32_t> size;
    uint32_t values[12];
};

上面 `HashChunk` 的这些设计都是为了可以快速根据key查找candidates. 下面是相关代码片段,这里省去了重新散列的过程,只看如何搜索一个HashChunk. 里面涉及到的SSE指令,可以在这里查询到含义以及伪代码 https://www.laruence.com/sse/

uint64_t HashIndex::find(uint64_t key_hash, std::vector<uint32_t>* entries) const {
    uint64_t tag = std::max((uint64_t)1, key_hash & 0xff);
    uint64_t pos = (key_hash >> 8) & _chunk_mask;
    uint64_t orig_pos = pos;
#ifdef __SSE2__
    auto tests = _mm_set1_epi8(static_cast<uint8_t>(tag));
    while (true) {
        // get corresponding chunk
        HashChunk& chunk = _chunks[pos];
        uint32_t sz = chunk.size;
        // load tags
        auto tags = _mm_load_si128(reinterpret_cast<__m128i*>(chunk.tags));
        auto eqs = _mm_cmpeq_epi8(tags, tests);
        // check tag equality and store equal tag positions into masks
        uint32_t mask = _mm_movemask_epi8(eqs) & 0xfff;
        // iterator over mask and put candidates into entries
        while (mask != 0) {
            uint32_t i = __builtin_ctz(mask);
            mask &= (mask - 1);
            entries->emplace_back(chunk.values[i]);
        }
        // ... rehash
    }
}

说明如下:


接着根据mask就知道,那些位置上的tags是相等的,那么我们就选择这些位置上的values.

while (mask != 0) {
   uint32_t i = __builtin_ctz(mask);
   mask &= (mask - 1);
   // use ith.
}

说明如下: