LSM-Tree - LevelDb Bloom Filter - 技术读书笔记

LSM-Tree - LevelDb Bloom Filter

introduction

The bloom filter is similar to the hash table, but it is more efficient than the hash table, because it uses bits to determine whether the key exists. The bloom filter has certain side effects while completing the efficient search for the existence of the key - - There is no guarantee that the Key must exist , so it is only suitable for systems that allow a certain fault tolerance rate.

In one sentence: Bloom Filter is a probability-based data structure that can only tell us that an element is definitely not in the set or may be in the set.

The thing about the Bloom filter is that it does not guarantee that the elements are 100% in a set, so it is suitable for businesses with a certain fault tolerance. Many of its theory and practice are referenced or directly extracted from online materials plus their own. understanding, please correct me if I am wrong.

theory

Articles related to the theoretical basis are similar with minor differences. This article is summarized from this article written by the great god: Bloom Filter Concepts and Principles Papers that have been written by foreigners.

In this URL, you can view the actual running effect through the JS code:
Bloom Filters by Example (llimllib.github.io)

Note that two simple hash functions Fnv and Murmur are used in the example.

For a Bloom filter, it is usually defined as follows:

n keys.
The space v of m bits is all initialized to 0.

Bloom Filter theory proposes the use of k independent hash functions (Hash Functions) to represent the set of n elements of S={x1, x2,…,xn} , for any element x, the position of the ith hash function mapping hi(x) will be set to 1 (1≤i≤k).
If there are multiple hash function positions all set to 1, then only the first hash result is used . For example, the position of the second "1" from left to right in the figure below is the final selected position of the hash function.

In order to determine whether the current element is in the set, it is necessary to perform a hash function k times for the current element y, if all the times of hi(y) are 1 (the times in i <= i <= k are all 1), then Considers that the current element y may be in the set, otherwise it definitely does not exist.
With the following content y1, because there is a result of hash 0, it is considered that it does not exist in the set, and all the hashes of y2 fall on 1, it can be considered that there may be a set.

The theoretical content of Bloom filter is relatively simple, and the key part is the choice of hash function and the balance of error rate.

Error rate calculation

First of all, the Bloom filter needs to pay attention to the bit length, that is, the length of the array. Usually a large bloom filter will have a smaller error rate than a small bloom filter.

The formula for calculating the false positive rate is: (1-e^(-kn/m))^k .

The derivation process is as follows, the process is not particularly important, just understand the final formula:

n is the number of keys, m is the number of bits (that is, the size of the array)

According to this formula, it can be found that you need to first determine the capacity size _n_ of the data set that may be inserted, and then adjust k and _m_ to configure the filter for your application. The lower the judgment rate.

Considering that p is the probability of being set to 0, it can be considered that when half of m is set to 1 and half of it is set to 0, the misjudgment rate is the lowest. Note that this sentence will be introduced in detail in the final derivation part.

How many hash functions?

According to the error rate calculation conclusion, there is another question here, which is how many hash functions should be selected. Too many hash functions can easily lead to lower computing efficiency and affect performance, and too few hash functions will increase the misjudgment rate.

Happily, this formula has also been derived:

The optimal number of hash functions k is ln2 * (m/n) .

The size of the Bloom filter can be determined by the following steps:

Determine the range of _n_
Select the value of _m_
Calculate the optimal value of _k_
Calculate the error rate for a given n, _m_, k. If this error rate is unacceptable, then go back to step 2, otherwise end

What is the time and space complexity of Bloom filter?

The time complexity of insertion and test operations are both O(k), because if you want to insert or query an element, you only need to perform k number of functional operations on the element.

The space complexity is more difficult to estimate. Because of the existence of the error rate, the size is difficult to determine. If it is difficult to estimate the size of a filter, it is best to choose a hash table or an expandable Bloom filter.

Note ⚠️: The filter size of LevelDB cannot be less than a 64-bit bit array.

How many digits in m are suitable for 1

Just remember the conclusion: it can be considered that when half of m is set to 1 and half of it is set to 0, the false positive rate is the lowest .

levelDB implementation

The essence of LevelDB's Bloom filter is the hash function, which achieves the performance of multiple hashes through one hash, while ensuring that the false positive rate is within a certain limit.

The specific code implementation can read bloom.cc : leveldb/bloom_test.cc at main google/leveldb GitHub

Introduction to index.md

Without going into the source code here, let's see how the author explained it in index.md :
leveldb/index.md at main · google/leveldb · GitHub

Due to the way leveldb data is organized on disk, a call to Get() may involve multiple reads from disk, so the optional FilterPolicy mechanism can be used to Greatly reduce the number of disk reads, in fact, here refers to the use of Bloom filters to improve filtering efficiency.

 leveldb::Options options;

options.filter_policy = NewBloomFilterPolicy(10);

leveldb::DB* db;

leveldb::DB::Open(options, "/tmp/testdb", &db);

... use the database ...

delete db;
// 注意，关闭leveldb的时候需要手动释放过滤器所占用内存空间
delete options.filter_policy;

The preceding code associates a bloom filter-based filtering strategy with the database.

Bloom-based filtering relies on keeping a certain amount of data in memory for each key (10bits per key in this example, since that's the parameter we pass to NewBloomFilterPolicy ).

This filter will reduce the number of unnecessary disk IOs required by the Get() call, improving the efficiency by about 100 times . Increasing the number of bits per key will result in a larger reduction at the cost of more memory . We recommend that applications whose working set does not fit in memory should not fit in memory, and applications that do a lot of random reads set a filtering strategy.

FilterPolicy

The entire filter passes the external interface FilterPolicy , the purpose is to reduce the function call time of DB::Get() , usually the internal Bloom filter is used by default.

Here is the code for the interface definition:

 namespace leveldb {


class Slice;

class LEVELDB_EXPORT FilterPolicy {

public:

virtual ~FilterPolicy();

// Return the name of this policy. Note that if the filter encoding

// changes in an incompatible way, the name returned by this method

// must be changed. Otherwise, old incompatible filters may be

// passed to methods of this type.

/*返回该策略的名称。注意如果过滤器的编码变化，此方法返回的名称必须被改变。否则不兼容旧的过滤器可能被传递给这种类型的方法。
*/
virtual const char* Name() const = 0;

  

// keys[0,n-1] contains a list of keys (potentially with duplicates)

// that are ordered according to the user supplied comparator.

// Append a filter that summarizes keys[0,n-1] to *dst.

//

// Warning: do not change the initial contents of *dst. Instead,

// append the newly constructed filter to *dst.

/* keys[0,n-1] 包含一个键的列表（可能有重复的）。根据用户提供的比较器进行排序。将一个总结keys[0,n-1]的过滤器追加到*dst。
 
 警告：不要改变*dst的初始内容。 相反将新构建的过滤器追加到*dst中。
*/
virtual void CreateFilter(const Slice* keys, int n,

std::string* dst) const = 0;

  

// "filter" contains the data appended by a preceding call to

// CreateFilter() on this class. This method must return true if

// the key was in the list of keys passed to CreateFilter().

// This method may return true or false if the key was not on the

// list, but it should aim to return false with a high probability.
/*
 "filter "包含了前面对这个类的CreateFilter()的调用所附加的数据。如果键在传递给CreateFilter()的键列表中，该方法必须返回true。如果键不在列表中，该方法可能会返回true或false，但它应该以返回false的概率大为目标。
*/
virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;

};

// 这一部分注释较多，放到后文介绍

LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);

  

} // namespace leveldb

  

#endif // STORAGE_LEVELDB_INCLUDE_FILTER_POLICY_H_

bloom.cc

The specific code explanations are put in the comments. The most noteworthy part is the part of creating the filter and the part of the hash function. This part introduces the source code of the filter itself. The key hash functions are placed in the following subsections. .

 // Copyright (c) 2012 The LevelDB Authors. All rights reserved.

// Use of this source code is governed by a BSD-style license that can be

// found in the LICENSE file. See the AUTHORS file for names of contributors.

  

#include "leveldb/filter_policy.h"
#include "leveldb/Slice.h"

#include "util/hash.h"

  

namespace leveldb {

  

namespace {

// 哈希函数
static uint32_t BloomHash(const Slice& key) {
    // 注意 0xbc9f1d34
    return Hash(key.data(), key.size(), 0xbc9f1d34);

}

  

class BloomFilterPolicy : public FilterPolicy {

public:
    explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
    
        // We intentionally round down to reduce probing cost a little bit
        // 我们有意四舍五入，以减少一点探测成本
        k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
    
        if (k_ < 1) k_ = 1;
        
        if (k_ > 30) k_ = 30;
    
    }

  

const char* Name() const override { return "leveldb.BuiltinBloomFilter2"; }

  

void CreateFilter(const Slice* keys, int n, std::string* dst) const override {

    // Compute bloom filter size (in both bits and bytes)
    // 计算布过滤器的大小（包括比特和字节）。
    size_t bits = n * bits_per_key_;


    // For small n, we can see a very high false positive rate. Fix it
    
    // by enforcing a minimum bloom filter length.
    // 对于小的n，我们可以看到一个非常高的误判率。通过强制执行最小Bloom filter长度来解决这个问题。
    // tip: 这里就是之前说的如果bit位数过小会增加误判率
    if (bits < 64) bits = 64;
    
    size_t bytes = (bits + 7) / 8;
    // 至少有64个bits
    bits = bytes * 8;
    
    const size_t init_size = dst->size();
    // 调整容器的大小，使其包含_n个_元素。
    dst->resize(init_size + bytes, 0);
    
    dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter
    
    char* array = &(*dst)[init_size];
    
    for (int i = 0; i < n; i++) {
    
        // Use double-hashing to generate a sequence of hash values.
        // See analysis in [Kirsch,Mitzenmacher 2006].
        // 使用双重哈希法生成一连串的哈希值。见[Kirsch,Mitzenmacher 2006]中的分析。
        // tips: 原始论文请看参考资料 -> LNCS 4168 - Less Hashing, Same Performance: Building a Better Bloom Filter (harvard.edu)
        uint32_t h = BloomHash(keys[i]);
        
        const uint32_t delta = (h >> 17) | (h << 15); // // 向右旋转17位
        
        for (size_t j = 0; j < k_; j++) {
        
        const uint32_t bitpos = h % bits;
        
        array[bitpos / 8] |= (1 << (bitpos % 8));
        
        h += delta;
    
    }

}

}

  
    /*
        关键函数：
        
    */
    bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {

    const size_t len = bloom_filter.size();
    
    // 只有1位的过滤器无意义
    if (len < 2) return false;

    const char* array = bloom_filter.data();
    
    const size_t bits = (len - 1) * 8;  

    // Use the encoded k so that we can read filters generated by
    // bloom filters created using different parameters.
    
    // 使用编码的k，这样我们就可以读取由 使用不同参数创建的bloom过滤器。
    const size_t k = array[len - 1];

    if (k > 30) {

        // 超过我们设定 k 个数，直接返回 true，不滤掉该 SSTable.

        // Reserved for potentially new encodings for short bloom filters.
        // Consider it a match.
        // 保留给可能出现的新编码的短Bloom过滤器。认为它是一种匹配。
        return true;
    
    }

    // 关键：哈希函数
    uint32_t h = BloomHash(key);
    // 右旋17位
    const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
    
    for (size_t j = 0; j < k; j++) {
    
        const uint32_t bitpos = h % bits;
        
        if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
        
            h += delta;
        
        }
    
        return true;
    
    }

  

    private:
    
        size_t bits_per_key_;
        
        size_t k_;
        
        };
    
    } // namespace

  
    // 新的布隆过滤器策略（请看下文注释）
    const FilterPolicy* NewBloomFilterPolicy(int bits_per_key) {
    
        return new BloomFilterPolicy(bits_per_key);
    
    }


} // namespace leveldb

NewBloomFilterPolicy function

Why is it called the new Bloom filter strategy, you can see the note given by the author:

Returns a new filter policy that uses a Bloom filter with approximately the specified number of bits per key per key. The number of bits of the key is 10.
The final test best value was 10, which would yield a filter with a 1% false positive rate.

Note that the related object memory must be released manually after use:

The caller must delete the result after any database using the result is closed, and after the database is closed, the caller must delete the result.

If a custom comparator is used, which ignores some parts of the keys being compared and some parts of the keys being compared, NewBloomFilterPolicy() is not allowed, and a custom comparator must be provided FilterPolicy implemented, since the original filter it also ignores the corresponding part of the key.

For example, if the comparator ignores trailing spaces , then using a FilterPolicy (such as NewBloomFilterPolicy ), the original pair FilterPolicy（如NewBloomFilterPolicy） behaves incorrectly because it Trailing spaces for keys are not ignored .

hash.cc

I said before that one of the key codes is a good hash function. Here is the relevant code for hash.cc :

Note that the pseudo-random number seed used by the hash function here is 0xbc9f1d34 , and the corresponding decimal is 9134 .

It can also be seen here that LevelDB uses its own high-quality hash function to make one function replace the effect of N functions. This is a theoretical adjustment, and the length of levelDB is also controlled internally to be at least 64 bits.

 /*
data：bit 位数
n：n 个 key
seed：种子，实际固定为 0xbc9f1d34
*/
uint32_t Hash(const char* data, size_t n, uint32_t seed) {

    // Similar to murmur hash
    // 类似杂音哈希
    const uint32_t m = 0xc6a4a793;
    
    const uint32_t r = 24;
    // limit指向了char*数组的最后一个位置的下一个位置，类似于迭代器end()
    const char* limit = data + n;
      
    uint32_t h = seed ^ (n * m);
    
    // Pick up four bytes at a time
    // 以4个字节作为一次解析    
    while (data + 4 <= limit) {
        //  每次解码前4个字节，直到最后剩下小于4个字节

        // DecodeFixed32 低级别的Get...版本，直接从字符缓冲区读取 而不进行任何边界检查，最近的clang和gcc将其优化为一条 mov / ldr 指令。
        uint32_t w = DecodeFixed32(data);
        
        data += 4;
        
        h += w;
        
        h *= m;
        
        h ^= (h >> 16);
    
    }
    
    // Pick up remaining bytes
    // 处理剩余的字节    
    switch (limit - data) {
        // 将剩下的字节转化到uint32_t里面
        case 3:
            // static_cast 表示的是良性转换，含义表示
            h += static_cast<uint8_t>(data[2]) << 16;
            
            // FALLTHROUGH_INTENDED宏可以用来注解开关标签之间的隐性落差。真正的定义应该由外部提供。 这个是为不支持的编译器提供的后备版本。
            /*
            #ifndef FALLTHROUGH_INTENDED

            #define FALLTHROUGH_INTENDED \
            
            do { \
            
            } while (0)
            
            #endif
            */
            FALLTHROUGH_INTENDED;
        
        case 2:
        
            h += static_cast<uint8_t>(data[1]) << 8;
            
            FALLTHROUGH_INTENDED;
        
        case 1:
        
            h += static_cast<uint8_t>(data[0]);
            
            h *= m;
            
            h ^= (h >> r);
            
            break;
    
    }
    
    return h;

}

Slice.h

It can be seen as a simple string sds design similar to Redis, except that the language uses C++.

For related explanations, you can read the documentation:

leveldb/index.md at main · google/leveldb · GitHub

A Slice is a simple data structure that contains a pointer and size into some external storage. The user of a slice must ensure that the slice is not used after the corresponding external storage has been deallocated (the memory must be freed manually when used up).

Multiple threads can call const methods on a Slice without external synchronization (thread-safe objects), but if any one thread may call a non-const method, all threads accessing the same Slice must use external synchronization.

A C++ or C-like string can be simply converted to a Slice:

 leveldb::Slice s1 = "hello";

std::string str("world");
leveldb::Slice s2 = str;

The reverse is also the same:

 std::string str = s1.ToString();
assert(str == std::string("hello"));

Be careful when using Slice because it is up to the caller to ensure that the external byte array pointed to by Slice remains valid while Slice is used. For example, the following example is wrong:

In the following example Slice will possibly point to an external reference without guaranteeing that the external reference exists.

 leveldb::Slice Slice;
if (...) {
  std::string str = ...;
  Slice = str;
}
Use(Slice);

When the if statement goes out of scope, str will be destroyed and the Slice's stored contents will disappear.

Other content

unit test

The unit test written by the author can see the specific effect more intuitively. The path is: /leveldb-main/util/bloom_test.cc .

 // Copyright (c) 2012 The LevelDB Authors. All rights reserved.

// Use of this source code is governed by a BSD-style license that can be

// found in the LICENSE file. See the AUTHORS file for names of contributors.

  

#include "gtest/gtest.h"

#include "leveldb/filter_policy.h"

#include "util/coding.h"

#include "util/logging.h"

#include "util/testutil.h"

  

namespace leveldb {

  

    static const int kVerbose = 1;

  

    static Slice Key(int i, char* buffer) {
    
    EncodeFixed32(buffer, i);
    
        return Slice(buffer, sizeof(uint32_t));
    
    }

  

class BloomTest : public testing::Test {

    public:
    
    BloomTest() : policy_(NewBloomFilterPolicy(10)) {}

  

    ~BloomTest() { delete policy_; }

  

    void Reset() {
        
        keys_.clear();
        
        filter_.clear();
    
    }

  

    void Add(const Slice& s) { keys_.push_back(s.ToString()); }

  
    // 
    void Build() {
    
        std::vector<Slice> key_Slices;
        
        for (size_t i = 0; i < keys_.size(); i++) {
        
            key_Slices.push_back(Slice(keys_[i]));
        
        }
        
        filter_.clear();
        
        policy_->CreateFilter(&key_Slices[0], static_cast<int>(key_Slices.size()),
        
        &filter_);
        
        keys_.clear();
        
        if (kVerbose >= 2) DumpFilter();
    
    }

  

    size_t FilterSize() const { return filter_.size(); }

  
    // 打印
    void DumpFilter() {
    
        std::fprintf(stderr, "F(");
        
        for (size_t i = 0; i + 1 < filter_.size(); i++) {
        
            const unsigned int c = static_cast<unsigned int>(filter_[i]);
            
            for (int j = 0; j < 8; j++) {
            
            std::fprintf(stderr, "%c", (c & (1 << j)) ? '1' : '.');
        
        }
        
        }
    
        std::fprintf(stderr, ")\n");
    
    }

  
    // 匹配
    bool Matches(const Slice& s) {
    
        if (!keys_.empty()) {
        
            Build();
        
        }
        
        return policy_->KeyMayMatch(s, filter_);
    
    }

  
    
    double FalsePositiveRate() {
    
        char buffer[sizeof(int)];
        
            int result = 0;
            
            for (int i = 0; i < 10000; i++) {
            
            if (Matches(Key(i + 1000000000, buffer))) {
            
            result++;
        
        }
    
    }
    
    return result / 10000.0;

}

  

private:

    const FilterPolicy* policy_;
    
    std::string filter_;
    
    std::vector<std::string> keys_;

};

  

TEST_F(BloomTest, EmptyFilter) {
    
    ASSERT_TRUE(!Matches("hello"));
    
    ASSERT_TRUE(!Matches("world"));

}

  

TEST_F(BloomTest, Small) {
    
    Add("hello");
    
    Add("world");
    
    ASSERT_TRUE(Matches("hello"));
    
    ASSERT_TRUE(Matches("world"));
    
    ASSERT_TRUE(!Matches("x"));
    
    ASSERT_TRUE(!Matches("foo"));

}

  

static int NextLength(int length) {

    if (length < 10) {
    
        length += 1;
    
    } else if (length < 100) {
    
        length += 10;
    
    } else if (length < 1000) {
    
        length += 100;
    
    } else {
    
        length += 1000;
    
    }
    
    return length;

}

  

TEST_F(BloomTest, VaryingLengths) {
    
    char buffer[sizeof(int)];

  
    
    // Count number of filters that significantly exceed the false positive rate
    
    int mediocre_filters = 0;
    
    int good_filters = 0;
    
      
    
    for (int length = 1; length <= 10000; length = NextLength(length)) {
    
    Reset();
    
    for (int i = 0; i < length; i++) {
    
    Add(Key(i, buffer));

}

    Build();

  

ASSERT_LE(FilterSize(), static_cast<size_t>((length * 10 / 8) + 40))

<< length;

  

// All added keys must match

for (int i = 0; i < length; i++) {

    ASSERT_TRUE(Matches(Key(i, buffer)))
    
    << "Length " << length << "; key " << i;

}

  

// Check false positive rate

double rate = FalsePositiveRate();

    if (kVerbose >= 1) {
    
        std::fprintf(stderr,
        
        "False positives: %5.2f%% @ length = %6d ; bytes = %6d\n",
        
        rate * 100.0, length, static_cast<int>(FilterSize()));
    
    }
    
    ASSERT_LE(rate, 0.02); // Must not be over 2%
    
    if (rate > 0.0125)
    
        mediocre_filters++; // Allowed, but not too often
    
    else
    
        good_filters++;

}

if (kVerbose >= 1) {
    
    std::fprintf(stderr, "Filters: %d good, %d mediocre\n", good_filters,
    
    mediocre_filters);

}

    ASSERT_LE(mediocre_filters, good_filters / 5);

}

  

// Different bits-per-byte

  

} // namespace leveldb

c++ syntax

Replenish:
I haven't learned C++ personally, so this part adds some keywords and syntax meanings that I don't understand.

explicit

The C++ reference manual explains it as follows:

explicit constructors cannot be called implicitly.
Implicit conversions between class objects are prohibited.

In this article, we focus on the first point: after the constructor is modified by explicit , it can no longer be called implicitly .

Here is an online related case:

 #include<cstring>
#include<string>
#include<iostream>

class Explicit
{
    private:

    public:
        Explicit(int size)
        {
            std::cout << " the size is " << size << std::endl;
        }
        Explicit(const char* str)
        {
            std::string _str = str;
            std::cout << " the str is " << _str << std::endl;
        }

        Explicit(const Explicit& ins)
        {
            std::cout << " The Explicit is ins" << std::endl;
        }

        Explicit(int a,int b)
        {
            std::cout << " the a is " << a  << " the b is " << b << std::endl;
        }
};

int main()
{
    Explicit test0(15);
    Explicit test1 = 10;// 隐式调用Explicit(int size)

    Explicit test2("RIGHTRIGHT");
    Explicit test3 = "BUGBUGBUG";// 隐式调用Explicit(const char* str)

    Explicit test4(1, 10);
    Explicit test5 = test1;
}

Although there is no error in the program, assigning a variable of type --- int or const char* to a variable of type Explicit does not seem to be very good, and once the use is wrong It is difficult to investigate, so at this time, after the constructor is modified by explicit , it can no longer be called implicitly. The effect after adding the keyword is not demonstrated, and the whole program cannot be compiled after that.

Tucao: Very crazy thing. In the past version, implicit calls were used to improve coding efficiency. As a result, it was found that the pit was too large and the pit was filled by myself.

resize function

It is easier to understand by looking directly at the following case:

myvector.resize(5);
Adjust the original vector array with 10 numbers to the length of 5 numbers, delete the excess numbers, and release the memory. 5 < 10 reduce array length
myvector.resize(8,100);
Adjust the length of the vector array with the length of 5 numbers to 8, and fill in the insufficient numbers with 100, that is, add 3 100s. 8 > 5 Increase the length of the array and specify the padding element
myvector.resize(12);
Adjust the length of the 8-number-length vector array to 12 and fill it with 0 by default, that is, 4 0s are added. 12 > 8 increase the length of the array, no padding elements are specified

Mathematical derivation

The derivation part is for those who want to have a deeper understanding. You can directly remember the above conclusions, and it doesn't matter if you don't understand them.

Much of the following is from the paper translated from LNCS 4168 - Less Hashing, Same Performance: Building a Better Bloom Filter (harvard.edu) .

false position : The false positive rate, that is, the false positive rate increases with the increase of the hash sum of 1 bits.

According to the composition of the bloom filter, for a specified bit, the probabilities of being set to 0 and 1 are:

 P(1) = 1/m
P(0) = 1 - 1/m

For k hash functions, the probability that this bit is set to 0 is:

 P'(0) = P(0) ** k = (1 - 1/m) ** k

After n keys, the probability that the bit is set to 0 is

 P''(0) = P'(0) ** n = (1 - 1/m) ** kn

According to the formula for the natural logarithm e:

We can approximate the previous P''(0)

For the value of the natural logarithm e, see the following:

When detecting a key that does not actually exist, the conditions are met:

The corresponding k bits are all set to 1, which is a false positive scenario.

The probability is:

The question is, how to minimize false_positive?

To simplify the description, first define p (ie P''(0) : the probability that a bit is set to 0):

Derive from the formula:

The base is e, which is a fixed value, then minimizing false_positive_rate is the minimization index

According to the previous calculation results, we can do the following deformation:

Finally get the result g:

According to symmetry, when p = 1/2 , f takes the minimum value.

At this time, the minimum value of k and f is:

The final derivation result:

Considering the probability that p is set to 0, it can be considered that when half of m is set to 1 and half of it is set to 0, the false positive rate is the lowest .

false position The example of the combination relationship table with m/n and k can be seen in the following screenshot:

Summarize

Bloom Filters are often used to quickly determine whether an element is in a collection. It essentially tolerates a certain error rate in exchange for the efficiency of space and time.

Significance for LevelDB: On the basis of the hash table, the conflict processing part is saved, and LevelDB uses a certain optimization in its implementation: a hash function is used to approximate the effect of k hash functions . Doing so enables efficient concurrent writes without sacrificing too much performance.

Except for the hash function and the optimization part for concurrent writing, the other parts of LevelDB are very close to the theoretical basis of the Bloom filter, and it is also an excellent study case. It is also a good reference for the C++ version of the filter production case application. template.

In the end, it's not wrong to ask Braum when you have a problem.

References

The following information will definitely give you a thorough understanding of the Bloom filter.

For the official account, please "read the original text" to get the access link.

LSM-Tree - LevelDb Bloom Filter