2

Bloom filter

Bloom filter is a bit array and multiple hash functions , returning two results may exist and must not exist .

An element in the Bloom filter is jointly determined state values The bit array stores the state value , and the hash function calculates the position of the state value

According to its algorithm structure, it has the following characteristics:

  • Use a finite bit array to represent the number of elements greater than its length, because a bit of state value can identify multiple elements at the same time.
  • The element cannot be deleted. Because a bit of state value may simultaneously identify multiple elements.
  • Adding elements will never fail. It's just that as the added elements increase, the false positive rate will increase.
  • If it is judged that the element does not exist, then it must not exist.

For example, in the following, X, Y, Z are respectively determined by the three state values whether the element exists, and state value is calculated by three hash functions.

bloom

Mathematical relationship

Probability of misjudgment

Regarding the probability of misjudgment, because the state value each bit may identify multiple elements at the same time, it has a certain probability of misjudgment. If the bit array is full, it will always return true when judging whether an element exists. For non-existent elements, its misjudgment rate is 100%.

Then, what factors are the probability of misjudgment related to, the number of added elements, the length of the Bloom filter (the size of the bit array), and the number of hash functions.

According to Wikipedia reasoning probability of misjudgment \( P_{fp} \) has the following relationship:

$$ { P_{fp} =\left(1-\left[1-{\frac {1}{m}}\right]^{kn}\right)^{k}\approx \left(1-e^{{-\frac {kn}{m}}}\right)^{k}} $$

  • $m$ is the size of the bit array;
  • $n$ is the number of elements that have been added;
  • $k$ is the number of hash functions;
  • $e$ is a mathematical constant, approximately equal to 2.718281828.

It can be obtained that when the number of added elements is 0, the false alarm rate is 0; when the bit array is all 1, the false alarm rate is 100%.

With different numbers of hash functions, the relationship between $P_{fp}$ and $n$ is as follows:

Bloom_filter_fp_probability

Some things can be done according to the formula for the probability of misjudgment

  • Estimate the optimal bloom filter length.
  • Estimate the optimal number of hash functions.

Optimal Bloom filter length

When \(n \) adds elements and \( P_{fp} \) the probability of false alarm is determined, \( m \) is equal to:

$$ {\displaystyle m=-{\frac {n\ln P_{fp}}{(\ln 2)^{2}}} \approx -1.44\cdot n\log _{2}P_{fp}} $$

number of best hash functions

When \( n \) and \(P_{fp} \) are determined, \( k \) is equal to:

$$ {\displaystyle k=-{\frac {\ln P_{fp} }{\ln 2}}=-\log _{2}P_{fp} } $$

When \( n \) and \( m \) are determined, \( k \) is equal to:

$$ {\displaystyle k={\frac {m}{n}}\ln 2} $$

Implement Bloom Filter

using 1618627fcace91 Bloom filter , we generally evaluate two factors.

  • The maximum number of elements expected to be added.
  • The business’s tolerance for errors. For example, if 1000 is allowed to be wrong, the probability of misjudgment should be within one thousandth.

It provides filtering tools Duobu Long expected to add the number and false probability configuration parameters, they will be calculated based on the parameters of optimum length and hash function number .

There are some good Bloom filter toolkits in Java.

  • Guava in BloomFilter .
  • redisson in RedissonBloomFilter can be used in redis.

Guava at BloomFilter . Before creating, calculate the bit and hash function number .

 static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    /**
     * expectedInsertions:预期添加数量
     * fpp:误判概率
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

According to the optimal bloom filter length , the optimal bit array length is calculated.


static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

According to the optimal hash function number , calculate the optimal hash function number.

static int optimalNumOfHashFunctions(long n, long m) {
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

The redisson in RedissonBloomFilter is also consistent.

    private int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
      }

    private long optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }

Memory footprint

Imagine a mobile phone number deduplication scenario, each mobile phone number occupies 22 Byte , the estimated logical memory is as follows.

expectedHashSetfpp=0.0001fpp=0.0000001
100000018.28MB2.29MB4MB
10 million182.82MB22.85MB40MB
1000000001.78G228.53MB400MB
Note: The actual physical memory usage is greater than the logical memory.

misjudgment probability \( p \) and added element \( n \), bit length 1618627fcad118 \( m \), hash function number \( k \) The relationship is as follows:

关系图

Application scenario

  1. Weak password detection;
  2. Spam address filtering.
  3. The browser detects phishing websites;
  4. Cache penetration.

Weak password detection

Maintain a list of weakly hashed passwords. When the user registers or updates the password, the Bloom filter is used to check the new password, and the user is prompted when detected.

Spam address filtering

Maintain a list of hashed spam addresses. When the user receives the mail, it uses the Bloom filter to detect it, and the detected mail is marked as spam.

Browser detects phishing websites

Use Bloom filter to find whether a website URL exists in the phishing website database.

Cache penetration

Cache penetration means that queries a data 1618627fcad250 that does not exist at cache layer nor the database will hit. When the cache misses, query the database

  1. If the database is missed, empty results will not be written back to the cache and empty results will be returned.
  2. The database hits, the query result is written back to the cache and the result is returned.

A typical attack simulates a large number of requests to query data that does not exist, and all requests fall to the database, causing the database to crash.

One of the solutions is to put the existing cache into the Bloom filter, and perform verification filtering before requesting.

缓存

summary

For petascale data, the use of Bloom filters has certain advantages. In addition, according to the business scenario, a reasonable evaluation of expected to add the number of and misjudgment probability is the key.

refer to

https://en.wikipedia.org/wiki/Bloom_filter

https://hur.st/bloomfilter


编程码农
455 声望1.4k 粉丝

多年编程老菜鸟👨‍💻🦍