Hash function

In a computer, the function is a black box with input and output, and the hash function is one of the functions. We usually come into contact with two types of hash functions.

  • The hash function used for the hash table For example, the hash function in Bloom filter, the hash function of HashMap .
  • The hash function used for encryption and signature. For example, MD5, SHA-256.

Hash functions generally have the following characteristics.

  • The length is fixed. Any input must get the same output length.
  • Certainty. The same input must get the same output.
  • One-way. Get output through input, but you cannot reverse input through output.

Hash function quality

The role of the hash function is to a bunch of data information 1618a4db9b2da8 to a short data, which represents the entire data information . Such as ID number.

How to measure the quality of a hash function, mainly from the consideration of the following aspects

  • Whether the hash value is evenly distributed and presents randomness, which is conducive to improving the utilization of the hash table space and increasing the difficulty of hashing;
  • The probability of hash collision is very low, and the collision probability should be controlled within a certain range;
  • Whether the calculation is faster, the shorter the calculation time of a hash function, the higher the efficiency.

Collision probability

What is a collision?

When the same hash value maps different data, a collision occurs.

Collision is inevitable, and the collision probability can only be reduced as much as possible, and the collision probability is determined by the hash length and algorithm .

How to evaluate the probability of collision. There is a classic problem in birthday problem . Mathematical laws reveal that the probability of two people with the same birthday in 23 people is greater than 50%, and the probability of two people with the same birthday in 100 people is more than 99%. This is counter-intuitive, so it is also called the birthday paradox.

生日问题

birthday question is a theoretical guide for collision probability. In cryptography, the attacker only needs \( {\textstyle {\sqrt {2^{n}}}=2^{n/2}} \) times to find hash function collisions according to this theory.

The following is a reference table for collisions of different bit hashes:
碰撞表

In addition, according to the derivation on the wiki, we can also get the following formula.

Specify the number of existing hash values\( n \) estimate collision probability\( p(n) \)

$$ p (n)\approx 1- e^{-\frac{n(n-1)}{2N}} $$

Specify the collision probability \( p \) and the maximum value of the hash range\( d \), and estimate the number of hashes needed to reach the collision probability\( n \)

$$ n (p)\approx \sqrt{2\cdot d\ln\left({1 \over 1-p}\right)}+{1 \over 2} $$

Specify the collision probability\( p \) and the maximum value of the hash range\( d \), estimate the number of collisions\( rn \)

$$ {\displaystyle rn=n-d+d\left({\frac {d-1}{d}}\right)^{n }} $$

// 估算理论碰撞概率
public static double collisionProb(double n, double d) {
    return 1 - Math.exp(-0.5 * (n * (n - 1)) / d);
}
//  估算达到碰撞概率时需要的哈希数量
public static long collisionN(double p, double d) {
    return Math.round(Math.sqrt(2 * d * Math.log(1 / (1 - p))) + 0.5);
}
// 估算碰撞哈希数量
public static double collisionRN(double n, double d) {
     return n - d + d * Math.pow((d - 1) / d, n);
}

According to the above formula, we evaluate String.hashCode() . In Java, hashCode () returns int , so the hash range is \( 2^{32} \). String.hashCode() at the performance of 0618a4db9b2fbc under 10 million UUIDs.

10 million UUIDs, the theoretical number of collisions is 11632.50

collisionRN(10000000, Math.pow(2, 32)) // 11632.50

Use the following code to test

private static Map<Integer, Set<String>> collisions(Set<String> values) {
    Map<Integer, Set<String>> result = new HashMap<>();
    for (String value : values) {
        Integer hashCode = value.hashCode();
        Set<String> bucket = result.computeIfAbsent(hashCode, k -> new TreeSet<>());
        bucket.add(value);
    }
    return result;
}

public static void main(String[] args) throws IOException {
        Set<String> uuids = new HashSet<>();
        for (int i = 0; i< 10000000; i++){
            uuids.add(UUID.randomUUID().toString());
        }
        Map<Integer, Set<String>> values = collisions(uuids);

        int maxhc = 0, maxsize = 0;
        for (Map.Entry<Integer, Set<String>> e : values.entrySet()) {
            Integer hashCode = e.getKey();
            Set<String> bucket = e.getValue();
            if (bucket.size() > maxsize) {
                maxhc = hashCode;
                maxsize = bucket.size();
            }
        }

        System.out.println("UUID总数: " + uuids.size());
        System.out.println("哈希值总数: " + values.size());
        System.out.println("碰撞总数: " + (uuids.size() - values.size()));
        System.out.println("碰撞概率: " + String.format("%.8f", 1.0 * (uuids.size() - values.size()) / uuids.size()));
        if (maxsize != 0) {
            System.out.println("最大的碰撞的字符串: " + maxsize + " " + values.get(maxhc));
        }
    }

The total number of collisions 11713 is very close to the theoretical value.

UUID总数: 10000000
哈希值总数: 9988287
碰撞总数: 11713
碰撞概率: 0.00117130
Note that the above test is not enough to draw the performance conclusion of string.hashCode(). There are many string situations and cannot be covered one by one.

The pros and cons of the hashCode algorithm in the JDK determines its distribution in the hash table. We can continuously optimize the algorithm by estimating the theoretical and measured values.

For some well-known hash algorithms, such as FNV-1 , Murmur2 there is a post on the Internet that compares their collision probability and distribution.

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

summary

Hash function maps long information to short data with a fixed length. To judge the quality of a hash function, consider its collision probability and hash value distribution .

https://en.wikipedia.org/wiki/Birthday_problem

编程码农
455 声望1.4k 粉丝

多年编程老菜鸟👨‍💻🦍