Java中的相似性字符串比较

我想比较几个字符串，并找到最相似的字符串。我想知道是否有任何库、方法或最佳实践可以返回哪些字符串与其他字符串更相似。例如：

“狐狸跳了”->“狐狸跳了”
“快狐跳”->“狐狸”

这种比较将返回第一个比第二个更相似。

我想我需要一些方法，例如：

 double similarityIndex(String s1, String s2)

有没有这样的地方？

编辑：我为什么要这样做？我正在编写一个脚本，将 MS Project 文件的输出与一些处理任务的遗留系统的输出进行比较。因为遗留系统的字段宽度非常有限，所以在添加值时，描述会被缩写。我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似，这样我就可以获得生成的密钥。它有缺点，因为它仍然必须手动检查，但它会节省很多工作

原文由 Mario Ortegón 发布，翻译遵循 CC BY-SA 4.0 许可协议

阅读 866

许多库中使用 的以 0%-100% 的方式计算两个字符串之间相似度 的常用方法是测量您必须将较长的字符串更改多少（以 % 为单位）才能将其变为较短的字符串：

 /**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

计算 `editDistance()` ：

上面的 editDistance() 函数应该计算两个字符串之间的 编辑距离。此步骤有多种实现方式，每种都可能更适合特定场景。最常见的是 _Levenshtein 距离算法_，我们将在下面的示例中使用它（对于非常大的字符串，其他算法可能表现更好）。

以下是计算编辑距离的两个选项：

您可以使用 Apache Commons Text 的 Levenshtein 距离实现： apply(CharSequence left, CharSequence rightt)
自己实施。下面是一个示例实现。

工作示例：

请在此处查看在线演示。

 public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

输出：

 1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

原文由 acdcjunior 发布，翻译遵循 CC BY-SA 3.0 许可协议

Java中的相似性字符串比较

计算 `editDistance()` ：

工作示例：

你尚未登录，登录后可以

Spring中的两个疑惑?

求java/php大佬帮帮忙？

Java实例变量默认值赋值时机是什么时候？

java连redis-sentinel连不上,接下来如何排查?

请问，低代码中，DSL和DSL2CODE是否有公共语言的实现呢？

java里怎么解析mybatis返回的单条map类型集合？

阿里的EasyExcel报错， Can not close IO.] with root cause java.io.IOException: Broken pipe？

Stack Overflow 翻译

Java中的相似性字符串比较

计算 editDistance() ：

工作示例：

你尚未登录，登录后可以

Spring中的两个疑惑?

求java/php大佬帮帮忙？

Java实例变量默认值赋值时机是什么时候？

java连redis-sentinel连不上,接下来如何排查?

请问，低代码中，DSL和DSL2CODE是否有公共语言的实现呢？

java里怎么解析mybatis返回的单条map类型集合？

阿里的EasyExcel报错， Can not close IO.] with root cause java.io.IOException: Broken pipe？

Stack Overflow 翻译

计算 `editDistance()` ：