新手上路，请多包涵

使用像 leveinstein（leveinstein 或 difflib）这样的算法，很容易找到近似的 matches.eg。

 >>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

可以根据需要确定一个阈值来检测模糊匹配。

当前需求：根据阈值在更大的字符串中找到模糊子串。

例如。

 large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

一种蛮力解决方案是生成长度为 N-1 到 N+1（或其他匹配长度）的所有子串，其中 N 是 query_string 的长度，并在它们上一个一个地使用 levenstein 并查看阈值。

python 中是否有更好的解决方案，最好是 python 2.7 中包含的模块，或外部可用的模块。

——————更新和解决方案—————-

Python 正则表达式模块工作得很好，尽管它比内置的 re 模块慢一点点，用于模糊子串情况，这是由于额外操作而产生的明显结果。所需的输出很好，并且可以轻松定义对模糊程度的控制。

 >>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

原文由 DhruvPathak 发布，翻译遵循 CC BY-SA 4.0 许可协议

python python-2.7 fuzzy-search

阅读 805

2 个回答

得票最新

社区维基

发布于
2023-01-09

✓ 已被采纳

即将取代 re 的新正则表达式库包括模糊匹配。

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来很有表现力，但这会给你一个或更少插入/添加/删除的匹配。

 import regex
regex.match('(amazing){e<=1}', 'amaging')

原文由 mgbelisle 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-09

我使用 fuzzywuzzy 基于阈值进行模糊匹配，使用 fuzzysearch 从匹配中模糊提取单词。

process.extractBests 采用查询、单词列表和截止分数，并返回匹配和分数高于截止分数的元组列表。

find_near_matches 获取 process.extractBests 的结果并返回单词的开始和结束索引。我使用索引构建单词并使用构建的单词在大字符串中查找索引。 max_l_dist 的 find_near_matches 是“编辑距离”，必须根据需要进行调整。

 from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

去测试：

 query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

输出：

 query: manhattan
string: thelargemanhatanproject is a great project in themanhattincity
match: manhatan
index: 8
match: manhattin
index: 49

query: citi
string: thelargemanhatanproject is a great project in themanhattincity
match: city
index: 58

query: greet
string: thelargemanhatanproject is a great project in themanhattincity
match: great
index: 29

原文由 Nizam Mohamed 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

在 Python 中检查较长字符串中存在的模糊/近似子字符串？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译