新手上路，请多包涵

我有一个涉及大量文本数据的机器学习任务。我想识别并提取训练文本中的名词短语，以便稍后在管道中使用它们进行特征构建。我已经从文本中提取了我想要的名词短语类型，但我对 NLTK 还很陌生，所以我以一种可以分解列表理解中的每个步骤的方式来解决这个问题，如下所示。

但我真正的问题是，我是在重新发明轮子吗？有没有更快的方法来做到这一点，我没有看到？

 import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

将我们最终得到的元组列表列表展平为元组列表列表

leaves = [tupls for sublists in leaves for tupls in sublists]

将提取的术语加入一个二元组

nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]

原文由 Silent-J 发布，翻译遵循 CC BY-SA 4.0 许可协议

python-3.x pandas 自然语言处理 nltk text-chunking

阅读 1.3k

2 个回答

得票最新

社区维基

发布于
2022-11-17

✓ 已被采纳

看看为什么我的 NLTK 函数在处理 DataFrame 时很慢？，如果不需要中间步骤，则无需多次遍历所有行。

使用 ne_chunk 和解决方案

[代码]：

 from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[出去]：

 0                   [New York]
1    [Washington, Bruce Wayne]
Name: text, dtype: object

要使用自定义 RegexpParser ：

 from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[出去]：

 0                  [bar sentence, New York city]
1    [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object

原文由 alvas 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2022-11-17

上述方法没有给我所需的结果。以下是我建议的功能

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re

def get_noun_phrases(text):
    pos = pos_tag(word_tokenize(text))
    count = 0
    half_chunk = ""
    for word, tag in pos:
        if re.match(r"NN.*", tag):
            count+=1
            if count>=1:
                half_chunk = half_chunk + word + " "
        else:
            half_chunk = half_chunk+"---"
            count = 0
    half_chunk = re.sub(r"-+","?",half_chunk).split("?")
    half_chunk = [x.strip() for x in half_chunk if x!=""]
    return half_chunk

原文由 Saurabh Yadav 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

有什么好的 nlp 工具可以提取代码中所有的英文单词？
有什么好的 nlp 工具可以提取代码中所有的英文单词可以适应 Camelcase 等命名方式
1k 阅读

Python (NLTK) - 提取名词短语的更有效方法？

你尚未登录，登录后可以

有什么好的 nlp 工具可以提取代码中所有的英文单词？

Stack Overflow 翻译