如何使用 spaCy 进行文本预处理?

新手上路,请多包涵

如何使用 python 在 spaCy 中执行预处理步骤,如停用词删除、标点符号删除、词干提取和词形还原。

我在 csv 文件中有文本数据,如段落和句子。我想做文本清理。

请通过在 pandas 数据框中加载 csv 来举例说明

原文由 RVK 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 799
2 个回答

这可能有助于:

 import spacy #load spacy
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
stops = stopwords.words("english")

def normalize(comment, lowercase, remove_stopwords):
    if lowercase:
        comment = comment.lower()
    comment = nlp(comment)
    lemmatized = list()
    for word in comment:
        lemma = word.lemma_.strip()
        if lemma:
            if not remove_stopwords or (remove_stopwords and lemma not in stops):
                lemmatized.append(lemma)
    return " ".join(lemmatized)

Data['Text_After_Clean'] = Data['Text'].apply(normalize, lowercase=True, remove_stopwords=True)

原文由 RVK 发布,翻译遵循 CC BY-SA 4.0 许可协议

迄今为止我遇到的最好的管道来自 Maksym Balatsko 的 Medium 文章 Text preprocessing steps and universal reusable pipeline 。最好的部分是我们可以将它用作 scikit-learn 转换器管道的一部分并支持多进程

我修改了 Maksym 并将包保持在最低限度,并使用生成器而不是列表来避免将数据加载到内存中:

 import numpy as np
import multiprocessing as mp

import string
import spacy
from sklearn.base import TransformerMixin, BaseEstimator

nlp = spacy.load("en_core_web_sm")

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self,
                 nlp = nlp,
                 n_jobs=1):
        """
        Text preprocessing transformer includes steps:
            1. Punctuation removal
            2. Stop words removal
            3. Lemmatization

        nlp  - spacy model
        n_jobs - parallel jobs to run
        """
        self.nlp = nlp
        self.n_jobs = n_jobs

    def fit(self, X, y=None):
        return self

    def transform(self, X, *_):
        X_copy = X.copy()

        partitions = 1
        cores = mp.cpu_count()
        if self.n_jobs <= -1:
            partitions = cores
        elif self.n_jobs <= 0:
            return X_copy.apply(self._preprocess_text)
        else:
            partitions = min(self.n_jobs, cores)

        data_split = np.array_split(X_copy, partitions)
        pool = mp.Pool(cores)
        data = pd.concat(pool.map(self._preprocess_part, data_split))
        pool.close()
        pool.join()

        return data

    def _preprocess_part(self, part):
        return part.apply(self._preprocess_text)

    def _preprocess_text(self, text):
        doc = self.nlp(text)
        removed_punct = self._remove_punct(doc)
        removed_stop_words = self._remove_stop_words(removed_punct)
        return self._lemmatize(removed_stop_words)

    def _remove_punct(self, doc):
        return (t for t in doc if t.text not in string.punctuation)

    def _remove_stop_words(self, doc):
        return (t for t in doc if not t.is_stop)

    def _lemmatize(self, doc):
        return ' '.join(t.lemma_ for t in doc)

您可以将其用作:

 from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import  LogisticRegressionCV
from sklearn.pipeline import Pipeline

# ... assuming data split X_train, X_test ...

clf  = Pipeline(steps=[
        ('normalize': TextPreprocessor(n_jobs=-1)),
        ('features', TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True)),
        ('classifier', LogisticRegressionCV(cv=5,solver='saga',scoring='accuracy', n_jobs=-1, verbose=1))
    ])

clf.fit(X_train, y_train)
clf.predict(X_test)

X_train 是将经过 TextPreprocessing 的数据,然后我们提取特征,然后传递给分类器。

原文由 Prayson W. Daniel 发布,翻译遵循 CC BY-SA 4.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题