Industrial-strength Natural Language Processing (NLP) in Python
工业级 NLP

前言

  • 官方文档:https://spacy.io/usage/spacy-101
  • spaCy GitHub:https://github.com/explosion/spaCy
  • 本文环境

    Windows 10
    Python 3.8.10
    spaCy 3.4.2
  • spcCy 的组件很多,有时我们并不需要全部组件,可以排除掉一些
    image.png
  • 安装模型

    # 小 small(46 MB)
    python -m spacy download zh_core_web_sm
    # 中 medium (74 MB)
    python -m spacy download zh_core_web_md
    # 大 large(574 MB)
    python -m spacy download zh_core_web_lg
  • 可以直接到 github 下载模型:https://github.com/explosion/spacy-models/releases/
  • poetry 添加离线 whl 文件

    poetry add ./zh_core_web_md-3.7.0-py3-none-any.whl
    # 添加到 pyproject.toml 文件里面的格式为
    zh-core-web-md = {path = "zh_core_web_md-3.7.0-py3-none-any.whl"}
    # poetry 导出为 requirements.txt 文件里面的格式为
    zh-core-web-md @ file:///mnt/d/.../esapi/zh_core_web_md-3.7.0-py3-none-any.whl ; python_full_version >= "3.11.2" and python_full_version < "3.12.0"

方法

  • spaCy 内置组件清单:https://spacy.io/usage/processing-pipelines#built-in
    image.png
  • 查看默认组件

    >>> spaNLP = spacy.load("zh_core_web_sm")
    >>> spaNLP.pipe_names
    ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner']
  • 比如只需要词性标注,可以排除其他组件

    >>> spaNLP = spacy.load("zh_core_web_md", exclude=['parser', 'ner'])
    >>> spaNLP.pipe_names
    ['tok2vec', 'tagger', 'attribute_ruler']
    >>> doc = spaNLP('调查显示:PDA功能华而不实')
      for token in doc:
          print(token.pos_, token.text)
    NOUN 调查
    VERB 显示
    PUNCT :
    NOUN PDA
    NOUN 功能
    VERB 华而不实
  • 不用排除,用包含的方法

    >>> spaNLP = spacy.load("zh_core_web_sm", config={'nlp.pipeline': ['tok2vec', 'tagger', 'attribute_ruler']})
    >>> spaNLP.pipe_names
    ['tok2vec', 'tagger', 'attribute_ruler']
本文出自 qbit snap

qbit
268 声望279 粉丝