《Python 网络数据采集》 BeautifulSoup查找word中的标签

学习《Python 网络数据采集》第六章, 尝试执行第六章中读取Word文档里标签的代码:

from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO
from bs4 import BeautifulSoup

wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')

wordObj = BeautifulSoup(xml_content.decode('utf-8'), "html.parser")
textStrings = wordObj.findAll("w:t")
for textElem in textStrings:
    print(textElem.text)

代码运行成功,但是却没有输出。
pirnt了wordObj.contents:

['xml version="1.0" encoding="UTF-8" standalone="yes"?', 'n', <w:document mc:ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-comoffice" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-comword" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"><w:body><w:p w:rsidp="00764658" w:rsidr="00764658" w:rsidrdefault="00764658"><w:ppr><w:pstyle w:val="Title"></w:pstyle></w:ppr><w:r><w:t>A Word Document on a Website</w:t></w:r><w:bookmarkstart w:id="0" w:name="_GoBack"></w:bookmarkstart><w:bookmarkend w:id="0"></w:bookmarkend></w:p><w:p w:rsidp="00764658" w:rsidr="00764658" w:rsidrdefault="00764658"></w:p><w:p w:rsidp="00764658" w:rsidr="00764658" w:rsidrdefault="00764658" w:rsidrpr="00764658"><w:r><w:t>This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a .</w:t></w:r><w:prooferr w:type="spellStart"></w:prooferr><w:r><w:t>docx</w:t></w:r><w:prooferr w:type="spellEnd"></w:prooferr><w:r><w:t xml:space="preserve"> file, rather than just publishing it as HTML</w:t></w:r></w:p><w:sectpr w:rsidr="00764658" w:rsidrpr="00764658"><w:pgsz w:h="15840" w:w="12240"></w:pgsz><w:pgmar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440"></w:pgmar><w:cols w:space="720"></w:cols><w:docgrid w:linepitch="360"></w:docgrid></w:sectpr></w:body></w:document>]

<w:t>标签是存在的,但是查看列表

len(textStrings) == 0

发现列表里是空的,求问该怎样查找word里的标签???

阅读 3.2k
3 个回答

你是python2吗,在python2中应该是

from urllib import urlopen

如果用解析 html 的方式解析 openxml 那作者简直是石乐志…

正巧最近在处理以解析 xml 的方式处理 word 文件

import requests
import lxml
from zipfile import ZipFile
from io import BytesIO
word_file = requests.get(
    url="http://pythonscraping.com/pages/AWordDocument.docx").content
document = ZipFile(BytesIO(word_file))
xml = document.read('word/document.xml')
tree = lxml.etree.XML(xml)
nsmap = tree.nsmap
body = tree.xpath("//w:t", namespaces=nsmap)
for i in body:
    print(i.tag)
新手上路,请多包涵

跟你遇到一样问题,这个我知道问题出在哪里了

解析器问题

wordObj = BeautifulSoup(xml_content.decode('utf-8'), "html.parser")

中的html.parser,改为xml

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题