python 读取pdf?

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdfFile)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = requests.get("http://pythonscraping.com/pages/warandpeace/chapter1.pdf").content
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()



图片描述

我看到网上的读取pdf的源码,但是调试发现出错,好像是编码错误,改了几次没能成功,求解如何修改,正确读取pdf内容 。先谢过大神。

阅读 3.8k
2 个回答
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO  import StringIO
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
print pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
'''
$ python readpdf.py
CHAPTER I

"Well, Prince, so Genoa and Lucca are now just family estates of
theBuonapartes. But I warn you, if you don't tell me that this
means war,if you still try to defend the infamies and horrors
perpetrated bythat Antichrist- I really believe he is Antichrist- I will
havenothing more to do with you and you are no longer my friend,
no longermy 'faithful slave,' as you call yourself! But how do you
do? I seeI have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known
AnnaPavlovna Scherer, maid of honor and favorite of the
Empress MaryaFedorovna. With these words she greeted Prince
Vasili Kuragin, a manof high rank and importance, who was the
first to arrive at herreception. Anna Pavlovna had had a cough for
some days. She was, asshe said, suffering from la grippe; grippe
being then a new word inSt. Petersburg, used only by the elite.

All her invitations without exception, written in French,
anddelivered by a scarlet-liveried footman that morning, ran as
'''   
chr()函数用一个范围在range(256)内的(就是0~255)整数作参数,返回一个对应的字符.
比如
>>>chr(65)
'A'

这说明你的chr(x)中的x不是这个范围的整数

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题