如何搜索PDF内容?

客户要求做全站的关键字搜索,包括PDF文档内容也要能搜到,目前的解决办法是将PDF转换成文本,写入数据库,然后搜索数据库字段。如果PDF不是文本内容,无法转换肯定无法搜索,是否有更好的解决方案?

阅读 4.1k
3 个回答

额,使用标签呢?怎么还有全站搜pdf的功能啊,关注一下

#python convert pdf to text
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO  import StringIO
#from io  import StringIO for python3
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
txt=pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
print txt
#如果pdf含有中文,命令行输出乱码,可以输出到文件
#open('pdf.txt','wb').write(txt)
'''
CHAPTER I
"Well, Prince, so Genoa and Lucca are now just family estates of
theBuonapartes. But I warn you, if you don't tell me that this
means war,if you still try to defend the infamies and horrors
perpetrated bythat Antichrist- I really believe he is Antichrist- I will
havenothing more to do with you and you are no longer my friend,
no longermy 'faithful slave,' as you call yourself! But how do you
do? I seeI have frightened you- sit down and tell me all the news."
It was in July, 1805, and the speaker was the well-known
AnnaPavlovna Scherer, maid of honor and favorite of the
Empress MaryaFedorovna. With these words she greeted Prince
Vasili Kuragin, a manof high rank and importance, who was the
first to arrive at herreception. Anna Pavlovna had had a cough for
some days. She was, asshe said, suffering from la grippe; grippe
being then a new word inSt. Petersburg, used only by the elite.
All her invitations without exception, written in French,
anddelivered by a scarlet-liveried footman that morning, ran as
''' 
新手上路,请多包涵

优看PDF有直接对PDF搜索的功能,还有超级检索(支持逻辑检索)。但是就你的要求而言,转成文本搜索是效率最高的,因为直接在PDF检索速度很慢。如果PDF是图像,那必须做OCR之后才能检索

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题