新手上路，请多包涵

我想使用 PDFMiner 从 PDF 文件中提取所有文本框和文本框坐标。

许多其他 Stack Overflow 帖子解决了如何以有序方式提取所有文本的问题，但我如何才能完成获取文本和文本位置的中间步骤？

给定一个 PDF 文件，输出应该类似于：

 489, 41,  "Signature"
500, 52,  "b"
630, 202, "a_g_i_r"

原文由 pnj 发布，翻译遵循 CC BY-SA 4.0 许可协议

python pdf pdfminer

阅读 2.1k

2 个回答

得票最新

社区维基

发布于
2023-01-04

✓ 已被采纳

完全披露，我是 pdfminer.six 的维护者之一。它是 python 3 的 pdfminer 的社区维护版本。

如今，pdfminer.six 有多个 API 可以从 PDF 中提取文本和信息。对于以编程方式提取信息，我建议使用 extract_pages() 。这允许您检查页面上的所有元素，这些元素按布局算法创建的有意义的层次结构排序。

以下示例是一种显示层次结构中所有元素的 pythonic 方式。它使用 pdfminer.six 示例目录中的 simple1.pdf。

 from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages

def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        x1  y1  x2  y2   text')
        print('------------------------------ --- --- --- ---- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_bbox(o)} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)

def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of LTItem"""
    return '  ' * depth + o.__class__.__name__

def get_optional_bbox(o: Any) -> str:
    """Bounding box of LTItem if available, otherwise empty string"""
    if hasattr(o, 'bbox'):
        return ''.join(f'{i:<4.0f}' for i in o.bbox)
    return ''

def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''

path = Path('~/Downloads/simple1.pdf').expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

输出显示层次结构中的不同元素。每个的边界框。以及该元素包含的文本。

 element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator
  LTPage                       0   0   612 792
    LTTextBoxHorizontal        100 695 161 719  Hello
      LTTextLineHorizontal     100 695 161 719  Hello
        LTChar                 100 695 117 719  H
        LTChar                 117 695 131 719  e
        LTChar                 131 695 136 719  l
        LTChar                 136 695 141 719  l
        LTChar                 141 695 155 719  o
        LTChar                 155 695 161 719
        LTAnno
    LTTextBoxHorizontal        261 695 324 719  World
      LTTextLineHorizontal     261 695 324 719  World
        LTChar                 261 695 284 719  W
        LTChar                 284 695 297 719  o
        LTChar                 297 695 305 719  r
        LTChar                 305 695 311 719  l
        LTChar                 311 695 324 719  d
        LTAnno
    LTTextBoxHorizontal        100 595 161 619  Hello
      LTTextLineHorizontal     100 595 161 619  Hello
        LTChar                 100 595 117 619  H
        LTChar                 117 595 131 619  e
        LTChar                 131 595 136 619  l
        LTChar                 136 595 141 619  l
        LTChar                 141 595 155 619  o
        LTChar                 155 595 161 619
        LTAnno
    LTTextBoxHorizontal        261 595 324 619  World
      LTTextLineHorizontal     261 595 324 619  World
        LTChar                 261 595 284 619  W
        LTChar                 284 595 297 619  o
        LTChar                 297 595 305 619  r
        LTChar                 305 595 311 619  l
        LTChar                 311 595 324 619  d
        LTAnno
    LTTextBoxHorizontal        100 495 211 519  H e l l o
      LTTextLineHorizontal     100 495 211 519  H e l l o
        LTChar                 100 495 117 519  H
        LTAnno
        LTChar                 127 495 141 519  e
        LTAnno
        LTChar                 151 495 156 519  l
        LTAnno
        LTChar                 166 495 171 519  l
        LTAnno
        LTChar                 181 495 195 519  o
        LTAnno
        LTChar                 205 495 211 519
        LTAnno
    LTTextBoxHorizontal        321 495 424 519  W o r l d
      LTTextLineHorizontal     321 495 424 519  W o r l d
        LTChar                 321 495 344 519  W
        LTAnno
        LTChar                 354 495 367 519  o
        LTAnno
        LTChar                 377 495 385 519  r
        LTAnno
        LTChar                 395 495 401 519  l
        LTAnno
        LTChar                 411 495 424 519  d
        LTAnno
    LTTextBoxHorizontal        100 395 211 419  H e l l o
      LTTextLineHorizontal     100 395 211 419  H e l l o
        LTChar                 100 395 117 419  H
        LTAnno
        LTChar                 127 395 141 419  e
        LTAnno
        LTChar                 151 395 156 419  l
        LTAnno
        LTChar                 166 395 171 419  l
        LTAnno
        LTChar                 181 395 195 419  o
        LTAnno
        LTChar                 205 395 211 419
        LTAnno
    LTTextBoxHorizontal        321 395 424 419  W o r l d
      LTTextLineHorizontal     321 395 424 419  W o r l d
        LTChar                 321 395 344 419  W
        LTAnno
        LTChar                 354 395 367 419  o
        LTAnno
        LTChar                 377 395 385 419  r
        LTAnno
        LTChar                 395 395 401 419  l
        LTAnno
        LTChar                 410 395 424 419  d
        LTAnno

（此处、此处和此处的类似答案，我会尽量使它们保持同步。）

原文由 Pieter 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2023-01-04

这是一个复制粘贴就绪的示例，它列出了 PDF 中每个文本块的左上角，我认为它应该适用于任何不包含其中包含文本的“Form XObjects”的 PDF：

 from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open('yourpdf.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('Processing next page...')
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

上面的代码基于 PDFMiner 文档中的执行布局分析示例，以及 pnj ( https://stackoverflow.com/a/22898159/1709587 ) 和 Matt Swain ( https://stackoverflow.com/a/ ) 的示例 ^25262470⁄_1709587 ）。我对前面的示例做了一些更改：

我使用 PDFPage.get_pages() ，这是创建文档的简写，检查它 is_extractable ，并将它传递给 PDFPage.create_pages()
我懒得处理 LTFigure s，因为 PDFMiner 目前无论如何都无法干净地处理其中的文本。

LAParams 允许您设置一些参数来控制 PDF 中的单个字符如何被 PDFMiner 神奇地分组为行和文本框。如果您对这种分组是一件需要发生的事情感到惊讶，那么在 pdf2txt 文档中它是合理的：

在实际的 PDF 文件中，文本部分可能会在运行过程中分成几个块，具体取决于创作软件。因此，文本提取需要拼接文本块。

LAParams 的参数与大多数 PDFMiner 一样，未记录，但您可以在源代码中或通过调用 help(LAParams) 在您的 Python shell 中查看它们。一些参数的含义在 https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py 中给出，因为它们也可以作为参数传递给 pdf2text 在命令行中。

上面的 layout 对象是一个 LTPage ，它是一个可迭代的“布局对象”。这些布局对象中的每一个都可以是以下类型之一……

LTTextBox
LTFigure
LTImage
LTLine
LTRect

…或其子类。（特别是，您的文本框可能都是 LTTextBoxHorizontal s。）

文档中的这张图片显示了 LTPage 结构的更多细节：

<code>LTPage</code> 结构的树形图。与此答案相关：它表明 <code>LTPage</code> 包含上面列出的 5 种类型，并且 <code>LTTextBox</code> 包含 <code>LTTextLine</code> 以及未指定的其他内容，并且 <code>LTTextLine</code> 包含 <code>LTChar</code>、<code>LTAnno</code>、<code>LTText</code> 和未指定的其他内容。

上述每种类型都有一个 .bbox 属性，其中包含一个 ( x0 , y0 , x1 , y1 ) 元组，分别包含对象的左、下、右和上坐标。 y 坐标表示距页面底部的距离。如果使用从上到下的 y 轴更方便，您可以从页面的高度中减去它们 .mediabox ：

 x0, y0_orig, x1, y1_orig = some_lobj.bbox
y0 = page.mediabox[3] - y1_orig
y1 = page.mediabox[3] - y0_orig

除了 bbox ， LTTextBox es 还有一个 .get_text() 方法，如上所示，将文本内容作为字符串返回。 Note that each LTTextBox is a collection of LTChar s (characters explicitly drawn by the PDF, with a bbox ) and LTAnno s ( PDFMiner 根据相距很远的字符添加到文本框内容的字符串表示中的额外空格；这些没有 bbox ）。

本答案开头的代码示例结合了这两个属性来显示每个文本块的坐标。

最后，值得注意的是，与上面引用的其他 Stack Overflow 答案不同，我不会费心递归到 LTFigure s。尽管 LTFigure s 可以包含文本，但 PDFMiner 似乎无法将该文本分组为 LTTextBox es（您可以自己尝试来自 https://stackoverflow.com/ 的示例 PDF a/27104504/1709587 ) 而是生成一个 LTFigure 直接包含 LTChar 对象。原则上，您可以弄清楚如何将它们拼凑成一个字符串，但 PDFMiner（从 20181108 版开始）无法为您完成。

不过，希望您需要解析的 PDF 不使用其中包含文本的 Form XObjects，因此此警告不适用于您。

原文由 Mark Amery 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

如何从PDF文件中提取文字和文字坐标？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译