问题:本来是从《伤寒论医案集》pdf中识别文字,但是此书的pdf图片分辨很低,不清晰,即使调用百度的高精度ocr识别错误率也比较高。后来找到了该书的最新版,买了一本京东的电子版,然后准备把里面的医案一个一个复制出来进行整理,但是弄了好久,感觉还是太慢了,里面医案有500多啊。
解决办法一:
从Windows系统中运行的软件中提取文字,可以用到uiautomation 模块,这个模块是一个国人自己业余开发自己用的,结果发布在GitHub上。
https://github.com/yinkaishen...
uiautomation封装了微软UIAutomation API,支持自动化Win32,MFC,WPF,Modern UI(Metro UI), Qt, IE, Firefox(version<=56 or >=60, Firefox57是第一个Rust开发版本,前几个Rust开发版本个人测试发现不支持), Chrome和基于Electron开发的应用程序(Chrome浏览器和Electron应用需要加启动参数--force-renderer-accessibility才能支持UIAutomation)
但是,我弄了半天,获取不到京东阅读软件里面的内容,可能其不是用的上述技术开发的。
import sys
import time
import uiautomation as auto
def usage():
auto.Logger.ColorfullyWrite("""usage
<Color=Cyan>-h</Color> show command <Color=Cyan>help</Color>
<Color=Cyan>-t</Color> delay <Color=Cyan>time</Color>, default 3 seconds, begin to enumerate after Value seconds, this must be an integer
you can delay a few seconds and make a window active so automation can enumerate the active window
<Color=Cyan>-d</Color> enumerate tree <Color=Cyan>depth</Color>, this must be an integer, if it is null, enumerate the whole tree
<Color=Cyan>-r</Color> enumerate from <Color=Cyan>root</Color>:Desktop window, if it is null, enumerate from foreground window
<Color=Cyan>-f</Color> enumerate from <Color=Cyan>focused</Color> control, if it is null, enumerate from foreground window
<Color=Cyan>-c</Color> enumerate the control under <Color=Cyan>cursor</Color>, if depth is < 0, enumerate from its ancestor up to depth
<Color=Cyan>-a</Color> show <Color=Cyan>ancestors</Color> of the control under cursor
<Color=Cyan>-n</Color> show control full <Color=Cyan>name</Color>, if it is null, show first 30 characters of control's name in console,
always show full name in log file @AutomationLog.txt
<Color=Cyan>-p</Color> show <Color=Cyan>process id</Color> of controls
if <Color=Red>UnicodeError</Color> or <Color=Red>LookupError</Color> occurred when printing,
try to change the active code page of console window by using <Color=Cyan>chcp</Color> or see the log file <Color=Cyan>@AutomationLog.txt</Color>
chcp, get current active code page
chcp 936, set active code page to gbk
chcp 65001, set active code page to utf-8
examples:
automation.py -t3
automation.py -t3 -r -d1 -m -n
automation.py -c -t3
""", writeToFile=False)
def main():
import getopt
auto.Logger.Write('UIAutomation {} (Python {}.{}.{}, {} bit)\n'.format(auto.VERSION, sys.version_info.major, sys.version_info.minor, sys.version_info.micro, 64 if sys.maxsize > 0xFFFFFFFF else 32))
options, args = getopt.getopt(sys.argv[1:], 'hrfcanpd:t:',
['help', 'root', 'focus', 'cursor', 'ancestor', 'showAllName', 'depth=',
'time='])
root = False
focus = False
cursor = False
ancestor = False
foreground = False
showAllName = True
depth = 4
seconds = 3
showPid = False
for (o, v) in options:
if o in ('-h', '-help'):
usage()
sys.exit(0)
elif o in ('-r', '-root'):
root = True
foreground = False
elif o in ('-f', '-focus'):
focus = True
foreground = False
elif o in ('-c', '-cursor'):
cursor = True
foreground = False
elif o in ('-a', '-ancestor'):
ancestor = True
foreground = False
elif o in ('-n', '-showAllName'):
showAllName = True
elif o in ('-p', ):
showPid = True
elif o in ('-d', '-depth'):
depth = int(v)
elif o in ('-t', '-time'):
seconds = int(v)
if seconds > 0:
auto.Logger.Write('please wait for {0} seconds\n\n'.format(seconds), writeToFile=False)
time.sleep(seconds)
auto.Logger.ColorfullyLog('Starts, Current Cursor Position: <Color=Cyan>{}</Color>'.format(auto.GetCursorPos()))
control = None
if root:
control = auto.GetRootControl()
if focus:
control = auto.GetFocusedControl()
if cursor:
control = auto.ControlFromCursor()
if depth < 0:
while depth < 0 and control:
control = control.GetParentControl()
depth += 1
depth = 0xFFFFFFFF
if ancestor:
control = auto.ControlFromCursor()
if control:
auto.EnumAndLogControlAncestors(control, showAllName, showPid)
else:
auto.Logger.Write('IUIAutomation returns null element under cursor\n', auto.ConsoleColor.Yellow)
else:
indent = 0
if not control:
control = auto.GetFocusedControl()
controlList = []
while control:
controlList.insert(0, control)
control = control.GetParentControl()
if len(controlList) == 1:
control = controlList[0]
else:
control = controlList[1]
if foreground:
indent = 1
auto.LogControl(controlList[0], 0, showAllName, showPid)
auto.EnumAndLogControl(control, depth, showAllName, showPid, startDepth=indent)
auto.Logger.Log('Ends\n')
if __name__ == '__main__':
main()
UIAutomation 2.0.16 (Python 3.6.13, 64 bit)
please wait for 3 seconds
2022-01-28 11:54:57.432 test2.py[77] main -> Starts, Current Cursor Position: (683, 479)
ControlType: WindowControl ClassName: CAnswerWnd AutomationId: Rect: (0,0,1366,728)[1366x728] Name: '京东读书' Handle: 0x51326(332582) Depth: 0 SupportedPattern: LegacyIAccessiblePattern TransformPattern WindowPattern
ControlType: WindowControl ClassName: CReadWnd AutomationId: Rect: (2,92,1364,726)[1362x634] Name: '《伤寒论》方医案集' Handle: 0x40F3C(266044) Depth: 1 SupportedPattern: LegacyIAccessiblePattern WindowPattern
2022-01-28 11:54:57.725 test2.py[112] main -> Ends
Process finished with exit code 0
用里面的示例代码,可以获取到记事本notepad.exe的各个控件以及其value值,但是用在京东阅读软件上只能识别到其title,里面内容获取不到。如果有哪位使用这个方法可以获取出来的话,也麻烦留言告诉我一下哈。
解决办法二:
没办法,不想动手,只能试试其他方法。
参考方法:https://jingyan.baidu.com/art...
这个方法是把电子书专为图片格式的pdf,之前文章可以把图片pdf专为文字,于是觉得此法可行。
这个方法的思路是,用一个录屏软件把电子书进行截屏,保存图片,然后不断重复这个过程,就可以得到整本书的截图。
1、录屏软件有很多,我也选择了参考文档里面的SnagIt,软件有绿色版,下载地址:http://www.downxia.com/downin...
解压之后如下图:
打开后界面如下,选择Capture->share->file
然后选择Properties,及“属性”,进行如下的设置。
然后下图的界面中三个箭头所指的位置,设置同下图,这样可以保证图片直接保存为文件,否则不会直接保存,而是打开在SnagitEditor里面。
这样设置之后,可以自己先试着截图,看看能否保存图片文件,下面是软件自动保存的文件格式。
2、重复操作步骤1
手动做太慢,可以用TinyTask软件重复来做。绿色免费的TinyTask下载地址:https://www.greenxf.com/soft/...
打开后界面如下,非常简洁:
点击Rec,就开始了录制你在电脑上的操作,开始后按钮变红,同时上面记录你的操作时间。
然后就开始用Snagit进行截图和翻页,操作完了之后需要停止录制,可用默认的快捷键ctr+shift+alt+r进行停止。
然后需要设置重复执行的次数,点击prefs,就可以等着完成。
3.识别图片中的文字
这里就可以用到我们前面的文章了。
https://segmentfault.com/a/11...
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。