python 和ocr的故事

Python ocr

有2-3个选择

PyTesser

http://code.google.com/p/pytesser/
PyTesser

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.
Dependencies

PIL is required to work with images in memory. PyTesser has been tested with Python 2.4 in Windows XP.
Usage Example

>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

(more examples in README)

Python-tesseract

http://code.google.com/p/python-tesseract/
Python Wrapper Class for Tesseract
(Linux & Mac OS X & Windows)

Python-tesseract is a wrapper class for Tesseract OCR that allows any conventional image files (JPG, GIF ,PNG , TIFF and etc) to be read and decoded into readable languages. No temporary file will be created during the OCR processing.

Windows version compiled by VS2008 is available now!
remember to
1. set PATH: e.g. PATH=%PATH%;C:\PYTHON27 Details
2. set c:\python27\python.exe to be compatible to Windows 7 even though you are using windows 7. Otherwise the program might crash during runtime Details
3. Download and install all of them
python-tesseract-win32 python-opencv numpy
4. unzip the sample code and keep your fingers crossed Sample Codes
5. python -u test.py
it is always safer to run python in unbuffered mode especially for windows XP

Example 1:

import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)

mImgFile = "eurotext.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result

Example 2:

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
print text

Example 3:

import tesseract
import cv2
import cv2.cv as cv

image0=cv2.imread("p.bmp")
#### you may need to thicken the border in order to make tesseract feel happy to ocr your image #####
offset=20
height,width,channel = image0.shape
image1=cv2.copyMakeBorder(image0,offset,offset,offset,offset,cv2.BORDER_CONSTANT,value=(255,255,255)) 
#cv2.namedWindow("Test")
#cv2.imshow("Test", image1)
#cv2.waitKey(0)
#cv2.destroyWindow("Test")
#####################################################################################################
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
height1,width1,channel1=image1.shape
print image1.shape
print image1.dtype.itemsize
width_step = width*image1.dtype.itemsize
print width_step
#method 1 
iplimage = cv.CreateImageHeader((width1,height1), cv.IPL_DEPTH_8U, channel1)
cv.SetData(iplimage, image1.tostring(),image1.dtype.itemsize * channel1 * (width1))
tesseract.SetCvImage(iplimage,api)

text=api.GetUTF8Text()
conf=api.MeanTextConf()
image=None
print "..............."
print "Ocred Text: %s"%text
print "Cofidence Level: %d %%"%conf

#method 2
cvmat_image=cv.fromarray(image1)
iplimage =cv.GetImage(cvmat_image)
print iplimage

tesseract.SetCvImage(iplimage,api)
#api.SetImage(m_any,width,height,channel1)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
image=None
print "..............."
print "Ocred Text: %s"%text
print "Cofidence Level: %d %%"%conf

p.bmp

More Examples

Sample Codes

pyocr

https://github.com/jflesch/pyocr
Pyocr is an optical character recognition (OCR) tool wrapper for python. That is, it helps using OCR tools from a Python program.

It has been tested only on GNU/Linux systems. It should also work on similar systems (*BSD, etc). It doesn't work on Windows, MacOSX, etc.

Pyocr can be used as a wrapper for google's Tesseract-OCR or Cuneiform. It can read all image types supported by Pillow, including jpeg, png, gif, bmp, tiff, and others. It also support bounding box data.    
Pyocr 是 OCR 引擎的简单 Python 封装，支持 Tesseract 和 Cuneiform 等。支持 Python 2.7 和 3.x，要求 Pillow。
import Image
import sys
from pyocr import pyocr

tools = pyocr.get_available_tools()[:]
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
print("Using '%s'" % (tools[0].get_name()))
tools[0].image_to_string(Image.open('test.png'), lang='fra',
                         builder=TextBuilder())

安装pyocr

成功后的history记录
652 wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
653 ls
654 cd pyocr-master/
655 ;s
656 ls
657 tar zxvf setuptools-0.6c11.tar.gz
658 cd setuptools-0.6c11
659 python setup.py build
660 python setup.py install
661 ls
662 wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
663 tar zxvf setuptools-0.6c11.tar.gz
664 cd setuptools-0.6c11
665 python setup.py build
666 python setup.py install
667 sudo python setup.py install
668 cd ..
669 ls
670 rem *.gz
671 rm *.gz
672 ls
673 cd ..
674 ls
675 rm *.gx
676 rm *.gz
677 rm *.1
678 ls
679 pip install Pillow
680 sudo apt-get install python-pip
681 sudo apt-get install python-dev python-setuptools
682 sudo apt-get install libtiff4-dev libjpeg8-dev zlib1g-dev libfreetype6-dev liblcms2-dev libwebp-dev tcl8.5-dev tk8.5-dev
683 ls
684 sudo python ./setup.py install

python 和ocr的故事

Python ocr

PyTesser

Python-tesseract

pyocr

安装pyocr

jhfnetboy

引用和评论

pycharm+docker的配置(含jupyter notebook的镜像run方法)