Coverage for /Users/virantha/dev/ocr/pypdfocr/pypdfocr_tesseract : 91%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
#!/usr/bin/env python2.7
# Copyright 2013 Virantha Ekanayake All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.
Run Tesseract to generate hocr file """
# Ugly hack to pass in object method to the multiprocessing library # From http://www.rueckstiess.net/research/snippets/show/ca1d7d90 # Basically gets passed in a pair of (self, arg), and calls the method
"""Class to wrap all the tesseract calls""" """ Detect windows tesseract location. """
binary = '"%s"' % binary binary = binary.replace("\\", "\\\\") else: # Explicit str here to get around some MagicMock stuff for testing that I don't quite understand else:
'TS_MISSING': """ Could not execute %s Please make sure you have Tesseract installed correctly """ % self.binary, 'TS_VERSION':'Tesseract version is too old', 'TS_img_MISSING':'Cannot find specified tiff file', 'TS_FAILED': 'Tesseract-OCR execution failed!', }
""" Make sure the version is current """ # Could not run tesseract
ver_str = ver_str[:-3]
# Iterate through the version dots
# Aargh, in windows 3.02.02 is reported as version 3.02 # SFKM
# This minor version number is not present in tesseract, so it must be # lower than required. (3.02 < 3.02.01) # 3.02.02 == 3.02.02 # 4.0 > 3.02.02 # 3.03.02 > 3.02.02 # 3.01.02 < 3.02.02
def _warn(self, msg): # pragma: no cover print("WARNING: %s" % msg)
# Glob it #fns = glob.glob(img_filename)
except KeyboardInterrupt or Exception: print("Caught keyboard interrupt... terminating") pool.terminate() raise finally:
# Could not run tesseract
# Output format is html for old versions of tesseract logging.info("Created %s.html" % basename) return hocr_filename else: # Try changing extension to .hocr for tesseract 3.03 and higher else:
|