Table Of Contents

Next topic

pypdfocr package

This Page

Recent Changes

Version Date Changes
v0.9.1 10/11/16 Fixes (#43, #41)
v0.9.0 2/29/16 Fixed rotated page text, Mac OS X invisible fonts, and pdf merge slowdown
v0.8.5 2/21/16 Better ctrl-c and cleanup behavior
v0.8.4 2/18/16 Maintenance release
v0.8.3 2/18/16 Bug fix for multiprocessing on windows, ctrl-c interrupt, and integer keywords
v0.8.2 12/8/14 Fixed imagemagick invocation on windows. Parallelized preprocessing and tesseract execution

Testing

PyPDFOCR - Tesseract-OCR based PDF filing

image0 image1 image2 passing quality Coverage Status

This program will help manage your scanned PDFs by doing the following:

  • Take a scanned PDF file and run OCR on it (using the Tesseract OCR software from Google), generating a searchable PDF
  • Optionally, watch a folder for incoming scanned PDFs and automatically run OCR on them
  • Optionally, file the scanned PDFs into directories based on simple keyword matching that you specify
  • Evernote auto-upload and filing based on keyword search
  • Email status when it files your PDF

More links:

Usage:

Single conversion:

pypdfocr filename.pdf

--> filename_ocr.pdf will be generated

If you have a language pack installed, then you can specify it with the -l option:

pypdfocr -l spa filename.pdf

Folder monitoring:

pypdfocr -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

Automatic filing:

To automatically move the OCR’ed pdf to a directory based on a keyword, use the -f option and specify a configuration file (described below):

pypdfocr filename.pdf -f -c config.yaml

You can also do this in folder monitoring mode:

pypdfocr -w watch_directory -f -c config.yaml

Filing based on filename match:

If no keywords match the contents of the filename, you can optionally allow it to fallback to trying to find keyword matches with the PDF filename using the -n option. For example, you may have receipts always named as receipt_2013_12_2.pdf by your scanner, and you want to move this to a folder called ‘receipts’. Assuming you have a keyword receipt matching to folder receipts in your configuration file as described below, you can run the following and have this filed even if the content of the pdf does not contain the text ‘receipt’:

pypdfocr filename.pdf -f -c config.yaml -n

Configuration file for automatic PDF filing

The config.yaml file above is a simple folder to keyword matching text file. It determines where your OCR’ed PDFs (and optionally, the original scanned PDF) are placed after processing. An example is given below:

target_folder: "docs/filed"
default_folder: "docs/filed/manual_sort"
original_move_folder: "docs/originals"

folders:
    finances:
        - american express
        - chase card
        - internal revenue service
    travel:
        - boarding pass
        - airlines
        - expedia
        - orbitz
    receipts:
        - receipt

The target_folder is the root of your filing cabinet. Any PDF moving will happen in sub-directories under this directory.

The folders section defines your filing directories and the keywords associated with them. In this example, we have three filing directories (finances, travl, receipts), and some associated keywords for each filing directory. For example, if your OCR’ed PDF contains the phrase “american express” (in any upper/lower case), it will be filed into docs/filed/finances

The default_folder is where the OCR’ed PDF is moved to if there is no keyword match.

The original_move_folder is optional (you can comment it out with # in front of that line), but if specified, the original scanned PDF is moved into this directory after OCR is done. Otherwise, if this field is not present or commented out, your original PDF will stay where it was found.

If there is any naming conflict during filing, the program will add an underscore followed by a number to each filename, in order to avoid overwriting files that may already be present.

Evernote upload:

Evernote authentication token

To enable Evernote support, you will need to get a developer token for your Evernote account.. You should note that this script will never delete or modify existing notes in your account, and limits itself to creating new Notebooks and Notes. Once you get that token, you copy and paste it into your configuration file as shown below

Evernote filing usage

To automatically upload the OCR’ed pdf to a folder based on a keyword, use the -e option instead of the -f auto filing option.

pypdfocr filename.pdf -e -c config.yaml

Similarly, you can also do this in folder monitoring mode:

pypdfocr -w watch_directory -e -c config.yaml

Evernote filing configuration file

The config file shown above only needs to change slightly. The folders section is completely unchanged, but note that target_folder is the name of your “Notebook stack” in Evernote, and the default_folder should just be the default Evernote upload notebook name.

target_folder: "evernote_stack"
default_folder: "default"
original_move_folder: "docs/originals"
evernote_developer_token: "YOUR_TOKEN"

folders:
    finances:
        - american express
        - chase card
        - internal revenue service
    travel:
        - boarding pass
        - airlines
        - expedia
        - orbitz
    receipts:
        - receipt

Auto email

You can have PyPDFOCR email you everytime it converts a file and files it. You need to first specify the following lines in the configuration file and then use the -m option when invoking pypdfocr:

mail_smtp_server: "smtp.gmail.com:587"
mail_smtp_login: "virantha@gmail.com"
mail_smtp_password: "PASSWORD"
mail_from_addr: "virantha@gmail.com"
mail_to_list:
    - "virantha@gmail.com"
    - "person2@gmail.com"

Advanced options

Fine-tuning Tesseract/Ghostscript/others

You can specify Tesseract and Ghostscript executable locations manually, as well as the number of concurrent processes allowed during preprocessing and tesseract. Use the following in your configuration file:

tesseract:
    binary: "/usr/bin/tesseract"
    threads: 8

ghostscript:
    binary: "/usr/local/bin/gs"

preprocess:
    threads: 8

Handling disk time-outs

If you need to increase the time interval (default 3 seconds) between new document scans when pypdfocr is watching a directory, you can specify the following option in the configuration file:

watch:
    scan_interval: 6

Installation

Using pip

PyPDFOCR is available in PyPI, so you can just run:

pip install pypdfocr

Please note that some of the 3rd-party libraries required by PyPDFOCR wiill require some build tools, especially on a default Ubuntu system. If you run into any issues using pip install, you may want to install the following packages on Ubuntu and try again:

  • gcc
  • libjpeg-dev
  • zlib-bin
  • zlib1g-dev
  • python-dev

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable called pypdfocr.exe

You still need to install Tesseract, GhostScript, etc. as detailed below in the external dependencies list.

Manual install

Clone the source directly from github (you need to have git installed):

git clone https://github.com/virantha/pypdfocr.git

Then, install the following third-party python libraries:

These can all be installed via pip:

pip install Pillow
pip install reportlab
pip install watchdog
pip install pypdf2

You will also need to install the external dependencies listed below.

External Dependencies

PyPDFOCR relies on the following (free) programs being installed and in the path:

Poppler is only required if you want pypdfocr to figure out the original PDF resolution automatically; just make sure you have pdfimages in your path. Note that the xpdf provided pdfimages does not work for this, because it does not support the -list option to list the table of images in a PDF file.

On Mac OS X, you can install these using homebrew:

brew install tesseract
brew install ghostscript
brew install poppler
brew install imagemagick

On Windows, please use the installers provided on their download pages.

** Important ** Tesseract version 3.02.02 or newer required (apparently 3.02.01-6 and possibly others do not work due to a hocr output format change that I’m not planning to address). On Ubuntu, you may need to compile and install it manually by following these instructions

Also note that if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) then you need to find your tessdata directory and do the following:

cd /usr/local/share/tessdata
cp eng.traineddata osd.traineddata

osd stands for Orientation and Script Detection, so you need to copy the .traineddata for whatever language you want to scan in as osd.traineddata. If you don’t do this step, then any landscape document will produce garbage

Disclaimer

While test coverage is at 84% right now, Sphinx docs generation is at an early stage. The software is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Changelog

Version Date Changes
v0.9.1 10/11/16 Fixes (#43, #41)
v0.9.0 2/29/16 Fixed rotated page text, Mac OS X invisible fonts, and pdf merge slowdown
v0.8.5 2/21/16 Better ctrl-c and cleanup behavior
v0.8.4 2/18/16 Maintenance release
v0.8.3 2/18/16 Bug fix for multiprocessing on windows, ctrl-c interrupt, and integer keywords
v0.8.2 12/8/14 Fixed imagemagick invocation on windows. Parallelized preprocessing and tesseract execution
v0.8.1 12/5/14 Added –skip-preprocess option, scan_interval option, and fixed too many open files bug during page overlay
v0.8.0 10/27/14 Added preprocessing to clean up prior to tesseract, bug fixes on file names with spaces/dots
v0.7.6 9/10/14 Fixed issue 17 rotation bug
v0.7.5 8/18/14 Update for Tesseract 3.03 .hocr filename change
v0.7.4 3/28/14 Bug fix on pdf assembly
v0.7.3 3/27/14 Modified internals to use single image per page (instead of multipage tiff). Also enabled orientation detection
v0.7.2 3/26/14 Switched from Pil to Pillow. Now uses original images from PDF in output pdf (no dpi/color/quality changes!)
v0.7.1 3/25/14 OCR Language is now an option
v0.7.0 3/25/14 Now honors original pdf resolution
v0.6.1 2/16/14 Bug fix for pdfs with only numbers in the filename
v0.6.0 1/16/14 Added filing based on filename match as fallback, added tesseract version check
v0.5.4 1/12/14 Fixed bug with reordering of text pages on certain platforms(glob)
v0.5.3 12/12/13 Fix to evernote server specification
v0.5.2 12/08/13 Fix to lowercase keywords
v0.5.1 11/02/13 Fixed a bunch of windows critical path handling issues
v0.5.0 10/30/13 Email status added, 90% test coverage
v0.4.1 10/28/13 Made HOCR parsing more robust
v0.4.0 10/28/13 Added early Evernote upload support
v0.3.1 10/24/13 Path fix on windows
v0.3.0 10/23/13 Added filing of converted pdfs using a configuration file to specify target directories based on keyword matches in the pdf text
v0.2.2 10/22/13 Added a console script to put the pypdfocr script into your bin
v0.2.1 10/22/13 Fix to initial packaging problem.
v0.2.0 10/21/13 Initial release.

Todo list

  • #43 version check for tesseract
  • On windows, search for pdfimages and imagemagick instead of relying on path
  • Split up into flow steps
  • Run more robustness tests for watching networked shares
  • Add more docstrings
  • Add more option specifiers to tesseract and ghostscript

Indices and tables