pypdfocr package¶

Submodules¶

pypdfocr.pypdfocr module¶

class pypdfocr.pypdfocr.PyPDFOCR[source]¶

Bases: object

The main clas. Performs the following functions:

Parses command line options
Optionally just watches a directory for new PDF’s to OCR; once a file appears, it does the next step
Runs a single file conversion:
- Runs ghostscript to get tiff/jpg
- Runs Tesseract-OCR to do the actual OCR
- Takes the HOCR from Tesseract and creates a new PDF with the text overlay
Files the OCR’ed file in the proper place if specified
Files the original file if specified

_clean_up_files(files)[source]¶: Helper function to delete files :param files: List of files to delete :type files: list :returns: None

_convert_and_file_email(pdf_filename)[source]¶: Helper function to run the conversion, then do the optional filing, and optional emailing.

_get_config_file(config_file)[source]¶

Read in the yaml config file

Parameters:	config_file (file) – Configuration file (YAML format)
Returns:	dict of yaml file
Return type:	dict

_send_email(infilename, outfilename, filing)[source]¶: Send email using smtp

_setup_external_tools()[source]¶: Instantiate the external tool wrappers with their config dicts

_setup_filing()[source]¶

Instance the proper PyFiler object (either pypdfocr.pypdfocr_filer_dirs.PyFilerDirs or pypdfocr.pypdfocr_filer_evernote.PyFilerEvernote)

TODO: Make this more generic to allow third-party plugin filing objects

Variables:	filer – `pypdfocr.pypdfocr_filer.PyFiler` PyFiler subclass object that is instantiated pdf_filer – `pypdfocr.pypdfocr_pdffiler.PyPdfFiler` object to help with PDF reading
Returns:	Nothing

file_converted_file(ocr_pdffilename, original_pdffilename)[source]¶

move the converted filename to its destiantion directory. Optionally also moves the original PDF.

Parameters:	ocr_pdffilename (filename string) – Converted PDF file original_pdffilename (filename string) – Original scanned PDF file
Returns:	Target folder name

“rtype: string

get_options(argv)[source]¶

Parse the command-line options and set the following object properties:

Parameters:	argv – usually just sys.argv[1:]
Returns:	Nothing
Variables:	debug – Enable logging debug statements verbose – Enable verbose logging enable_filing – Whether to enable post-OCR filing of PDFs pdf_filename – Filename for single conversion mode watch_dir – Directory to watch for files to convert config – Dict of the config file watch – Whether folder watching mode is turned on enable_evernote – Enable filing to evernote

go(argv)[source]¶

The main entry point into PyPDFOCR

Parses options
If filing is enabled, call _setup_filing()
If watch is enabled, start the watcher
run_conversion()
if filing is enabled, call file_converted_file()

run_conversion(pdf_filename)[source]¶

Does the following:

Convert the PDF using GhostScript to TIFF and JPG
Run Tesseract on the TIFF to extract the text into HOCR (html)
Use PDF generator to overlay the text on the JPG and output a new PDF
Clean up temporary image files

Parameters:	pdf_filename (string) – Scanned PDF
Returns:	OCR’ed PDF
Return type:	filename string

pypdfocr.pypdfocr.error(text)[source]¶

pypdfocr.pypdfocr.main()[source]¶

pypdfocr.pypdfocr.open_file_with_timeout(*args, **kwargs)[source]¶

pypdfocr.pypdfocr.retry(count=5, exc_type=<type 'exceptions.Exception'>)[source]¶

pypdfocr.pypdfocr_gs module¶

Wrap ghostscript calls. Yes, this is ugly.

class pypdfocr.pypdfocr_gs.PyGs(config)[source]¶

Bases: object

Class to wrap all the ghostscript calls

_find_windows_gs()[source]¶

Searches through the Windows program files directories to find Ghostscript. If it finds multiple versions, it does a naive sort for now to find the most recent.

Rval:	The ghostscript binary location

_get_dpi(pdf_filename)[source]¶

_run_gs(options, output_filename, pdf_filename)[source]¶

_warn(msg)[source]¶

make_img_from_pdf(pdf_filename)[source]¶

pypdfocr.pypdfocr_gs.error(text)[source]¶

pypdfocr.pypdfocr_pdf module¶

Wrap pdf generation and text addition code

class pypdfocr.pypdfocr_pdf.PyPdf(gs)[source]¶

Bases: object

Class to create pdfs from images

_atoi(text)[source]¶

_get_font_spec(tag)[source]¶

_get_img_dims(img_filename)[source]¶

Rval:	(width, height, dpi)

_get_merged_single_page(original_page, ocr_text_page)[source]¶: Take two page objects, rotate the text page if necessary, and return the merged page

add_text_layer(pdf, hocrfile, page_num, height, dpi)[source]¶

Draw an invisible text layer for OCR data.

This function really needs to get cleaned up

get_transform(rotation, tx, ty)[source]¶

iter_pdf_page(f)[source]¶

mergeRotateAroundPointPage(page, page2, rotation, tx, ty)[source]¶

natural_keys(text)[source]¶: alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy’s implementation in the comments)

overlay_hocr_page(dpi, hocr_filename, img_filename)[source]¶

overlay_hocr_pages(dpi, hocr_filenames, orig_pdf_filename)[source]¶

polyval(poly, x)[source]¶

regex_baseline = <_sre.SRE_Pattern object at 0x107d1c030>¶

regex_bbox = <_sre.SRE_Pattern object at 0x105ae7b70>¶

regex_fontspec = <_sre.SRE_Pattern object at 0x107d21220>¶

regex_textangle = <_sre.SRE_Pattern object at 0x105aea468>¶

class pypdfocr.pypdfocr_pdf.RotatedPara(text, style, angle)[source]¶

Bases: reportlab.platypus.paragraph.Paragraph

Used for rotating text, since the low-level rotate method in textobject’s don’t seem to do anything

beginText(x, y)[source]¶

draw()[source]¶

pypdfocr.pypdfocr_pdffiler module¶

Provides capability to search PDFs and file to a specific folder based on keywords

class pypdfocr.pypdfocr_pdffiler.PyPdfFiler(filer)[source]¶

Bases: object

_get_matching_folder(pdfText)[source]¶

file_original(original_filename)[source]¶

iter_pdf_page_text(filename)[source]¶

move_to_matching_folder(filename)[source]¶

pypdfocr.pypdfocr_tesseract module¶

Run Tesseract to generate hocr file

class pypdfocr.pypdfocr_tesseract.PyTesseract(config)[source]¶

Bases: object

Class to wrap all the tesseract calls

_is_version_uptodate()[source]¶: Make sure the version is current

_warn(msg)[source]¶

make_hocr_from_pnm(img_filename)[source]¶

make_hocr_from_pnms(fns)[source]¶

pypdfocr.pypdfocr_tesseract.error(text)[source]¶

pypdfocr.pypdfocr_tesseract.unwrap_self(arg, **kwarg)[source]¶

pypdfocr.pypdfocr_watcher module¶

Something

class pypdfocr.pypdfocr_watcher.PyPdfWatcher(monitor_dir, config)[source]¶

Bases: watchdog.events.FileSystemEventHandler

Watch a folder for new pdf files.

If new file event, then add it to queue with timestamp. If file mofified event, then change timestamp in queue. Every few seconds pop-off queue and if timestamp older than 3 seconds, process the file else, push it back onto queue.

check_for_new_pdf(ev_path)[source]¶

Called by the file watching api on any file creations/modifications. For any file ending with ”.pdf”, but not “_ocr.pdf”, it adds new files to the event queue with the current time stamp, or it updates existing files in the queue with the current timestamp. This queue is used to track files and keep track of their last “touched” time, so we can start processing a file if check_queue() finds a file that hasn’t been touched in a while.

If the file does note exist in the events dict:

Add it with the current time

Otherwise:

If the file time is marked as -1, delete it from the dict

Else, update the time in the dict to the current time

check_queue()[source]¶

This function is called at regular intervals by start().

Iterate through the events, and if there is any with a timestamp greater than the scan_interval, return it and set its timestamp to -1 for purging later.

Returns:	Filename if available to process, otherwise None.

events = {}¶

events_lock = <thread.lock object at 0x1056e5a70>¶

on_created(event)[source]¶

on_modified(event)[source]¶

on_moved(event)[source]¶

rename_file_with_spaces(pdf_filename)[source]¶

Rename any portion of a filename that has spaces in the basename with underscores. Does not affect spaces in the directory path.

Parameters:	pdf_filename (string) – Filename to remove spaces
Returns:	Modified filename
Return type:	string

start()[source]¶

stop()[source]¶

pypdfocr.pypdfocr_preprocess module¶

Wrap ImageMagick calls. Yes, this is ugly.

class pypdfocr.pypdfocr_preprocess.PyPreprocess(config)[source]¶

Bases: object

Class to wrap all the ImageMagick convert calls

_run_preprocess(in_filename)[source]¶

_warn(msg)[source]¶

cmd(cmd_list)[source]¶

preprocess(in_filenames)[source]¶

pypdfocr.pypdfocr_preprocess.unwrap_self(arg, **kwarg)[source]¶

pypdfocr.pypdfocr_filer module¶

class pypdfocr.pypdfocr_filer.PyFiler[source]¶

Bases: object

Abstract base class for defining filing objects, whether you want to save to a file-system/directory structure or to something like Evernote

_abc_cache = <_weakrefset.WeakSet object at 0x107d5e890>¶

_abc_negative_cache = <_weakrefset.WeakSet object at 0x107d5e910>¶

_abc_negative_cache_version = 25¶

_abc_registry = <_weakrefset.WeakSet object at 0x107d5e7d0>¶

_get_unique_filename_by_appending_version_integer(tgtfilename)[source]¶

_split_filename_dir_filename_ext(filename)[source]¶

add_folder_target(folder, keywords)[source]¶: Add a target folder for a list of keywords

default_folder None¶

file_original(original_filename)[source]¶

Move the original file given by filename to the proper location. You will need to use original_move_target

Parameters:	original_filename (string) – File to move
Returns:	Full path+filename of destination(original_filename if not moved)
Return type:	string

folder_targets None¶: Data structure for mapping a keyword to a folder target. Usually just a dict, and new mappings are added from add_folder_target()

get_default_folder()[source]¶

get_folder_targets()[source]¶

get_original_move_folder()[source]¶

get_target_folder()[source]¶

move_to_matching_folder(filename)[source]¶

Move the file given by filename to the proper location. You will need to use target_folder and folder_targets to figure out what the proper destination is. If there is no matching location, then use default_folder

Parameters:	filename (string) – File to move
Returns:	Full path+filename of destination
Return type:	string

original_move_folder None¶

set_default_folder(default_folder)[source]¶

set_folder_targets(folder_targets)[source]¶

set_original_move_folder(original_move_folder)[source]¶

set_target_folder(target_folder)[source]¶

target_folder None¶

pypdfocr.pypdfocr_filer_dirs module¶

class pypdfocr.pypdfocr_filer_dirs.PyFilerDirs[source]¶

Bases: pypdfocr.pypdfocr_filer.PyFiler

_abc_cache = <_weakrefset.WeakSet object at 0x107d5eb90>¶

_abc_negative_cache = <_weakrefset.WeakSet object at 0x107d5ec10>¶

_abc_negative_cache_version = 25¶

_abc_registry = <_weakrefset.WeakSet object at 0x107d5ead0>¶

add_folder_target(folder, keywords)[source]¶

file_original(original_filename)[source]¶

move_to_matching_folder(filename, foldername)[source]¶

pypdfocr.pypdfocr_filer_evernote module¶

class pypdfocr.pypdfocr_filer_evernote.PyFilerEvernote(dev_token)[source]¶

Bases: pypdfocr.pypdfocr_filer.PyFiler

_abc_cache = <_weakrefset.WeakSet object at 0x104999f10>¶

_abc_negative_cache = <_weakrefset.WeakSet object at 0x1049a3290>¶

_abc_negative_cache_version = 25¶

_abc_registry = <_weakrefset.WeakSet object at 0x104999d50>¶

classmethod _check_and_make_notebook(*args, **kwargs)[source]¶

_connect_to_evernote(dictUserInfo)[source]¶

Establish a connection to evernote and authenticate.

Returns success:
Parameters:	dictUserInfo – Dict of user info like user/passwrod. For now, just the dev token
	Return wheter connection succeeded
Rtype bool:

classmethod _create_evernote_note(*args, **kwargs)[source]¶

classmethod _create_notebook(*args, **kwargs)[source]¶

classmethod _get_notebooks(*args, **kwargs)[source]¶

_update_notebook(notebook)[source]¶

add_folder_target(folder, keywords)[source]¶

default_folder None¶: Override this to make sure we only have the basename

file_original(original_filename)[source]¶: Just file it to the local file system (don’t upload to evernote)

get_default_folder()[source]¶: Override this to make sure we only have the basename

get_target_folder()[source]¶

move_to_matching_folder(filename, foldername)[source]¶

Use the evernote API to create a new note:

Make the notebook if it doesn’t exist (_check_and_make_notebook())
Create the note (_create_evernote_note())
Upload note using API

set_default_folder(default_folder)[source]¶: Override this to make sure we only have the basename

set_target_folder(target_folder)[source]¶: Override this to make sure we only have the basename

target_folder None¶

class pypdfocr.pypdfocr_filer_evernote.en_handle(f)[source]¶

Bases: object

Generic exception handler for Evernote actions

Table Of Contents

Previous topic

This Page

pypdfocr package¶

Submodules¶

pypdfocr.pypdfocr module¶

pypdfocr.pypdfocr_gs module¶

pypdfocr.pypdfocr_pdf module¶

pypdfocr.pypdfocr_pdffiler module¶

pypdfocr.pypdfocr_tesseract module¶

pypdfocr.pypdfocr_watcher module¶

pypdfocr.pypdfocr_preprocess module¶

pypdfocr.pypdfocr_filer module¶

pypdfocr.pypdfocr_filer_dirs module¶

pypdfocr.pypdfocr_filer_evernote module¶

Module contents¶

Navigation

Table Of Contents

Previous topic

This Page

Quick search

pypdfocr package¶

Submodules¶

pypdfocr.pypdfocr module¶

pypdfocr.pypdfocr_gs module¶

pypdfocr.pypdfocr_pdf module¶

pypdfocr.pypdfocr_pdffiler module¶

pypdfocr.pypdfocr_tesseract module¶

pypdfocr.pypdfocr_watcher module¶

pypdfocr.pypdfocr_preprocess module¶

pypdfocr.pypdfocr_filer module¶

pypdfocr.pypdfocr_filer_dirs module¶

pypdfocr.pypdfocr_filer_evernote module¶

Module contents¶

Navigation