OCR Extractor

From WandoraWiki
Jump to: navigation, search

The OCR extractor found in File > Extract > Simple files > OCR extractor enables plain text extraction from image data using the Tesseract OCR engine. The extractor relies on an independent Tesseract installation for it's functionality. Make sure to install Tesseract and additionally set the appropriate environment variables in 'SetTesseract.bat/.sh' located in the 'bin' subdirectory. Only installations using the provided installer for Windows and Apt for Ubuntu have been tested but source distributions should work as well. Refer to the appropriate Tesseract documentation regarding available language data and supported file formats. The language used for the extraction is set in 'SetTesseract.bat/.sh' and is also used to localize the extracted text in the resulting occurrence field.

Input to the extractor is specified either as file references to local resources or URLs to external resources. Make sure you have write access and enough space available on your file system whilst extracting external images as external resources are temporarily saved on the client file system.

In Wandora each extracted resource is represented as a single topic of type 'OCR processed document'. The document topics are populated with occurrence data with the extracted plain text as well as rudimentary metadata (extraction date, resource location, filesize etc).

Further, a simple occurrence population method using the subject locator as a resource reference is implemented; thus the extractor can be used to add OCR converted text data to previously gathered image location data in a Topic Map.

Keep in mind that OCR yields sufficient results with proper training files and quality samples. Tesseract should always be provided with ample training data with matching fonts and scanning methods to ensure good quality for the conversion process. For example simple noise due to bad image sensor quality may be trivial for a human brain to filter out but may still be enough to throw off the recognition algorithm if the training data provided is noiseless. A scanned A4 document is used in the example below and should be considered a good quality sample.


Let's consider the image below for OCR extraction.

OCR sample.gif

Ocr 1.png

The extractor is found in File > Extract > Simple files > OCR extractor.

Ocr 2.png Ocr 8.png

Images are specified with either file references or URLs.

Ocr 3.png

A typical extraction result with plain text data and rudimentary metadata

Ocr 4.png

Well formed samples yield the best results.

Ocr 5.png

Ocr 6.png

Ocr 7.png

Existing topics may be complemented with OCR data using the topic's subject locator.

Ocr 9.png

The Wandora Firefox plugin may be used to perform extractions for images opened in the browser.

Personal tools