OCR Extractor

The OCR extractor found in File > Extract > Simple files > OCR extractor enables plain text extraction from image data using the Tesseract OCR engine. The extractor relies on an independent Tesseract installation for it's functionality. Make sure to install Tesseract and additionally set the appropriate environment variables in 'SetTesseract.bat/.sh' located in the 'bin' subdirectory. Only installations using the provided installer for Windows and Apt for Ubuntu have been tested but source distributions should work as well. Refer to the appropriate Tesseract documentation regarding available language data and supported file formats. The language used for the extraction is set in 'SetTesseract.bat/.sh' and is also used to localize the extracted text in the resulting occurrence field.

Input to the extractor is specified either as file references to local resources or URLs to external resources. Make sure you have write access and enough space available on your file system whilst extracting external images as external resources are temporarily saved on the client file system.

In Wandora each extracted resource is represented as a single topic of type 'OCR processed document'. The document topics are populated with occurrence data with the extracted plain text as well as rudimentary metadata (extraction date, resource location, filesize etc).

Further, a simple occurrence population method using the subject locator as a resource reference is implemented; thus the extractor can be used to add OCR converted text data to previously gathered image location data in a Topic Map.

Keep in mind that OCR yields sufficient results with proper training files and quality samples. Tesseract should always be provided with ample training data with matching fonts and scanning methods to ensure good quality for the conversion process. For example simple noise due to bad image sensor quality may be trivial for a human brain to filter out but may still be enough to throw off the recognition algorithm if the training data provided is noiseless. A scanned A4 document is used in the example below and should be considered a good quality sample.