OCR Extractor

From WandoraWiki
Revision as of 12:48, 10 June 2013 by Eero (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The OCR extractor found in File > Extract > Simple files > OCR extractor enables plain text extraction from image data using the Tesseract OCR engine. The extractor relies on an independent Tesseract installation for it's functionality. Make sure to install Tesseract and additionally set the appropriate environment variables in 'SetTesseract.bat/.sh' located in the 'bin' subdirectory. Only installations using the provided installer for Windows and Apt for Ubuntu have been tested but source distributions should work as well. Refer to the appropriate Tesseract documentation regarding available language data and supported file formats. The language used for the extraction is set in 'SetTesseract.bat/.sh' and is also used to localize the extracted text in the resulting occurrence field.

Input to the extractor is specified either as file references to local resources or URLs to external resources. Make sure you have write access and enough space available on your file system whilst extracting external images as external resources are temporarily saved on the client file system.

In Wandora each extracted resource is represented as a single topic of type 'OCR processed document'. The document topics are populated with occurrence data with the extracted plain text as well as rudimentary metadata (extraction date, resource location, filesize etc).

Further, a simple occurrence population method using the subject locator as a resource reference is implemented; thus the extractor can be used to add OCR converted text data to previously gathered image location data in a Topic Map.

Note: The extractor is yet to be released on a public Wandora distribution.

Personal tools