Simple Document Extractor

From WandoraWiki
Jump to: navigation, search

Wandora's Simple document extractor is used to create a topic out of given document and attach document content as an occurrence to the created topic. Simple document extractor can convert several documents at once. Simple document extractor starts with a menu option File > Extract > Simple files > Simple document extractor.... You can use the extractor as a drag'n'drop extractor also, or as a browser extractor.

Simple document extractor extracts text out of PDF, Office, and HTML (including XML) documents. You may also use extractor for binary documents but resulting document content occurrences contain binary data and are probably unusable. Wandora doesn't really support binary occurrences.

Simple document extractor example

In this example, Wandora user has downloaded a document collection known as CableGate of Wikileaks. Wandora user aims to build a topic map out of CableGate documents. User has the document collection available in her file system. All Cablegate documents are in folder caller cable. Wandora user chooses menu option File > Extract > Simple files > Simple document extractor.... A dialog opens. User selects Files tab and presses Browse button. User addresses the folder containing all CableGate documents and starts the extractor by pressing Extract button. Wandora reads given folder, it's subfolders and all enclosed files, and creates a topic for each file. While the extraction ends, Wandora user can find all extracted documents in topic tree, below Document topic. Each document topic contains three occurrences:

  • Extraction-time
  • File-name
  • Document-content

Occurrence typed as document-content contains the content of that document. Notice, documents were HTML files and Wandora has stripped all HTML tags away. To continue, Wandora user might be interested in filtering and refining extracted occurrences.

Simple document extractor 01.gif

Simple document extractor 02.gif

Simple document extractor 03.gif

Simple document extractor 04.gif

Simple document extractor 05.gif

Simple document extractor 06.gif
Personal tools