Any23 extractor

From WandoraWiki
Jump to: navigation, search

Any23 (Anything to triplets) is a library used to extract structured data i.e. micro-formats out of web documents such as HTML pages. Wandora's Any23 extractor uses Any23 library and ables Wandora user to extract topic mapped RDF out of web resources. Especially Any23 extractor adds Wandora a full featured micro-format extractor. Any23 extractor starts with Wandora's menu option File > Extract > Microformats > Any23 extractor.... Extractor opens up a dialog for input files and URLs. Extraction starts when user presses Extract button.

Any23 library generates RDF triplets. Wandora is based on Topic Maps technology and converts RDF triplets to Topic Maps associations. Conversion schema is very simple. RDF triplets are converted to binary associations where RDF predicate will be association type and RDF resource and object association players. Association roles are predefined topics. Conversion schema is described in detail in wiki page Importing RDF. Addition to this simple schema, RDF triplet's source plays very important role in this picture. RDF triplet's source is the web resource or file the RDF triplet originates from. Wandora's Any23 extractor creates a topic for this source and adds it as a third player in every association the extractor generates. This addition is important if you consider extracting similar triplets from different sources. Feature enables the user to track and verify information sources. Reader should notice the similarity of described association structure and n-quads of RDF.

Example

In this example Wandora user applies Any23 extractor to world wide web page http://news.stanford.edu/. Example was created 2011-06-14.


Any23 example 01.gif


Any23 example 02.gif


Any23 example 03.gif


Any23 example 04.gif


Any23 example 05.gif


Any23 example 06.gif


Any23 example 07.gif
Personal tools