Refining occurrences
In the world of Topic Maps an occurrence is a resource attached to a topic. Topic Maps standard defines occurrence as the instance of subject, the topic represents. However, usually occurrence is thought more generally a property of the topic. Occurrence resource can be almost anything. Usually occurrence resource is an URL addressing networked resource or a literal, a resource itself. Thinking the URL resource, the resource data is stored outside the topic map while the literal resource is in the topic map. Wandora supports only literal resources although Wandora may handle the literal as an URL.
It is a valid observation that occurrence as a text fragment contains information that somehow relates to the topic carrying the occurrence. On the other hand, topics are connected with each other. These connections, associations in Topic Maps world, represent also information related to topics they connect. Saying this, it becomes quite natural to ask a question, how one could distill associations out of occurrence resources. And more over, could one pack information in an association into an occurrence. This page discusses the first option of transforming occurrences to associations (and topics, of course).
First we'll look at refining options of occurrence editor. After that, we'll investigate Wandora's batch extractors used to refine occurrences. Finally, we'll look at general occurrence transformation options of Wandora.
Occurrence editor
Wandora's occurrence editor is a simple text editor window used to modify occurrence text. Wandora opens occurrence editor window when user mouse clicks an occurrence text cell in occurrence table of topic panel. For example, user has a topic Wandora application open in topic panel (see image below). This topic has one occurrence of type description. Occurrence's scope is English. First words of the occurrence data is viewed in the table cell.
When user clicks the cell in occurrence table, an editor window is opened (see image below). This window is an occurrence editor. It has a menu called Refine. In this menu, occurrence editor contains few refining options for the occurrence text. At the moment these refining options are
- Make topics with selection. This option is used to transform selected text fragment to a topic. Topic's base name will be the text selection. Wandora creates a random subject identifier for the topic. Created topic is not associated with the occurrence carrier topic.
- Make topics with selection and associate. Option creates a topic and sets selected text fragment to topic's base name. Furthermore, option gives a random subject identifier to topic. Created topic is associated with the occurrence carrier topic. Association type is Occurrence distilled association and roles Occurrence carrier and Occurrence distilled topic.
- Find topics in occurrences.... This option is useful when you have an existing vocabulary and want to spot vocabulary terms out of given occurrence. Option loops over all topics in Wandora and tries to find topic's base name in occurrence text. If occurrence text contains topic's base name, topic is associated with the occurrence carrier. Association type is Occurrence association and roles Occurrence container and Topic in occurrence.
- Find topics with similar occurrence... option searches for topics containing similar occurrence text. Occurrence texts are compared using Levensthein distance with threshold value 0.75. To change the distance metric and threshold value, keep CTRL key pressed while starting the menu option. Wandora uses Simmetrics library to measure string similarity.
- Classify with OpenCalais option sends the occurrence text or selected occurrence text fragment to OpenCalais web service for classification. Service returns keywords which are transformed to topics and associated to occurrence carrier topic. OpenCalais classifier is also a general extractor in Wandora.
- Classify with SemanticHacker option sends occurrence text to SemanticHacker web service. Service returns a set of weighted terms and Wandora transforms these to topics, and associates topics to the occurrence carrier. SemanticHacker classifier is also a general extractor in Wandora. Personal API key is required for SemanticHacker classification.
- Classify with Bing search engine... option takes selected occurrence text fragment and sends it to the Bing search engine, receives web addresses, and transforms these web addresses to topics. Finally option associates these web resource topics with the occurrence carrier. Personal Bing API key is required for Bing classification.
- Alchemy entity extractor option sends occurrence text to AlchemyAPI web service. Service returns named entities and Wandora transforms these entities to topics. Finally option associates created entity topics with the occurrence carrier. AlchemyAPI extractors have also several other uses in Wandora. Personal API key is required for AlchemyAPI extractions.
- Alchemy keyword extractor option is similar to Alchemy entity extractor but handles keywords instead of named entities.
- Alchemy category extractor option sends occurrence text to AlchemyAPI web service and transforms received category or categories to topics, and associates category topic with the occurrence carrier.
- Alchemy language extractor option sends occurrence text to AlchemyAPI web service and transforms received language to a topic, and associates the language topic with the occurrence carrier.
Occurrence editor example
Lets continue with an example. In this example Wandora user first creates a topic out of text selection and then applies OpenCalais classifier to the whole occurrence text. To create a topic out of selected text, Wandora user first selects text graphical user interface.
Then user selects menu option Refine > Make topics > Make topic with selection and associate....
Now user closes the occurrence editor. In the topic panel there is a new association between topics graphical user interface and Wandora application. Association type is Occurrence distilled association.
User clicks the occurrence editor open again. Now there is no text selection available.
User selects menu option Refine > Classify > Classify with OpenCalais....
OpenCalais classifier informs that it found total 4 tags in the occurrence text.
User closes the occurrence editor and finds three new associations in the topic panel. Two associations have type OpenCalais tag classification and one association has type OpenCalais topic classification. But where is the fourth tag?
User opens up the topic Java and founds out that it has two subject identifiers and two types. It appears that Wandora has merged two tag topics named Java.
In this chapter you have learned how Wandora user can extract topics out of single occurrence using refine menu options in occurrence editor. Next chapters discuss options used to refine multiple occurrences at once.
Batch refining occurrences
Wandora features several options to refine multiple occurrence at once. These options locate in context menus that appear when you right click a topic selection or a complete topic map layer. Actual context menu path is Topics > Occurrences > Refine. The refine menu options are
- With OpenCalais classifier... option refines found occurrences with the OpenCalais classifier. Option extracts keywords out of the occurrence text and associates these keywords with the occurrence carrier topic.
- With Alchemy entity extractor... option refines found occurrence with the entity API of AlchemyAPI extractors. Option recognizes named entities from the occurrence text and associates entity topics with the occurrence carrier. All AlchemyAPI based refiners require personal API key for the AlchemyAPI.
- With Alchemy keyword extractor... option refines found occurrence with the keyword API of AlchemyAPI extractors. Option recognizes keywords from the occurrence text and associates keyword topics with the occurrence carrier.
- With Alchemy category extractor... option refines found occurrence with the category API of AlchemyAPI extractors. Option recognizes a category for the occurrence text. Category topic is associated with the occurrence carrier.
- With Yahoo! YQL term extractor... option refines found occurrence with the Yahoo! YQL term extractor. Option recognizes keyword terms from the occurrence text. Keyword topics are associated with the occurrence carrier.
- With GeoNames near by extractor... option refines found occurrence with the Geonames extractors. Option finds all near by locations for the geo location represented by the occurrence text. Near by locations are associated with the occurrence carrier. Option assumes, that the occurrence is a pair of GPS coordinates separated by a comma character. For example, a valid GPS coordinate occurrence is "60.170833, 24.9375" where first number is the latitude and second number the longitude. Look up distance of near by locations is 100 m.
- Find associations in occurrences...
- Find associations in occurrences using pattern...
Selecting any of the refine options described above starts with a dialog where Wandora user is asked to point occurrence type and scope. The occurrence scope usually refers to a language topic in Wandora. If user has successfully pointed occurrence type and scope topics, Wandora loops over selected topics and tries to spot occurrences of given kind, and if suitable occurrence is found, it is refined. The effect of refining depends on selected option but generally each topic with a suitable occurrence should get linked to topics that describe the occurrence and thus the topic itself. For example, refining an occurrence with the option With Alchemy category extractor... results a category topic linked with the occurrence carrier.
Batch refining example
In this example Wandora user has converted Wikileaked Afghanistan War Diary documents to a topic map. Each document in Afghanistan War Diary is a war report with a summary text describing the war incident. The number of war reports in topic map is 1987. Obviously Wandora user can't refine reports one by one as it would take too long. In this example Wandora user batch refines war reports of Afghanistan War Diary using OpenCalais web service. First, Wandora user locates war report topics. It looks like all report topics are instances of topic Report. Opening the topic (See image below) lists all report topics.
Wandora opens up one of the report topics. It has several occurrences as shown below.
User clicks the occurrence typed as Summary. Notice the occurrence scope is Lang.indep.. An occurrence editor opens up with the actual occurrence text.
Ok, user closes the occurrence editor and opens up the Report topic listing all war reports. Wandora user selects all cells in the instance list by pressing CTRL+a. Then she presses right mouse button over the selection and chooses context menu option Topics > Occurrences > Refine > With OpenCalais classifier....
Wandora opens up a dialog window used to address occurrence type and scope.
User clicks the <No topic> button beside label Type of occurrences and yet another dialog opens to select the actual type topic. User selects the Finder tab and searches for summary. Recall that summary was the name of occurrence type we want to refine. Wandora locates several topics with summary text. User picks the occurrence type topic and clicks OK button.
Then user clicks the <No topic> button beside label Scope of occurrences and locates the Language independent topic below Wandora language topic.
Now Wandora user has addressed both occurrence type and occurrence scope and she may continue by pressing the OK button.
Wandora starts refining occurrences. As the number of occurrences is high, it takes a while to finish the operation. In my computer it took ~45 minutes. Notice, you are supposed to have a good network connection to do the OpenCalais classification. When refining finishes, you can open any of the report topics and see if OpenCalais succeeded to find any tags out of the report text of that specific topic. As the original report text is available, user can always check if the tags are relevant.
For example, checking the report topic reviewed above shows OpenCalais found four tags and one classification topic for it.
Now Wandora user can click the tag topic, TOYOTA COROLLA for example and look what other reports have been classified with the tag. It appears our topic map has three different war reports that contains tag TOYOTA COROLLA.
Wandora user can explore found refining tags and topics opening the OpenCalais topic at topic tree (See image below).
This example was based on our experiments on Afghanistan War Diary topic map. Wandora user should be extremely careful while drawing conclusions based on extracted tags as information extraction is performed by external web service beyond user control. In this particular example the refined documents are especially tricky for a general purpose classifier such as OpenCalais as they contain military specific expressions, vocabulary and acronyms.
Occurrence editing options
This chapter reviews some occurrence editing options in Wandora. Addition to occurrence editor Wandora features options to
- Replace text in occurrence
- Copy and move occurrences to different scope
- Remove occurrences of specific type and scope
- Convert occurrences to associations
- Copy occurrence to topic's base name or variant name
- Make occurrence with topic's name
- Construct topic's subject with occurrence
Replace text in occurrences
To replace a text in occurrences Wandora user must first select a set of topics. Replace is performed only for occurrences in selected topics. Then user should right click the topic selection, and choose context menu option Topics > Occurrences > Modify occurrences with a regex... or Topics > Occurrences > Modify all occurrences with a regex.... First option asks user occurrence type and scope while the second option seeks given regular expression pattern in all occurrences. Wandora opens up Regular Expressions Replacer, a specific dialog window in Wandora used to build and test regular expressions (See image below). Regular expression replacer accepts Java Pattern type regular expressions. When user clicks the Apply button, Wandora loops over the topic selection and tries to match given regular expression to occurrence. If regular expression matches the occurrence (or part of it), Wandora replaces the match with given text.
Copy and move occurrence to different scope
Scope usually represents the language of occurrence text. Sometimes Wandora user may want to copy or move occurrences to another scope. First, Wandora user should make a topic selection. Occurrence copy or move applies only to selected topics. Then, Wandora user should right clicks selected topics and choose Topics > Occurrences > Copy occurrences to other scope... or Topics > Occurrences > Move occurrences to other scope.... Both options open up a dialog window (See image below) used to specify source and target occurrences. Dialog window has also a checkbox to specify if Wandora should override existing occurrences and if Wandora should also copy occurrences that are null i.e. there is no source occurrence available.
Remove occurrences of specific type and scope
To remove occurrences in Wandora user should first select topics, a set of occurrence carriers. Then, Wandora user should right click the selection and choose context menu option Topics > Occurrences > Delete occurrences with type... or Topics > Occurrences > Delete all occurrences. First option removes only occurrences of given type while the second option removes all occurrences in selected topics.
Convert occurrences to associations
Sometimes information stored to an occurrence could be a topic itself. For example, a topic representing an instance of a person may contain an occurrence representing person's birth year. It is a matter of ontology designer's choice to model birth years as occurrences. Some other ontology might model years as topics and birth years as associations between person and year topics. Thus, it is somehow a natural operation to convert occurrences to topics, and then associate created topics with occurrence carrier topics. Wandora features such an operation. First Wandora user must address topics in which she wants to convert occurrences to associations. Right clicking the topic selection and choosing context menu option Topics > Associations > Make associations with occurrences... starts the operation. Before Wandora performs transition, user should select occurrence type, occurrence scope, association role topic, and give subject identifier pattern.
Copy occurrence to topic's base name or variant name
Sometimes an occurrence may contain a name for the topic. This is especially true, if you have imported RDF resources to Wandora. Wandora has a feature that copies occurrence text to topic's base name. Feature starts with menu option Topics > Base names > Make base name with an occurrence... in context of a topic selection. Before Wandora can continue, Wandora user must address occurrence type and scope. User should note that the feature doesn't override existing base names i.e. if topic already has a base name, feature doesn't change it. To really set topics' base names, remove all base names first with context menu option Topics > Base names > Remove base names.... User should also note that Wandora merges topics with equal base name.
To transform an occurrence to topic's variant name, use feature at context menu option Topics > Variant names > Make display variants with occurrences....
Make occurrence with topic's name
Previous chapter reviewed a feature used to transform an occurrence to a topic name. Wandora features a counter operation used to make occurrences with topics' names. Selecting context menu option Topics > Occurrences > Make occurrence with base names... or context menu option Topics > Occurrences > Make occurrence with variant names... or Topics > Occurrences > Make occurrences with all variant names....
Construct topic's subject with occurrence
An occurrence may contain an identifier that suits well a basis of topic's subject. To construct an additional subject identifier based on an occurrence for a topic, select context menu option Topics > Subject identifiers > Make subject identifier with an occurrence.... Wandora user must specify occurrence type, occurrence scope, and a template pattern for subject identifiers.
Similarly, choosing menu option Topics > Subject locators > Make subject locator with an occurrence... creates topic a subject locator based on occurrence text.
Conclusions
This tutorial has documented some advanced methods used to transform textual information to graph nodes and edges in context of Wandora application and Topic Maps. Wandora features many information extractors that distill topics and associations out of occurrence text. Wandora user can process one or many occurrences at a time. Reader should notice that topic maps are able to store both unstructured text in occurrences and graph nodes and edges as topics and associations. In other words Topic Maps are capable to store both well-structured and unstructured information. This duality makes topic maps an ideal solution for data-mining applications where user has to classify large collections of text documents, for example.