Export similarity matrix

From WandoraWiki
Jump to: navigation, search

Wandora's Export similarity matrix feature creates a matrix out of topic map or topic collection and saves it to a user addressed file. Each matrix row and column reviews one topic. Intersection of column and row contains a number representing similarity between row and column topics. For example, if a cell holds a number 0 (zero), then one can conclude that row and column topics are very different. If a cell holds a number 1 (one), then one can conclude row and column are very similar -- even identical.

Wandora's Export similarity matrix feature starts with menu option File > Export > Export similarity matrix... or in context of topics with menu option Topics > Export > Export similarity matrix..., or in context of topic map layer with menu option Export layer > Export similarity matrix.... First option exports all topics in Wandora application. Second option exports only selected topics, and third option only selected topic map layer.

By default Export similarity matrix feature opens up a configuration dialog first. This dialog is used to select similarity algorithm, output format, output filtering options and few other options. Below is a screen capture of Export similarity matrix configuration dialog.

Similarity matrix export.gif

At this point (Wandora release 2011-09-19) available similarity algorithms are

  • Subject locator similarity calculates Levenshtein distance of subject locators and returns that as a topic similarity.
  • Highest subject identifier similarity creates two identifier sets, one for both topics, and calculates Levenshtein distance for all permutations of identifiers in both sets. Algorithm returns highest similarity value.
  • Highest occurrence similarity creates two string pools, one for both topics, and adds topic's occurrences into the pool. Then algorithm calculates Levenshtein distance for all permutations of strings in both pools. Algorithm returns highest found similarity value.
  • Highest variant name similarity works as Highest occurrence similarity but pools contain variant names.
  • Basename similarity calculates Levenshtein distance of basenames and returns value as a topic similarity. Notice, Wandora supports only one basename for each topic. If either topic has no basename, the similarity is 0 (zero).
  • Topic type (class) similarity is an ad hoc similarity measurement that compares topic types. If both topics are typed similarly, algorithm returns 1 (one). If types are not identical, algorithm returns a value between 0 and 1. It is very likely that this measure will be replaced with a better algorithm in upcoming Wandora releases.

Export supports both tabulator separated plain text and HTML tables as output formats. Similarity value is usually a real number between 0 and 1. Number of decimals option is used to set value precision. Filter values below option is a real number option that turns all smaller similarity values to zero. Maximize values above option is a real number that turns all bigger similarity values to one. If Add labels option is selected, Wandora adds matrix column and row names. If Write zeros option is selected, Wandora writes also zero similarity values. If option is not selected, Wandora writes empty cell if similarity is zero.

Other similarity algorithms

It is clear that similarity algorithms listed above doesn't really cover the topic similarity issue very well. If you have some kind of topic similarity measure in mind you would like to see in Wandora, please, do contact us and tell your idea. It is very likely we also try to develop some new topic similarity measures.

If you wish to proceed by yourself, adding self written topic similarity measures to Wandora is easy. To add another topic similarity measure to Wandora, just write and compile a Java class that implements interface org.wandora.topicmap.similarity.TopicSimilarity, place it into the package org.wandora.topicmap.similarity and restart Wandora application. Wandora scans the package for available topic similarity measures when export tool is started for the first time.

See also

Personal tools