Thomas Efer of Topic Maps Labs [1] has released a topic map named "Afghanistan War Diary - 2004" based on Wikileak documents. The AWD topic map contains information about almost 2000 war reports. Most information in the topic map is stored as occurrences i.e. unstructured text. The topic map is available at Maiana [2]. Lutz Maicher announced the release of the AWD topic map at TopicMapMail email list [3]. The announcement raised an interesting discussion related to the quality and usefulness of the AWD topic map. The discussion inspired me trying to enhance the AWD topic map using Wandora's extractors. Wandora [4] has several extractors that distill entities and keywords out of unstructured text. I selected extractors
* Alchemy Entity extractor [5]
* Alchemy Keywords extractor [5]
* OpenCalais extractor [6]
* Yahoo! term extractor [7]
And applied each extractor, one by one, to all summary occurrences of report topics. Summary occurrence contains textual representation of the military event the report talks about. Summary occurrences have type
http://psi.topicmapslab.de/wardiary/schema/summary
Applying an extractor to occurrences generated more topics and associations. Extractors' outputs were stored to separate topic maps resulting 4 different information packages, one for each extractor. All topic maps were packed into a zip. This zip package is available at [8]. In practice anyone can merge any of the generated topic maps with the original AWD and see extracted topics interleaving with the original report topics. As the original summary occurrence is available, anyone can evaluate if the extracted topics and associations really describe the report.
To summarize results:
* Yahoo term extractor found 3563 term topics (15092 associations).
* OpenCalais found 1749 tags (3455 associations) in 18 tag classes and 25 topic categories (1317 associations).
* Alchemy entity extractor found 424 entities (1285 associations).
* Alchemy keyword extractor found 9054 keywords (14684
associations).
Although the technical implementation of the experiment was easy, I am not really sure about the quality of automatic classifications provided by Calais [9], Alchemy [10] and Yahoo [11]. It is clear that the source material, reports of military actions, is very challenging due to military specific expressions, terms, and acronyms, and it looks like all classifiers have made false interpretations. It would be very interesting if someone would like to evaluate the overall quality of generated topic maps and point out typical error classes.
In any case, I hope this demonstration clearly shows a simple topic map storing merely occurrence data is not necessarily a dead end but a good start.
[1] http://www.topicmapslab.de
[2] http://maiana.topicmapslab.de/u/efi/tm/wd2004
[3] http://www.infoloom.com/pipermail/topic ... 08539.html
[4] http://www.wandora.org
[5] http://www.wandora.org/wandora/wiki/ind ... extractors
[6] http://www.wandora.org/wandora/wiki/ind ... classifier
[7] This extractor is not part of officially Wandora version yet.
[8] http://www.wandora.org/wandora/download ... riment.zip
[9] http://www.opencalais.com/
[10] http://www.alchemyapi.com
[11] http://developer.yahoo.com/search/conte ... ction.html