Enriched Afghanistan War Diary Topic Map
This page is a duplicate of blog writing at Wandora Forum.
Thomas Efer of Topic Maps Labs has released a topic map named Afghanistan War Diary - 2004. Topic map is based on Wikileaked documents and contains 2000 war reports from Afghanistan war. Reports are stored in topic map as unstructured text. The topic map is available at Topic Maps Labs' topic map repository, Maiana. Lutz Maicher announced the release of the AWD topic map at TopicMapMail email list. The announcement evoked an interesting discussion related to the quality of the topic map. The discussion inspired me trying to enhance the AWD topic map using Wandora's extractors. Wandora has several extractors that distill entities and keywords out of unstructured text. I selected extractors
And applied each extractor, one by one, to all summary occurrences of report topics. Summary occurrence contains textual representation of the military event the report talks about.
Applying the extractor to occurrences generated more topics and associations. Extractors' outputs were stored to separate topic maps. As a result I got 4 different information packages, one for each extractor. All resulting topic maps and the original topic map were zipped and this zip package is available here. In practice anyone can merge any of the generated topic maps with the original Afghanistan War Diary - 2004 topic map and see extracted topics interleaving with the original topics. As the original summary occurrence is available, anyone can evaluate if the extracted topics and associations really describe the report. To summarize results:
- Yahoo term extractor found 3563 term topics (15092 associations).
- OpenCalais found 1749 tags (3455 associations) in 18 tag classes and 25 topic categories (1317 associations).
- Alchemy entity extractor found 424 entities (1285 associations).
- Alchemy keyword extractor found 9054 keywords (14684 associations).
Although the technical implementation of the experiment was easy, I am not really sure about the quality of automatic classifications provided by Calais, Alchemy and Yahoo. It is clear that the source material, reports of military actions, is very challenging due to military specific expressions, terms, and acronyms, and it looks like all classifiers have made false interpretations. It would be very interesting if someone would like to evaluate the overall quality of generated topic maps and point out typical error classes.
In any case, I hope this demonstration clearly shows a simple topic map storing merely occurrence data is not necessarily a dead end but a good start.