Topic map conversion of WordNet

From WandoraWiki
Revision as of 15:17, 26 July 2012 by Akivela (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

WordNet is a large lexical database for English. WordNet has been developed at the Cognitive Science Laboratory of Princeton University. Topic map conversion is based on W3C's work on RDF version of WordNet 2.0.

Contents

Download WordNet topic map

There are two versions of WordNet topic map available:

  • Wandora project file (6.9 MB). This version is targeted to Wandora users. Project file contains synsets and associations in separate layers enabling the user to dynamically filter out undesired associations of the WordNet. Download Wandora application here.
  • Single XTM dump (zipped 5.4 MB, unzipped 144 MB). This version is a general topic map file usable in any topic map application supporting XTM file format.

WordNet glossary is available as a separate topic map:

  • Glossary XTM dump (zipped 6.0 MB, unzipped 52 MB). Glossary contains only topic stubs with single subject identifier and glossary occurrences. To use the glossary merge it to the WordNet topic map.

History

  • 2008-03-04 / v25 / Changed SI base to http://www.wandora.org.
  • 2007-07-31 / v24 / Fixed instanceOf elements in XTM dumps.
  • 2007-07-07 / v23 / Initial release.

Metrics of WordNet topic map

The WordNet topic map contains

  • 115486 topics
  • 137383 associations

Word topics:

  • 115424 word topics
  • 79689 noun topics
  • 13508 verb topics
  • 7482 adjective topics
  • 3664 adverb topics
  • 11081 adjective satellite topics

Associations:

  • 648 attribute associations
  • 218 causes associations
  • 1280 classified-by-region associations
  • 6166 classified-by-topic associations
  • 983 classified-by-usage associations
  • 409 entails associations
  • 94842 hyponym-of associations
  • 12205 member-meronym-of associations
  • 8636 part-meronym-of associations
  • 874 same-verb-group associations
  • 11098 similar-to associations

Moreover:

  • Each word topic has a unique base name and English display variant
  • Each word topic has an occurrence of synsetID referring to the original word ID given in Princeton
  • Clustering coefficient of topic map WordNet is 0.0265


Wordnet association count.gif

Using WordNet topic map in Wandora

Topic map version of WordNet contains over 100 000 topics and associations, and requires at least 2 GB of memory to be used properly in Wandora. To get such a memory for Wandora, start the application with bin/Wandora-huge.bat or adjust Java's memory settings in bin/Wandora.bat. Below is a screenshot of Wandora with WordNet's meeting topic open. Note the layer structure of Wandora project file. Wandora project file includes separate layer for synsets and each association type. Layer wordnet-synset layer contains all word topics but no associations between words. wordnet-similarity contains only word topic stubs (one subject identifier but no base name nor variant names) and SimilarTo associations between stubs. To concentrate on SimilarTo associations Wandora user may hide all other associations by clicking the eye icons of hideable layers. Exporting topic map with File > Export options generates topic map file containing only visible layers. Note that the XTM version of WordNet does not contain described layer stucture. Importing XTM version of WordNet to Wandora generates only one merged layer that contains all WordNet topics and associations.

Summary of Wandora project file layers:

  • Base layer contains Wandora's base ontology.
  • Layer wordnet-basics builds a simple instance-of hierarchy including all association types and classes, and attaches this hierarchy under Wandora's base ontology. Motivation of this layer is merely a navigational. It is easier to access WordNet when topics and associations are in Wandora's topic tree.
  • Layer wordnet-synset contains all words in WordNet divided into separate synset categories. Each word topic has single subject identifier, base name, and English variant name. This layer does not contain associations between word topics.
  • Layer wordnet-similarity contains all Similar to (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in Similar to (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the full word topic in wordnet-synset layer.
  • Layer wordnet-part-meronym contains all PartMeronymOf (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in PartMeronymOf (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the full word topic in wordnet-synset layer.
  • Layer wordnet-member-meronym contains all MemberMeronymOf (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in MemberMeronymOf (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the complete word topic in wordnet-synset layer.
  • Layer wordnet-hyponym contains all HyponymOf (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in HyponymOf (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the complete word topic in wordnet-synset layer.
  • Layer wordnet-classified-by contains all ClassifiedByRegion (wordnet), ClassifiedByTopic (wordnet), and ClassifiedByUsage (wordnet) associations between word topics. Layer contains also association types and role topics. Also such word topics are included that play a role in listed associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the complete word topic found in wordnet-synset layer.
  • Layer wordnet-attribute contains all Attribute (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in Attribute (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the complete word topic in wordnet-synset layer.
  • Layer wordnet-cases contains all Causes (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in Causes (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier that merges the stub to the complete word topic in wordnet-synset layer.
  • Layer wordnet-entailment contains all Entails (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in Entails (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier used to merge the stub to the complete word topic.
  • Layer wordnet-same-verb-group-as contains all SameVerbGroup (wordnet) associations between word topics. Layer contains also association type and role topics. Also such word topics are included that play a role in SameVerbGroup (wordnet) associations. Included word topics are topic stubs. They have only one subject identifier used to merge the stub to the complete word topic.


Wordnet example.gif

Conversion details

The topic map conversion of WordNet is based on W3C's RDF version of WordNet. The conversion had (little simplified) steps of

  • Importing each single RDF file of WordNet to Wandora as a separate layer. For each imported layer
    • RDF triplets were manually fixed to topic map associations. Generally this required mapping RDF's subject and object to association roles.
    • Fixing certain subject identifiers of imported topics.
  • Constructing base and variant name for all words. Base names were constructed using URIs of RDF subjects. Variant names were constructed using base names. Simple Regular expressions were used in name construction.
  • Creating light-weight topic hierarchy to connect WordNet topics to Wandora's base ontology.

The overall amount of work was about two working days. The most demanding step was to decide which roles to use in associations. Next chapters describe the most important base names and subject identifiers of the topic map conversion.

Synsets

Synsets are classes that collect all words under word categories. Categories comply with W3C's and WordNet's categories. Single words are instances of these class topics. All synsets are instances of Synsets (wordnet) topic.

Base name Subject identifiers
AdjectiveSatelliteSynset (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/AdjectiveSatelliteSynset
AdjectiveSynset (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/AdjectiveSynset
AdverbSynset (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/AdverbSynset
FullSynset (wordnet) http://www.wandora.org/wordnet/synset
NounSynset (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/NounSynset
VerbSynset (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/VerbSynset

Association types

Association types define separate relations between word topics. Association types comply with W3's WordNet schema. Each association type has been added extra subject identifier to connect the topic to Wandora. All association types are instances of Association-Types (wordnet) topic.

Base name Subject identifiers
Attribute (wordnet) http://www.wandora.net/wordnet/type/attribute
http://www.w3.org/2006/03/wn/wn20/schema/attribute
Causes (wordnet) http://www.wandora.org/wordnet/type/causes
http://www.w3.org/2006/03/wn/wn20/schema/clauses
ClassifiedByRegion (wordnet) http://www.wandora.org/wordnet/type/classifiedByRegion
http://www.w3.org/2006/03/wn/wn20/schema/classifiedByRegion
ClassifiedByTopic (wordnet) http://www.wandora.org/wordnet/type/classifiedByTopic
http://www.w3.org/2006/03/wn/wn20/schema/classifiedByTopic
ClassifiedByUsage (wordnet) http://www.wandora.org/wordnet/type/classifiedByUsage
http://www.w3.org/2006/03/wn/wn20/schema/classifiedByUsage
Entails (wordnet) http://www.wandora.org/wordnet/type/entails
http://www.w3.org/2006/03/wn/wn20/schema/entails
HyponymOf (wordnet) http://www.wandora.org/wordnet/type/hyponymOf
http://www.w3.org/2006/03/wn/wn20/schema/hyponymOf
MemberMeronymOf (wordnet) http://www.wandora.org/wordnet/type/memberMeronymOf
http://www.w3.org/2006/03/wn/wn20/schema/memberMeronymOf
PartMeronynOf (wordnet) http://www.wandora.org/wordnet/type/partMeronymOf
http://www.w3.org/2006/03/wn/wn20/schema/partMeronymOf
SameVerbGroup (wordnet) http://www.wandora.org/wordnet/type/sameVerbGroupAs
http://www.w3.org/2006/03/wn/wn20/schema/sameVerbGroupAs
SimilarTo (wordnet) http://www.wandora.org/wordnet/type/similarTo
http://www.w3.org/2006/03/wn/wn20/schema/similarTo

Association roles

W3's WordNet does not contain association roles as RDF has no similar structure. For this reason role topics have no corresponding subject identifier of W3's RDF schema. Although it might be reasonable to classify each word topic with corresponding role topic, the word topic has no class referring to roles it plays in the topic map. Also note that associations in Wandora may not contain two players with same role. As a consequence there exists roles such as verb-1 and verb-2 or word and similar-word that should be considered identical. All association role topics are instances of Association-Role (wordnet) topic.

Base name Subject identifiers
action (wordnet) http://www.wandora.org/wordnet/role/action
adjective (wordnet) http://www.wandora.org/wordnet/role/adjective
attribute (wordnet) http://www.wandora.org/wordnet/role/attribute
cause (wordnet) http://www.wandora.org/wordnet/role/cause
consequence (wordnet) http://www.wandora.org/wordnet/role/consequence
hypernym (wordnet) http://www.wandora.org/wordnet/role/hypernym
hyponym (wordnet) http://www.wandora.org/wordnet/role/hyponym
member-holonym (wordnet) http://www.wandora.org/wordnet/role/member-holonym
member-meronym (wordnet) http://www.wandora.org/wordnet/role/member-meronym
part-holonym (wordnet) http://www.wandora.org/wordnet/role/part-holonym
part-meronym (wordnet) http://www.wandora.org/wordnet/role/part-meronym
region (wordnet) http://www.wandora.org/wordnet/role/region
similar-word (wordnet) http://www.wandora.org/wordnet/role/similar-word
topic (wordnet) http://www.wandora.org/wordnet/role/topic
usage (wordnet) http://www.wandora.org/wordnet/role/usage
verb-1 (wordnet) http://www.wandora.org/wordnet/verb-1
verb-2 (wordnet) http://www.wandora.org/wordnet/verb-2
word (wordnet) http://www.wandora.org/wordnet/role/word

Occurrence types

All occurrence type topics are instances of Occurrence-Type (wordnet) topic.

synsetId (wordnet) http://www.w3.org/2006/03/wn/wn20/schema/synsetId

Limitations of the topic map WordNet

To limit the size of resulting topic map some RDF files of WordNet have been left outside the conversion. For example the current WordNet topic map does not contain glossary. However, it is very easy to extend the current version by simply importing the required RDF files to Wandora.

You should also note the extensive use of W3C's name space instead of original Princeton's name space. Although topics may carry multiple subject identifiers, subject identifiers complying Princeton's name space were not added to word topics as the topic map's size would have increased a lot. If you would like to use Princeton's name space in subject identifiers instead of W3C's name space use Wandora's Regular expression replace tool to replace the domain of subject identifier URIs. If you would like to use subject identifiers of both Princeton's and W3C's construct another subject identifier with Princeton's domain for all topics using topic's base name.

WordNet license

WordNet has been created originally in Cognitive Science Laboratory of Princeton University. The license of WordNet is here:

Topic map conversion is based on W3C's work on RDF version of WordNet. Read more here:

Personal tools