MARCXML extractor

From WandoraWiki
Revision as of 09:49, 15 June 2011 by Akivela (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

MARCXML is a file format used to transfer and store bibliographical information. MARCXML standard is developed by The Library of Congress' Network Development and MARC Standards Office. Wandora's MARCXML extractor converts MARCXML files to topic maps. Extractor starts with menu option File > Extract > Bibliographical > MARCXML extractor.... Wandora user can configure the conversion as described below.

Wandora features also a batch conversion for MARCXMLs. Batch conversion tool is started with menu option File > Extract > Bibliographical > MARCXML batch extractor.... Batch extractor stores all converted topic maps in XTM format to wandora_export folder beside the MARCXML files.

Contents

Use example

User selects File > Extract > Bibliographical > MARCXML extractor... to start Wandora's MARCXML extractor.


Marcxml example 01.gif


Wandora opens up a dialog. User selects Urls tab and enters url http://www.loc.gov/standards/marcxml/xml/collection.xml to the text area. Then user clicks Extract button.


Marcxml example 02.gif


After extraction a log window remains open.


Marcxml example 03.gif


User closes log window and opens up a created MARC topic in the left topic tree. Below MARC user opens up topic Record (MARC), and below that the first record topic. Record topics represent library entities expressed with MARCXML.


Marcxml example 04.gif


By default Wandora doesn't resolve base name nor subject identifier for the record topic. One could say, a record is defined entirely by it's association.


Marcxml example 05.gif


Marcxml example 06.gif


User opens up topic Field (MARC) and a list of MARCXML fields is opened. MARCXML fields are used as association type in Topic Maps conversion.


Marcxml example 07.gif


User opens field topic 245 - Title Statement (NR) and Wandora views all associations of type 245 - Title Statement (NR).


Marcxml example 08.gif


Association type's subject identifier is an URI that resolves a WWW page of Library of Congress' MARC documentation describing the field type.


Marcxml example 09.gif

Configuring the extraction

Holding a CTRL key down while starting MARCXML extractor with menu option File > Extract > Bibliographical > MARCXML extractor... opens up a configuration dialog where user can adjust the MARCXML to Topic Maps conversion details. Available configuration options are:


Marcxml configure.gif


Record SI patterns is a comma separated list of URI patterns. For example pattern

 http://www.wandora.org/marcxml/loc/___a@010___ , http://mydomain.org/marcxml/scn/___a@035___

describes two possible subject identifiers for records. First attaches value of MARCXML field 010 and subfield a to a body http://www.wandora.org/marcxml/loc/. Second attaches value of MARCXML field 035 and subfield a to a body http://mydomain.org/marcxml/scn/. Each tag in pattern is surrounded by three underscore characters on both sides. Pattern may contain multiple tags referring different fields and subfields. If all tags in pattern are filled, resulting URI is added to the record topic as a subject identifier. If one or more tags refer to a non existing field and subfield, URI is rejected. By default Wandora adds a random subject identifier to a record topic.

Record base name patterns is a comma separated list of base name patterns. Patterns may contain tags similar to SI pattern tags described above. By default Wandora doesn't add a base name to record topics.

Trim datas option controls if Wandora should trim special characters out of all field data. Special characters are

,;+:// 

Solve field names option controls names of association type topics. If ticked, Wandora tries to figure out a name for field topic. Wandora uses an internal mapping of field codes and their names, and it may not contain a name for all possible field codes. By default Wandora solves names for fields.

Solve subfield names option, if ticked, results Wandora solving names for subfield codes. Wandora uses an internal mapping of subfield codes and their names, and it may not contain a name for all possible subfield codes. By default Wandora solves names for subfields.

Convert MARC indicators option controls, whether or not Wandora converts also first and second field indicators. If converted, indicators are added to the field association as separate member. By default Wandora converts MARC indicators. User should not mix MARC indicators to subject indicators in topic maps.

Include fields is a comma separated list of field codes Wandora should convert only. If specified, Wandora leaves all other fields out.

Exclude fields is a comma separated list of field codes Wandora should leave out of conversion. By default Wandora converts all fields.

Include subfields is a comma separated list of subfield codes Wandora should convert only. If specified, Wandora leaves all other subfields out. User should note Wandora doesn't allow you to specify included subfields per field i.e. setting affects all fields.

Excluded subfields is a comma separated list of subfield codes Wandora should leave out of conversion. User should note Wandora doesn't allow you to specify excluded subfields per field i.e. setting affects all fields. By default Wandora converts all subfields.

MARCXML encoding is a character encoding of extracted MARCXML file. By default encoding is UTF-8.

Default language is a two letter code referring to a language topic of field data occurrences. By default language code is en referring English language.

Conversion details

MARCXML to Topic Maps conversion creates a topic for each record entity found in MARCXML file. One association is created for each field in the record. Association type is based on a field code. Created record topic is added to the association as first player with static role Record (MARC). For each subfield in the field, a topic is created for both subfield code and subfield value. The subfield value is the actual text content within the subfield entity. Both the subfield code topic and the subfield value topic are added to the association as a single member, i.e. a player-role pair, where the subfield code topic is a role and the subfield value topic is a player.

As an exception, MARC fields 001-009 and their content are added to the record topic as occurrences where occurrence types reflect field's code and occurrence data is the content of the field entity.

Subject identifier of the record topic is random if Wandora user has not specified matching SI pattern. If matching SI pattern is available, a resulting string is added to the record topic as a subject identifier.

By default record topic has no base name. If user has specified matching base name pattern, resulting string is added to the record topic as a base name.

Record topics are made instances of topic Record (MARC).

Additional notes

  • User should note that the conversion schema respects MARCXML format. Conversion tries not to hide original file format but uses it's fields and subfields in the topic map.
  • Conversion tries to be non-lossy. It should be possible to convert extracted topic map back to MARCXML format.
  • There is a lot of semi-structured text in field contents. General principle is If you don't know what it means, leave it alone. Conversion leaves the text intact.
Personal tools