Monday, December 13, 2010

RE: Dublin Core as a replacement for MARC21

Posting to Autocat

My own opinion on this has gone back and forth. This is my current thinking:
  1. We should remember that MARC is above all a *communications format* for transferring information. I doubt if anybody's catalog actually stores records in native MARC format anymore http://www.loc.gov/standards/marcxml/Sandburg/sandburg.mrc (open this in Notepad), but probably in a relational database. What catalogers see when cataloging is a type of display that is created especially for them. When you export a record from your catalog, it converts the relational database record into ISO2709 format for Z39.50 transfer. Also, when importing a record, the catalog takes the ISO2709 record and converts it into the local database structures (which may be different in every catalog);
  2. Dublin Core can never be a replacement for MARC21, but it is not meant to be;
  3. Through crosswalks, it is very possible to take almost any XML format and convert it to our own, although it will almost always mean loss of some information somewhere, e.g. from UNIMARC to MARC21, you would lose the distinguishing 200$d (parallel title) and 200$g (Subsequent Statement of Responsibility) since UNIMARC is more granular than MARC21 in these areas. Therefore, 200$d would convert to 245$b (maybe) and 200$g would convert to ";" in the 245$c (probably). Converting from MARC21 to UNIMARC would also involve loss of information in a similar fashion. Dublin Core would involve far too much loss for libraries to switch to;
I always understood the rather elementary arguments above, but continuing from this point on I have found difficult to accept. For a long time, I thought MARCXML was all we needed to do, but I came to realize that it is not enough. The problem is: we need to *enter into* the larger metadata world in various ways. This means transferring our records *not* through Z39.50 but other protocols. This means that we will be entering a world where libraries and their formats are not in complete control, and where our information must be usable in other databases and projects. We need to work with others while these others do not necessarily feel compelled to work with us, but they are normally willing.


MARCXML, even though it is in XML and can therefore be manipulated (the fixed fields are still based on position though!), is still completely incomprehensible to a non-cataloger since all of the coding is still in MARC tags. Therefore, if somebody wants to work with MARCXML, they must have access to a trained cataloger, which most don't and anyway, most systems developers don't feel that they should have to pay some trained cataloger to interpret the format for them; they feel it is the format that is at fault. And what's the result? They ignore our information, preferring to get it from other sources that are easier to work with. Also, library bibliographic data is hard to get.

As an example, let's say that some developer wanted to create word clouds using the subjects from the records at the Library of Congress. That could have some really interesting results. The developer has heard that there are some data sets in the Internet Archive, and downloads the hideous ISO2709 format. Then let's suppose the developer doesn't give up and finds the tools to parse it, and maybe even find an XML converter, but still sees a bunch of tags and codes. The developer may have heard of Dublin Core, so will look for a MARC-Dublin Core converter and work with Dublin Core but there may be so much loss the final product may not be useful. But in any case, all of this represents a struggle right from the beginning because they have to massage the data so much before they can even begin to do the neat stuff. This is something developers don't have to do when working with other formats, and they may give up before the finish.

In any case, the trend is to work and manipulate this information on the fly, and this is essentially what APIs and web services does. For various reasons it becomes clear (to me at least!) that MARCXML is not a real solution.

I am not saying that I know what the solution is though!

1 comment:

  1. Mikael Nilsson argues that RDF is (the basis for) a solution -- see particularly chapter 6 on Horizontal harmonization:
    http://kmr.nada.kth.se/papers/SemanticWeb/FromInteropToHarm-MikaelsThesis.pdf

    ReplyDelete