Friday, December 31, 2010

RE: Amazon's ONIX to MARC converter

Posting to Autocat

Brian Briscoe wrote:
<snip>
I see the expected pattern for the future to be one where we (catalogers) accept machine-generated publisher metadata with its inherent shortcomings, we then improve it by human intervention to provide depth and detail (that is our strength) and we then return it to them as well as use it ourselves.

That means that we incur the costliest part of the bibliographical information creation process.

The publishers/booksellers gain the most because not only are they responsible for the least expensive part of the process but, as for-profit entities, they have the most to gain.

If we are the ones who do the most changing and are the ones who take on the largest amount of work without commensurate compensation to do that work, is it true cooperation?
</snip>
Brian brings up some very good points that I am sure others share. Since I am more or less isolated where I am currently, I take many things for granted that others may not. If this were the scenario then I would agree completely: it would be outright exploitation of the library cataloging community and I would be completely against it.

The idea of machine-generated publisher metadata is not what I envision. I do not think that most of this metadata from publishers, if any of it at all, is actually generated by machines. Most of this type of metadata that I have seen is actually a byproduct of a publisher's own internal management processes. My main knowledge of this is from my work at FAO of the United Nations, where they have what is called a "document repository," but many other publishers do something similar.

The guts of this document repository is a "content management system" (CMS) that allows the editors, administrators, and so on, to manage the production of the documents: what the title of a document is, who are the authors and editors, what publication it will appear in, what languages it will be translated into and who is responsible for them, and so on and so on. Consequently, the actual workflow for metadata creation begins from the moment the resource has been assigned to someone *before the resource exists*. Then, there is a gradual accretion as it goes through the process of creation: any additional authors, changes in editors, or title changes, series numbering, conference names, and so on. For all kinds of reasons, not all of this internal information is kept up to date and explains why, e.g. the title may not match the item you see: the cataloger sees the final title of course, while in the content management system, there is still the original working title that was never updated and is taken into the ONIX data.

Also, the authors and mainly secretaries (who come and go with some regularity) are often responsible for inputting and updating the metadata. These people have only so much training and in any case, have less interest in the metadata than in the book, or journal article, itself. Therefore, for these people working on the metadata is a rather distasteful chore.

It is the same situation in open archives: the metadata you see is created by humans--normally the authors, or their graduate students, if the profs find it beneath them.

There's more to it, but I think this makes it clear that I think very little metadata of this sort is actually computer generated. Perhaps a contents note but very little else and perhaps not even that. This is the information that has been exported in MARC format for others to use and what we see in the ONIX records.

Still, I think that it is very important to keep in mind that far more metadata *could* be generated automatically if the actual books and other resources were in XML. In that case, the title could come from *the item itself* if the title on the chief source were coded correctly, or the statement of responsibility, or the publication information, and other parts relevant to ISBD (transcribed from the item), and in this case, accuracy would actually increase. (This is why I think that the ISBD principle of transcription is so important today) If these matters were seen as important, I think that publishers would be much more interested in XML-based publications.

In short, I see the metadata world as human beings inputting information, and consequently, *if* we could get these human beings to just agree on some things we could actually cooperate: where to take the title and how to input it; how to count the stupid pages(!), how to deal with the dates of publication--AND--most importantly, to take these tasks seriously. It doesn't seem like such an impossible task! But maybe it is for now.

It is possible to reach such an agreement, of this I have no doubt. And it will be done eventually. Food standards and building standards exist now; metadata standards are no more difficult and cannot be impossible. I just don't know how soon it will be.

No comments:

Post a Comment