Sunday, December 26, 2010

RE: ONIX data

Posting to NGC4LIB

Cory Rockliff wrote:
OK, but the key word in my statement was "iterative." To clarify, I'm not talking exclusively or even primarily about correcting systemic errors with global changes. I'm questioning the "do it once, and do it right" premise. To follow your analogy, yes--in our current ecosystem (OCLC, essentially), if one wanted to make a change to a record or record set that would then propagate to all participating libraries, it would be very much like doing a product recall (but possibly more painful). I don't think it needs to be this way, though. Standards aside, as Karen observed, bad data is bad data. But if the data's open and there are enough eyeballs on it, errors stand a better chance of being caught, and substandard data of being upgraded. Unfortunately, our current systems aren't designed for this approach.
This is one of those suggestions that I find very difficult to envision how it can work in practice. Here is an actual, real life example: I just discovered to my great joy that a scan of the Report of the Royal Commission about Panizzi's catalog has recently been put online, so now I have my very own copy! [Tremendous thanks to the Bavarian State Library!] It's at Nice scan, too. Let's compare the cataloging with what is in Google Books and what is in the LC Catalog.

LC Catalog record:
Corporate Name: Great Britain. Commissioners appointed to inquire into the constitution and government of the British museum. [from old catalog] 
Main Title: Report of the Commissioners appointed to inquire into the constitution and government of the British museum; Published/Created: London, Printed by W. Clowes and sons, for H. M. Stationery off., 1850. 
Related Names: Ellesmere, Francis Egerton, Earl of, 1800-1857.
Description: iv, 823, [1] p. 34 cm. 
Subjects: British museum. [from old catalog] 
LC Classification: Z792.B863 G3

This illustrates older cataloging practices (the LCCN dates from 1902 but I am sure this also represents a conversion from earlier practices) as we can see from the non-ISBD punctuation, but primarily from the abbreviated title, which omits "with minutes of evidence," (I do not know but I suspect that the title ending with a semicolon inferred a continuation, but this is only hazarding a guess); the older method of recording the paging: [1] p. which reflects the colophon with no page number in the original, but above all, the abbreviated subject, which doesn't use even the free-floater "Management" now valid under corporate bodies. I would expect a cataloger today to add several additional subjects, but things were different back then, with far fewer materials to deal with, and less differentiation needed in the catalog.

But when we compare this to the Google metadata (found at the bottom of the information page), we find:
Title Report of the Commissioners appointed to inquire into the Constitution and government of the British Museum; with Minutes of Evidence: (Presented to both Houses of Parlament by Command of Her Majesty.)
Publisher Will. Clowes, 1850
Original from the Bavarian State Library
Digitized Jun 28, 2010
Subjects Travel / Museums, Tours, Points of Interest

While the title is fuller, even including the "presented" note, the publication information is very abbreviated, omitting the place and the important Stationery Office; it omits the Earl of Ellesmere as Chair of the Committee; the committee itself is not there as a corporate body; no physical description; and finally the subject itself is totally bogus, one of those "silly" ones, that--I hope!--was automatically generated; either that or it is similar to some of the BISAC terms that are too general to be of any real use.

In this new system you suggest, let us for a moment assume that the LC record does not exist. All we have is the Google Books record. In this case, how will this record be fixed, and who is supposed to do it? Let us further assume, for the sake of argument, that the title is quite different (as I and others have mentioned happens a lot of the time for all kinds of reasons) and/or there are a few typos (in the Google record, the only typo is in the presented note, which misspells Parliament, but let's imagine some more serious typos in the record, including the title).

In doing this, I mean to set up a scenario: if the record is so bad because of bogus subjects, lack of access points, and serious typographical errors (as we are positing here) how can such a record *even be found in the first place*, so that it can be brought up to some kind of standards? I realize that the idea of crowdsourcing, using a thousand eyeballs looking at every record (although the analogy reminds me of the descriptions of some of the monsters from the Book of Revelation), may be able to find these kinds of lousy records occasionally, but it all still seems to rest on some kind of faith that everything will work itself out. What is the basis of this faith? I believe such faith is the unspoken assumption of a *minimum level of quality* of some sort, so that the record can be found at least somewhat reliably--i.e. so that the item has a chance to get those thousand eyeballs focused on it, or even two eyeballs!--because only then can everybody begin to work on it.

[I did some additional work on this, since this is how I learn what is available on the web, and I just recorded everything here. Those who are less "hard-core" may want to ignore the remainder.]

As a test, I wanted to find this item in an early LC catalog to see how our predecessors cataloged it (the 1861 catalog, its 1868 supplement at, but have so far been unsuccessful. I still think it's in there, though.

I did find it in an early NY Catalogue of the New York state library: 1855. Law library, using a brute force search at:, where for some reason it is labelled as no. 49 from a series "House of Commons Papers". I haven't yet found anything like this in the item I downloaded. (This item is probably in other catalogs as well, but this brute force search assumes no problems with OCR)

Matters get even stranger with the British Library record,
Here is the record:

Author - corporate Great Britain. Commissioners appointed to inquire into the Constitution and Government of the British Museum.
Title Report of the Commissioners appointed to inquire into the Constitution and Government of the British Museum; with minutes of evidence. (Index to report and minutes of evidence.).
Publisher/year London, 1850.
Physical descr. 2 pt. fol. 34 cm.
Series ( (Parliamentary papers. House of Commons. Session 1850. vol. 24. no. 1170.))

This record has even different title information, ignores the printer and publisher completely, the physical description is quite different, and we see a series statement of "Parliamentary papers. House of Commons. Session 1850. vol. 24. no. 1170", which is different from the no. 49 one above!
The series in the British record probably came from the listing available at "Parliamentary Papers, 1850"

I haven't found anything relating to this in the actual report, but it could be that the statement "Presented to both Houses of Parliament by Command of Her Majesty" is actually a codeword that means (for those who know!) that it is part of the series of Parliamentary Papers, and you have to look it up in the separate index, which "you" are supposed to know about.

The other "series" above where it says it is no. 49 in the NY State Law Library, is probably some kind of local numbering practice. (These last two points are strictly suppositions that may be wrong)

Now that I have this additional information about the series, I could look again in the old LC catalogs to see if it's there, but I'm getting sick of it all.

I don't think this is an especially difficult item, although a novel published only a single time written by an author with a unique name would certainly be an easier example. Other resources are far more complicated than this one. So, I think this is a realistic example of some of the difficulties we will face in trying to blend/merge, or simply to find one another's metadata before these other records can even begin to be of help.

But on the positive side, I also realize that it is amazing I could do this by never moving from my couch in my apartment in Rome, Italy. Also, there is such a possibility today, bordering on magic, of what I described as, a "brute force" search. Twenty years ago and using only printed materials, doing exactly the same searching would have taken weeks of work, trying to consult multiple expert librarians and others, running all over the place, and I would have been left completely exhausted!

Finally, I notice how much I myself have changed, comparing what I have done sitting at home, deciding that this is now too much and unwilling to exert myself to complete the final part of the task. I don't think this is a symptom only of age, but I feel I am relating to information in a fundamentally different way than before, and have entirely different expectations.

No comments:

Post a Comment