Friday, December 31, 2010

RE: Amazon's ONIX to MARC converter

Posting to Autocat

Brian Briscoe wrote:
While author and title are common search terms, the secretary input of such in any manner of uncontrolled syntax and a lack of subject and other biliographical analysis shows that catalogers will be left with the lion's share of the work. Indeed, we already get the bare basics from LC via CIP. I suppose publishers might replace that division at LC, but it seems a relatively small thing when compared with the work that catalogers do.
This is correct, although it is *conceivable* that if our name authority files were fully accessible and "easy" to use, there might be possibilities to get a lot of the name headings too. The tool at even works with the VIAF, which could be monumentally important and is certainly headed in the right direction. I also agree 100% that this is "a relatively small thing compared with the work that catalogers do".

Where we may be in disagreement is that in my opinion, the very fact that it is such a "small thing" is a huge point in our favor because it allows us to demonstrate our own importance in the entire matter, and may be the key to our own--and I will even go out on a limb and say our patrons'--salvation. Taking us out of the equation would be a definite loss for everyone. The key is to get the "decision makers" to agree to the importance of library-type cataloging and understand what we can add--as efficiently, and as ethically as possible (as I have pointed out in other posts).

How can we demonstrate this convincingly to non-specialists? I think this is one of the most important points that the cataloging community needs to make, and more importantly, to prove, even if it may make some people angry. After all, that is no less than what Panizzi did all those years ago.

In ending, I confess I feel more and more like Cato the Elder, who spoke so long ago in the Roman Senate (not that far away from where I live), and who always ended his speeches with, "Carthago delenda est" (Carthage must be destroyed). I repeat once again that RDA does not address the real problems we face.


RE: Amazon's ONIX to MARC converter

Posting to Autocat

Brian Briscoe wrote:
I see the expected pattern for the future to be one where we (catalogers) accept machine-generated publisher metadata with its inherent shortcomings, we then improve it by human intervention to provide depth and detail (that is our strength) and we then return it to them as well as use it ourselves.

That means that we incur the costliest part of the bibliographical information creation process.

The publishers/booksellers gain the most because not only are they responsible for the least expensive part of the process but, as for-profit entities, they have the most to gain.

If we are the ones who do the most changing and are the ones who take on the largest amount of work without commensurate compensation to do that work, is it true cooperation?
Brian brings up some very good points that I am sure others share. Since I am more or less isolated where I am currently, I take many things for granted that others may not. If this were the scenario then I would agree completely: it would be outright exploitation of the library cataloging community and I would be completely against it.

The idea of machine-generated publisher metadata is not what I envision. I do not think that most of this metadata from publishers, if any of it at all, is actually generated by machines. Most of this type of metadata that I have seen is actually a byproduct of a publisher's own internal management processes. My main knowledge of this is from my work at FAO of the United Nations, where they have what is called a "document repository," but many other publishers do something similar.

The guts of this document repository is a "content management system" (CMS) that allows the editors, administrators, and so on, to manage the production of the documents: what the title of a document is, who are the authors and editors, what publication it will appear in, what languages it will be translated into and who is responsible for them, and so on and so on. Consequently, the actual workflow for metadata creation begins from the moment the resource has been assigned to someone *before the resource exists*. Then, there is a gradual accretion as it goes through the process of creation: any additional authors, changes in editors, or title changes, series numbering, conference names, and so on. For all kinds of reasons, not all of this internal information is kept up to date and explains why, e.g. the title may not match the item you see: the cataloger sees the final title of course, while in the content management system, there is still the original working title that was never updated and is taken into the ONIX data.

Also, the authors and mainly secretaries (who come and go with some regularity) are often responsible for inputting and updating the metadata. These people have only so much training and in any case, have less interest in the metadata than in the book, or journal article, itself. Therefore, for these people working on the metadata is a rather distasteful chore.

It is the same situation in open archives: the metadata you see is created by humans--normally the authors, or their graduate students, if the profs find it beneath them.

There's more to it, but I think this makes it clear that I think very little metadata of this sort is actually computer generated. Perhaps a contents note but very little else and perhaps not even that. This is the information that has been exported in MARC format for others to use and what we see in the ONIX records.

Still, I think that it is very important to keep in mind that far more metadata *could* be generated automatically if the actual books and other resources were in XML. In that case, the title could come from *the item itself* if the title on the chief source were coded correctly, or the statement of responsibility, or the publication information, and other parts relevant to ISBD (transcribed from the item), and in this case, accuracy would actually increase. (This is why I think that the ISBD principle of transcription is so important today) If these matters were seen as important, I think that publishers would be much more interested in XML-based publications.

In short, I see the metadata world as human beings inputting information, and consequently, *if* we could get these human beings to just agree on some things we could actually cooperate: where to take the title and how to input it; how to count the stupid pages(!), how to deal with the dates of publication--AND--most importantly, to take these tasks seriously. It doesn't seem like such an impossible task! But maybe it is for now.

It is possible to reach such an agreement, of this I have no doubt. And it will be done eventually. Food standards and building standards exist now; metadata standards are no more difficult and cannot be impossible. I just don't know how soon it will be.

Thursday, December 30, 2010

RE: Amazon's ONIX to MARC converter

Posting to Autocat

For those who are interested in this discussion, I suggest you look at the related postings on NGC4LIB under the topic "ONIX data" (starting at;%2BYRDJw;20101222155441-0500). A couple of postings I made were comparing actual records for some books (the postings are easiest to read on my blog at: and but the entire discussion has been interesting as well.

I would like to emphasize that ONIX data represents only one type of metadata with which we need to interoperate. There are all kinds of metadata out there from journal articles to open archives to statistical information, and on and on. Somehow, all of this metadata *will* be searchable in a much easier way than it is now--of this I am sure. It will happen either with the cooperation and coordination of the various metadata creators, or without it through "metadata mashups" that will be more or less crude.

"Cooperation" does not mean that everybody else does what I say, but it means that *everybody* will have to change to cooperate. For example, if we are to cooperate with the ONIX metadata community, we will have to change, but of course, so will the ONIX community. So, I don't think that focusing the argument on "good" and "bad" metadata, or "better" and "worse" is a productive direction. Everybody can point fingers in this way, since each metadata community has its own purposes, and the purposes are not necessarily the same in all communities, while the metadata standards and practices of each community reflect that.

A more productive direction, I think, is to focus on the advantages of cooperation that will result in greater efficiencies and increasingly easy access for our users (and their potential customers). This is how we might be able to get the communities to want to change in the first place. If we cannot show advantages, there will be no cooperation, and everything we do will just be mashed up without us in a process according to what the computer technicians and administrators want.

Can we demonstrate any advantages for the ONIX community to want to change? I think so, but we must keep in mind that they will not provide AACR2 (or RDA) records, because that would be doing what we say. We must concentrate on cooperation.

I think we can show such advantages to the ONIX community, but we should concentrate on providing reliable ISBD information, which is based on transcribing the information from the item itself, especially the 245-title/statement of responsibility information. For example, I cannot imagine anyone arguing that the information in the <title> field should *not* reflect what is on the resource.

We might all have to change some of our rules for publication information and physical description because publishers will often have more information than we do. For example, publishers, who have access to the files, could easily provide word counts.

Would this work? I don't know because it's unclear whether the metadata communities are willing to accept these types of changes, but we would have to be flexible, just as they would. Creating shared standards is very difficult work. Then, something new and advantageous to everyone may emerge. But as I said before, I have no doubt that if nothing changes, everybody's records will be mashed up, as is happening now in the Google Book metadata.

Wednesday, December 29, 2010

RE: Author added entries under RDA

Posting to Autocat

Arakawa, Steven wrote:
In this case, I think you're allowed to criticize the cataloger; limiting access to one author is inhibiting discovery & is contrary to FRBR principles. RDA clearly emphasizes more personal responsibility in decision making, rather than mastery of increasingly complex rules & rule interpretations for choice of entry (that are probably ignored in a lot of day to day cataloging anyway). Whether you approve or not may depend on whether you believe most catalogers are capable of making these decisions and are dedicated enough to apply them. I think RDA represents a turn away from the increasing proliferation of rules for every contingency, which, from a practical standpoint, have become too complicated to apply consistently, have no significant benefit for discovery, and can slow down cataloging output to a crawl. I think there is also an assumption that cataloging today occurs in a collaborative environment, and that any given catalog record does not need to be restricted to the cataloging decisions of a single individual. More individual responsibility but more community responsibility as well, in other words, with an emphasis on ultimate benefit to the user rather than matching the contingency to the closest rule approximation.
This would be nice if the world actually worked that way, but human nature--just as with the behavior of organizations--tends to do the least amount required. Some very few may do more, if amply rewarded; some very, very few may do more even without the rewards. This is a completely normal reaction and we see it in our lives all the time. After we accept this as a fact, we can understand the basic premises of standards: to define and mandate a minimum level below which anything is unacceptable. This is the only way to ensure reliability. Defining the minimum is absolutely critical, since this is what others can rely upon. A standard that states, "Do what you think is best" assumes far too much and cannot work for an entire variety of reasons: each person will have different workloads, different local mandates from supervisors, and of course, different motivations. In that case, everything is OK (since the rules state to do what I think is best) and as a result, you can only fault people for some kind of moral inadequacies ("Well, you were just being lazy ... again!") which is an abyss I don't think we want to get into.

Once this reality is accepted for what it is, we see that a rule such as:
"Only the first author is required to have an entry. Added entries are only required for translators and illustrators of children's books. All others are at the discretion of the cataloger."
is essentially a "voluntary standard" and will be taken very literally by many, many people (including me, by the way). In the example of tracing only the single author of an item with three authors, I cannot fault the cataloger because *the cataloger followed the rule*. That is why I fault the rule. According to normal standards that define minimums, the rule is: trace the first author, translators and illustrators of children's books. I personally find this rule rather senseless and maybe even silly. Why is a translator more important than author #2? Still, OK: it's the rule and that's it, period. If you don't do any more work, then *by definition*, it's OK. You are still doing your job competently and adequately since the record will follow RDA. So, the new rule is to trace the first author, translators and illustrators.

All we can conclude is that this is an incredibly huge *backward* change from what we do now. I don't know how many hundreds of years this goes back in cataloging practice, but I have seen added entries for multiple authors from 'way back. Others who do not want to take this huge step backward and want to maintain the three authors (which is completely reasonable as we have had this practice for a *long, long time*) will have to do lots of extra work. I don't know how long the rule of three will survive until it is de facto replaced by the RDA minimum, while we will see random records that will have the number of AEs in the stratosphere.

Why would a cataloger do more? For some kind of "professional" or "ethical" or "moral" reasons? This, while they watch the people around them being praised and getting raises because their statistics are higher, while the more "ethical" one is pushed to do more and more, especially with budget cuts and fewer staff to do even more work? This is an unsustainable scenario.

No amount of explanation can avoid these facts and if RDA is accepted and the public discovers how access is going down in one of the areas they need, while the reply is that "abbreviations are being spelled out, and just look at these new 336 to 338 fields!", they will only see it as one more bit of evidence why library cataloging does not provide what they need. Placing the responsibility onto the shoulders of individual catalogers by trying to shame them into doing *more* than what the standards mandate is completely unfair, in my opinion. Standards should be exactly what they say they are and should be clear.

Does this mean that our current (AACR2) standards should not change? Of course they should change; they have to. But they need to be reconsidered. I currently think that AACR2 in a way, is not really a standard, but almost a template that tells everyone to produce precisely the same item with *very few* options. This is not how other standards work. Many libraries and librarians cannot achieve this level; I think this is more than obvious. Change is necessary. If the RDA rule were written something like, "always trace at least the first three authors of a resource," such a rule is clear, defines a minimum, and still allows people and organizations to do more if they want. This is how food standards work, for example: defining a minimum level that everyone can rely upon and allows more for any organization that wants it.

Tuesday, December 28, 2010

RE: ONIX data

Posting to NGC4LIB

Karen Coyle wrote:
Jim, it seems that when you say "standards" you mean "library/ISBD standards." There are many, many different standards, and only libraries will produce bibliographic data using library standards. If you must have library data you will have to stick with library data.
But it doesn't make sense to criticize the publishers for producing *publisher* data, not *library* data. It's like criticizing your proverbial grocery store for not selling hardware. (BTW, grocery stores do not have to "re-check" because they use barcodes, not text, as data flows from the manufacturer through the wholesaler to the retailer. And they work together to agree on these standards, they don't each make up their own.)
What you are saying is that the only bibliographic data that you find acceptable is that data created by libraries (and only the more competent libraries at that).
I guess I am not being clear: this is precisely what I am *not* saying. What I am trying to say, and I have a bit of experience with many of them, is that there are very many bibliographic standards out there. I talk about this in my podcast pt. 4 of my personal journey with FRBR. Each of these standards has a history of changes. Records that were made 30 years ago reflect the standards of that time, and those standards may be quite different from what they are today. There is nothing strange or weird about this: it works exactly the same in many fields.

When library systems remained separate from the rest of the world of information, we could safely ignore those other standards and concentrate on our own. This is what every bibliographic agency did--each one worked in isolation of the others and each concentrated on its own little world, but those times have changed drastically. Everyone is having to deal with this new, wider universe--not only libraries, but every bibliographic agency. So, in practical terms, the metadata examples I gave in my previous message will actually be translated into a series of decisions on the part of the "metadata creator": the library cataloger when looking at the ONIX record will have to decide what part(s) to use, what parts to change, or if they should ignore the record altogether. The ONIX person will have to do something similar; non-ISBD catalogers will decide the same things; journal database indexers, open archive managers, document repositories, and so on and so on, will all have to decide whether to cooperate or just continue to do the same things as they have always done before, thereby remaining in the pre-Internet world.

This is the tremendous change we are facing--all of us. Many systems people I have met only want to concentrate on the computer coding, and figure the rest will solve itself, but any library cataloger knows that the MARC format is actually the *easiest* part of the record, what's hard is the data itself. This is the same in other bibliographic agencies I have come into contact with: the coding is the easy part. For the "metadata creators", the idea of coordinating the data in, e.g. MARC records with ONIX records with AGRIS records with everybody else, is so complicated and overwhelmingly daunting that they don't want to think about it, or sometimes as a "solution" they just decide: here's my data, do what you want with it. I've tried to show that this is *not* at all a solution.

Concerning what is "acceptable" data, is it what the competent libraries produce? I link it more to individual catalogers instead of to libraries, since the quality can vary tremendously within one library. But the deeper point is that it doesn't matter which metadata I consider "acceptable"; I just want the data we make to fulfill its function. I have had enough experience to realize that there is no *correct* way to do it. There are more accurate ways and less accurate ways to do something, but the highly accurate ways are more complex and take more time and training; yet, there are some ways that are definitely wrong. Counting the pages is an interesting case in point, and in that book that should be published soon "Conversations with catalogers in the 21st century", I go into a lot more detail on this, but in any case, counting the pages can be done in a whole variety of ways, with no "best" way. We see this in the examples I have given here. What is the solution? I don't know, and as you'll see from the book coming out, I don't really even care(!), but I want the information in the paging area to *mean something*. If everybody counts the pages in their own way, the final product is meaningless, and we can see this meaninglessness in the examples I gave. The only way to know how many pages one of the items has is to look at the item! Then, what use is the paging information in the metadata record? Very little, I submit, and that kind of thinking becomes a very slippery slope because you can apply that same reasoning to every part of the record. And this is because of lack of standards (everybody is doing their own thing). Nevertheless, I disagree with lots of AACR2, and it is no secret about what I think about RDA. But I also realize it doesn't matter what I think.

These are some of the considerations why I created that Conceptual Outline in the Cooperative Cataloging Rules, just to try to begin to get a handle on this complexity. If there are solutions out there, I think we are going to have to begin reconsidering metadata from its most fundamental points, starting with: what is its purpose? I think there is little agreement even on this, especially today.

I just figure (and fear) that if the community of metadata creators do nothing, a solution will be imposed on us with Google Book-types of mashups. Who knows what those metadata records will look like? I shudder to think of it!

RE: ONIX data

Posting to NGC4LIB

Charles Ledvina wrote:
The Amazon To Marc Converter at takes Amazon's ONIX data and creates a Marc where you can verify names via the VIAF API and add call numbers and subject headings using OCLC's Classify API.
While this could potentially become a very useful tool, it also exemplifies the problems I mentioned about standards. Here is a record taken at random from this tool. I'll assume everyone on this list knows basic MARC:
245 10 Central and eastern europe after transition / |c Alberto Febbrajo.
260 [S.l.] : |b Ashgate, |c 2010.
300 374 p. ; |c 24 cm.
490 1 Studies in modern law and policy.
[Sadurski exists as an added entry. Febbrajo is main entry.--JW]

Here is the LC ISBD information:
Central and Eastern Europe after transition : towards a new socio-legal semantics / edited by Alberto Febbrajo and Wojciech Sadurski.
Farnham, Surrey, England ; Burlington, VT : Ashgate Pub., c2010.
xi, 362 p. ; 25 cm.
[Both names as added entries]
It is interesting to note that when you search by the subtitle in Amazon ("towards a new socio-legal semantics"), you do not get this item.

Here is another one,
245 00 Picasso : |b the mediterranean years 1945-1962.
260 [S.l.] : |b Rizzoli, |c 2010.
300 390 p. ; |c 32 cm.
[Richardson, Cowling, Arnaud as 700s. Nothing for the gallery]

this copy from Columbia:
245 10 |a Picasso : |b the Mediterranean years 1945-1962 / |c curated by John Richardson ; [with contributions by Elizabeth Cowling, Claude Arnaud].
260__ |a London : |b Gagosian Gallery ; |a New York : |b Distributed by Rizzoli International Publications, |c c2010.
300__ |a 386 p. (some folded) : |b ill. (some col.), maps, ports. ; |c 31 cm.
[Main entry under the gallery--JW]

These two examples (with zillions more very easy to find) illustrate the problem of standards I keep pointing out: almost every single field *even in the ISBD areas* differs. So, while I agree that there is a type of "copy" here, its existence is essentially useless: it winds up saving the cataloger no time at all since every field must be redone, and, when faced with such a situation in the aggregate, each field of each record must be checked, even if no editing is done because it is obvious that nothing can be taken for granted. This is the only way for someone to ensure that a certain level of quality is achieved, otherwise all we can do is just give up and accept everything. The cataloger dealing with the records here might as well start from scratch, and experience shows that when confronting these kinds of records, it is often best to ignore them completely.

This is why I say that if a grocery store had to do this kind of work with every loaf of bread or any other item it sold, or if each business in the world that sells anything had to recheck every single item it sold, our society would completely fall apart. Such a situation would be called insanely inefficient. That's why standards exist and why they absolutely must be enforced, by law if necessary. Achieving these types of standards can be done in almost every industry, except in bibliographic information. This has always struck me as very strange.

Any standards must ensure not only a certain minimal level of quality, they must be readily achievable. I, and many other catalogers, have noticed that record quality from other libraries has gone down significantly. I think that we must honestly question that perhaps our current standards are set too high since apparently so few can achieve them, and now with the idea of including other standards such as what we see here, I think we must reconsider completely what the purpose of our standards are and how they can *really* be achieved so that some level of assurance can be gained by all. Otherwise, all we will get will be hash or the insane inefficiency I mentioned.

Unfortunately, RDA heads in exactly the opposite direction.

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB
Alexander Johannesen wrote:
The slogan *should* be; "Mapping knowledge for all humanity."
When the whole world will use and as persistent identifiers within a mergable identifier framework, there will be bliss, singing and dancing in the streets, and both librarians and normal folks looking for a semantic crutch in a world so complex will hug and drink mulled wined together, and all shall be well.*
* Oh, and extend (or refine) FRBR to take serious identification (like a canonical set of rules for merging and culling) into account while you're at it, and you've got exactly what the world needs.
I completely agree that if we could only get the persistent identifiers out there, people would want to use our work, and I think the mapping--or much of it--would then be done almost by default. This is because the authority and bibliographic records have lots of references already, but the problem is that they represent what is in essence, a closed system, i.e. authority records link only to other authority records, or in the case of bibliographical records, only to other bibliographical records, to the items themselves, or to other pages created specifically by the publishers such as for author biographical information or summary notes.

As an example, see the Mark Twain heading (click the first heading and keep clicking), and you will see lots of information, but not much of use to a patron, except that he or she can find the form of the name used in the catalog, plus discover some weird things about how they have to search under other names to get everything by Mark Twain.

In my own opinion, this "system" desperately needs to be broadened, which could be done primarily through automated means. There may be some relatively quick and easy way to include the records in the dbpedia project, e.g. somehow to include the page People would love to have the links available from there, including the "influences on" and "influenced by" (although I may or may not agree with all of them); or through something that I have attempted crudely with my Extend Search, which attempts to help people find other related resources, but my attempt could be improved tremendously, especially if there were some level of cooperation.

In many ways, I think that what is needed is a change in the "World View" of the cataloger community, so that we see our primary task as making records that will help people discover the world of information that is *really* available to them--not only to related library records. This change in "World View" has already happened long ago with our patrons.

Monday, December 27, 2010

RE: Author added entries under RDA

Posting to Autocat

James Bowman quoted LC:
>"Only the first author is required to have an entry.
Added entries are only required for translators and illustrators of children's books. All others are at the discretion of the cataloger."
I really hope this is some sort of miscommunication. One of the things that has always separated library cataloging from the simple citations is controlled author headings. To single out translators (apparently not only of belles lettres) and illustrators only of children's books, leaving *everything else* up to the vagaries of "cataloger's discretion", which can all too often mean how industrious I happen to feel at the moment, or how late I stayed up last night, represents a tremendous step backward in the history of cataloging. It is hard even to relate it to the concept of standards, much less "high-quality" standards.

Examining the record at, shows me that the abbreviations are spelled out, the "IV izdanje" instead of "4. izd.", and the 336-8 fields, none of which any user will care about at all, but they will care about author access. If the argument is that there is "keyword access" to the additional authors, this is a treacherous argument, since the same can be said for the main entry as well.

How can anyone possibly consider this a step forward--that this record represents what our users want more than our current records? I am definitely *not* criticizing the cataloger here, but the *rule*, which I do not think would be accepted by the public were it known to them.

Sunday, December 26, 2010

RE: ONIX data

Posting to NGC4LIB

Cory Rockliff wrote:
OK, but the key word in my statement was "iterative." To clarify, I'm not talking exclusively or even primarily about correcting systemic errors with global changes. I'm questioning the "do it once, and do it right" premise. To follow your analogy, yes--in our current ecosystem (OCLC, essentially), if one wanted to make a change to a record or record set that would then propagate to all participating libraries, it would be very much like doing a product recall (but possibly more painful). I don't think it needs to be this way, though. Standards aside, as Karen observed, bad data is bad data. But if the data's open and there are enough eyeballs on it, errors stand a better chance of being caught, and substandard data of being upgraded. Unfortunately, our current systems aren't designed for this approach.
This is one of those suggestions that I find very difficult to envision how it can work in practice. Here is an actual, real life example: I just discovered to my great joy that a scan of the Report of the Royal Commission about Panizzi's catalog has recently been put online, so now I have my very own copy! [Tremendous thanks to the Bavarian State Library!] It's at Nice scan, too. Let's compare the cataloging with what is in Google Books and what is in the LC Catalog.

LC Catalog record:
Corporate Name: Great Britain. Commissioners appointed to inquire into the constitution and government of the British museum. [from old catalog] 
Main Title: Report of the Commissioners appointed to inquire into the constitution and government of the British museum; Published/Created: London, Printed by W. Clowes and sons, for H. M. Stationery off., 1850. 
Related Names: Ellesmere, Francis Egerton, Earl of, 1800-1857.
Description: iv, 823, [1] p. 34 cm. 
Subjects: British museum. [from old catalog] 
LC Classification: Z792.B863 G3

This illustrates older cataloging practices (the LCCN dates from 1902 but I am sure this also represents a conversion from earlier practices) as we can see from the non-ISBD punctuation, but primarily from the abbreviated title, which omits "with minutes of evidence," (I do not know but I suspect that the title ending with a semicolon inferred a continuation, but this is only hazarding a guess); the older method of recording the paging: [1] p. which reflects the colophon with no page number in the original, but above all, the abbreviated subject, which doesn't use even the free-floater "Management" now valid under corporate bodies. I would expect a cataloger today to add several additional subjects, but things were different back then, with far fewer materials to deal with, and less differentiation needed in the catalog.

But when we compare this to the Google metadata (found at the bottom of the information page), we find:
Title Report of the Commissioners appointed to inquire into the Constitution and government of the British Museum; with Minutes of Evidence: (Presented to both Houses of Parlament by Command of Her Majesty.)
Publisher Will. Clowes, 1850
Original from the Bavarian State Library
Digitized Jun 28, 2010
Subjects Travel / Museums, Tours, Points of Interest

While the title is fuller, even including the "presented" note, the publication information is very abbreviated, omitting the place and the important Stationery Office; it omits the Earl of Ellesmere as Chair of the Committee; the committee itself is not there as a corporate body; no physical description; and finally the subject itself is totally bogus, one of those "silly" ones, that--I hope!--was automatically generated; either that or it is similar to some of the BISAC terms that are too general to be of any real use.

In this new system you suggest, let us for a moment assume that the LC record does not exist. All we have is the Google Books record. In this case, how will this record be fixed, and who is supposed to do it? Let us further assume, for the sake of argument, that the title is quite different (as I and others have mentioned happens a lot of the time for all kinds of reasons) and/or there are a few typos (in the Google record, the only typo is in the presented note, which misspells Parliament, but let's imagine some more serious typos in the record, including the title).

In doing this, I mean to set up a scenario: if the record is so bad because of bogus subjects, lack of access points, and serious typographical errors (as we are positing here) how can such a record *even be found in the first place*, so that it can be brought up to some kind of standards? I realize that the idea of crowdsourcing, using a thousand eyeballs looking at every record (although the analogy reminds me of the descriptions of some of the monsters from the Book of Revelation), may be able to find these kinds of lousy records occasionally, but it all still seems to rest on some kind of faith that everything will work itself out. What is the basis of this faith? I believe such faith is the unspoken assumption of a *minimum level of quality* of some sort, so that the record can be found at least somewhat reliably--i.e. so that the item has a chance to get those thousand eyeballs focused on it, or even two eyeballs!--because only then can everybody begin to work on it.

[I did some additional work on this, since this is how I learn what is available on the web, and I just recorded everything here. Those who are less "hard-core" may want to ignore the remainder.]

As a test, I wanted to find this item in an early LC catalog to see how our predecessors cataloged it (the 1861 catalog, its 1868 supplement at, but have so far been unsuccessful. I still think it's in there, though.

I did find it in an early NY Catalogue of the New York state library: 1855. Law library, using a brute force search at:, where for some reason it is labelled as no. 49 from a series "House of Commons Papers". I haven't yet found anything like this in the item I downloaded. (This item is probably in other catalogs as well, but this brute force search assumes no problems with OCR)

Matters get even stranger with the British Library record,
Here is the record:

Author - corporate Great Britain. Commissioners appointed to inquire into the Constitution and Government of the British Museum.
Title Report of the Commissioners appointed to inquire into the Constitution and Government of the British Museum; with minutes of evidence. (Index to report and minutes of evidence.).
Publisher/year London, 1850.
Physical descr. 2 pt. fol. 34 cm.
Series ( (Parliamentary papers. House of Commons. Session 1850. vol. 24. no. 1170.))

This record has even different title information, ignores the printer and publisher completely, the physical description is quite different, and we see a series statement of "Parliamentary papers. House of Commons. Session 1850. vol. 24. no. 1170", which is different from the no. 49 one above!
The series in the British record probably came from the listing available at "Parliamentary Papers, 1850"

I haven't found anything relating to this in the actual report, but it could be that the statement "Presented to both Houses of Parliament by Command of Her Majesty" is actually a codeword that means (for those who know!) that it is part of the series of Parliamentary Papers, and you have to look it up in the separate index, which "you" are supposed to know about.

The other "series" above where it says it is no. 49 in the NY State Law Library, is probably some kind of local numbering practice. (These last two points are strictly suppositions that may be wrong)

Now that I have this additional information about the series, I could look again in the old LC catalogs to see if it's there, but I'm getting sick of it all.

I don't think this is an especially difficult item, although a novel published only a single time written by an author with a unique name would certainly be an easier example. Other resources are far more complicated than this one. So, I think this is a realistic example of some of the difficulties we will face in trying to blend/merge, or simply to find one another's metadata before these other records can even begin to be of help.

But on the positive side, I also realize that it is amazing I could do this by never moving from my couch in my apartment in Rome, Italy. Also, there is such a possibility today, bordering on magic, of what I described as, a "brute force" search. Twenty years ago and using only printed materials, doing exactly the same searching would have taken weeks of work, trying to consult multiple expert librarians and others, running all over the place, and I would have been left completely exhausted!

Finally, I notice how much I myself have changed, comparing what I have done sitting at home, deciding that this is now too much and unwilling to exert myself to complete the final part of the task. I don't think this is a symptom only of age, but I feel I am relating to information in a fundamentally different way than before, and have entirely different expectations.

RE: ONIX data

Posting to NGC4LIB

Cory Rockliff wrote:
Isn't this, to some extent, a false either-or? In the card era, a "Do it once, and do it right" mentality made perfect sense, since any change meant pulling cards and revving up the electric eraser, and what today is a simple global find-and-replace could mean months of labor.
Nowadays, a more iterative approach to cataloging is possible, so perhaps the priority should be building better systems for collaborative editing and enhancement of bibliographic metadata, rather than trying to enforce standards.
This is why I added the possible option 3, which method would, when transferred to the food industry, amount to doing a recall when the quality control finds problems. While this is an option, the number of "recalls" must be kept to a minimum of course. What is this minimum: 90% compliance or 50%? In library terms, this is equivalent to the amount of recataloging. While some changes may be fixable through globals, this assumes that the errors are consistent. But if it's a matter of lousy titles because the title in the ONIX record is actually the title of an earlier version that was changed, or there are typos (which come from my own experience), then the only way of dealing with it is to retrieve materials. If we are dealing with XML documents however, the title and much of the descriptive information could quite literally come from the item itself, and this could mean a major savings. This is a ways off in the future, though.
Well, there we are. In my position, I do a certain amount of cataloging, and I vastly prefer editing any kind of existing text to transcribing.
But then again, I'm a lousy typist.
So, if all these records have to be changed as it comes in, let's not delude ourselves that we are all following standards. As I have mentioned before, if an electronics store had to open up and check every TV set they sold to make sure that the wiring was good enough not to electrocute the customers, or if a grocery store had to check each candy bar to make sure it isn't filled with rat hair, our society would disintegrate.

Once again, other fields can deal with real, standards that are enforced, and I am thankful they do. I don't want to be electrocuted or eat food that is rotten or nasty. The bibliographic world so far has not been able to do this. What will be the future? I don't know.

RE: ONIX data

Posting to NGC4LIB

Karen Coyle wrote:
It's not just conforming to standards or levels of standards: ONIX data has a different *purpose* from library data.
So even if the publishers perfectly follow their own standards, their records will not look like ours. And that is not a bad thing. We should be able to take what is useful for us from the publisher records, or link to them for further information. It's a different view, and a legitimately different view.
This would be easier to do if our record format were more open. You can't grab someone else's bit of data and add it into a MARC record. (It should be possible to combine publisher data into a MODS record, and I'd be interested to hear from anyone who has done that.) We need to see other peoples' data as additive, not a substitute, as I said before.

This is correct, but it doesn't cover the whole of the problem. The world of metadata is far bigger than just library metadata and ONIX from the publishers, but there are lots of different types of metadata that we need to interoperate with, each with their own purposes, including a whole variety of open archives. We need to interoperate with all of this because our patrons want those resources. That was why I placed that "Conceptual Outline" section in the Cooperative Cataloging Rules Wiki, to try to allow metadata creators to begin to get some kind of understanding of what others are doing. Not much has gone on there, but maybe it will eventually.

Nevertheless, while it is true that cataloging the same thing in different ways because of "different purposes" is being done now and will continue to be done for awhile, I believe it will prove to be a luxury that is financially unsustainable in the long run, just as the cooperative cataloging movement changed so much in libraries in the past. Our records are being mashed up now in Google Books and will continue to be--whether we like it or not--and doubtless Google Books will not be the only place these mashups will exist. This will force major changes. Somehow, I think bibliographic standards need to be reconsidered in this regard.

Thursday, December 23, 2010

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB
Bernhard Eversberg wrote:
22.12.2010 17:40, Ross Singer:
>> No technical or theoretical reason, yes. GBS is doing it, with structured metadata provided by you know who. One of the reasons for us is that we don't have the full text and won't get it.
> Although we have only ourselves to blame. It's not as though the internet archive or Open Content Alliance haven't been trying to find libraries to work with them - libraries just can't be bothered.
Only ourselves to blame? I don't think blame is the right term, and how much would blame help anyway.
I think Google Books represents the best of the "entrepreneurial spirit" the neo-conservatives are always applauding. Libraries are highly bureaucratized, and they are subservient to higher agencies; they could never have done what Google has done. It takes a highly dynamic, bold, and even somewhat crazy idea to achieve something like what Google has with Google Books. Before they started (and many thought they were crazy, since they were threatened with law suits, the enormity of the task, and so on), accepted thought was that it would take a few hundred years to scan everything in the libraries, but Google has proven otherwise, and now everybody wants to get on board with their own projects.

Of course it's not perfect; of course parts will have to be redone, but Google has achieved what in essence is the impossible; it is not ever going to disappear, and now, everything is changing in ways that are completely unpredictable. Everybody having anything to do with this incredible amount of information, will find almost everything they do changed. This includes above all, libraries.

I may not like much of this, but Google's achievement is incredible and a public entity such as libraries could never have accomplished it. It's also happened before and the changes were fairly well documented: the introduction of the printing press and its profound changes for society (and libraries). Many of those changes had horrible consequences, such as the Counter-Reformation and the Index of Forbidden Books, but we survived and became stronger.

I hope history will repeat itself, but minus the nasty parts! I do think that, for better or worse, it's going to be a wild ride.

RE: ONIX data

Posting to NGC4LIB

Cory Rockliff wrote:
This is actually what I imagined would be the primary use case for ONIX in libraries--Even if there's a lot more to add in order to arrive at a "full" record (however we're to define that), deriving MARC record "stubs" from ONIX should significantly lessen the burden of transcription from the item-in-hand by catalogers.
The word "should" here holds the entire point of whether to use ONIX records for library cataloging. There are certain levels of standards that must be adhered to if the entire system is not to dissolve into complete chaos. These standards must be linked to a certain level of assurance that the records actually *do* conform to those standards, i.e. while you can never get 100% compliance in anything, what is acceptable? 98%? 90%? 75%? 50%? If there is no assurance (within tolerable limits) that a specific record will conform to the standards, there are essentially two options:
  1. to give up, accept everything, and admit that there are no standards; or
  2. recheck each and every record received to ensure the standards are met within your own catalog
Of course, this problem is nothing new in the library cataloging world and has been going on since the beginning. It's no secret that some libraries produce "higher-quality" records than others, and that there is a lot of junk requiring mounds of reworking. Perhaps these other libraries *should* produce higher-quality records, but they just don't for all kinds of reasons, legitimate and less so. What is a solitary library supposed to do when they find a lousy copy record? Accept it or revise it?

[There is potentially a 3rd option, which would be quality control done retrospectively on random samples. That has always seemed dangerous to me, since what would you do if you found a major problem? Relating it to food, what happens when the controllers discover retrospectively that x% of canned tomatoes on the market have ptomaine? One thing that happens is that the producer responsible is severely punished. Other librarians may have some experience with such an option for catalog records.]

So, if taking ONIX records really were a matter only of *adding* information, i.e. essentially the headings, that would be one thing, but this assumes quite a bit: that the rest of the record conforms to your standards (AACR2, ISBD). From what I have read from others who have more experience working with these records, this is not at all the case and therefore, taking ONIX records will be just having more lousy copy cataloging available. As a result, the cataloger has to recheck and/or redo the entire record anyway, saving very little, nothing, or even incurring the additional labor of fixing everything. This or, the only other choice is to hold your nose, not do anything and accept whatever you find.

This is ostensibly one of the reasons for RDA: that we could receive RDA-compatible records from publishers through ONIX. I haven't seen any evidence for this. Why, if publishers won't give us AACR2-compatible records, would they be more willing to give us RDA-compatible records? I find that totally inconceivable.

One good point of having the ONIX records (along with other metadata records) is that we may be forced to confront what I think is reality: I have mentioned before that perhaps our current AACR2 standards are too high since so many libraries have problems reaching them. How could we design standards that these different bibliographic agencies could reach? These would be a reliable, assured, minimal level of quality.

In my own opinion, ISBD could be a very good beginning.

Wednesday, December 22, 2010

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB

Laval Hunsucker wrote:
Most of us will often have encountered discussions of whether librarianship is or is not a profession, which sometimes go on ( and on ... ) on the question to what acknowledged professions it can then justifiably be compared. Many are adduced, but rarely or never mentioned is the one profession to which I myself believe that librarianship can best be compared -- that of the clergy. I could never see librarians as being in essence very much like physicians, or nurses, or engineers, or lawyers, or pilots, or ( good ) teachers, or even architects ( though perhaps a little like accountants ), but they *are* very much like priests, it seems to me, in very many respects. Fletcher was pretty much on the mark in his inclination to look at library systems and services [ and 'information' ? ] as a kind of religion. And indeed, if they were already a religion back then, such may well be even more the case now than it was in his time.
This may be correct, although I still prefer to think of catalogers (at least) more as mechanics or plumbers, but that betrays my working class background since I consider this a compliment. I have had the opportunity to get to know some wonderful clerics here in Rome, so the idea of librarianship as a religion doesn't seem that bad to me.

On the other hand, I have seen economists (especially lately!) compared to the alchemists who had an unshakable belief in their philosopher's stone, weather forecasters (who forever get forecasts wrong and rarely if ever, feel the need to offer any explanation for their failures) compared to soothsayers, along with the often remarked upon intellectual and social isolation of much of the faculty, who so often speak in a jargonized language understandable only to the handful of their colleagues who have been "initiated" into ever-narrowing specialties, and whom I have seen compared to "navel gazers", i.e. monks who lock themselves in cells for years at a time, whispering prayers to one another.

So, I personally see catalogers as mechanics, public service librarians as social workers, selectors as "government regulators" ensuring levels of quality, all trying their best in spite of failing very often. At the same time, I have seen clergy of different religions who still do a lot of good for society today, although they are going through scandalous times at the moment.

I'll accept the comparison with a religion. It could be worse.

RE: Dept. v. Department

Posting to Autocat

On Tue, 21 Dec 2010 14:16:19 -0700, john g marr wrote:
>On Tue, 21 Dec 2010, Jane Kelsey wrote:
I as just one librarian can only encourage the Library of Congress to review this decision soon and to adopt the full spelling of Department.
> Oh, good grief! This sort of (RDA) attention to hyper-minute detail is getting completly obsessive [hmmm... ] but definitely upholding our traditional image. :)
> It should be possible to design [program?] library catalogs [at least] to produce the same search results whether "dept." or "department" is the word used in specific searches (same goes for "cm" or "cm.").
I agree that the number of patrons, except those specially trained, who would ever even think of searching for "dept." when they want the Department of Education, is essentially zero, but the solution today is *not* the 19th century one of retyping zillions of headings. Those days should be over in today's informational universe that has tools that are far more powerful than the library world is using. It is precisely the same problem as RDA mandating catalogers to spell out all of the other abbreviations found in the catalog.

No, pardon--that's not correct. They don't want to spell out *all* of the abbreviations in the catalog, only those few allowed to be input by catalogers. You know: those abbreviations that people are forever hectoring reference librarians about because they cannot be expected to understand them; those terribly difficult ones such as p. or et al. or cm. (No--that last one isn't right either: cm is not an abbreviation but some sort of symbol or something so it doesn't get a period. Well, that's not right either because it does end with a period at the end of a sentence or field)

Of course, my own experience doing reference work has shown that, while I am asked many questions about the catalog and catalog records, I have never been asked what those abbreviations mean. I honestly doubt if reference librarians are deluged with such questions and would probably prefer catalogers to devote their efforts elsewhere. Plus, there is the fact that abbreviations probably are used by the public now more than ever before, with the use of Internet slang, especially for texting, e.g. lol, imho, imnsho, some quite cryptic, as well as that incredible collection of military abbreviations, going from AWACS to AWOL, to other more earthy ones, such as snafu, fubar, and so on and so on. As an aside, this discussion about abbreviations in the library catalog reminds me of SUSFU! :-)

This is the world our patrons inhabit, so we should not expect that when they see a p. or et al. or s.n., they will suddenly fall into a state of total confusion and helplessness. So I agree with John: dept. (and I will go on to add the other abbreviations RDA wants us to spell out) is the sort of thing that can best be fixed by the programmers, and there are multiple ways to "fix" it. I'll bet they would love to try solving it using something like (I discussed this on RDA-L, or building something on their own.

While the programmers are working on this, we should be working on setting up the system of URIs so that the labels for all of the headings work in a more fluid way, and therefore, will prove more useful to the world as a whole.

Tuesday, December 21, 2010

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB

Bernhard Eversberg wrote:
catalogs into disregard and obsolescence, but full-text search, by its nature, has no way to do WEMI and FRBR. But it has some ways to find stuff that leave catalogs hopelessly behind.
and Jonathan Rochkind wrote:
There is no reason you can't have a system that indexes full text and supports full text search, AND has structured metadata ala FRBR. Supporting fielded searching as well as full text searching, or supporting searching that boosts hits in the result set when they match in certain fields over body text matches. And grouping or otherwise displaying items that belong to the same work in the result set. Etc.
I think librarians will be able to get all this to work together more or less, but I still question if that is what our patrons want. Do people want more links (no, that's not right; it's not even *more* links with FRBR/RDA since the number will remain the same, but somehow "different" links); do people want links that lead *into* our traditional "library universe"? Or do they want links from our materials that go *out* into the greater informational universe in some way? In my opinion, it's not even a question: people want the latter.

We really need to accept that people right now, if they want to, can find all the WEMI they want in a library catalog. Yes, maybe it's a little clunky, but no worse than it's ever been. It can be done now and can be improved using automated means. So catalogers and the entire library world should move on and put our ever-diminishing resources toward the needs that a modern "information-hungry" public wants.

I like Bernhard's "grounded Dreamliner, screwdrivers in hand". I've been doing some research and, among other things, just discovered an article from LJ, March 1905, p. 141+: "The Future of the Catalog" by William Fletcher. I'll quote his opening paragraph:
"Several years ago I wrote a paper for one of the meetings of the American Library Association on "Library superstitions." I am now inclined to add to those I then named, another--the Dictionary Catalog. I do not intend by this expression to intimate that the dictionary catalog is a thing to be disbelieved in and rejected, but rather to suggest that it has the character of a superstition in so far as it is accepted and religiously carried out on grounds that are traditional, rather than on any intelligent conviction that it meets present needs and is good for the future needs for which we must make provision."
I don't agree with everything Fletcher says, after all, he wrote this over 100 years ago, but his basic premise is still highly provocative: to reconsider the very purposes and traditions of a library catalog--and we are still creating what is basically a dictionary catalog. The purpose of the library catalog is closely connected with the purpose of the library itself. Certainly, I believe the library catalog is still needed today, but our public relates to it differently. As they do to libraries as a whole.

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB

Todd Puccio wrote:
Plain and simple, if our services as librarians are not helpful to our users so that they can be successful, then they don't need us. We should not have to convince them that we help them become successful. If it's not obvious to them - then maybe they really don't need us.
Or maybe we should just do a better job ?
Perhaps FRBR & RDA will become the best Librarian tools out there. And what's wrong with that ?
I agree with much of this, but I think a lot of it is still unknown because everything is changing so much and so fast: for example, while I want to do "a better job", I honestly don't know what that means today. With so many different excellent, new, wild, bizarre, etc. tools out there, it is hard to know what to learn first and what the public expects from us today.

One thing I know we need though, are new tools. I think many librarians are finally seeing the many deficiencies of their tools and much of this is focused in the library's catalog. I try to picture it to myself using various scenarios of mechanics standing around a car, scratching their heads:
  1. the mechanics are modern and competent, have state-of-the-art tools, but the car is a Model T
  2. the mechanics are from the time of the Model T, and they are looking at a Ferrari racer
  3. the mechanics are up-to-date and competent, the car itself may be OK, but all they have are stone tools
  4. ...
I can go on with variations of this image for quite awhile, but I'm sure others can devise their own if they wish. Anyway, all the mechanics are always scratching their heads in bafflement.

Concerning RDA and FRBR, even though they don't really change anything substantial, it's true that they may turn out to be the best librarian tools out there.

Unfortunately, I find that a really depressing thought. There's absolutely *got* to be something better!

RE: our profession's bibliographic information

Posting to NGC4LIB

There are a few projects dealing with this [i.e. data mining for new types of resource discovery]. First, there is simply Google, which has the option in the left-hand menu of plotting any search to a timeline, e.g. search for "wisdom":

How this is generated, I have absolutely no idea, but just glancing at it, it looks as if the word "wisdom" was widely used around 200BC, in 0AD it stopped being used until about 50AD; it went through sporadic use until around 900AD when it became popular again, and then with the rise of printing, its use went up more or less steadily.

Does anybody really believe that?!

There is also the Corpus of Historical American English (COHA) at, which has many more controls. They have an interesting comparison with Google's Ngram tool at:

And of course, there are the notable OCR problems, discovered and blogged simultaneously by many people (including myself!) who apparently think alike. is one example.

I mentioned my own amazement to find this "specific word" in the book "The Act of Tonnage and Poundage, and Rates of Merchandize" from 1702, where I found the exact usage: in the sentence:
"Every Merchant making an Entry of Goods, either Inwards or Outwards shall be dispatched in such Order as he cometh;..." and it misread the old spelling of "such". So, not only did it mistake the medial s for an f, it also misread the h as a k.

The poor author must be spinning in his grave! It appears that Google's OCR tool is more similar to many human beings than I had suspected: both have filthy minds! :-)

Of course, this is far from the only OCR problem. To be fair, this sort of "data mining" is in its very earliest stages, so it is easy to point out problems. It will take time, plus trial and error, to discover if these techniques lead to anything of value.

We are in a time of experimentation.

Monday, December 20, 2010

RE: Purpose of transcribed imprint (was: Form)

Posting to RDA-L
Erin Blake wrote:
For my own research, and for many users I serve professionally (in an independent research library), it is vital to have both transcribed and normalized information for primary resources. I can find things published in London, England, through MARC 752 ‡a Great Britain ‡b England ‡d London. I can find engravings published by John Bowles through 700 ‡a Bowles, John, ‡d 1701-1779, ‡e publisher; but through 260‡b I can see that there are two distinct versions of the plate, each with a varying address for Bowles' firm. </snip>
This is great that you can do those things, but when we get into the larger world of metadata, there are problems with the reliability of the result. For example, you can search for 752$d for place of publication, going down to the city, plus the publisher through a 700$e search, and that is fine. But these types of access points are not on every record in every library. So, this works within the confines of the (magnificent!) Folger collection, but it ceases to function, or at least functions differently, the moment the searcher steps outside the single catalog, i.e. it may function for some other records in some other catalogs, but even then, it is so hit or miss that for anyone except the genuine expert who knows the variations in all the different cataloging practices, the existence of this information, or lack of it, must be considered random.

I have only seen a few of these records, but for example, the "Early American Imprints" series includes the 752, but it appears that when the author is also the printer, there is not a separate 700$e made for the author as publisher, e.g. in Princeton's catalog we can see where James Parker is added as a printer in only one of these books he printed, apparently because he authored one of them.^&CNT=50&HIST=1 vs.^&CNT=50&HIST=1. Therefore, if there were a separate search for "printers" limited to 700$e, when you searched for Parker, you would retrieve only one of these.

When this is translated into the world of union catalogs, the task is for the users to know what is really happening when they search a 752 field, or when they do a search for a "printer", and this becomes even more complex. For instance, see this Worldcat record which has Parker's name only in the publication information without a 752 or 700 of any kind with his heading: Naturally, this method is not used for (most) modern imprints.

By pointing this out, I am not finding fault with anything at all, just trying to emphasize the amount of knowledge assumed when someone would search, e.g. for someone as a printer: not only do they need to know about the history of the man or woman as printer, but also how all these different catalogs deal with this kind of information, and how each library's treatment is mashed/blended/wrung through union catalogs such as WorldCat. If researchers do not understand such intricacies, they could believe they are doing far more than they actually are when they do a search for James Parker as printer, or when they search for printers in Woodbridge, New Jersey; by definition, they are only getting subsets of the whole.

This is a fact, and it is important for searchers to realize it. This is what I mean by the reliability of the result. When it comes to matters of general intellectual input (authors, editors, translators to a lesser degree), there is a lot more standardization, but in other areas such as what you mention, there are special, local practices.

Today, I think it is becoming more and more important to always assume that our users do not understand this sort of complexity. And they won't ask questions. So, what can we do in this new environment to help users get some level of awareness of such problems and how to deal with them? I think there are many ways that the catalog can provide help, but we need to think in entirely new ways.

And let's not even contemplate what will happen when Google Books enters the fray!

Sunday, December 19, 2010

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB

Karen Coyle wrote:
Do people know this about us, though? If we are to fight the battle on our ethics, we need to make sure that people know what they are. In fact, we might need a good slogan.
People don't know this, and what's more, they don't even think about it.

I am really concerned that if we decide to fight the battle by declaring that our information is "better" or more "reliable", or that we "own" better information, we are doomed to ignominious failure. "Better" is difficult to prove, especially in these days: "Better" information leads down a very difficult path, riddled with booby-traps where "better" tends to mean "sanctioned by approved authorities", and pretty much ignores open-access projects which are definitely very good, and other open resources, which is precisely where all the dynamism and excitement lies.

Maintaining that libraries provide "better" access, I think if we stay the course the library world is currently on (in the sense of FRBR/RDA types of access), we risk turning ourselves into laughing stocks, and in the other sense of access, i.e. actually providing authorization to view copyrighted materials, librarians need to remember that it in 90% of the cases, it is not the *libraries* that "own" the books and resources, but our respective universities, institutions, organizations, and so on. In this kind of hierarchical sense, a library is nothing more than a hierarchical unit that can be downgraded and merged with any other unit. As one rather drastic example, our British colleagues are seeing their "bureaucratic organizing" changing from the Museums, Libraries and Archives (MLA), to Arts Council England (ACE).

Also, when it comes to providing "access" to remotely owned databases, such as Ebsco and Elsevier, these are simply links made available from the library's website, and those links can be anywhere else the powers that be decide to place them: they could go on the Student Services page, or individual academic departments, or "information services".

Once again, it would make sense to focus on what the library world provides that is unique, and to let others know, and a slogan is essential. If anybody reads my postings, it is clear that I am really bad at coming up with short, pithy statements! I realize that this is a failing of mine. But others are very good at it. One pitfall to avoid though, is that when I have brought this up in my classes, people are automatically *very protective* of Google and these other services, and are very quick to assume that this "filthy librarian" is dissing Google and I am saying that while librarians follow ethics, Google, etc. are unethical.

No, not at all. It's just that ethics don't even enter into the entire discussion with Google. So, it's not that Google is *un*-ethical, it's just that ethics can't even enter into the discussion when we talk about Google. People at least appear to accept this, although even then I am not sure. So, any slogan would have to be gentle, and fun.

Friday, December 17, 2010

RE: New "Cataloging Matters" podcast

Posting to NGC4LIB

Bernhard Eversberg wrote:
On this vast background, catalog search is a very narrow field. RDA's vision amounts to little more than making it an electrified and enhanced version of 19th century cataloging ambitions. Other players, like Amazon, LibraryThing, GoogleBooksearch, have already added many more features to their Search products while partly re-inventing our age-old ideas but only as much as required for their business model. This raises the vexing question again: What is our business model? Only after we answer this can we set out to define what our Search model ought to be. And then, what our code of cataloging rules should focus on and include. A much bigger project, I'm afraid, than what can be taken on by our "powers that be" in the ways they go about their business.
Yes, we need a business model, but in order to find it, I think we need to reconsider what we do in relation to the entire library endeavor. Apologies for referring to myself, but I have had what I think is an excellent exchange with John Vosmek on Autocat. His message brings up some really good points and I think he speaks for a lot of people, including me: My reply is on my blog at:

I really believe that, "a library catalog, when looked at relative to the *totality of the goals* of how a library is to serve its community, does more than what FRBR says; thinking in this way, the catalog could potentially do *a lot* more than it does today". Libraries provide a service, summarized in providing a level of trust over information, that people can find nowhere else. That trust is exemplified in our code of ethics. Everything we do (or almost) reflects that: selection decisions, how to organize materials for retrieval, granting access to those materials, and so on. There is, and should be, at least some level of trust when someone enters a library, and while people trust Google, it is entirely different since Google is a for-profit organization, (and Google does change results based on political or societal pressure) and we have our ethics that we--at least--*should* be following.

I think an emphasis on ethics will make more of an impression on the public than librarians may think. When I have told my info-lit classes that I will suggest databases for them to use, BUT if it turned out I was making 5 or 10 euros from every person I could get to access, e.g. Lexis-Nexis, they might be more skeptical of my suggestions, but they don't have to worry about that because I am a librarian. I compare that to Google's "Don't be evil", or McDonald's "We do it all for you", then wonder if anybody out there really believes any of that, and we all wind up having a good laugh. This is something that people have no trouble understanding and, I think, appreciating.

Counting on Google Books

Google Books Ngram Viewer (Was: Fifty Years of Cataloging artifacts)

Related obliquely to all of this could be Google's textual analysis tool that searches the text of Google Books (announced in the NY Times The article has some interesting searches, and I thought I would try my own words searching "catalog card" and "card catalog" in US and UK spellings.

Apart from some strange anomaly in the mid 1700s, we see that as the industrial revolution makes possible card production in the mid-1800s, the usage skyrockets until computerization in the 1970s, when the usage of the terms plummets, along with card production, although the terms are still being used today a little bit. When selecting for "different Englishes" (English Fiction, English One Million--whatever those mean) the trends still appear to be the same.

I am not really sure how this tool works, for example, I *believe* that typing multiple words automatically provides a phrase search, and I don't even want to bring up the problems of OCR, but in any case, this could prove to be an interesting tool. How actually *useful* it will be remains to be seen, but it is interesting.

This database comes from the scanned books, and it would be interesting to be able to compare these words with "born digital" web pages as well.

Thursday, December 16, 2010

Cataloging Matters Podcast no. 7: Search

Cataloging Matters #7:

Hello everyone. My name is Jim Weinheimer and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy. The topic of this installment? Search!

What is Search? Is it really so different from what everybody has always done, or is it just another example of serving up new wine in old bottles?

Before I begin, I would like to spend just a moment on a couple of grammatical peculiarities I have noted. If you do some research on this topic, you will soon discover that the term “search” is used without an article: not “a search”, not “the search”, just “search”. Also, authors rarely use the gerund form (i.e. “search-ing”) for this concept.

So, once again: what is search?

From the library point of view, there would seem to be clear parallels between this newer concept of “Search” and the traditional library/FRBR user task of “Find” from Find/Identify/Select/Obtain, nevertheless it is Search that is getting an increasing amount of attention in our society. Yet, it is vital--for librarians especially--to understand that the two are quite different in their methods and in their goals. A lot of this difference has to do with user expectations, how they are changing and it may give us some insight into how these expectations will change in the future. Personally, I believe that search, if it becomes widespread, as I think it will, may very well become an important political and even moral issue.

So what is search and what makes it so different from what people have always done?

Modern computer technology has made child’s play of some tasks that had been incredibly complex not so long ago. As only one example, Bing Travel allows someone to search for airline tickets in multiple databases at once, and will even give you a prediction for the price you are paying, whether it will most probably go up or down in the future. [There is a link to this, plus everything else I discuss, from the transcript] At one time, this would have demanded a highly-experienced and well-trained travel agent but today, all of this can be done in just a couple of seconds by a layperson, who has had absolutely no training in how to do any of it. The obvious question is: How good of a job does Bing Travel do? Only an expert travel agent could make an accurate determination, but from what I have read, Bing Travel appears to be not all that bad.

Another example that I find simply amazing is the Google Public Data site. Google has partnered with various agencies such as the World Bank, the OECD, Eurostat, and others, to use the power of Google’s tools to create something genuinely new using data that remains on each agency’s site. Today, anybody in the world with an internet connection can do their own statistical analyses in vital areas of concern, using some of the most powerful computers that exist. Of course here, the obvious question is: do people know how to interpret this information? That is another issue, but the fact remains that everyone can actually work with the same data.

Even though these kinds of projects take advantage of some of the power of modern computing, they do not deal with search and many see options that are even more subtle and far more intrusive. Depending on who you are, such options can be viewed in either a positive or a negative light. In essence, this newer concept of search foresees a time when the computer will automatically look for things that even you, yourself are not looking for consciously. In other words, search will do all of the work. Isn’t that bizarre? How could something like that function in reality?

Let’s consider an example based on something we can all understand: a library catalog. Someone uses a catalog to find books on how money is divided among the population of the U.S. This person knows how to use a library catalog, goes to Worldcat, finds the subject heading “Income distribution -- United States” and is led by this subject to the record for Lisa Dodson’s book The Moral Underground: How Ordinary Americans Subvert an Unfair Economy. New York: The New Press, 2009.

Of course, using the traditional tools, this book can be found by other subjects that the cataloger has added, by searching the author’s name, by the title of the book, and if it is part of a series, by that title as well. These “access points” represent the FRBR user tasks of finding by author, title, and subject. This is also where FRBR pretty much stops, and if searchers want to continue, they are expected to repeat those same FRBR user tasks over and over again. But the alternate concept of search works quite differently. We can see one, very minor part of this new idea of search in the Worldcat record mentioned earlier, where, if we scroll to the bottom, we can see various “User lists” that have been created by individuals, and we can click on them. For example, this book is part of a list called “New Economics Books” created by someone named Joyline, and this list includes several other books on the same topic that may be of interest to the searcher.

It is important to note that finding these books in this way definitely falls outside the FRBR user tasks, since these materials most probably have different authors, titles, and subjects than what the searcher originally utilized. But we need to admit that even this represents nothing fundamentally new since people from time immemorial have been recommending books and articles to one another. Normally however, people have known at least something of the person recommending a book: they may be a friend, a relative, a teacher, a journalist of a newspaper or magazine, or maybe even a person talking on Oprah Winfrey that you can see and hear.

In the case we are examining, we know nothing of Joyline since this person’s profile is private. Joyline may be an economics professor, a high school teacher, a librarian, a truckdriver, a dentist, or even a teenage girl from Japan. Even if this profile were public, it could be completely falsified. The anonymity of Joyline as a book recommender is something rather new, and this anonymity may or may not be of much importance for a searcher to decide to read a book on these list, but this remains to be seen.

I can’t resist a bit of self-advertisement at this point and I’ll mention the Extend Search that I have instituted in my own catalog at the AUR Library. In some of my postings, I have mentioned that I believe the information universe is composed of separate “intellectual microcosms”. These microcosms are defined when you choose a resource and then become aware of other resources related to your resource: perhaps books on similar topics, but there may be reviews, critical blog postings, public lectures and all kinds of resources surrounding this item you are looking at. My Extend Search is an attempt to make it easier for the public to find and enter those “intellectual microcosms”. An advantage of this is seen by the book by Lisa Dodson mentioned earlier: that book is not in my library, but my searchers can nevertheless get into the “intellectual microcosm” and find all kinds of other resources related to it; in this specific case, these resources include a 75 minute public lecture the author gave on her book. The methods I employ differ somewhat from the traditional library searching methods, but nevertheless, I want to make clear that the Extend Search methods I employ are also not a part of this new concept of search.

OK, end of advertisement.

Although these tools are useful and allow some new options, none have much to do with the new concept of search. Search is built on metadata, but not necessarily the library-bibliographic type of metadata that librarians think of: titles, authors, series statements and so on; it is built on metadata about you, and your interests such as what websites you go to, what kind of documents you download, what you buy, what you spend time reading and all kinds of extremely subtle bits of information about you. It is also built on similar metadata about your friends, as well as about me and my friends, about everyone else and their friends, and relating it all together. Search attempts to figure out what you want by indexing your documents, following your movements on your computer, and doing a semantic analysis to determine your interests. Not only that, but it links all of your metadata to similar metadata taken from your friends on Facebook, Twitter, and other projects to build an overall profile of you, your needs and your wants.

Based on this deep and profound profile of you that the computer now has at its command, when you look for something in a tool that utilizes this information, e.g. on Google, the result can be tailored much more finely to what you actually want. So, if I am logged in to my Google account and look for “cat” in Google, it would know immediately that I am looking for the animal, whereas if a construction worker were looking, it would know he or she wanted heavy machinery. Or if I enter chess, that I am primarily interested in the board game, but another person would be interested in the record label.

Once again, these computers will have an extensive profile not only of me, but of everyone who is linked to me because it is building a web of everyone. But this is far from the end.

There is a site called Aardvark [], (why people insist on using these ridiculous names is beyond me!) and I still do not fully understand this site, but it apparently relies completely on the power of these personal profiles, so that when you type in a question, it will link you to a specific expert who can provide an answer. It does this by doing a semantic analysis of your search terms, also looking for your “friends” in tools such as Facebook or LinkedIn, using that information to find their friends, to find the friends of those friends, and so on, to finally link to profiles of those who can answer your question. The site claims that you will get an answer in a few minutes, although I have never tried it.

But wait! Did I say the word “search”? What I have outlined so far is only the palest vision of what many want. The latest ideas are to get rid of search-ing (not search) completely! Well, not completely; of course, it’s more subtle than that. It’s just that you won’t have to do any search-ing anymore.

One example of this is the popularity of the new applications, such as what you can find at the Apple Apps store. I do not own a phone that allows applications, so I only know this through reading articles, but if you own an iPhone for example, you can download special applications that will do all kinds of things, from keeping up with the latest news on topics of your choice such as movies or sports scores, to maps and directions, to social interaction, and on and on. This way, you can keep up with everything of your choice with practically no effort and without searching anything. If you decide you need some kind of information and are lacking it, you will just download a new app. How would this work? I could imagine someone could download a Fine Arts app, which would bring you the information on fine arts, or a Music app, which would bring you the information on music, or a French Renaissance music app, each of greater or lesser specific needs, and which could allow you to configure them to your needs, as expert, undergraduate, high school, interested layperson, or whatever.

But even this is not the end. There is also the idea of persistent, implicit search. This means that in the background, the computer will be running, searching, and analyzing constantly, using your profile, which is being constantly updated and refined, and in this way, the computer can actually interpret your needs. Let’s imagine that you have spent the last few days searching for a new refrigerator. The computer has logged everything you have done, analysed the kind of refrigerators you happened to like by noting how long you looked at each one, and compared the similarities of the majority you looked at, or something like this. Then, it continues to seek out information for you even though you may have never asked it to, but that is how your profile works. The goal of persistent, implicit search, is when you are walking or driving down a street, using a GPS system through your automobile or cell phone, this entire system would alert you that a few blocks away, a refrigerator you would like is on for sale at the best price within a 500 mile radius, and here are the directions.

Although many of the examples are presented in a marketing or shopping sense, it is pretty easy to imagine uses in more educational and informational settings. In fact, the future is here today, right now, and even, for free! Today, everybody can download their very own copy of Mendeley, which purports to do exactly what I outlined before except it is in the field of scholarship. After installing Mendeley, you add your documents to it and Mendeley does the work of semantic analysis to figure out what your interests are. Of course, in the process it will dig out the citations from your documents, and if it can’t find the citations within the documents, Mendeley will go out on the web and find them for you. You can then go online to share by joining groups of similarly-minded people, but none of this is all that new.

The new part is that while you are searching for information you want, Mendeley is learning all the while and will search all kinds of databases for you automatically, using the profile it has created and is constantly updating, to find resources it considers relevant to your needs, and it will even show you the latest trends in research! It does this by analysing you, creating your profile and comparing it to other researchers’ profiles, to better figure out what you want. Of course, Mendeley is as yet very new and still has a long way to go.

So in this way, the goal is that you won’t even have to search anymore because the computer will do it all for you automatically, silently, persistently and implicitly. I mean, people can and will continue to do searching, but the idea is that they won’t have to anymore because the computer will have done it for them already, and will have done an even better job. You will do a search only when you think the computer hasn’t worked well enough.

For those who have read some of my postings where I have discussed “find” in the FRBR user tasks, and have mentioned that I am not sure if “find/identify/select/obtain” is what people are doing now, and what they will be doing, this is primarily what I have had in mind. With search, a tremendous amount of “search-ing” will still be going on behind the scenes, in fact there is so much “search-ing” that I believe it really does turn into something new and consequently, justifies the separate term “search” without the article “a” or “the”, as I pointed out at the beginning of this podcast. But it remains to be seen how much similarity there will be with “find” in the traditional library/FRBR sense; that is, if there will remain any similarity at all. I am not sure how to answer this, but in any case, it is clear that the future of “search” is only very remotely connected with the library ideas of “find/identify/select/obtain” or with “authors/titles/subjects”. Search goes far beyond these traditions.

Will “search” become predominant as the new information environment develops? Of course, it is impossible to tell since even newer capabilities may become available, but search is one of the only really serious attempts I have read about that tries to deal with information overload, which is a serious problem already and can only get much worse in the future. Its promise of incredible simplicity and ease is also a point in its favor of being adopted by the general populace.

It is difficult to imagine that anyone, least of all any librarian, who comes into contact with a profound change such as this will not have at least some opinion about it, and I am no exception. Personally, I do like the idea of getting away from a lot of the drudgery of research. I also like the idea of having more chances to come into contact with other scholars who share interests and communicate with them because this can often be very difficult.

Yet, the idea of a machine that silently collects all of this information about me, collates it, summarizes me to extract my “needs” that perhaps even I myself may not realize consciously, and then to search--persistently and implicitly--in a whole variety of places, makes me very uncomfortable. Perhaps this is because of the way I was raised; or perhaps I am just of another generation and those who are more comfortable with the Facebook-type “let it all hang out” mentality will have fewer qualms about it, I don’t know. Machines are storing vast amounts of information now, no matter how we may feel about it. For those who have Google accounts, take a look at your Web history if you haven’t yet. If you have never seen it, you might find it quite enlightening and perhaps even highly alarming. “They” (whoever “they” are) have a lot of information on you!

But another problem I have is that by making everything so incredibly easy, with all of this information simply falling into your lap with little or no effort at all, would seem to be terribly numbing, and reminds me of the story of the Land of the Lotus Eaters. For those who do not remember, this event takes place in Homer’s Odyssey, where Odysseus and his men are coming home to Ithaca. Just before he meets the Cyclops, he discovers a land where people eat flowers. I quote from Robert Fagle’s translation:
“...on the tenth day we reached the land of the Lotus-eaters, who live on a food that comes from a kind of flower. Here we landed to take in fresh water, and our crews got their mid-day meal on the shore near the ships. When they had eaten and drunk I sent two of my company to see what manner of men the people of the place might be, and they had a third man under them. They started at once, and went about among the Lotus-eaters, who did them no hurt, but gave them to eat of the lotus, which was so delicious that those who ate of it left off caring about home, and did not even want to go back and say what had happened to them, but were for staying and munching lotus with the Lotus-eaters without thinking further of their return; nevertheless, though they wept bitterly I forced them back to the ships and made them fast under the benches. Then I told the rest to go on board at once, lest any of them should taste of the lotus and leave off wanting to get home, so they took their places and smote the grey sea with their oars.”

What was actually a death sentence seemed to those sailors bedazzled by the lotus, to be everything they could ever want, and all they had to do was reach out and eat the lotus. But those under its influence couldn’t even be aware of the dangers it held.

It seems to me that reliance on such a “computerized-lotus” to search for what it determines you need before you even know it, and presents the results to you before you have even thought about any of it, and very possibly in overwhelming quantities and complexity, this would seem to be the very prescription for how to kill curiosity. Of course, it is easy to expand such a scenario and imagine the spectre of some silent, ruling cabal behind everything, leading everyone to the resources they want people to see, and we can picture ghoulish visions from the film “The Matrix”. Although such images are exaggerated, I believe dangers are looming even without them, and in any case, it demonstrates that the management of information really does have the potential to become a powerful political tool, especially in a world such as ours is becoming.

So to be frank, I am personally horrified by many of these possibilities and remain profoundly skeptical. For instance, although I have installed Mendeley on my computer, I still haven’t added any of my documents to it! Perhaps that is silly, yet I realize that skepticism, fear and even repugnance are natural reactions when someone confronts profound change. During the early days of printing, good Catholic folks were deeply shocked by some of the publications coming off the new-fangled presses. Although we may laugh and mock them today, such reactions are natural and easy to understand. Today we are witnessing similar reactions on the part of individuals, organizations, and even governments, to what they are seeing on the Internet.

I just hope that I can learn from the struggles of those people from the early days of printing as they tried to come to terms with what they saw, and let their experiences help me discover where the real problems lie. One point I am trying to learn: it is useless to get angry and try to clamp down on the changes we fear and perhaps even abhor. Stopping the changes doesn’t work and history takes a very dim view of you and your reactions.

At the same time, I admit I may be completely wrong and it may turn out that innovations such as search may actually free the human mind from milennia of unnecessary mindless toil and allow humanity to experience a new Renaissance and Enlightenment all at once. Let us hope so.

It seems to me, that search may very well represent a Darwinian challenge in the information environment that will force such deep and lasting change that it will prove to be evolutionary. I don’t think there is the slightest possibility of rolling things back to a former time, which I think most will agree was not really “better” at all, so there remains little to do but adapt to the new circumstances or die. Will a change toward a universal acceptance of search have an impact on libraries? Of course it will! I believe libraries are feeling the lightest initial impacts of search already, but of course, search itself is still in its infancy. Libraries will have to adapt to this in some way as well, or I think they will be fated to be discarded as anachronisms.

I wish I could offer some useful suggestions, but with new capabilities such as search--which will continue to develop in ways that are unpredictable--it seems everyone is entering virgin territory, whether they want to or not. All I can possibly suggest for librarians is to keep in mind the ALA Code of Ethics, especially those parts about trying to uphold “intellectual freedom” and not advancing “private interests at the expense of library users.”

For those who are interested in some philosophical reflections, I suggest a thought-provoking talk by the journalist Frank Shirrmacher from Germany available on the Edge website, where, among other things, he discusses these concerns in relation to free will or determinism and possible political ramifications.

I would like to close with a wonderful piece by Andrea Falconieri, his Ciaccona, performed by the group Sonatori de la Gioiosa Marca from Treviso, a town in northern Italy. Falconieri worked around Italy and Spain in the early 1600s before dying from the plague. One episode from his life I found curious was that he lost his job at the Santa Brigida convent in Genoa, because the Mother Superior decided that his music was too unsettling for the nuns! Perhaps you’ll understand from this piece.

That’s it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.