Wednesday, June 29, 2011

The Catalog & the Public (Was: nice step toward what a catalog can become)

Posting to Autocat

On 28/06/2011 20:43, MULLEN Allen wrote:
<snip>
If I observe a shift on Autocat (or NGC4LIB, for that matter) so that a significant plurality of discussions for new directions in library-based discovery originates from a perspective of investigation and appreciation for the value of user interactivity and their contributions to the process of our services, the dichotomy will falter in my mind, just as it already does in your own. Until then, I observe study after study demonstrating dismal results for catalog use across public, community college, and university libraries, full of direct evidence for abysmally high failure rates. These studies span decades and don't show any significant trend toward improvement. Therefore, at the risk of curtailing the service of the catalog to highly experienced users (and catalog creators) such as most members of this list, I will continue to recognize the contributions of traditional metadata services but also recognize their inadequacies and prefer success for most users over traditional approaches.
</snip>
This is correct. Everybody has always repeated the mantra that the catalog is for the users, e.g. Cutter's "the convenience of the user should be preferred to the ease of the cataloguer," (I think it's in his "Rules", but I cannot find the exact quote at the moment) but he broke that rule himself once in awhile, although he readily admits it when he does. In spite of this declaration, the public has complained about the catalog from the beginnings because it is so complicated and often, they complained that it did not do what they wanted. I have been going through the debates at the Royal Commission with Panizzi and other documents, and it is really enlightening to read their complaints in light of today.

The major problem has always been that there are so many different kinds of searchers, so many different types of needs, but there has always been only a *single tool* for everyone to search, from cataloger to layman. It appears to me that the arguments made by Panizzi reflect a different mindset from what we have today, or I will even say, the European mindset vs. the U.S. where in Europe there were (are) primarily closed stacks and in the U.S., primarily open stacks. For Panizzi, I think it is evident that the catalog was primarily a tool for the librarians, or perhaps it is more exact to phrase it that the catalog was the most important tool for librarians to help the readers in the library.

His arguments (that I have read so far) seem to hover around this stated desire of his to help the readers in the British Museum get the materials they wanted as quickly and efficiently as possible, and this had important implications. In his closed stacks, it meant that the public must utilize librarians. Therefore, the service that librarians are to provide to the public are always a central part of his discussions. This seemingly simple argument has major implications for the functions of the catalog, which wound up placing a lot of responsibility on those searchers, and they did not appreciate that *at all*, but nevertheless he created what was essentially an efficient inventory tool that needed an expert to navigate, but of course, that is what reference librarians were for if someone had problems. Once people knew what they wanted, his library could provide the item very efficiently. Or, at least that is what he claimed.

So, this is how I am beginning to see it: for various reasons, many of them bureaucratic, we have created a tool with a history from at least the 1840s that almost every member of the public has had major troubles with. Librarians continue to need a really good and powerful inventory tool--we need it, there can be no doubt about that--but our users have had little need for such a tool, and the more we try to convince them that they are wrong and *they really do need it*, the more they resist and turn away. To use it even semi-efficiently, people have to learn and know quite a bit but the number of reference questions are far down, and online catalog reference simply doesn't work. This has been happening for a long time and the difference is: people had no alternatives back then: they could either use what Panizzi built them or go without; whereas today, there are *very attractive* alternatives for the public that could not be imagined back in the 1840s.

Someone could conclude from this that the catalog is obsolete, at least it is for the public, but the discussion doesn't end here. In my experience, the public often assumes that what they are getting through Google ... [et al.] actually provides authority control, that is, when they search Dostoyevsky, they think they are looking at "the set of resources about Dostoyevsky" when the librarian knows they are not. Or when they search "war on terrorism" that they are getting everything about that. People rarely search *text* but normally search *concepts*: of people, things, places and so on. They make the completely incorrect assumption that when Google returns 6 million hits, they must be getting "everything" since they have never thought about the difference between searching "text" vs. "concepts" and they have no idea whatsoever of the complexity of such a task. We see the same fallacy in the library all the time when somebody asks: where are your books on immigration? Or where are your books on art? The only answer is: all over the place. Look in the catalog!

It turns out that explaining and understanding the fallacy of this way of thinking is very difficult to do, because people are resistant: many desperately want to "trust" the results in Google and do not like even to think about these kinds of problems. When pressed, many respond that 6 million hits is already plenty to work with, but the fallacy in their assumptions still remain: if certain results are not in the result, it doesn't matter if that set is 12 million or 20 million.

In my own opinion, it is obvious that the inventory aspects of catalogs must continue for our purposes but they should be hidden from the public users. This is one reason why it becomes more and more obvious that the FRBR user tasks are actually the FRBR librarian tasks. The fact is, people want to spend the least amount of time with the catalog as possible because they want the resources themselves. There is no longer a need for everyone to use exactly the same tool, as they did back in the 1800s but there are all kinds of options available for everyone involved, including us. The "omnipresent librarian" is no longer a reality, nor was it all that much appreciated, even back in Panizzi's day.

Some words of Cutter may be appropriate. He gave a report to ALA in 1885, and he reproduced it in his "Rules". Although he was speaking about transliteration, his comments apply more generally:
"In determining the principles of transliteration it must be remembered that a catalogue is not a learned treatise intended for special scholars. It is simply a key to open the doors of knowledge to a partly ignorant and partly learned public, and it is very important that such a key should turn easily. A good catalogue therefore, will be a compromise between the claims of learning and logic on the one hand, and of ignorance, error and custom on the other."
http://www.archive.org/stream/rulesforadictio08cuttgoog#page/n117/mode/1up p. 108.

Tuesday, June 28, 2011

Re: nice step toward what a catalog can become

Posting to Autocat

On 28/06/2011 17:12, Laurence Creider wrote:
<snip>
Yes, a major flaw. One might even say that what the catalog becomes in such a case is no longer a catalog; this finding tool has ceased to fulfill one of the most basic functions stated by Cutter, the Paris Principles, and the Statement of International Cataloging Principles.

The look may be fine, the bells and whistles are nice, but the baby blunder means the project flunks, as do any managers who signed off on such a defective tool. Is there any reason in this day and age that collocation should be impossible?
</snip>
Looking at a few headings, Napoleon, even United States, it looks to me as if they may have decided to display only the subfield a. This may be either a design error (the headings can get awfully long) or perhaps the designers didn't really understand how bibliographic records work. It also could be that only a truncated record was brought over into this system.

Yet, the subject of this thread is "*nice step* forward" and I still think it is, although it can be improved. If, in this case, the full records are in there and the web designer was simply unaware of the additional information, that is the sort of problem that can be fixed in probably about 15 minutes but *designing* for the longer headings is obviously more difficult. Even Worldcat has similar displays to the ones at NYPL, e.g. for "napoleon emperor", his entire heading is not displayed:
http://nypl.bibliocommons.com/search?t=smart&search_category=keyword&q=napoleon%20emperor&commit=Search&searchOpt=catalogue
and
http://www.worldcat.org/search?q=napoleon+emperor&qt=results_page

In Worldcat, the links in the left column are only to "Napoleon". What we see with the 111 with Napoleon I Emperor, are all from OAIster where the internal coding has probably been lost in the various conversions.

So, I have sympathy for the web designer: how is he or she supposed to come up with a decent-looking display with headings like "International Conference "21st Century Slavery--the Human Rights Dimension to Trafficking in Human Beings" (2002 : Rome, Italy)" or "García y García, Julio Gabriel Salvador, 1928 or 9-" or "Austur-Eyjafjallahreppur (Iceland)"? I'm not even mentioning the subject arrays! It seems as if a better cut-off would be not by a subfield a, but only display a certain number of letters, then the searcher could roll the mouse over it or click it to see the entire heading(s).

This is what development is: not to simply declare something a disaster, but to point out how it can be improved, try it, and see, through trial-and-error. Ideally, the designers should genuinely welcome this type of input.

Re: rbms seminar

Posting to NGC4LIB

On Mon, 27 Jun 2011 09:20:47 -0400, Eric Lease Morgan wrote:
<snip>
Last Thursday (June 23) the Rare Books and Manuscripts Section of ALA hosted a seminar on "next-generation" library catalogs, and there were three of us presenters:

3. myself (Notre Dame) where I asked the question "Are we there yet?" http://bit.ly/mcF27t and I answered "No, not IMHO."
</snip>
Thanks for this excellent report, but I do have one point of difference: mentioning that Koha and Evergreen are "free" as in a "free kitten," where you discover that in reality, you have taken responsibility for this kitten, which means to feed it, teach it not to tear up your furniture, spend money for the vet, etc. Freeware and open source software are not "free" in this sense either, since anybody who has undertaken such a project quickly finds out that there is maintenance of the system, you have to do lots of server maintenance, you need to defend against attacks, and so on and on.

But this is not what freeware and open source software is. Richard Stallman, the father of free software, titled his story: "Free as in Freedom". (You can read it for free at http://oreilly.com/openbook/freedom/).

In my experience, this is very difficult for catalogers to really understand. Much of their training is: this is how you do this; I have done it this way for several years; now we have a new system and you no longer do it this way, you do it that way. Much of cataloging unfortunately has an "automaton" aspect to it and the advantage of freeware/open source software is that it can emancipate you from this way of thinking and working, so that you *can* say: I don't want to do it that way, I have a better way of doing it. With open source, if you have the knowledge, you can just change things yourself, right then, without asking permission from the "owners". If you want a link from one page to another, it only takes a minute. If you have an idea to actually have your database interoperate with another database, something more complicated, you don't have to beg the developer to *please, please, please!* implement this, wait for him to get around to it and then pay through the nose; you can just do it yourself, or hire somebody else if you don't have the knowledge.

Still, it is difficult to free yourself from the traditional need to adjust yourself to the system as opposed to the new need to adjust the system to yourself. Both demand some responsibilities on the part of the user base. But getting catalogers to think in these ways is difficult. They certainly can--and have done so--but for them, it is a step outside what they normally do. This is why I say that imagination is what is needed now: catalogers (and not only catalogers) need to speak out about what they don't like in their systems, and suggest better ways, but this is much easier said than done.

An example from my own career: we had cataloged in RLIN and inserting diacritics demanded something like three key strokes; we switched to NOTIS and I needed (I think) four key strokes; we switched to Voyager and it needed something like six or eight key strokes! I remember thinking: "Man, this is going the wrong way!" and I began to experiment with MacroExpress. I remember getting it down to the same number of keystrokes as in NOTIS, and I thought: "Why not go back to RLIN?" So I did. But then I thought, "You know, I never did like doing it that way anyway. What more can I do?" I remember getting the keystrokes down to less than in RLIN, and in fact, I was even able to make a little keyboard with Russian characters that allowed students who knew Russian to input correct transliteration with no training!

I think there are lots of improvements out there just waiting for someone to give them a voice!

Sunday, June 26, 2011

Re: nice step toward what a catalog can become

Posting to Autocat

On 25/06/2011 18:44, MULLEN Allen wrote:
<snip>
In short, while the opinions of the "rank and file" may not matter to you, Jim, the ability of library virtual bibliographic presence to support and be enriched by the users beyond the cataloging community, is a keystone to transcending Googleization. The answer does not lie in the cataloging community, nor the programmers, nor library decision makers - it lies in incorporating and unleashing our users.
</snip>
Whether I like the opinions of the rank and file doesn't matter--my concern is to create tools that make "nice steps toward what a catalog can become", as you so put it so well. Some may believe that the opinions of the rank and file are far more relevant to their own needs than the paid, and sometimes "bought and paid for" opinions of the so-called experts, while others concentrate on accepted "scholarly" opinion. I like to think that I am in the middle.

But as librarians, we make tools that *help* the public find relevant information that *they* want, not that *we* want--tools that are more useful than what they find today. These tools do not need to be perfect, just as getting the opinion only of the rank and file, or only of the "upper class" is also not perfect, but making it all easier to find is a nice step forward. To do substantially more we will need cooperation from all kinds of players, plus more powerful tools for searching.

Still, what I wanted to show is that there is a lot we can do right now just by using the technology we have and we don't have to kill ourselves with the despairing, Sisyphean task of manually making links to materials that we know will break eventually. Tools can be made to do a lot of the more distasteful tasks, although it may mean that we lose some control over the final product. And if those tools are seen as useful, we can continue to improve them. Or to not improve them if people don't like them. If they prove not to be useful, they can be retired far more easily because there is much less lost than if catalogers would start putting in gobs and gobs of links into individual records by hand. That is not the kind of road to go down--at least, not any longer.

Saturday, June 25, 2011

Re: nice step toward what a catalog can become

Posting to Autocat

On 24/06/2011 22:17, MULLEN Allen wrote:
<snip>
Here's a search result from the recently unveiled NYPL site: http://nypl.bibliocommons.com/item/show/18763007052_the_last_olympian Well done!
javascript:void(0)
</snip>

This is a very nice step, although not a final one. As Pat Sayre-McCoy pointed out, not everyone wants to read the opinions of the rank and file. Can something be built fairly easily that would be an improvement? I think something can, but again, it is a "nice step", not a final one. Earlier, I built a site that lets you go to the latest book reviews on some of the most important newspaper and magazines, etc. at http://www.jweinheimer.net/news/bookreviews.html. From just the US newspaper book review sites, I created a Google Custom Search page http://www.google.com/cse/home?cx=000565588499581966796:mvnd8aio2za&hl=en.

This just took a few minutes, but from this widgets can be made to automatically search for reviews using the information from the catalog record, by taking the exact 245 and first author, to create a query, e.g. for Great Soul: Mahatma Gandhi and His Struggle With India By Joseph Lelyveld
or Untold Story by Monica Ali

This is the kind of tool that is quick and easy to make and update, *free*, and one that can undoubtedly be improved in all kinds of ways. There are probably better tools out there and ways to improve the search, but something like this would be another nice step forward to provide people with something better than what exists now.

Friday, June 24, 2011

Re: Expressions of manifestations (?)

Posting to Autocat

On 24/06/2011 16:01, Brenndorfer, Thomas wrote:
<snip>
Gene Fieg wrote:
We just came across something that may or may not be addressed in the new discussions of the new age of cataloging: expressions of manifestations.

We get the translated works of Bonhoeffer, published by Fortress Press. After consulting our vendor's website, there seems to be a one to one relationship between original volumes of collected works in German (by the German publisher) and the volumes of the English translations.

Since these are translations of the German volumes, is there any to indicate that in the record so that, for instance, the patron could compare the German with the English translation.
There are two types of expression relationships going on here.

The vertical relationship ("primary relationship") is Expression Manifested, and it's a Core element in RDA. Basically, this relationship just indicates that in hand is the English translation of a particular work in this particular manifestation.
...
</snip>
I think this exchange shows that some new thinking is in order. People have been complaining about how lousy the current PCC records are, and now we are all supposed to think that catalogers can supply something like this?! Expecting records of this complexity and difficulty is just beyond the realm of possibility if we can't even supply records of less complexity today. Sooner or later, I think there has to be a sense of what can be realistically attained with our ever-diminishing resources. In a more updated catalog, how could we furnish users with much the same functionality using what we have *right now*? I believe it is more than possible.

Right now, I can search Worldcat as "au: Bonhoeffer, Dietrich, 1906-1945 ti:ethik" and retrieve: http://www.worldcat.org/search?q=au%3ABonhoeffer%2C+Dietrich%2C+1906-1945+ti%3Aethik&qt=advanced&dblist=638

This uses the power of the uniform title and author to provide a very handy display that the searcher can re-sort in various ways: relevance, author, dates, and in the left-hand column, find the dates, other authors (in this case, no others), languages, and "topics" (although I have never completely understood how this part works). This power exists *only because* the catalogers have assigned uniform titles correctly and consistently.

This retrieves a very small result set, so let's try with something more: Homer's Odyssey. http://www.worldcat.org/search?q=au%3Ahomer+ti%3Aodyssey&qt=results_page Here, we can limit in all kinds of ways: by books, audiobooks, videos, etc.; we can limit by Pope's translation, by date of publication, by language, etc. etc. etc.

Can this be improved? Of course it can, and undoubtedly it will, but everyone needs to recognize that modern computer systems allow the catalog to extract information from the individual records in the search result and display it, as in this case, where the searcher can see the other authors, dates, languages, plus sorts and so on. Consequently our catalogs already *can do more than anyone could have imagined just 25 years ago*. It would completely blow the minds of our "barbarous" ancestors from the 19th century (to borrow an expression from Thomas Jefferson).

Does it take some incredibly expensive, state-of-the-art catalog to get these kinds of displays? No, not at all. Koha allows it and the price is *free*! That's right, it's open source and available for download by anybody. Here is an example at Middletown Township Public Library: http://kohaopac.mtpl.org/cgi-bin/koha/opac-search.pl?idx=au%2Cwrdl&q=homer&idx=ti&q=odyssey&idx=kw&do=Search&sort_by=relevance

These are the sorts of directions our profession needs to take if we are to retain any kind of relevance to our public or even to society. The main task is to continue to create complete and consistent catalog records. Plus, we need to accept that FRBR merely presents a 19th century vision and people have moved on. At the same time, computer systems are fabulously powerful today. Let's figure out how to push those computer systems to their limits.

We need to work smarter.

Wednesday, June 22, 2011

Book chapter available "Realities of Standards in the Twenty-First Century"

Just to announce that the chapter I wrote in "Conversations with Catalogers in the 21st Century" published by Libraries Unlimited, is available for free to those who are interested. My chapter was titled "Realities of Standards in the Twenty-First Century" and is now available in the E-LIS open archive at http://hdl.handle.net/10760/15838

I did the scanning on a machine I had never used before, so everybody gets to see my fingers! Sorry about that, but ...

Tuesday, June 21, 2011

Re: Graduate student cat question

Posting to Autocat

On 21/06/2011 18:24, Janet Hill wrote:
<snip>
Perhaps we shouldn't be tricked into talking about "professional" catalogers, but should remember that we are talking about librarians who catalog. That is, the term "librarian" is applied to those who have earned the right to call themselves librarians through completion of the accepted professional degree. The term "professional librarian" is redundant, because if one does not have the appropriate professional credential, one is not -- by definition -- a librarian. Other individuals who work in libraries, but who do not have the accepted minimum academic credentials are not librarians even though they may perform their work in "a professional manner."
</snip>
Also, concerning Janet Hill:
On 21/06/2011 20:20, Janet Hill wrote:
<snip>
 I am using the standards of the profession in the United States, as articulated in ALA policy.
</snip>

Although I personally agree with all of these considerations, unfortunately (or not, depending on each person's opinion) there are lots of professional librarians without the librarian degree. They certainly wear the title of "Librarian" and are equal in status to those with the degree, if indeed, not "more equal" than everyone else. I  mentioned Jeff Trzeciak's talk earlier (here's a link to the Annoyed Librarian's blog http://blog.libraryjournal.com/annoyedlibrarian/2011/04/18/academic-insecurity/) where he mentioned that he will probably not be hiring any more librarians into the library.

It appears as if the library degree is being devalued, and one of the reasons *may* be that libraries themselves are much less willing to do the post-degree training that they were willing to do earlier. Therefore, people come out of library school educated much as they have been, but libraries want more today. Then, they somehow assume that a person with a PhD in any subject is better trained (somehow) and in any case, there are a lot of unemployed PhDs floating around. I don't know what the solution is, but it's one reason why I think certification may be part of the solution--and long overdue anyway. (This reminds me of Ernest Richardson at Princeton who wrote somewhere that you need a PhD to classify a book! When I read this, I thought: a PhD in what? By the way, the classifier was Henry Van Hoesen, later head librarian of Brown University, who had his PhD in classics, I believe)

I do not compare librarianship to being a custodian, but I do compare it with being a mechanic. The possibility of comparing my skills with those of a master mechanic, or a master electrician, etc. is--I think--nothing to be ashamed of and in fact, quite a compliment. That is not easy, but this betrays my proletarian background.

It would seem that working in a small library is simpler and easier than a large library, but I can honestly state that is incorrect. You have to be able to do everything, while in the large library, you have the "luxury" to specialize.

Re: Graduate student cat question

Posting to Autocat

On 20/06/2011 22:41, john g marr wrote:
<snip>
I disagree that "certification" can guarantee competence. People who just want to *get by* will not change (especially if no one is checking their work), unless, through some critical thinking training (and excellent constructive supervision) they can be motivated to do more.
</snip>
Certification does indeed guarantee competence; that is its very purpose. If all someone does is "get by and not change", that person will not get the certification and cannot practice in the field. This way, an employer (e.g. you) know that your dentist, who may have gotten his degree 35 years ago, did not stop learning but has kept up with the newest methods and technologies whether he wanted to or not. If he had not demonstrated this to a jury of his peers (i.e. the certification board) he would not be licensed to work on your teeth and would be barred from his practice. (I am reminded of that incredible novel "McTeague" by Frank Norris dealing with this! http://www.archive.org/search.php?query=title%3AMcTeague%20creator%3Afrank%20norris%20-contributor%3Agutenberg%20AND%20mediatype%3Atexts)

If someone took a couple of cataloging courses during their library degree 20 or 30 years ago, and did not increase their knowledge, are they competent catalogers today? The 20-year old degree does not guarantee that. Today, we look at someone's resume and see if they have been working as catalogers or not, thereby assuming that they have kept up with the field, but other professions have shown there are actual ways of measuring this competence, and one way is through certification.

When there is no certification or enforceable standards, the public is left helpless and must simply *hope* that the expert they are paying actually knows what he is doing and will do a decent job. Critical thinking and constructive criticism can motivate someone, but being  stopped from practicing because of a lack of demonstrable competence is an even better motivator!

Here is an organization that certifies records managers http://www.arma.org/careers/certification.cfm. There is certification in computers, accounting, and all kinds of professions. If there were certification for metadata creation, then if somebody wants to become e.g. a music cataloger and can't get an opportunity to do it at work, they could get a separate certification that would increase his or her future possibilities. Correctly implemented, certification empowers the individual who wants to get ahead.

Re: Suggested RDA improvements

Posted to Autocat

On 20/06/2011 18:48, MULLEN Allen wrote:
Concerning my remarks on:
<snip>
1. streaming of publisher/outside metadata
2. inheritance of work/etc. allows for less rekeying
3. potentially, a significant enough increase in the value of non-silo discovery systems that libraries will find alternatives more cost-effective than local record by record editing
4. more effective input of data through use of established element sets and other linked data potentials, lessening the amount of keyed input

I would simply indicate that the alignment of RDA and ONIX as part of the development argues for #1, it would seem. As for #2, again speaking from ignorance I would say that the depth and extent of relationships and differences in necessity for descriptions at the various levels (work, manifestation, expression) make this a somewhat different beast than the type of inheritance presently used for copying or deriving from existing records in the present environment. With regard to #4, I'd simply respond that if RDA is the vehicle that can accomplish these more indirect goals, this still can result in net cost reduction for libraries whether the same thing could be accomplished through different means or not.
...
Those advocating for RDA (which does not include me, by the way - I'm simply attempting to understand both the forest and the trees without being relentless critical and dismissive) would do well to provide much better indications of the worth ("business case") if there is any desire to convince a critical mass of the cataloging community. Convincing catalogers may not be a particularly fruitful or necessary endeavor, but again, I invite those in the RDA development community, and those well-versed in specifics of the intent and goals in the area of library-based metadata development, to help us understand the vision more clearly in our niche, similar to how Kelly McGrath and the OLAC testers have done with their FRBR-inspired work as a way to see the potential RDA offers for end-users. Lacking this, I expect this debate will continue to lack much of the knowledge base that will help achieve greater understanding.
</snip>
I agree with almost all of this. What is different is my attitude since I think it is time for the RDA community to demonstrate the tangible advantages of RDA, while the rest of the cataloging community should be encouraged to question those assertions very strongly. Still, *if* it can be demonstrated that publishers will be more forthcoming to supply ONIX metadata that will follow RDA *as opposed to AACR2*, then this would be a very important point in its favor. I have yet to see the slightest evidence of this. Concerning less re-keying with the WEMI framework, I maintain that it is difficult for me to imagine how it would be any better than how we do derives currently. I am more than willing to be shown, to be surprised, and to change my mind, but I am skeptical nevertheless. The report itself concluded that RDA would not increase production.

Finally, if RDA can be shown to work with other metadata systems *better than what we have now*, this would also be extremely important. But exactly what parts of the new RDA rules will accomplish any of this?
Typing out the abbreviations? Getting rid of Latin? Getting rid of O.T.and N.T.? Eliminating the rule of 3 and [sic]? Maybe main entries for treaties? Or the 33x fields? This is when I find myself shaking my head and thinking: Let's get real ...

I am completely 110% in *favor* of all those points you make, but the reality of RDA and the changes it proposes have *nothing whatsoever* to do with any of those points you mention. So, we are left with just more promises for a vague FRBR-type of future that (I believe) doesn't seem so great anyway, while we are all expected to hold our breaths, spend lots of money and resources for.... what? It seems to me that we would be in exactly the same place, if not a few steps back. For what it's worth I, for one, am not buying it.

Perhaps in less severe economic circumstances, everyone could be less exacting with accepting the new rules and figure: well, let *them* figure it out because we can trust that *they* know what they are doing and will do best by us. I am not buying any of that either.

This is why I compared the catalog to the trilobytes. Of course, I think and hope we are smarter than that, but the advantages of RDA need to be proven and demonstrated. There is absolutely nothing wrong pointing this out since it will have to be done sooner or later.

I think people are fed up with the theories and the ifs and maybes and if onlys, and are beginning to insist on something substantial.

Monday, June 20, 2011

RDA Test Final Report

Posting to RDA-L

One of the most remarkable findings I have found in the RDA Report is on p. 103:
"Intriguingly, a majority of test institutions thought that the U.S. community should implement RDA while at the same time, a majority believed that the implementation would have a negative impact on their local operations.
...
In general, formal test partners reported that they needed more time to create or update an RDA record than a record using their current rules; this was clear from responses to IQ question no. 2, “Please provide any general comments you wish concerning the test, the RDA Toolkit, or the content of the RDA instructions.” Comments showed major concern over the initial costs of an implementation that would be evident in reduced production for an unpredictable length of time. Several commenters also stressed the need for some kind of bridge document, possibly as part of the workflows in the RDA Toolkit: “The RDA instructions are organized according to FRBR and FRAD principles while the descriptive cataloging process remains linear by format. We found RDA to be a collection of rules that are ordered without respect to our existing workflows ….”"
I will grant that once people are trained, they will probably be able to make RDA records more or less as quickly as AACR2 records. But it seems obvious that an increase in productivity is not a reason to move to RDA, therefore, we must look elsewhere for reasons. Other reasons are addressed on p. 111, where people mentioned the benefits of RDA. Analyzing the benefits is also quite fascinating. They mention changes in "perspectives" and "how we look at the world". RDA "encourages" developments, is felt to be more "user-centric", and other highly nebulous benefits that make very little difference in a business case. I disagree with much of this, but it goes on to mention some specific benefits:
"RDA testers in comments noted several benefits of moving to RDA paraphrased as follows:
  • using language of users rather than Latin abbreviations,
  • seeing more relationships,
  • having more information about responsible parties with the rule of 3 now just an option,
  • finding more identifying data in authority records, and
  • having the potential for increased international sharing – by following the IFLA International Cataloguing Principles and the IFLA models FRBR and FRAD.”
My remarks:

- The inescapable fact is that the public will still be seeing Latin abbreviations until the end of time so long as there is no recon project. Therefore, merely changing the rule solves *nothing whatsoever* for our users, and we must be very clear about that.

- Seeing more relationships, that is, making the current relationships between records more explicit, may be useful to the public or not. According to the rest of the report, figuring out the exact relationship of an entry seems to be much more difficult for the cataloger than just making the entry in a 7xx field. I would say that this needs to be studied further to find out what the utility is to our users.
This reminds me of an actual instance in my own practice where an organization was considering adding "cause-effect" and other relationships for catalogers to encode in their subject descriptors. I gave an example where a cataloger assigns the terms "deforestation" and "desertification" as simple keywords, and comparing this to having to add them as cause-effect relationships. Either can be a cause or effect of the other, and the only way to find out is to spend additional time actually reading the document much more closely than you would otherwise, and even then, I found cases that were very unclear. I felt that the result on productivity could be quite large as much more time would be needed reading, considering, consulting, and so on, which would be in effect, making judgments that other could very easily disagree with. The ultimate utility to the user appeared quite negligible, and I argued, potentially even harmful since someone searching this could relate in entirely different ways to the information in the document. To repeat, studies need to be made to decide if the supposed gain of users seeing the relationships more explicitly is worth the necessary drop in production.

- Eliminating the rule of 3 "cuts both ways". As it stands now, it could just as easily wind up making fewer access points available. This also needs to be made very clear.

- More data in authority records is another one that "cuts both ways". If we add more data to authority records and actually increase the number of responsible parties in each record, total productivity must go down. To be fair however, if the additional data in authority records could be crowdsourced or even better in the current world: work with tools such as dbpedia, perhaps through tools such as Worldcat Identities, then this one could actually work. This has little to do with RDA however, and everything to do with changes in MARC format.

- having the "potential" for international sharing: a nebulous benefit once again. It would appear to make more sense that if we are to achieve increased international sharing, we should follow ISBD more strictly than less so. The authorized forms can be shared through the VIAF project, which, as Bernhard points out, is one place for interoperability. Still, in the VIAF we see very clearly how the actual *form of the name* becomes much less important than the concept (where we can see that one concept can have many forms of name), therefore the rules for "establishing the authorized form of the name or concept" as RDA does, are becoming anachronistic, since the actual textual forms are far less important than the concept URI, e.g. see http://viaf.org/viaf/35605/#John_Paul_II,_Pope,_1920-2005

Need for Certification (Was: Graduate student cat question)

Posting to Autocat

On 19/06/2011 08:26, Jennifer Morrissey-Myatt wrote:
<snip>
Hello,
I am a MLIS graduate student and will be in my final semester at SJSU this fall. I have not taken any cataloging classes but I did complete two internships to learn cataloging as well as coursework for LIBR240 Information Technology& Tools (I somehow thought it was cataloging related - not sure what I was thinking there).

In my job at the Yolo County Library, I completed a 4-week online course on cataloging basics and am currently training on the job to learn cataloging.

My question is this: in the real world of library jobs, would this alternate way of learning cataloging qualify in place of coursework in LIBR 248& 249 (beginning& advanced cat)?
</snip>
I agree with the comments the others have given, but there is still a major sticking point in all of this: certification. If a professional field demands a special certification, then that should at least count for something. In this case, the certification is the MLIS. After getting the master's degree, you can find yourself in a very ambiguous situation: you are not a librarian, you are not a cataloger, you are not a manager. In fact, after receiving the degree, it's probably easier to describe "what you are" by "what you are not", since what has really happened is: you have finally arrived at the starting line. It seems that in earlier times, organizations accepted this more easily: the recent graduate had "paid his or her dues," had shown they were serious in devoting themselves to the profession, so the organization would then be willing to risk the real resources to train you into the profession of reference, or cataloging, or acquisitions, or rare books, or whatever. It seems as if organizations are much less willing to do this today and prefer people to arrive "already trained".

Of course, this is unfair to the recent graduate and also unfair to those who teach the master's programs since the idea of education was never to produce finished professionals. Even law schools, business  schools or medical schools do not do that. Still, we can bemoan what has happened, but we nevertheless have to deal with the facts as they are.

Do you need an entire MLIS to become a cataloger? No. And probably most people who took a cataloging class 30 years ago have forgotten practically all of it. Therein lies the problem why some are devaluing  the degree, for some of the latest controversy, see http://laikaspoetnik.wordpress.com/2011/04/20/a-library-without-librarians-the-opinion-of-a-phd-librarian-on-the-jeffrey-trzeciak-controversy/

Jeffrey Trzeciak wants to hire PhDs instead of people with an MLIS. My own opinion is that he is making precisely the same error except on a different scale: someone who has only gone to school and wound up with a PhD, although that is certainly good, that person is in exactly the same situation as I mentioned above: you are still not a librarian, you are still not a cataloger, you are still not a manager; you are standing at the starting line, only now you have a PhD instead of an MLIS.

Whether the MLIS is retained as a prerequisite or not, I think more and more that a separate certification for the field of "metadata creation" should be required and updated periodically, much like a doctor or dentist. If that were the case, I think the field of cataloging would command much more respect.

Sunday, June 19, 2011

Re: Announcing the LC/NAL/NLM RDA Implementation decision

Posting to Autocat

On 18/06/2011 22:44, Williams, Ann wrote:
<snip>
In an RDA world with new discovery tools, would we be better off keeping the eresources on their own records rather than adding them to the paper text record?
</snip>
I don't think it will make much difference with RDA since record structure remains the same, but in an FRBR world, this is the sort of distinction that would in essence, go away since the unit record would disappear. The way it would work would be similar to a 19th-century book catalog, e.g. If we look at "A catalogue of the books of the Boston Library Society in Franklin Place" (Jan 1844)
http://books.google.com/books?id=5uTTV2GYNxIC&pg=PA252#v=onepage&q&f=true
we see:
Shelf Na
54 23      Smith, H. and J. Rejected addresses
                 Boston. 1813. 12mo
99 25        Another copy

This could be added to with:

Shelf Na
54 23      Smith, H. and J. Rejected addresses
                 Boston. 1813. 12mo
99 25        Another copy
Internet         Boston, 1841 LINK
                  3rd American ed. "From the 19th London edition"
Internet         London, Methuen&  Co., 1904 LINK
                  With an introduction and notes by A.D. Godley

[there are many more editions in the Internet Archive and Google Books, and probably other places as well]

This is, what I see as the essence of FRBR: it gets away from the unit record and introduces another system based on book catalogs. This is why deciding whether something goes onto the same record or not will no longer make much difference in an FRBR universe because the unit record disappears. RDA does not posit these changes.

I do not question for a moment that such a display is better for users,and no cataloger from any time would have ever questioned it. What I do question is whether this is what people really want *today*, and more importantly, whether such displays can be generated from our formats right now. After all, if they can be generated from what we have now, why go through all the pain and suffering of retraining & retooling everybody and everything, plus making our users wait even longer when the final product could be created today? I personally think it could.

Does someone who is looking at a catalog record for this book want to know about the versions in the Internet archive? Of course! But another method of doing it could perhaps be much simpler: just a link to  http://www.archive.org/search.php?query=smith%20rejected%20addresses. Or
perhaps better with http://openlibrary.org/search?q=smith+rejected+addresses or http://openlibrary.org/works/OL1145764W/Rejected_addresses.

These are some of the methods we could use and they exist right now! For free even!

Friday, June 17, 2011

Re: Suggested RDA improvements

Posting to Autocat

On 16/06/2011 23:42, MULLEN Allen wrote
<snip>
Whatever the shortcomings of RDA, it does not assume business as usual in the future. If RDA actually is implemented and works, it could be more cost effective in several ways:

1. more effective streaming of publisher and outside metadata into the catalog process, decreasing the amount of time that original and copy catalogers spend creating records

2. inheritance of expression and manifestation metadata from work level records, decreasing the rekeying of significant portions of data for these records once the work record has been created. The same should be true for inheritance among expressions and manifestations

 3. potentially, a significant enough increase in the value of non-silo discovery systems that libraries will find alternatives more cost-effective than local record by record editing

4. more effective input of data through use of established element sets and other linked data potentials, lessening the amount of keyed input
</snip>
I really don't see it in the same way. I think RDA actually does assume business as usual in the future; that RDA represents no substantial change from what we are doing now. This is one of its major problems. In the points you make:
1. streaming of publisher/outside metadata

- There has been no indication that I know of that publishers/outside metadata creators will be more willing to supply us with RDA metadata than with AACR2. In this regard, I think that the ISBD/XML Study Group of IFLA would be much more effective, but I don't know of the current state of their work http://www.ifla.org/node/1795. (If anybody knows, could you let me/us know?) If there were an accepted, truly international standard that could be implemented using modern formats (XML), with the standard being open and more or less simple (principally transcription), I think it would be possible that a suitable business case could be made for it that would convince the business mind.

Maybe.

2. inheritance of work/etc. allows for less rekeying

- I think this would have to be demonstrated, since it is difficult for me to imagine how separate W/E records would make anything faster than a simple derie from a copy record, which also needs almost no rekeying. Perhaps there could be some novel types of derives that take only specific kinds of information and therefore you would not have to e.g.  delete a note for your version or something similar. Any time saving here however, seems miniscule, especially as people would have to fight to create the work and expression records, which I personally think will be far more difficult conceptually than currently appears.

3 and 4

- These have much more to do with changing the format than with cataloging rules.

I am a stalwart believer in standards, and in both maintaining and enforcing those standards so that they rise to a level of genuine "trust" rather than "hope". "Trusting" that the water coming out of your pipes is safe is substantively different from merely "hoping" that it is. Trust is built on some kind of solid foundation of experience, while hope is.... well, it's just hope. It's just like pulling the lever on the one-armed bandit! In fact, that's often the way it's seemed to me when I've been looking for copy!

Still, it seems to me that instead of changing rules within individual catalog records, we should be focusing on how to change the catalog itself. I am beginning to think that it will take non-catalogers to   solve this problem. The fact is, the public always had trouble with our catalogs and now they are searching in ways that were unheard of before. Search is changing everything, and in this sense, library catalogs are not even as far advanced as dinosaurs, they are the equivalent of trilobites!

In fact, it occurs to me how people in the future may end up viewing library catalogs (adapted from the Wikipedia entry for Trilobite http://en.wikipedia.org/wiki/Trilobite):
"Library catalog or Library catalogue (\ˈka-tə-ˌlȯg, -ˌläg\) is a well-known fossil of a class of extinct bibliographic and searching tools that formed how people found information in more primitive information times. The first appearance of a real library catalog in the fossil record was in the Library of Alexandria in ancient Egypt...  Library catalogs finally disappeared in the mass extinction at the beginning of the Age of the Internet x-number of years ago. Library catalogs were among the most successful of all primitive attempts of  search and retrieval, roaming almost completely unchallenged over the bibliographic landscape for millennia, but they steadfastly refused to adapt to the new harsher conditions until they finally collapsed, exhausted and shivering, to allow themselves to be eaten by more intelligent and larger predators."

Thursday, June 16, 2011

Re: What the announcement means

Posting to Autocat

On 16/06/2011 17:14, J. McRee Elrod wrote:
<snip>
There is *no* commitment to implement in 2013. There is a long list of things to accomplish before that, including "progress" on a MARC replacement, and rewriting the instructions in plain English. The time lines on some of those tasks are, in my opinion, very optimistic.

I do not expect January 2013 implementation. Others may not agree with that assessment, or the US national libraries might proceed without the task being accomplished.
</snip>
I still say that RDA can not go anywhere until a decent business case has been made for it. It shouldn't go anywhere until then. All kinds of people can promise the moon, but sooner or later, you have to make a  credible case for why you have a solution to the problem, and why your solution should be chosen over others. Demanding this kind of accounting is fully reasonable and only makes perfect sense. As the report says very clearly http://www.nlm.nih.gov/tsd/cataloging/RDA_report_executive_summary.pdf:
"The test revealed that there is little discernible immediate benefit in implementing RDA alone. The adoption of RDA will not result in significant cost savings in metadata creation."
they go on to say:
"Immediate economic benefit, however, cannot be the sole determining factor in the RDA business case. It must be determined if there are significant future enhancements to the metadata environment made possible by RDA and if those benefits, long term, outweigh  implementation costs."

The report says there is *no business case* for RDA to be implemented by what exists now. So, all that remains is to maintain that the "future directions" of RDA will be worth it. What can people promise in the
future? It seems to be pretty late in the game to start talking about this.

And let's be honest about what current RDA changes from AACR2 promise about future directions: Abbreviations? Latin? Eliminating the rule of three, going down to a single author? Changing main entry for treaties? Changing O.T. and N.T.? Adding three barely comprehensible fields to MARC? Eliminating the GMD? Keeping text in ALL CAPS? Does anybody really believe that this is worth it in the business sense? Does anybody really think these represent the directions of "significant future enhancements"?

Plus, maybe you really do believe it, but how do you convince that person who actually makes the decisions and is just aching to cut your budget?

Cataloging Matters #11: “Open Archives, pt. 2”

See also: Open Archives pt. 1

Cataloging Matters #11:
Open Archives, pt. 2


http://www.archive.org/details/CatalogingMatters11OpenArchivesPt.2

Hello everyone. My name is Jim Weinheimer and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy.

In this episode, I want to continue with part two of my discussion about Open Archives. I intend here to concentrate on some of the technical aspects of how to get these materials under control, primarily from the cataloger’s viewpoint.



In the first part of my discussion on Open Archives, I spoke in more general terms and perhaps most people already knew much of that, but I believe it is necessary to emphasize the importance of the materials in the open archives, and although the materials in open archives are different in the sense that many of them have not gone through pre-peer review or they may differ in several other ways, these facts still do not detract from their importance. Once we accept that these materials really are highly important to our communities, as after all, they are by definition since they have been created and stored by scholarly institutions, some of them great and often including our own home institutions, all at great trouble and expense, then libraries cannot afford to ignore them. As I have said elsewhere, if libraries ignore the materials produced by their own communities, it should not be so surprising when those same communities begin to ignore libraries.

In an article Academic Libraries and the Struggle to Remain Relevant: Why Research is Conducted Elsewhere by John Law of Proquest, the author discusses the results of a project researching how academic patrons search. After discussing library catalogs, the myriad of databases, each with its searching peculiarities, and the real problems of Google Scholar, he writes: “Clearly, the desire among academic researchers is exceptionally high for credible, relevant results that can be refined to show only full-text resources.” This seems to me to be precisely what the open archive initiative is supposed to supply. http://www.serialssolutions.com/assets/publications/Sydney-Online-2009-John-Law.pdf

This is why I consider that, in library terms, the materials placed in open archives have already gone through the process of selection by their respective communities; there is no need to order anything, so the next step in the process that everyone is waiting for is description and organization, otherwise called cataloging.

So, how do we catalog these materials? For those who listened to part one and remember, I used the terms “exponential growth” when describing open archives and mentioned that already open archives hold around 9 million items. While I’ve seen some pretty big backlogs, I’ve never seen anything nearly that big! Of course, these are only the open archives that are registered, and not all are registered, plus there are many wonderful sites floating around on the web that are not in open archives, but I’m not dealing with those at the moment, only those materials in open archives.

In many ways, I think the open archive initiative has taken us all back in time to the beginnings of journals. The librarians and publishers of long ago understood as well as we do today that most people want individual articles out of journals, and not the journals themselves. Back then, a journal would sometimes provide an index in their final issue of the year, so that people wouldn’t have to go through each and every issue, and then to make it easier to find articles, some began to cumulate these annual indexes every 5 years or so, and eventually some even cumulated the cumulations. It turned out however that even with all of these cumulations, people still complained about doing all that work for each journal.

What did the librarians do? They too, quickly learned that, although it was what their patrons wanted, cataloging each article of each journal was an impossible task. William Poole got the brilliant idea of creating an index of the articles from a bunch of journals, printing it, and selling the publication to libraries. In the transcript I use the miracle of the internet to give a link to an early edition from 1853. That edition has a preface where Poole very clearly describes the situation of the mid-19th century, and it mirrors today’s reality almost exactly. Others may find his comments useful. http://books.google.com/books?id=yO9GGjPbPjYC&printsec=frontcover#v=onepage&q&f=false   

It turned out that Poole’s solution made everybody happy: he and his publishers made money; the librarians could buy the index, and a nice addition was that librarians had a general guideline of which journals to buy since if a journal happened to be in Poole’s index, it was a good reason for them to buy that journal. Based on that fact, journal publishers naturally wanted records of their articles in Poole’s index. The index made the catalogers happy since they had, in effect, outsourced one of the most difficult parts of the collection, and the patrons were happy, that is, once they learned that if they wanted an article, they had to look in several places: first, they had to find Poole’s index: they could then see which issue of which journal the article appeared, then go over to the library catalog and see if the library had the journal issue and where it was shelved, then finally into the stacks. Some patrons never learned this vital skill, and even the ones that did nevertheless did not like it much. Many never really understood why they had to look in so many different places and why it couldn’t just all be in the catalog.

Today with open archives, as we have seen, there are a huge number of articles and repositories, while their numbers are growing all the time. Just as before, there are not enough librarians to catalog them all, and the work will have to be outsourced in some way, as indeed, it already has been. But now we run straight into the biggest problem: nobody wants to pay anyone to index these materials. As I mentioned in part one, the idea of open archives is not to make money, but to save money. So far, we have lacked the genius of a modern William Poole who, if he were with us today, may have figured out by now how to make money indexing those open archive materials. The remarkable Faculty of 1000 site I mentioned in part one may prove to be a starting point. But however that turns out, it’s clear that our traditional methods and solutions are broken. Therefore, we must consider matters anew, and seriously: do we have any advantages today that our predecessors didn’t have before?

As I already mentioned, open archives include a requirement for an associated metadata record created by the authors (or whoever it is that adds the item to the open archive). These records are then made available in such a way that bigger databases can “harvest” them, i.e. take copies of the records into their own databases so the searcher doesn’t have to search the thousands of open archives separately.

Metadata harvesting
The process of harvesting metadata from open archives is normally not much of a problem. There are various tools you can use, for instance, MarcEdit will do it http://people.oregonstate.edu/~reeset/marcedit/html/index.php, but there are other tools as well, http://www.openarchives.org/pmh/tools/tools.php; even your web browser will do it, although it’s not the most efficient tool. You just need the link, then select the format you want to take, sometimes you can change the query in various ways, by date or subject, and just start in.

For those who want an example, you can do your own harvesting for a series of records in An American Ballroom Companion: Dance Instruction Manuals, ca. 1490-1920, comprising around 200 records in American Memory. In the transcript, I provide links for harvesting these records using the formats OAI-DC (a special form of Dublin Core), MODS and MARC21, and you can do it yourselves. It takes a moment, remember, you’re downloading over 200 records, so have some patience!
http://memory.loc.gov/ammem/oamh/books.html
http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=musdibib (OAI-dc)
http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=mods&set=musdibib (mods)
http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=marc21&set=musdibib (marcxml)

If you examine the links, they are all the same except for the “metadataPrefix”, which defines the format of the record you want the computer to serve you; oai_dc, mods, or marc21.

Naturally, once you get the records, you then need some kind of repository to place them. There are lots of options for that too, many of them “free” open source options such as Drupal, but I won’t talk about any of that here.

There are problems with metadata harvesting however: since you are taking copies of the records, somehow those records should be coordinated. The moment a record in the original database is updated, yours becomes obsolete. New records made in the original repositories need to wait until you harvest them and put them into your repository. With harvesting, your database will always be behind. In practice, conversions are often unavoidable. Information may be lost in the process, and many times, the outside archive will have additional powers for search and display that your repository does not have. We’ll discuss an example at some length later in this program.

Harvesting is not the only option if you want to work with open archives; there are tools such as APIs which query the live database, and often allows searchers the option to work with the original database in various ways. There is also my own method which I also won’t discuss right now.  There are many, many options available.

My current thinking is that open archives eventually will become specialized by topic, instead of the generalized ones we have now, based primarily on individual organizations. Specialized open archives will be much like the high-energy physics archive at Cornell and the E-LIS archive I mentioned in part one.

Open archives specialized by topic would mirror the history of web site creation. For those who can remember the early days of the World Wide Web when every institution was frantically creating its own web pages, it turned out that the websites of those organizations almost always mirrored the internal bureaucratic hierarchy within an institution. This happened because the websites were made by internal staff with the purpose of ensuring a “web presence” of the specific division or department. Some overall webmaster then collected the links to all of the divisions and departments into a single, overall page. Search engines were highly rudimentary at that time, and therefore, to find information, a searcher had to navigate the internal bureaucracy of an organization, through the divisions, departments, projects, and so on.

I can say that this really did seem logical at the time, but slowly, it dawned on website creators that the true logic of a site on the World Wide Web is to appeal to the greatest portion of the public as possible, and this meant people who had no idea of the internal workings of your organization. This kind of website was compared to a closed intranet, a distinction took me some time to understand.  Lacking this internal knowledge, outside patrons could almost never find the information they wanted, no matter how hard they tried.

As a result of this change in focus, today the information architect concentrates on the person who doesn’t know anything about the internal peculiarities of an organization, and so organizes the site to help those people find the information they want. In a similar way, it seems to me that this could easily be the direction that open archives could evolve: open archives specializing by topic would seem to be what most people want, and the totality of what an individual institution creates is much less useful. This would also be similar to people wanting journal articles over journals. All of this would probably make searching easier, but it may very well turn out that I am proven wrong, since after all, that is the nature of prediction.

Another, much more serious problem is: there are almost no standards for the metadata records in the open archives. One obvious problem is with formats. There was a great solution--I thought it was anyway--in the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) http://www.utsc.utoronto.ca/~chan/oaindia/presentations/OAI_PMH.pdf. I won’t discuss it here, but I provide a link in the transcript. What happened was that Google unfortunately decided not to use it, in favor of the much simpler site maps. http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html. If Google had accepted OAI-PMH, it would have been a great shot in the arm for the format. What will be the impact of Google’s decision remains to be seen.

In this program, I do not want to focus on formats. While they are a very important issue, I do not believe formats are where the underlying problems lie; I prefer to discuss the quality of the information within the metadata record. This is where I have heard the quality described as I mentioned before: “Pretty good.”

Problems with Harvesting
Havesting metadata records, if it is to be useful, must be considered in the context mentioned earlier. I like to visualize it as being in a large office building with hundreds of offices, each with a separate file cabinet. To do a thorough search of the documents in the office building, people have always had to go to every office and look through all of those hundreds of file cabinets. If we are to make the searchers’ task easier, one solution is to bring all of the file cabinets into one room so that they can be searched together. But after you do this, you find that this is not enough because you still have to search hundreds of separate file cabinets--all that has changed is that you no longer have to walk all over the building. While your feet may have it easier, the searching itself is not any easier at all since you still have to search each file cabinet separately. To make searching easier, it would need to be one search, and that means merging all of the files in all of the file cabinets into a single file. That’s a lot more work than just getting a few workmen to drag a bunch of file cabinets into one big room!

It means in turn, that all of the idiosyncracies that each person used in his or her own file cabinet have to be ironed out so that people can find materials by UN, ONU, United Nations, non-Roman materials, and all of the other details that I am very happy I do not need to explain here because my audience is catalogers and they understand these matters very well.

Whatever methods you decide to employ to solve this situation, everything will also need to be maintained.

Therefore, although metadata harvesting is very important, it is only one step toward a solution. We can imagine everything in one location (that is, one room or one database), but the searching itself is just as difficult as it has ever been, if there is no standardization. Without standardization, at least on some level, whenever we create an open archive repository, we wind up creating our own little Google, where we are simply hoping that full-text algorithms of the Google-type will solve the problems in deus ex machina fashion. I do not share that faith and believe we need some kind of standardization.

Lacking a miraculous solution, it isn’t as if we are left with nothing at all: there are still all those metadata records for each item, and the question becomes: how can that metadata be standardized?

Before we discuss possible solutions we need to get some idea of the nature of the beast. Let’s examine a practical example in one of the biggest Open Archive harvesters: OAIster.

A very incomplete analysis of OAIster
Before I begin my analysis, please believe me since I am being very sincere: it is not my purpose here to criticize any initiative; I am a big fan of all of them. My purpose here is to try to show how important is the task of cataloging, and to show how well-trained, professional catalogers could help identify and solve some of the problems.

Here is a document I found at random:
Consumption inequality and partial insurance / by Richard Blundell, Luigi Pistaferri, and Ian Preston.

To sum up what follows, after an examination, I discovered there are different versions of this document, at least from 2003 through 2008, when it was published in The American Economic Review, and available in JSTOR for those with a subscription. Let me describe in more detail what I found.

http://oaister.worldcat.org/search?q=Consumption+inequality+and+partial+insurance&qt=notfound_page&search=Search

When I did a search in OAIster for this document I find what appear to be duplicate records but on further analysis, it turns out that these are records for different versions, and only some of the records allow access to the actual document. (I ask myself: What happened to the Open part of the Open Archives?) As I examine the metadata in these records, I find that most give the authors’ names in citation format, i.e. surname plus initial, but one record has their names as they appear in the document.

It turns out that all the authors are in the NAF. The forms of the first two, Blundell and Pistaferri, match their NAF forms but the third lacks dates, since the NAF form is Preston, Ian, 1964-. I conclude that no one consulted the NAF. The match of the first two names with the NAF is purely coincidental since their NAF forms are the same as in the document. http://oaister.worldcat.org/title/consumption-inequality-and-partial-insurance/oclc/2390295190605&referer=brief_results

When we begin to examine the documents themselves, matters become more complex. In the 2003 version of the document, that is, one of the versions that is open, the authors mention in a note that it is a version of an earlier paper “Partial insurance, information and consumption dynamics”, also available in the open archive. This is not mentioned in any of the records. (http://oaister.worldcat.org/oclc/2390295663776)

Further examining the documents, we find abstracts along with the keywords: consumption, inequality, and insurance that is, words that are rather useless for searching purposes since they are taken directly from the title “Consumption inequality and partial insurance”. I conclude these keywords were assigned either by the authors or someone who had no interest in subject analysis. http://discovery.ucl.ac.uk/2854/1/2854.pdf

I discover that these records came from the open archive at the University College London and decide to search that archive separately. I find some interesting details, http://discovery.ucl.ac.uk/cgi/search/advanced?screen=Public%3A%3AEPrintSearch&_action_search=Search&_fulltext__merge=ALL&_fulltext_=&title_merge=ALL&title=Consumption+inequality+and+partial+insurance&creators_name_merge=ALL&creators_name=&editors_name_merge=ALL&editors_name=&abstract_merge=ALL&abstract=&divisions_merge=ALL&date=&satisfyall=ALL&order=-date%2Fcreators_name%2Ftitle. While there is no consistency among the records, we see that they contain additional information not in OAIster: none of the OAIster records for these documents have any subjects but in UCL, some records have subjects, yet once again, there is no consistency: some have no subjects, and others have differing subjects. One of the records has what appears to be real subject descriptors: http://eprints.ucl.ac.uk/15896/: LIFE-CYCLE EARNINGS, COVARIANCE STRUCTURE, TAX-REFORM, PANEL DATA, INCOME, HETEROGENEITY, DYNAMICS, WELFARE, UNCERTAINTY, VARIANCE. These terms appear to be authorized forms but I don’t know where these terms come from. Perhaps the EconLit thesaurus would be a good bet.

Again, none of the OAIster records concerning this document have any subjects at all and it appears that OAIster has decided not to harvest the keywords, probably because of the consistency concerns mentioned earlier. See how all of this recreates the scenario I described before? Bringing together hundreds of file cabinets into a single room saves the leather on your shoes, but it doesn’t make the searching itself any easier because you still have to make lots of separate searches.

This doesn’t end our analysis and in fact, it may actually just be starting. When you discuss searching today, everything naturally must be compared with Google, and in our present case, we find the same article in Google Scholar:  http://scholar.google.com/scholar?hl=en&q=Consumption+inequality+and+partial+insurance+blundell&btnG=Search&as_sdt=1%2C5&as_ylo=2007&as_vis=1. This has the normal link going to the restricted version at JSTOR, but in the right hand column, there is a link that goes to one of the free versions hosted at the University College London, the one dated Sept. 2003. This is very nice and handy for the patrons but I do not know why this version is singled out.


Yet the Google “metadata record” has something more that I find very impressive: a link going to different versions. If you click on the link labelled “All 46 versions”, you find many, many, many more versions of this article, including the one published in 2008 http://www.econ.upenn.edu/system/files/Blundell.pdf. Is this a legal copy? I don’t know; I don’t care. It’s available and that’s all that matters to me right now.

I confess that I have not looked at all 46 records, so I don’t know what else may be hiding there, but in any case, after this simple examination, the situation seems to this experienced cataloger at least, to be a bit chaotic. Don’t get me wrong: the materials are all great, it’s just very confusing to understand what exists, and if it’s confusing to me, I must assume that it would be just as confusing to non-specialists.

I did not go out of my way to find this example; it seems to be a normal record, and a normal level of metadata quality in a normal open archive. Can and should professional catalogers conclude that such a level of quality is “pretty good”?

As an aside, I can imagine that if anyone has been listening to my podcasts, they could be thinking at this point, “But with all these versions and the chaos you describe, you are actually talking about how we need FRBR! You’ve gone into long tirades over how you don’t agree with FRBR! How do you get out of that one?”

In my defense, what I have said in the past is that FRBR does nothing more than restate the traditional operations of the catalog--it just uses other terminology and posits a different structure by eliminating the unit record. It provides nothing new in the way of searching. The only “innovations” it introduces are in display based on works-expressions-manifestations-items, and even those displays are based on 19th-century models. I maintain that what FRBR intends is designed for librarians over our patrons, and thus is no real change from our current library catalogs. Therefore, I say that what FRBR calls “User tasks” are actually “Librarian tasks”. Librarians have to know exactly what exists so they can organize it for later retrieval--I don’t question that. My stance is that the catalog as it stands today allows all of this right now for librarians and the FRBR “user tasks” are not what the public either wants or needs. Consequently, creating catalogs with FRBR in mind ignores what the public wants.

In the case we have just examined, what would patrons want? Would they want a detailed browsable listing of the 46 or so variants at their disposal, or would that just be too confusing and too big of a pain? My own opinion--and it is an opinion although based on experience--is that patrons would probably be happy with almost any version they could get, and if given the choice, most probably would simply opt for the latest one they could have for free.

As I said earlier, it is not my intention here to point out faults; my purpose is much more positive: I want to demonstrate that harvesting is only one part of the solution, in some ways, the easiest part, and there are other options besides harvesting. All the while, it is important to keep in mind the traditional cataloging concepts, which remain completely valid today, although the traditional practices or techniques used by catalogers may end up in the garbage can.

Are there solutions toward improving this situation? I believe there are, but I think it is clear that solutions should focus not on creating quality in these metadata records, but on managing the quality. The unavoidable fact is that there will not be enough catalogers to create quality, therefore we can only try to manage it as best we can. Accepting this would represent a major shift in the viewpoint of the traditional cataloger. There are many ways to include cataloger-type controls in open archives as metadata managers, and the only limit is our imaginations.

I suggest thinking in terms of creating new tools with the purpose of providing help to catalogers: a useful tool could be one that included items from a local open archive into the main cataloging workflow automatically; tools that allow catalogers to upgrade records en masse; tools to find items that exhibit inconsistencies or other inadequacies in metadata, perhaps through statistical analyses that can be viewed graphically, so that inconsistencies could show as stray dots on a graph, or to point out where subject terms are absent or simply repeat what is in the title. Ontologies need to be built so that when patrons see a subject heading or descriptor in one open archive, they can be led to similar materials in other databases that use different thesauri or subject systems. How about a tool that allows corrections and updates by members of the public, while everything would be done under the watchful eyes of the catalogers. Perhaps a difference in procedures would help: additional descriptive work could be done retrospectively depending on whether a new item is a version of something already in the database. Perhaps if there are no versions, less work can be done originally.

The watchword should be as before: to help our patrons navigate in the information universe, not to expect everything to be in a single standard: that standard being that “we” use, whoever the “we” happens to be. That would be an impossible task leading us to disaster.

The biggest change of all however, would come when librarians honestly consider the whole of the materials in the open archives as fundamental parts of their own collections, just as important as the books on their shelves or the databases to which they subscribe. After all, that’s how our patrons consider them. No single library can control all of those materials; it must be done on a truly cooperative basis.

Librarians must make something that people can use, and I think it should be done soon, since expecting our patrons to wait longer and longer will be tantamount to self-obsolesence and suicide, especially in times such as these. In his Preface to the 1853 edition of the Index to Periodical Literature that I mentioned earlier, William Poole wrote:
“To persons who have given but little reflection to the subject, there are few things which appear simpler than the compilation of a Catalogue or an Index; while those who have had experience in such labor well know that the undertaking is full of difficulties. If the preparation of this work had been delayed until a plan had been fixed upon that reconciled all objections, it would never have been commenced; or, if the labor had been continued until the work was satisfactory to myself, it would never have been presented to the public. My endeavor was to bring the contents of some fifteen hundred volumes into as narrow a space as possible. The ordinary plan of indexing periodicals was, under the circumstances, wholly impracticable.”
http://books.google.com/books?id=yO9GGjPbPjYC&pg=PP13#v=onepage&q&f=false

Poole’s remarks describe very well the situation we are facing today. We also have to create something that will help our patrons and it doesn’t have to be perfect, we just need to make tools that are better than what people have today--just today! That’s all. Everyone understands that whatever we make will improve. Nobody expects perfection, but they do expect improvements. Poole improved his index, others took up his baton later, and libraries should follow his example.

And with this we come to the end of my discussion of Open Archives, so no one need worry that it will go on and on like my personal journey with FRBR. I hope you enjoyed it or at least found it interesting. If you have any suggestions for future podcasts, please let me know.



The music I have chosen to end this programme is Pandolfi Mealli’s evocative Violin Sonata Op.4 No.1 called "La Bernabea" http://www.youtube.com/watch?v=oYsbdlyAAMU. Pandolfi Mealli lived in the mid-17th century and very little of his music remains, just two sets of violin sonatas numbered provocatively 3 and 4. This is an excerpt with Andrew Manze on the Baroque violin and Richard Egarr, harpsichord.

That’s it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.