Tuesday, March 29, 2011

RE:The next generation of discovery tools (new LJ article)

Posting to NGC4LIB

Jonathan Rochkind wrote: (concerning relevance ranking being a "crapshoot")
<snip>
Well, it depends on what you mean. That's a dangerous statement, because it sounds like the kind some hardcore old school librarians use to say we shouldn't do relevance ranking at all, I mean why provide a feature that's "a crapshoot", just sort by year instead. I don't think it's true that relevancy ranking provides no value, which is what "a crapshoot" implies.

Instead, relevance ranking, in actual testing (in general, not sure about library domain specifically), does _very well_ at putting the documents most users most of the time will find most valuable first. It does very well at providing an _order_. Thus the name "ranking".
</snip>
In these economically very troubled times, I don't think there is much of a chance that we won't do any relevance ranking at all--my concern is quite the opposite: administrators are indeed *desperate* to save money wherever they can, and today, computerization is seen as a method to save money because people are "so expensive". (A curious idea, by the way) If there is a danger, it is much more that it will be the practice of cataloging that will be tossed overboard, not the computerized relevance ranking.

Perhaps cataloging really will be thrown overboard--I don't know, and the millennia-old practice of cataloging will be done automatically or semi-automatically, by students and secretaries with only a few minutes or hours of training, following few standards, if any whatsoever. I am sure that if there is a danger, it is to cataloging and not to any kind of computerized rankings. In any case, it shouldn't be done without full understanding of what we would be losing.

Let's discuss practice and consider whether Google-type relevance ranking really does "very well at putting the documents most users want". The only way to determine if this is true is to compare it with some kind of alternative. Do we have anything? How about the library catalog?

Let's take as an example that I want to do some kind of research (not for publication, just an undergraduate paper) on air warfare of WWI. Doing this search in Google retrieves 65,400 results http://tinyurl.com/5tbouru (at least on my machine) and gives me at first Wikipedia, something from Firstworldwar.com, Britannica, answers.com, life123.com, pages about games, and so on. In the "Wonder Wheel" I see synonyms for "air warfare wwi" except for naval warfare and surface warfare. Is this relevant to my search? (As an aside, Google's menus letting people re-sort the results in several ways implies that they admit that the single relevance ranking is not enough)

To me, this is similar to my own experiences of very poor reference librarians who you ask for information on a topic, they run off into the stacks and come back with a book, often an encyclopedia, open to a chapter or article more or less on your topic. Then they leave you and return to their other work.

To decide if the Google result is relevant, we need to compare this with the correct, expert search in a library catalog, that I admit, no regular person would ever do: subject search: "World War, 1914-1918--Aerial operations" http://tinyurl.com/6hkgrcq, and I see a grouped result by American, Australian, Austrian, Belgian, etc., i.e. concepts I would not have considered on my own.

Now, if we look at the very first record:
Main title: The achievements of the Zeppelins, by a Swede.
Published/Created: London, T. F. Unwin, ltd. [1916]
Description: 16 p. incl. pl. 15 cm.
Notes: "Reprinted from the Stockholms Dagblad of 19th March, 1916."
Subjects: World War, 1914-1918 --Aerial operations.
LC classification: D604 .A3
There is not a single word of the subject heading anywhere within the description of the item, and therefore, without the subject heading, the person interested in air warfare would not have found this and would have had to come up with "Zeppelins" independently somehow. Full-text would not have helped either, since this publication is from 1916, and WWI was not called WWI until WWII broke out.

What I am trying to show is that the subject heading arrangement--when used as it is designed to be used--is an incredible time saver for the searchers, since very quickly they can get a nice overview of what is in the local collection and decide, e.g. I am not interested in World War, 1914-1918--Aerial operations, Italian, and don't have to look at any of those. This system is far from perfect, but there is real power in the traditional subject headings that *is not replicated* in relevance ranking, that is, so long as the library catalogers do their jobs satisfactorily.

This traditional method was designed for printed catalogs and I readily admit that it does not work in the world of the web, but the question naturally arises: could these clear sorts of result sets be repurposed to function on the web? Of course they could, if the powers-that-be decided to devote the resources. Yet, I fear that there is little chance that we want to devote the resources to this task and we want to put our faith in "relevance" ranking, which I think is, in reality, a search for the "perfect algorithm", that I do not believe exists.

*Everybody* from provider to searcher has an interest in maintaining the idea that relevance ranking really does give us what is "relevant" (in the normal meaning of the term), and not actually some kind of incredibly complex mathematical algorithm that provides result rankings that no human being could ever explain since the mathematics are too complex, but results that that can only be accepted at face value. Yet, we must accept this since the only other choice would be to look at all of the 64,500 hits where the "relevance" really does trail down to .0001 sooner or later (and of course, we know this is only the tip of the iceberg of what is really on the web on this topic).

It would be nice if we could somehow get the two methods to work in tandem somehow, because where the subject headings are strong, relevance is weak, and where the subject headings are weak, relevance is strong.

Something tells me that that will be a very hard case to make however, since the administrators will no doubt claim it is double work, although they can say that it would be nice in a different economic environment and so on and so on.

Monday, March 28, 2011

RE: The next generation of discovery tools (new LJ article)

Posting to NGC4LIB

Jonathan Rochkind wrote:
<snip>
I would be wary of assuming that this is reflected in the _math_ though. Jim, by "my own experience too is that this is correct", do you mean you've actually looked at the distribution of calculated relevance scores in the result, or just that your own judgements of relevance of hits would distribute like that, trail off into non-relevance?
</snip>
Google is a highly secretive organization and I would be surprised if they would release this sort of data, but maybe they do somewhere. Still, while the actual relevance numbers assigned by Google may be
100, 98, 87, 54, 35, 12, 4, 1, 1, 1, 1, 1, 1, 1, 1,
or
100, 70, 69, 68, 67, 66, 65, 64, 30, 39, 28, 27, 26, 10, 9, 8, 7
the fact is that very few people go past the first screen. This includes me. I don't think it is so much a matter of laziness but that the results past the first page just do not serve. As a consequence, it seems to me that below a certain threshold--I'll pull a number out of my hat, let's say 20--it may as well be 1 or 0.

Of course, speaking as the "information specialist" I have no doubt that there is far more that is really relevant on the web than the handful of what I see in the first couple of pages of a Google search (here, I am using the term "relevance" in the normal sense, and not in the Google sense). Since people love Google so much, I always feel I have to add that this is not a criticism of Google. Google is nothing more than a tool, like a hammer or a power saw, and any tool has its strengths and weaknesses. This is simply an illustration of the importance of understanding those strengths and weaknesses.

But in any case, that is why I suggested using Google Scholar results instead, since it seems as if the major ranking of the search is by number of citations, and this can be seen.

RE: The next generation of discovery tools (new LJ article)

Posting to NGC4LIB

Karen Coyle wrote:
<snip>
I have always assumed (and I would love for someone to post some real data related to this) that after a very small number of high ranked results the remainder of the set is "flat" -- that is, many items with the same value. What makes this flat section difficult is that there is no plausible order -- if your set is ranked:
100, 98, 87, 54, 35, 12, 4, 1, 1, 1, 1, 1, 1, 1, 1, ....
and you go to pick up results for page 2, they will all have the same rank and they will be in no useful order. (probably FIFO).
</snip>
My own experience too is that this is correct. Something that may be relevant to this discussion or not, I have worked a bit with a Firefox plugin, called Cloudlet http://www.getcloudlet.com/, where it takes a search in Google, Yahoo, and some other databases, and returns a word cloud. In the Wired article at http://www.wired.com/epicenter/2008/12/firefox-add-ons/, they mention that to get better results, you should change your account to get 100 results per page, but otherwise I haven't discovered any more details concerning how it works. I've concentrated on trying to find out if it is genuinely useful.

I still haven't decided if it is or is not, but something within me says that it *has* to be useful. My concern is: when I click on a word in the cloud, I don't really know what I'm looking at.

In any case, this is a different take on the same idea as what we are discussing here.

Anyway, a suggestion for Karen is to relate the search to Google Scholar, which arranges results by number of citations (mostly). For more specific searches, i.e. not only single words but multiple terms, the citations die out after a couple of pages or so.

Tuesday, March 22, 2011

The Internet: For Better or for Worse

Posting to NGC4LIB

Re: The Internet: For Better or for Worse / Steve Coll, NY Review of Books (April 7, 2011) http://www.nybooks.com/articles/archives/2011/apr/07/internet-better-or-worse/

This article reviews a couple of books, but the one that interests me here is "The Master Switch: The Rise and Fall of Information Empires" by Tim Wu, which discusses how in the past, the new media was relatively free and open, only to be taken over by bigger powers in the end. He uses his example in the early days of radio, e.g.
"Churches, clubs, oddballs, gadget hounds, and sports entrepreneurs launched radio stations that could reach listeners over a few square miles. By the end of 1924, American manufacturers had sold more than two million radio sets capable of broadcasting. Dense urban areas such as Manhattan tuned in to a cacophony on the airwaves."

Many people thought that radio would really let democracy work since everybody could be connected in all sorts of new ways. Of course, this did not happen since the radio waves became controlled by business and the governments.

I am sure this sounds familiar with today's focus on the internet, but the author (Tim Wu) says: "The individual holds more power than at any time in the past century, and literally in the palm of his hand," Wu writes. "Whether or not he can hold on to it is another matter."

While I agree with this, (I haven't read the book so perhaps there is more there. By the way, there are several videos of him discussing his book which I also haven't seen yet but I plan to, e.g. http://www.youtube.com/watch?v=LVZLl4EKQis) my own experience demonstrates that in order to have this "power"--which consists of nothing more than having an incredible amount of useful information at your fingertips--it takes a lot of skill to be able to find what you want after wading through all of the garbage that is in the way.

One thing I have learned by answering reference questions is a major difference I have with almost all of my patrons: to do good reference work, you need a lot of patience and that so-called "stick-to-it-iveness". The vast majority of people give up far too soon, or decide "good enough," or even--a surprising number of times--they change their question because they feel they are either getting too much information or not enough! People who know how to search, where to search and what to search for definitely have a certain amount of power, but for the average person who has not been trained, they can only type in a few words into Google, or the database of their choice, and assume (that is, they must have no choice except to hope) that whatever comes up first is the most "relevant" and the rest must be "irrelevant".

Many times, the reference librarian's job is simply to use the right tool for the right job. I wrote about this before in a posting "Cablegate from Wikileaks: a case study" http://catalogingmatters.blogspot.com/2010/12/cablegate-from-wikileaks-case-study.html, where I mentioned that I am not a US government documents expert, but nevertheless based on my training, I could find the answer to a very specific question--yet it took me time. I am sure that the average person could not do what I did, because they don't have my training, but also: I didn't give up.

Robert Noyce (inventor of the microchip) said, "Knowledge is power. Knowledge shared is power multiplied." So, while I agree that the individual holds an incredible amount of power/knowledge, this remains only potential power if you can't find what you need, or if you are satisfied with what some algorithm created by who knows who for who knows what purposes, decides for you what is "relevant". If this is the case, it is only logical to ask what exactly the "power" is that the individual is supposed to have, and who or what actually controls it.

I honestly believe that the traditional goals of librarianship are just as important now as ever before. What seems to be simple and easy more often than not turns out to be far more difficult and demanding.

Friday, March 18, 2011

RE: "Business case" for RDA changes (was: RDA "draft")

Posting to RDA-L

Mac wrote:
<snip>
But this could have been accomplished by an AACR2 revision page, and treaties make up a very small part of the collections of most libraries.
</snip>
Mac is right and this idea of course, goes beyond treaties to include the whole of RDA. While we can all agree that some rules can and should be changed, it does not add up to a *business case* to justify junking our old rules and spending our quickly diminishing budgets for an entirely new set of rules that everybody must be retrained in, especially when the final product will be practically the same as what we make now.

Another point is that perhaps we can use the computerized tools available to us much more wisely. While *perhaps* there is a problem with people understanding abbreviations (many of which they read in newspapers and books all the time, but for the sake of argument, for the moment I'll accept that abbreviations are indeed a problem, and so much of a problem that we must focus our resources on "correcting" abbreviations over other problems), it still doesn't follow that the best solution is to type out everything in full by hand. For instance, if we do so, it can be argued that we are not solving anything at all for our users since all records we have made up to now will have abbreviations. Unless we embark on a huge retrospective project, every user until the end of time, will be looking and dealing with abbreviations. That is an absolute fact that we *cannot ignore* because if we do so, we *will be ignoring* the needs of our users and only give non-catalogers yet another reason to say how little catalogers care about the users. At the same time, the techies are always complaining that we use text instead of codes. Here are the abbreviations, more or less: http://www.library.yale.edu/cataloging/abbrev.htm Can there be automated solutions to solve this problem? My answer: of course there are! I'll bet a perl programmer could devise a preliminary solution in a few minutes.

In answer to Karen, I will point out that protestations that we cannot make a business case will fall on very deaf ears and be fatally counterproductive. There is absolutely no choice except to make a valid business case and one that will be convincing to non-librarians, people who are not librarians. Still, the final point is right on target and needs a lot more discussion:
"It could very well be that changing from AACR2 to RDA has a small return on investment, but that a much larger return on investment could come from other changes -- ones that we aren't even contemplating."

RE: RDA "draft"

Posting to RDA-L

Mike Tribby wrote:
<snip>
Should cost of access and the possibility of universal access have been concerns? I think they should have been-- but they were not. To perhaps put it crassly: theoretical purity was a higher concern than access. It's hard to blame the co-publishers very much since none of them are exactly rolling in extra money, and this process has been expensive, but some of us have been complaining about the assumed cost of subscriptions to RDA for some time now.
</snip>
The current metadata universe could not have been foreseen when FRBR and RDA were being created. I can't find fault with anyone on this. FRBR first came out in 1998 (which meant several years of development before that). It turned out to be the model for later work, which didn't begin until 2005 or so. While this may be considered the "fast track" in traditional library experience, the revolutionary changes in information searching and retrieval brought about by Google and continued by many other very powerful companies, didn't really begin until afterwards, about 2000 or so. In fact, Google didn't go public until 2004. Most of the really new possibilities of search have taken place only in the last few years and I think we all expect these changes to increase at a huge rate. People really like these new capabilities a lot and in fact, are considered "the standard" by many who compare our tools to the full-text ones. Nobody could really have expected that in the mid-1990s.

Things often don't turn out as we wish. Those poor people in northern Japan could tell us a lot about that. But "stuff happens" and you have no choice except to deal with them. If the Google-type algorithms had not been discovered (created?), and the global economic meltdown hadn't happened and everybody were still swimming in money like before :-) , matters would be quite different for librarians and catalogers now. But libraries have lost whatever "primacy" they had in metadata, the black box has been opened (as I mentioned in my last podcast) and there is no telling what will happen.

But if RDA is implemented, it must split the library metadata world; that is clear.

Thursday, March 17, 2011

RE: RDA "draft"

Posting to RDA-L

Laurence Creider wrote:
<snip>
Please do not tell me to consult the workflows; if you are making a cataloging code, the rules should be structured not according to a theoretical model but to facilitate the production of metadata, in other words, the very nitty-gritty contact between any model or rules and the varieties and perversities of the ways information resources present themselves.
</snip>
Well said! Although theory is fine, it all ultimately comes down to a very practical task: I have this *thing* I need to add to the collection, so I have to make some kind of metadata for it. One part of this metadata is a title. My resource has three possible titles on it, plus a fourth if I use my imagination a bit (this happens a lot with books, which may have different titles on the title page, vs. the half title, vs. the running title, plus there may be a series title or title of a related variant). Which titles do I need to enter and how do I enter these titles into the record in a way that is coherent to patrons and to other librarians? This may seem easy, and often it is easy, and other times it is very difficult. Naturally, this same task needs to be extended to every single part of the record. Catalogers do not have time to sit around and theorize about these matters since there are mountains of materials waiting.

For many years, I was the moderator of SIG-IA, the ASIS&T list for Information Architecture, and sometimes the discussions would veer into metadata creation. Most of these people were web masters who had no idea that there even were any rules for such an esoteric task. I remember in particular, one woman who was trained as a dancer, who said that she created her metadata through "her feelings". Looking up rules and practices of others was *not* for her. Another time, I had a series of discussions with a faculty member who was trying to set up what would later be called an open archive and he wanted to get faculty to create metadata for their own resources. This fellow actually listened to my explanation of what cataloging is: following rules and trying as best you can to maintain consistency (I showed him the rules and LC classification tables, which blew him away), and he finally became very depressed about faculty making their own metadata, because he came to understand the importance of consistency, and as he put it: faculty members see metadata creation as extensions of their own creativity. Obviously, these popular views of metadata creation reveals something quite different from following cataloging standards as closely as you possibly can.

Also, there are practical issues that could be more or less safely ignored in that past, but we must discuss them today, as exemplified in the recent Autocat thread: "Help! "Elevator speech" for keeping a cataloger" http://tinyurl.com/645bdgl, where the thread's originator wrote: "I am hoping your feedback will sway my boss. She has a general disdain for "traditional" library activities, which includes the library catalog and cataloging in general. She has described authority control as a waste of time and wonders why we should bother with a catalog at all." This type of administrator is not at all unique today. In many cases, there are no "friends of the library" in upper echelons.

Ultimately, I think it is the lack of a sound and reasonable business case in favor of RDA which is the real problem. This has been brought up over and over again, including in the report of the Working Group. Everyone is just supposed to accept that it makes sense to spend all this time and effort training people how to use the new rules, with the final result that abbreviations are spelled out and that N.T. and O.T. are not entered into their headings anymore. People will see weird dates and some relator codes here and there, but otherwise, they will see no changes of substance. Searching will be the same, the records will look the same, everything will be the same except for some details here and there that probably, no one will even notice unless catalogers point them out.

Compare this with the ONIX Best Practices at http://www.bisg.org/docs/Best_Practices_Document.pdf and you will see that *each field* has an associated business case in its favor. Although I don't think this level is necessary for library cataloging rules, at some point something will have to be done, because otherwise the changes RDA offers seem completely random and strange with no overall purpose--at least this is how they seem to me and I am sure how they seem to many others. I still see *no tangible advantages* whatsoever and I cannot imagine that a non-library administrator would see any more than I do.

As a result, if there is no convincing business case--and it will have to be a pretty convincing one--I fear that the attitude above must win out among administrators who absolutely have to save money today. They are thinking: where can I cut? I can imagine that they would conclude: if this is all catalogers can offer, why should we bother with a catalog at all? What else can we do that will offer savings and access?

Monday, March 14, 2011

RE: [ACAT] "our" vs "or" (was:Re: Subjective Judgements in RDA 300s????)

Posting to Autocat

Pete Schult wrote:
<snip>
Nice. It is, of course, necessary to keep in mind that the results may not be ready-for-prime-time. The German translation had "Krank" for the "ill." in the physical description.
</snip>
That is so funny!! I guess there are some reasons why we can't simply use Google Translate for our purposes, but it's important to keep in mind that when we enter text into a computer or on a webpage, it is completely different from the text printed on a physical page or a card: from the fact that it is computerized, it can be transformed in all kinds of different ways for all kinds of purposes. For example, it would be possible to create an API, much simpler than Google Translate, that would work only with 300$b that would transform our few specific abbreviations in this field into full forms. While this would demand a certain amount of labor, it would work for every record in our database and be a much better solution than retyping everything. Then, instead of everybody typing everything out, one IT person would add an API, which is quite easy.

Also, even entertaining such a solution brings up a related question: would this be the best use of the new technology? If we could do something like this, who would it be for? For our patrons or for ourselves? If our patrons were to see such possibilities available, would they say: "Well, we want you to work on spelling out abbreviations in the record." or would they would choose something quite different and much more useful--for *their* purposes?

I think the latter.

Thursday, March 10, 2011

RE: "Opacity" of AACR2 (was: question ...)

Posting to Autocat

J. McRee Elrod wrote:
<snip>
There is nothing in the physical form of the card catalogue to prevent more than one main entry, if the shelf list card were used to record entries as opposed to the main entry card; main entry is only a matter of unit card indentation, and two or more enties could could share first indentation, being ticked for filing.
</snip>
The purpose of single main entry in a physical catalog was purely mechanical. If you have a single card for a single (or at least a complete) resource, and you do not have to write out additional cards by hand but have a mimeograph machine or something, matters were much simpler.

But if you have a longer record that needs more cards and has more complex holdings, then matters are different. Through the magic of the Internet, I have managed to find a real example, although it's in Russian, but that is unimportant.

In Princeton's scan of their card catalog, we can see the main entry card for Marx & Engel's complete works in Russian. http://tinyurl.com/67jo35c It gives the extent 50 v. in 54. If you go to the next card, there's a dashed-on for an index published in 1974.

When we look for the AE card under Engels http://tinyurl.com/5tblcjb we see it lacks the extent statement, and there is no second card, with no notice of the other index.

When we get to a subject card http://tinyurl.com/6gn7zsa, it's also a single card, so you don't know about the other index, plus, there is the stamp: For Holdings see Main Card (which I guess people understood back then).

Also, when comparing these cards, the ME card had more extensive information about the AE cards. Finally, reading a bit more closely: this bookset started in 1955 and ended in 1981, so there was probably a lot of work done *in pencil* with temporary holdings throughout the years, and changed to pen at the end. Also, this card came originally from LC, cataloged by them in 1956, and was originally a single card. The second card was a locally typed one when the second index came out, and they stamped "See next card" on the LC card.

In a card catalog, they wanted above all to keep the number of cards to a minimum so the catalog would not get out of control, this is why AE cards were limited to single cards. Plus, just looking at the physical labor the ME card above shows, to copy that same information for, e.g. in this case, holdings in several different places, plus the shelflist and maybe even an official catalog, if you had one, was just too much work and begging for errors.

(I can't hold myself back from mentioning what appears to be a "correction" from the reviser, or at least some kind of a change. The original cataloger wanted to add a card under the title of this book in Russian (Sochineniia) for Karl Marx, but it was crossed out. Whether this happened at the time of cataloging, or this card was withdrawn later, I don't know. But the final result is rather confusing: you can find this book under Marx, Karl, 1818-1883. Works. Russian (the ME card), but not under the book's title "Sochineniia", whereas you *can* find this book under Engels, Friedrich, 1820-1895. Sochineniia, but *not* under Engels, Friedrich, 1820-1895. Works. Russian, exactly the opposite of Marx. Curious. Anyway, I like to think the cataloger got his or her hands slapped here!)

RE: question about RDA: title capitalization and ME relator codes

Posting to Autocat

On Wed, 9 Mar 2011 09:58:19 -0500, Brenndorfer, Thomas wrote:
<snip>
These authorized access points are there for backwards compatibility-- for card catalogs and for linking of headings in MARC environments. So no, the main entry is not just an anachronism-- it is still a critical component tied into the current functionality of our catalogs, whether collocating headings or in linking headings in a MARC environment. So one still has to know the instructions for main entry to create card catalogs and MARC catalogs. But RDA is organized in such a way that future catalogs can be constructed without them, and therefore without the idea of main entry.
</snip>
That's nice to know that the need to choose a single main entry will apparently no longer be needed someday in the future, but for now, we still need to do it. Nevertheless, I have always said that main entry is a vital concept, but a single one is, without a doubt, anachronistic.
<snip>
>"Finding" can mean scattershot keyword searching, or it can mean controlled searching where you find **all** the resources related to an entity. In the case of Clint Eastwood, you can find all resources he is associated with, or you can find those resources where he played a particular role. Indicating that role with something like a relationship designator helps people find the resources they're interested in. A statement on the resource specifying who's the director is an identifying element, but the function of the relationship designator (a different element) is to help people find those resources.
>
>The task is "find" not "identify" in this case.
</snip>
I still don't agree, but I am tired of discussing the 19th century user tasks, and I am sure others are sick of reading (or listening) to me about this. Besides, the next point is more important:
<snip>
>Many years ago, in my SQL course, a class assignment had everyone create a database for DVDs. Within no time we had tables for the equivalent of manifestations, expressions, items, persons, and roles. That's Mom-and-Pop stuff for DVD rental store databases, and every kid in the class got it. My take was that this was so similar to library cataloging except that we seemed scared stiff about things like relationship designators, when in fact the idea is elementary and widely used.
</snip>
I've done the same thing, but there is a tremendous difference between the rather elementary (today) task of creating a brand new relational database with different tables, and the library task of taking what has been handed down to us after over 100 years' worth of work and dealing with what we have. If librarians were setting up a brand-new database, I am sure they would never choose a MARC type of format, and I am even more sure that they would not choose ISO2709 for record transfer. If they did, they would be laughed out of the office! Plus, the amount of change in the cataloging procedures and rules themselves since the beginning of the modern library catalog has been breathtaking as well. These are just facts, and is what we have now.

Libraries are also different from most other organizations in that we are supposed to make resources cataloged from 150 years ago (and more in some cases) just as easy to search and available as anything processed today. Most other organizations do not need this sort of accessibility for their older materials and therefore, archive anything that is over 10 years old, if not less. Therefore, to find those materials is more difficult but they don't care because older materials are considered less important and needed very rarely.

I have met several database designers who say libraries need to do the same thing: just throw your old stuff into a separate archival database and start in with something entirely new. If this happened, then your example of a Mom and Pop DVD store might be applicable. (I have a sneaking suspicion that something like this may be in the plans, and it may actually be more or less achievable--but I am merely guessing)

But no matter what: we will still be dealing with our legacy *data*. Whether we like it or not, our predecessors did not include information that we would like to have today and we don't have the sci-fi option of getting into our time ship, travelling back and saying, "You dummies! Do it THIS way!!!" We have what we have.

So the only *correct* answer for someone wanting to find people by relator codes (relationship designators) of "editor", "author", "translator", "director", "actor" and so on is: the library catalog is not the correct tool. Although the information is there, you have to dig it out of the statements of responsibility and/or the notes. The library catalog is not and will never be the correct tool for such a task because it wasn't designed for it. There's nothing awful or terrible about stating this: it is also not the place to find journal articles, or specific chapters in books (although sometimes the chapter titles are entered) or tons of other things. Certainly, people expecting full-text keyword searching should not use a library catalog (but many don't understand this).

In the case of searching for directors of films however, this is not any kind of a problem at all for the searcher since there are other tools that can be consulted *just as easily* as our library catalogs, if not easier. With other types of resources, there may be other tools out there. For all I know, there may be a database that lists people working only as editors or translators. If so, maybe we could work with it.

The problem at this point is to figure out a way so that the searchers know that they must switch to other tools to find information that either doesn't exist in the catalog, or that another tool is much more efficient for them. How do we let them know? At one time, people were supposed to ask the reference librarian, but not that many people asked, and it sure doesn't work today. Today, the catalog must supply that information somehow. I don't know what is the best way, but I know there are several possibilities. I am sure there must be a number of different solutions, so long as we are innovative and decide to honestly cooperate.

Working with the other information providers on the web would certainly be a lot of work for the IT staff, but far more progressive than just adding the relator information to new records, which would forever provide false information to the catalog searcher, yet, it would also be far *less* work than a mind-boggling retrospective conversion, which would waste tremendous amounts of library resources on a very minor task when we are faced with real problems.

Wednesday, March 9, 2011

RE: [ACAT] question about RDA: title capitalization and ME relator codes

Posting to Autocat

Brenndorfer, Thomas wrote:
<snip>
[James Weinheimer wrote]:
> What strikes me in these kinds of discussions is that we see an increasing complexity to reach what is precisely the same result.
No, the complexity is derived from the opacity in AACR2, where all of these terms have to be decided upon by catalogers in any case so they know what can be a main entry and what can only be an added entry. Instead of recording those decisions, we slot entries into 1XX's and 7XX's as if the only output that matters is card catalog sorting.
</snip>
The determination of a *single* main entry is a different, and quite difficult, matter, and as I have mentioned a number of times, selecting a *single* main entry is an anachronism left over from the physical forms of the catalog. Figuring out a single main entry really can be difficult, but if we could change it from what we have now (single 1xx and multiple 7xx) to something more like--even--Dublin Core with "creator" and "contributor", it would make it easier for us to work and train (e.g. no longer figuring out whether a corporate body should get main entry, although it is definitely a "creator").

But this is a problem of determining which of the many creators/contributors to place into the 1xx field, not figuring out which roles they have played. I don't think that has ever been much of a problem. Certainly not in my experience.
<snip>
[James Weinheimer wrote]:
> Adding the relator codes will not help anyone *find* anything more than they do now, it is that adding relators *may* help people *identify* a specific item they want.
Sure it will. Check out "Clint Eastwood" in Internet Move Database (http://www.imdb.com/name/nm0000142/). The results are divided by his role-- actor, director, producer, and so on.
</snip>
No, what this shows is not *finding* but "identification". In the IMDB, you can *find* only by "All; Titles; TV Episodes; Names; Companies; ..." (from the drop down box). Once you have *found* in this way, then you can *identify*, in this case of Clint Eastwood, by Actor, Director, Producer. If you could choose in the drop-down box, by actor, director, producer, etc. then you could *find* by more, but as it is, it only helps with identification after the search result.
<snip>
How is it easier to wade through all results lumped together to find the movies he just directed? Currently we have to scan the entire record to find statements that indicate that relationship (and we put those statements in because we claim they're important for the end-user, yet we certainly don't make it easy for users to sift through results !!).
</snip>
This shows a larger area of concern. From a theoretical, even utopian point of view, I agree, but there are basic--even brutal--facts that state otherwise. What are those facts? 1) there is a finite number of library catalogers; 2) that number is not growing appreciably and may be heading downward; 3) library budgets are in great trouble; 4) yet the numbers of materials needing cataloging is growing, and perhaps exponentially. Conclusion: sooner or later, something has to go "POP!" In other words, what we have now is not a sustainable situation.

Does it make sense that we make the cataloging of individual records even more complex? Does it make sense to spend our resources to recreate a functionality currently available to everyone in the IMDB, while admitting that our product will forever be incomplete since we cannot retrospectively update our records? (What a waste of resources that would be!) What would be the patron's view if we decided to add the relator codes for films to our new records anyway? They would conclude (correctly) that our work was inferior to IMDB since the results would be complete in IMDB and not in a library catalog.

Libraries have only so many resources and there *definitely must be* a tradeoff: would our public rather have catalogers add this kind of role information (that they can find in the IMDB and scads of other sites online), or that we spend our time making more records?

There is a third way possible today: to admit that library catalogs are not separate, but exist within a tremendous universe of information that can be exploited in all kinds of ways. How can we use that universe of information? Would there be some way for libraries to use (interoperate with) the "superior" information in the IMDB or in one of these other sites, instead of redoing the same work?

We must focus on what is practical today, even if it may be unpleasant. It is a fact that people have many problems using our catalog records (as I tried to point out in my previous post). Is relator information one of the problems? Perhaps or perhaps not, but if it is, it is definitely far, far, far down the list. We need to focus our energies on the important problems.

Tuesday, March 8, 2011

RE: question about RDA: title capitalization and ME relator codes

Posting to Autocat

On Mon, 7 Mar 2011 16:58:21 -0500, Brenndorfer, Thomas wrote:
<snip>
The issue of "forcing" the variety of relationships through a limited set of designators is a complete red herring in discussing RDA.
>
>Justification of access points by other elements is one thing, but the problem is that's often the only method users have of inferring relationships. The reason why access points are what they are is not readily transparent. For example, how many users think of a "work" when looking at a 100 field? Catalogers have no choice but to identify a work in order to catalog an AACR2 record.
</snip>
What strikes me in these kinds of discussions is that we see an increasing complexity to reach what is precisely the same result. Adding the relator codes will not help anyone *find* anything more than they do now, it is that adding relators *may* help people *identify* a specific item they want. (That still remains to be demonstrated) By this I mean that when people search, they will select (at the most) "author, title, subject" as they do now, and not from the entire gamut of terms at http://www.loc.gov/marc/relators/relaterm.html.

While there may be some kind of minor utility for the user in seeing the relator codes, this discussion shows clearly that it will definitely be more complex for the cataloger and take more time. For example, let's say that someone is cataloging a "remote-accessed electronic resource" and you come across someone who is the "web coordinator"? Which relator code do you use? Often, you see names not with "editor" or "compiler" but a person' job title, and who in the world can know what that is supposed to mean? How do you choose a relator code for "web coordinator"? If I choose one of the terms, how do I know if the next cataloger will do the same thing? This is the brilliance of the statement of responsibility, which is far more exact and easy to implement because all you do is transcribe what you see.

And Mac mentions that consistency will be broken, and this is a *very serious* matter that should not be brushed aside. The old records will never be "updated" and that is a concern since you potentially make all of your previous records obsolete.

So, complexity and difficulty of record creation increases, without a doubt, and the benefits to the users are highly dubious, and I would even say, remain theoretical. While it is clear that the public has lots and lots of problems understanding our records, they increasingly have problems even understanding *what our records are* because they have become used to seeing the results of Google full-text searches, with that small clip underneath each link that shows how their search term has been used. Of course, the Google "record" that they see is completely different from anything we do. To demonstrate, here are first three results for the search "metadata" on Google:
<snip>
Metadata - Wikipedia, the free encyclopedia
The term Metadata is an ambiguous term which is used for two fundamentally different concepts (Types). Although a trite expression "data about data" is ...
Definition - Metadata types - Metadata structures - Metadata standards
en.wikipedia.org/wiki/Metadata - Cached - Similar

[PDF] Understanding Metadata
File Format: PDF/Adobe Acrobat - Quick View
Understanding Metadata is a revision and expansion of Metadata Made ... Metadata can be embedded in a digital object or it can be stored separately. ...
www.niso.org/publications/press/UnderstandingMetadata.pdf

Metadata Definition
The definition of Metadata defined and explained in simple language.
www.techterms.com/definition/metadata - Cached - Similar
</snip>
Of course, it is very natural for the public to relate our catalog records to these types of "records" which are just clips that show keyword in context. This type of result is pretty easy for anyone to understand, but comparing it to a catalog record is simply wrong since they are completely different, with separate purposes. This is the sort of thing that confuses our public, and not relator codes.

It seems to me that adding the relator codes is a solution in search of a problem. There are plenty of genuine problems out there that need genuine solutions. This is not one of those.

Friday, March 4, 2011

RE: "our" vs "or" (was:Re: Subjective Judgements in RDA 300s????)

Posting to Autocat

On Thu, 3 Mar 2011 01:11:31 -0500, Hal Cain wrote:
<snip>
>On Wed, 2 Mar 2011 10:33:54 -0500, Aaron Kuperman wrote:
>
>>Is there a way we could have the OPACs and cataloging systems automatically switch between British and American spellings in fields that don't involve transcription, such as the 300, the 5xx fields (other than quoted notes), etc. It seems to me a program would be told which spelling a library prefers, and could adapt the field without making any extra work for us. It probably could even do that for changing cataloging from non-English sources as well.--Aaron
>
Barbara Tillett's notion of authority control as "access control", offering the user the capability to switch between desired languages for searching (and seeing results) on names (like Confucius, or Vergil, to mention some classics) that vary according to the language environment, could be extended to cover alternative languages for bibliographic description.

In such a case, that would require that "language of cataloguing" be coded differently for British or American English.
</snip>
There are other possibilities though. The incredible example of Google Translate shows an alternative. I have implemented the API for my catalog, as you can see in a record, e.g. http://www.galileo.aur.it/cgi-bin/koha/opac-detail.pl?bib=26480 in the right column, you will see a drop-down box of Google Translate. (This is a mashup and the box may take a moment to load) Just select any language and watch the magic! I am still amazed when I see it.

Students at my institution think this is really cool, plus they have found it useful to translate into and out of Italian. To see the original, just run your mouse over the parts you want. Of course, not all the translations are perfect--people have no trouble at all understanding this today--but it can be implemented very easily and for *no cost* at all.

The differences between British and American spellings ("u" or "re") pale in comparison with what this tool from Google accomplishes. People who know English have no problem understanding that "labour" and "labor" or "centre" and "center" mean the same things, but people have genuine problems understanding languages they either do not know or have a shaky grasp of. Google Translate can provide highly substantial help.

All of this shows that providing translations that can *help* our patrons (and it is important to separate "providing help" from "being perfect") is quite achievable. Just examining a few of the abbreviations in the 300 field shows that Google Translate does a highly creditable job.

If catalogs had something similar that worked only in certain areas/fields of the record, we could create something of much greater utility for everyone concerned. But who knows? It could be that simply implementing the Google Translate API solves *all the problems,* while it is so much more flexible, saves scads of money, plus it allows the catalogers to focus their time on more productive tasks. For example, why would implementing Google Translate *not* solve the "problems" we are discussing here?

Again, modern technology allows an entire raft of solutions that were considered science fiction just a decade or so ago. We must utilize that power to improve what we do and how we do it.

Thursday, March 3, 2011

RE: Standards (Was: Subjective Judgements in RDA 300s????)

Posting to RDA-L


Karen Coyle wrote:
<snip>
Standards are only enforceable if they are measurable. There is no way to enforce a standard on transcribed data elements. The more that our data allows for free text input, the less we can do to ensure that standards are followed.
</snip>
What people are calling "free text" does not necessarily mean that you are free to enter the text you want. It is *text*, certainly, but anything but *free*. For example, the ISBD rules of exact transcription of the title page has the result that the information in the 245 is *not* free at all, although as with every rule or standard, there is naturally a little bit of wiggle-room, and the more experience you have, the more wiggle-room you can find. Still there are limits, so there is no way that any standard could e.g. allow for the preliminary title the author chose when first writing a book, and by the time the book was finished, the author had changed the title into something quite different, to then say that the preliminary title should be accepted as the final title of the book makes no sense. This would be like putting the title "Trimalchio on West Egg" on Fitzgerald's "The Great Gatsby" http://www.guardian.co.uk/books/2009/jan/19/1000-novels-top-10-trivia-rejected-titles-mullan. Sure, the title may be of interest and someone may want to record it, but it is *not* the title of the book.

All of this can certainly be measured, and has been a lot, as any cataloger (including myself) can testify who has undergone the (often humiliating) scrutiny of strict revisers. Those revisers would have kicked me out of the places I have worked if I had given them a record like this: http://chopac.org/cgi-bin/tools/az2marc.pl?ct=com&kw=0521348358
For this item:
http://tinyurl.com/4s2rlke
(and you can even compare the scan! http://www.amazon.com/Cicero-Cambridge-History-Political-Thought/dp/0521348358)

Practically every field needs updating. There are far worse records than this one. Still, such things *can* be measured and are, every day.

I also don't see that even in the 260$b it's all that "free". The terms and abbreviations used there have been strictly controlled over such a long period of time, and I have a suspicion that a good perl programmer could probably work out a routine to display "ill." or "illus." as "illustrations" or in whatever language you would want. Doing this should be child's play. There are just not that many abbreviations, even historically. So, it wouldn't surprise me if it turned out that those abbreviations could perform essentially the same function as computer codes, maybe not quite so perfectly, but it would be far more flexible (different languages) and in any case, a better use of the cataloger's time than the tiresome-Sisyphean task of typing out all of those abbreviations in full (and would only make the programmer's job more complicated), or wishing that the creators of MARC had made special codes and click boxes for everything.

The headings are not free-text, by definition. The only place where there is true "free text" is in the note fields (I know--there are probably some others I've forgotten), but not all the note fields. There is nothing wrong with some free-text fields, and they are vital for a cataloger to do the job properly. And Google Translate has demonstrated in amazing fashion how much you can do with transforming true free-text.

It would be so much more fruitful to concentrate on the powers that the computer systems give us and to work with what has been given to us in new and powerful ways. We need to work with what we have. Our traditional controlled terminology should be exploited for all it is worth.

Standards (Was: Subjective Judgements in RDA 300s????)

Posting to RDA-L

Diane I. Hillmann wrote:
<snip>
I think what this discussion points out is a gap in how we think about who contributes to data and how it is created. In libraries we have this fantasy that catalogers are 'objective' and that's what we're trying to do when catalogers create data--provide one-size-fits-all, all-purpose objective data. The problem is that isn't necessarily what our users want, we just think that's it and go on serving it out (no matter that it's not objectivity we were aiming for, but consistency). And the issue of costs keeps coming up to justify why we can't do anything different from what we've always done.
</snip>
I believe the matter is actually a matter of adhering to *standards* and considering whether adhering to genuine standards is important for libraries or not. The idea of standards is one that the business world understands far better than the library world. Every single day in the business world, they work within real, genuine standards that everybody *absolutely must* follow, whether they happen to agree with them or not, because if you decide not to follow those standards, you may wind up in jail, sued, and the company closed down. Just imagine the number of standards followed when you do something supposedly so simple as buying a can of peas: there are standards for fertilizers, for storage, for grading, for canning, for labeling, and many other standards I do not know about. Everybody assumes much of this when you pick up that can of peas, e.g. do you care about how these peas that you are about to eat have been stored? (Yes) Do you want to know the details? (No) Do you want to be able to go about your business after you eat these peas and not wind up in an emergency room? (Yes) All this is assumed, but we do not consciously think about them since we figure that the experts behind the scenes can be trusted.

None of this has much to do with "objectivity". Even if the standards are not effected though legal means that still doesn't let you off the hook in the business world since there are other methods of enforcement, and your products could still easily wind up effectively boycotted by every other legitimate business on the planet, and bankruptcy is inevitable.

Of course, in the library cataloging world, not following bibliographic standards can be done more or less with impunity. I personally do not think this laissez-faire aspect of our traditional library practice can be transferred into the larger world of "metadata", which includes the business world. Business is starting to understand the importance of metadata (of the many recent articles, these are interesting, http://radar.oreilly.com/2010/07/metadata-not-e-books-can-save.html, http://www.digitalbookworld.com/2010/metadata-more-important-than-ever/, http://tinyurl.com/32ders5)

Somehow, sooner or later, real standards will be introduced that people will be forced to follow, or suffer negative consequences, just as in the regular business world.

I believe this concern is what underlies Diane's mention of *consistency*, which is correct but has tremendous consequences the moment we take the matter of "consistency" seriously. What does "consistency" mean, and what does it entail?

So, it is not so much a matter of "objectivity" since there may be hundreds or thousands of ways of doing a single task in a competent manner. The task of standards is to find a limited number of those "competent manners" and create a system that allows for *reliability*, i.e. guarantees that the item you receive fulfills these specified standards. For example, there is no single, objective, "best" way of counting the pages of a book. There are more or less accurate ways, but no "objectively best" way. What is far more important is to follow a standard, whether I happen to agree with that standard, or think it is completely wrong.

An associated concern of standards is that people must take a serious attitude toward the standards. So, do we want someone really *playing* with the quality of the water that comes out of our pipes, and laughing and laughing? Do we want an airplane mechanic examining a vital part of the airplane we are about to board, and he sighs, "Who cares?" In these cases, they had better take the standards seriously because people could die from bad water or a plane crashing. In those cases, standards are not a joke.

Maintaining standards in the wider world is more complicated today than before. For example, here in Italy, there are not only the Italian standards for food, but the more recent European standards must be merged somehow. This has been controversial, to put it mildly, but I think libraries could learn a lot from this since I believe we will experience something similar as our library standards must somehow be merged into the larger universe of metadata standards.

In my opinion, adhering to *real* standards that are enforceable, no joking, no excuses, is the only way forward. This also means standards must be realistic and not simply a policy statement. For example, governments could make a standard saying that automobiles must get 250+ miles to the gallon, but that is unrealistic. A standard must be something that is achievable by the vast majority of anyone who is interested to ensure reliability. Are RDA, or AACR2, or ISBD achievable in this respect? That remains to be seen. If not, what is achievable and useful for the public?

Still, I realize that setting up and enforcing genuine standards is really a tremendous undertaking, but it will have to be done sooner or later. The rewards are tremendous however: after all, "metadata" should be the librarian's "backyard", the place we've played all of our lives. While we have a lot to learn, we should know it better than anyone else.

Wednesday, March 2, 2011

RE: Subjective Judgements in RDA 300s????

Posting to RDA-L

Hal Cain wrote:
<snip>
My point is that what we provide in cataloguing should be accurate as far as it goes, and it should go as far as is reasonably foreseeable to be useful.
</snip>
Absolutely agreed, but my point is: in the environment we are entering willy-nilly, where everyone and everything is supposed to interoperate, the definition of the word "accurate" must be reconsidered. This is why I added the ONIX example:
"500 illustrations, 210 in full color, 35 figures, 26 line drawings, 8 charts" 
vs.
"illustrations (some colo*u*red)"
as possible descriptions of the same item depending on who made it.

How does "accuracy" figure into this? Are we to consider them "equally accurate"? How does looking at everything in the aggregate affect matters, i.e. multiple records displaying multiple methods and the user sees differing practices more or less randomly? Will this trouble the users, or will they not even notice? How does the librarian figure into this? Does trying to maintain consistency in this bibliographic area not matter? Also, in the huge metadata universe beyond RDA/AACR2/ONIX, there are even more practices.

Personally, I don't think maintaining consistency is worthwhile in this case, but I am sure others would have different opinions.

I grant that illustrations are less critical, but counting the pages (extent) is far more important to decide that we are--or are not--looking at the same resource. There are lots (and lots and LOTS) of ways to count the number of pages. I discussed this at some length in that book "Conversations with Catalogers in the 21st Century". (I need to put my chapter up on the web) This is an area that certainly requires consistency.
<snip>
A great deal of the detail provided in cataloguing has been irrelevant to the majority of users -- but vital to the people who manage the collections and make decisions about selection and discard, and significant to a fraction of end-users.
</snip>
This is a key point that needs to be kept in mind. Libraries have their own needs and purposes (collection management) and this is reflected in their metadata. For the ONIX community, illustration information is added for their purposes, and represents an important *advertising* point, as they say in the Best Practices document under "Business Case" (these sections are absolutely vital to understanding ONIX):
"For many illustrated books the details on the illustrations are a critical selling point. Customers purchasing art books, for example, want to know the number of color plates included in a book.
Customers purchasing atlases want to know the number of maps included in the book. This information can only aid in the sales of illustrated books to both trading partners and end consumers."
Of course, this is quite different from the collection management purpose within the library catalog since there it is not a matter of getting people to open up their wallets and actually purchase a book, but rather simply to help them decide whether to consult it. (ILL is another matter)

I keep going back to the talk Michael Gorman gave at the RDA@yourlibrary conference. It still strikes me as the best way to move forward in the current environment.

RE: Subjective Judgements in RDA 300s????

Posting to RDA-L concerning a record with the 300 description
"319 pages : |b illustrations (some coloured, all beautiful), maps ; |c 25 cm."

Jonathan Rochkind wrote:
<snip>
Because like I said, I suspect that whether illustrations are present in color or not is not of much concern to 99% of patrons 99% of the time. In fact, if you think about it too hard it's a bit frustrating that expensive cataloger time is being spent marking down whether illustrations are colored or not (let alone correcting or changing someone elses spelling of colored!), when our actual real world records generally can't manage to specify things the user DOES care about a lot -- like if there is full text version of the item on the web and what it's URL is. (Anyone that has tried to figure this out from our actual real world shared records knows what I'm talking about; it's pretty much a roll of the dice whether an 856 represents full text or something else, it can't be determined reliably from indicators or subfields.)
</snip>
Hal Cain wrote:
<snip>
I don't agree -- maybe so in an academic environment, but for other kinds of libraries (school and public, and maybe specials too) the presence of illustrations can be a significant element in making a choice of the possibilities. The LCRI for AACR2 which enjoins just "illus." for all kinds of illustrative material doesn't help!
</snip>
I think Jonathan is absolutely right. Cataloger time is valuable, and at least I *very much* hope cataloger time will become increasingly valuable in the future (since the opposite is a terrifying possibility!). It has always been the case that creating bibliographic records/metadata involves a tradeoff of including some information at the expense of other information. For example, the rule as it states now is that a cataloger needs only to add the first of a number of authors, and "use cataloger's judgment" concerning adding any others. Why should there be such flexibility on rule as important as this one (and which I personally believe is unwarranted), but then worry so much over whether the illustrations are colored (or coloured)? And Jonathan is completely correct about the problems with the 856 field, which I see miscoded much of the time anyway.

Yet, it is always interesting to compare matters with the rest of the metadata universe out there, since we should be trying to interoperate with them. If you look at the ONIX Best Practices http://www.bisg.org/docs/Best_Practices_Document.pdf look at p. 85 for "30. Illustration details & description" and see their guidelines. Frighteningly detailed, e.g. "500 illustrations, 210 in full color" but we see it can also be: halftones, line drawings, figures, charts, etc.

So, how are we supposed to handle this? If we get an ONIX record with "500 illustrations, 210 in full color, 35 figures, 26 line drawings, 8 charts", do we devote the labor to edit it down to AACR2/RDA thereby eliminating some very nice information? But if we just accept it, what do we do then with the materials we catalog originally? "illustrations (some coloured)" looks pretty lame in comparison and can certainly lead to confusion.

Finally, we should ask: how important is this issue compared to the many others facing the cataloging world today, and how much time should we spend on this issue when, as Jonathan points out, one thing people really want to know is that there is a free copy of Byron's poems online for download in Google Books, the Internet Archive, plus lots of other places, and here are some links. While you're at it, you may be interested in these other links to related resources that deal with Byron's poetry in different ways.

My own opinion is: people are confused in general by library catalogs and their records, while the "illustrations" section is one of the least important areas of confusion.

Considering all of this, maybe "illustrations (some coloured, a few beautiful, several less than aesthetically pleasing, and a couple downright nasty)" isn't so bad after all!