Catalog Matters Podcast no. 18:
Problems with Library Catalogs
Hello everyone and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy. My name is Jim Weinheimer.
In the last episode, I provided some examples of how people want to manipulate data instead of plowing their way through masses of printed text but I went on to express my doubts that the information in catalog records is actually the type of information that most people want to manipulate. I would like to continue that discussion.
In the previous episode, I provided some examples of the kind of data that people want to manipulate, and I want to add one more example here because it has meaning to me personally.
I used to be a semi-serious chess player. Every beginning player has the experience of just after a few moves, you find yourself looking at a position you do not understand, but your opponent knows everything. He is smiling, moving quickly and easily, while you are suffering and spending lots of time just to find moves that you hope don't lose. It doesn't take too many of these experiences, and lost games, before you figure out that if you want to have good results, you must prepare your first moves, also called the chess openings, and that means doing research.
This is genuine research by the way—nothing at all like those undergraduate papers where five or six scholarly articles fulfill an assignment that nobody cares about. No, you care. You want the best and you want to be thorough because otherwise, you will suffer and you will lose. So what does it mean to do this kind of research?
In the past, it meant spending money to get the largest library of chess books and magazines you could afford and borrowing anything you could get your hands on. These materials were—and still are—filled with games and notes, and you hoped everything was well indexed, so that you could bring it all together and write—manually—your own “opening book” of good moves, bad moves, plans, ideas and so forth. Doing this could take months of hard work and you were always adding to it.
Today, all this is done with databases and what used to demand so much labor and time to sift through this massive amount of information now takes only a few seconds. The first time I saw one of these tools in action, I was quite literally left speechless! Grandmaster Gennadii Sosonko says that before databases, it took anywhere from a year to a year and a half to prepare a new opening. But because of databases, the research takes only a few seconds, and the data can be mined in new ways, so today to reach the same level of preparation requires only... two weeks! Two weeks versus a year and a half. And you are as well prepared as anyone. That is incredible. For those who are interested, I have added a link to a video that demonstrates this. You don't need to know any chess to see the power of such a tool. Obviously, chess players who do not use these tools are probably at a serious disadvantage.
I have no doubt that others want to do something similar—not with chess, but with whatever topic they prefer. I know I would. The reason it works so well with chess is because the moves that once were printed on paper have now been made into data and that data can be manipulated by computers in all kinds of ways. To do something similar with other topics, it would be necessary to turn the information on paper into a kind of data that computers can understand and work with.
It also shows the problem with catalog information that I discussed in the previous episode of Cataloging Matters. As a chess player, I am interested in the data of the chess games themselves, that is, the individual moves, their evaluations, who played them and when, not the data about the books and the serials and the videos and everything else that contains the information I want. Therefore, as a chess player interested in improving my play, which information from the catalog record would I want to manipulate? The fixed fields, the standard numbers, the main or added entries, the titles, the publication information, the physical information, the series information, the notes, the subjects? None of that helps me improve my play. And yet, I am always interested in finding more “chess data” to put into my database.
In the same way, I think most people are interested in improving their knowledge and understanding of baroque architecture or political issues of my community or plasma physics or whatever interests them, but manipulating the bibliographical details of the containers that hold the information that interests them will not help them understand those topics.
This is why I say that while we can go ahead a “turn our catalog records into data”, it is—lacking any evidence to the contrary—at the very least, extremely naive to expect the public to find new insights into the topics that interest them because they will be able to manipulate the standard numbers or the publication information or the notes or the publication patterns, or any of the other information that is in our records.
So, why would anybody need catalog records? What more could I want regarding my chess data? As I said before, I am always interested in finding more “chess data” to put into my chess database, and this is where the catalog information comes in. Although the catalog does not have chess data, it can lead me to chess data.
It can be argued that full-text searches can lead to more chess data, too. What is the difference between these tools?
Everyone recognizes that the public has changed its “information seeking” behavior in fundamental ways from what it was only 20 or so years ago. For those listening who may be relatively young, 20 years may seem like a long time, but in library-time, it must be recognized that 20 years is quite literally the blink of an eye. What this means is that every day almost everyone who uses a library's collection works with materials and records made long ago. Often, those materials are among the most important and valuable parts of the collection. This does not happen with many other fields such as with businesses and most other organizations. For them, the information made before a certain time, say five or ten years, is much less important for their needs and is discarded or archived, and those times it is retained, it is kept as a curiosity.
Materials in libraries are very different.
Full-text search engines have profoundly changed the way people search and even the way people think about searching. It seems that even for many of those who did work in those earlier times, their memories have faded. I know my memories have until I start working to remember.
One example of how deeply we have changed is that today, everyone takes for granted the over-arching importance of “relevance ranking”. Relevance, a word that sounds innocent enough, has taken on semi-propagandistic uses in that it mixes the sense of its meaning in statistics and information science with the way it is more popularly understood. Companies such as Google that make billions of dollars, are very interested in making sure that the these two definitions remain mixed together in people's minds as much as possible.
In spite of what some may prefer to believe, the two senses are definitely not the same, but it can be difficult to see and comprehend the difference. We can discern that difference most clearly when we examine a search engine result verses a search in a library catalog, when the search in the library catalog has been correctly made and the library catalog also works correctly. I emphasize correctly because it is extremely difficult to do today.
How do people find materials with full-text searches? Research on search engines (I have some links in the transcript) has consistently shown that people concentrate almost all their attention on the top three or so results. People almost never go beyond the first page. It should be added that the default number of search results in Google is ten, and since people rarely change a default setting, the first page means ten results.
Search User Interfaces: Presentation of Search Results / Alexander Schreiner. In: Themen des Information Retrieval : Suchmaschinen und Web-Suche : Beiträge des Seminars im Sommersemester 2012 / Andreas Henrich, Daniel Blank (Hrsg.). p. 35+ and Search User Interfaces / Marti Hearst. Cambridge University Press, 2009. p. 136)
I have personally been fascinated when I watch people work with Google. They put in a word or two or three, look at the top three results, or five at the most, and if they don't find what they want, they immediately try other words, look at the top three or five results, try yet other words, and so on.
I confess I have found myself searching Google in exactly this same way. Such actions betray a number of assumptions on the part of the searchers—and this apparently includes me when I do it.
Many of these assumptions are rather illogical but entirely understandable. As one example of these assumptions, it seems illogical to believe that a search through the vast information resources now on the internet and that retrieves several hundreds of thousands or millions of results could possibly have only a paltry three or four hits that are “relevant” and that the millions of other pages are therefore practically “irrelevant” and can be ignored. That really makes no sense but it is what I see with Google results. After the top few results, the rest really is almost completely irrelevant.
After the first few hits, I see more and more places to buy books or videos or tee shirts or bizarre email exchanges that are (I guess) somehow “relevant” to my search. I have always found this very strange. You would think you would find highly relevant items at first, then slowly you would see less relevant and gradually it would trail off to complete irrelevance, but my experience, which may be different from anyone else's, has been a more or less complete drop off after the first five or ten maximum. Therefore, I think people are right to stop looking after the top few. But I often think: is that true? I can't believe it. Furthermore, to believe that a machine could automatically bring the results to the top that are the “best” and “most appropriate” and to do it for me as an individual at any particular moment, is akin to magical thinking.
It begins to make more sense when we consider the information science meaning of the word “relevance”. That meaning of relevance is quite different and has to do with mathematics and algorithms, with precision versus recall and so forth. This is the meaning of relevance for a Google search—buried in statistics and algorithms (almost all secret by the way)—but it is something I don't believe the average person understands. When people hear that the top hits are the most “relevant” to their search, they confuse this algorithmic sense of “relevance” with “best” or “most appropriate” or “most useful” and then, they eventually come to believe that these pages, by definition, really are the “best” or “most appropriate” or “most useful”.
Although I can't prove it, but I don't think it can be disproved either, I have come to suspect that Google does not so much find the most relevant sites (even in the information science meaning) so much as it has managed to move the completely irrelevant junk that had tormented everyone for such a long time, to lower levels in the search result. What is left over is popularly interpreted as the “most relevant” or “best” but what is genuinely the “most relevant” or “best” may still lie buried inside the search result somewhere or not even in that result at all.
More importantly though, this matter becomes clearer when we compare it with a correctly done catalog search where everything works differently. Let's imagine that I am interested in “popular songs”. A reference librarian would immediately understand that my request most probably reflects a lack of focus and would begin to ask questions such as: popular songs from where, from what time, which genres, am I interested in recordings or texts, and so on. A reference librarian could help me a lot.
But even if I do not consult a reference librarian, there is a lot of help with a correctly done search in a correctly made catalog.
I know that on the lists and in my podcasts I discuss library and cataloging history and I hope it doesn't put too many people off. I do so not out of a sense of nostalgia, but because I believe it is impossible at this point in time to understand our current catalogs and decide in which directions they should change without clearly understanding what they are, and that means knowing at least a bit of their history. And for better or worse, that means discussing catalogs that existed in other formats. Never forget that the records we make today could easily fit into a card catalog of 1870.
During the days of the card catalog where everything was in alphabetical order, I would search for “popular songs” by opening the card drawer as close to “Popular” as possible and eventually come to a card like the one I have placed in the transcript, which is from the Princeton University scanned card catalog. It says:
When imagining someone doing this in reality, it is essential always to keep in mind that I could not come to this card directly as the hyperlink allows. It would take me some time to find this card, because first, I wouldn't know it existed, plus I would be browsing from the beginning of the drawer of cards. In this case, I would have seen and browsed past the title “Popular history of British ferns”, the subject heading “POPULAR LITERATURE--FRANCE”, the title “Popular political economy” a cross-reference for the corporate body “Popular revolutionary American alliance” and so on. That is, I would see many records that have nothing at all to do with what I want—popular songs.
After this browsing, I would find the card that would tell me that I should look under “Music, Popular (Songs, etc.)” so I would walk over to the “Ms” where I would once again browse just as I did before, seeing even more materials that had nothing at all to do with what I wanted, and would eventually find a special arrangement of cards. There is a link to this arrangement in the transcript. http://bit.ly/XPy0ek. Unfortunately the scans go a bit crazy for awhile but you can still see each card. Click on “Next Card” and go through just a few of them. The searcher discovers that this topic “Music, Popular (Songs, etc.)” has been subdivided into groups, such as “Addresses, essays, lectures” “Bibliography” “Dictionaries”, and as I continued to browse, I would discover that I could also find popular songs of different geographical areas. Quite a bit of help.
For those who used the printed books of the Library of Congress Subject Headings (those terrifying, big, fat, red books that I never understood before library school), I would again browse alphabetically, looking for “popular songs”. We can see how it worked from a copy of the relevant page found in Google Books. I added it to the transcript.
Under “popular songs” I see that I should look under “Popular music”. The historian can see that the heading has changed since the card catalog. In this example, “Popular music” is on the same page in the printed book, so we just go to the top of the page.
We discover that added to the topic “Popular music” is the not very highly readable (May Subd Geog) which, to those who know, means that this can be subdivided by geographical area. We also see related classification numbers, the UF, BT, NT and a scope note. Continuing on, we can see some subdivisions and find that “Popular music – Louisiana” has a Narrower Term of “Zydeco music”.
All of this can be very helpful to someone interested in popular songs, and in the absence of a reference librarian can help people focus their thoughts and perhaps lead them from the vague notion of “popular songs” to something tangible that interests them. In this case “Zydeco music”.
That's how it worked in the printed world. It would have taken a lot more time than I have taken to explain it. You may also have had to wait because someone was using the card drawers you needed. Searching the card catalog was just a pain. And yet, there were advantages.
Let's compare this to browsing entries for the subject “Popular music” in the online LC catalog. There's a link in the transcript. http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=popular%20music&Search_Code=SUBJ_&CNT=100&hist=1. We are assuming we already know that the subject to browse is “Popular music”. What do we see?
We see many, many more subdivisions than those in the printed LC Subject headings. Each geographical subdivision displays, resulting in an overwhelming list and illustrates how the cryptic (May Subd Geog), although not very comprehensible, actually came in very handy to help someone understand how a topic is sub-arranged. There are also many more subdivisions in this list than we see in the printed LC subject headings, and these come from the list of free-floating subdivisions that can be used under any topical heading. I provide a link to an old version of that list http://www.itcompany.com/inforetriever/form_subdivisions_list.htm, where we can find “Bibliography” “Bio-bibliography” “Discography” and many others.
After browsing through ten screens comprising 100 subject headings each under “Popular music” or 1000 subject headings—I repeat: 1000 subject headings—I am only up to “Popular music—France—1901-1910”. It's hard to say how many screens of popular music there are, but I think it is safe to conclude that only the tiniest percentage of a populace used to looking only at the top three hits would last to the bitter end, or even half-way through to see the key Narrower Term reference from Popular music – Louisiana that leads them to “Zydeco music”. No one will do that today. Including me. I refuse.
There was a similar problem with card catalogs of course. Although I can't demonstrate it physically—people will have to just take my word for it—it was a lot easier to flip through the cards in a card catalog or page through the subject headings in a book catalog than plow through these web pages. But it was still a pain.
Once I do find “Zydeco music” in the computer catalog http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&Search_Arg=zydeco%20music&Search_Code=SUBJ_&CNT=100&hist=1 I find some other intriguing subjects, such as “Zydeco music—Finland” along with a related term “Cajun music”.
This simple example illustrates that the catalog is based on creating intellectual groupings, that is, sets, of similar items and presenting those sets to the searcher in different ways. There is no concern at all for anything resembling “relevance”. It isn't as if you would look at the 200 items you find listed under “Zydeco music” in the LC catalog and think “I don't see what I want under the first three records listed here so I'll try another search”. At least, I hope searchers do not do that today, although from their point of view if they did it would be fully logical. So people may do this—I don't know. Does anybody know? Somebody should.
The assumption with a library catalog should be: if the information about Zydeco music you want exists, it will definitely be within this grouping labeled “Zydeco music”.
Is that true?
Why? For several reasons. One of the main ones: catalog records base themselves primarily on complete resources—technically speaking, 20% or more of an item, so within a specific collection there may be many materials with information about “Zydeco music” but not everything warrants a separate heading. In fact, there may be a lot of information about Zydeco music in the resources found under the broader term “Popular music—Louisiana”, maybe even “Popular music—Southern States”. It would not be stretching the imagination that there may also be significant information on Zydeco music under materials with the related term “Cajun music”. How can someone be aware of all of that?
Let's look. What happens in the library catalog if I browse the subject headings for “Zydeco music” and I go forward and backward? If I browse backward, I find the heading Zydeco dance--Study and teaching—Louisiana which is perhaps not too bad, but next comes a subject heading about the word “Jew” in Lithuanian.
If I browse forward, I find Zydeco musicians but then come some place names and corporate bodies in Poland. While Zydeco musicians and dancing may be all right for my purposes, those other topics are of absolutely no value to me. They are so far off that they can't even be labeled serendipity. Some have claimed that alphabetical arrangement is essentially no different than random arrangement—or at least a completely arbitrary arrangement—and this demonstrates why.
Obviously, what someone really needs, when looking at records of “Zydeco music” is to know that there may be more information on Zydeco music at least in the groupings “Popular music—Louisiana” and “Cajun music” if not maybe others.
These relationships exist now but as we have just seen, utilizing these relationships is practically impossible since even if you know how to do it, as I do, you have to fight with the catalog. This is why I have stated repeatedly that the catalog is broken.
Why do we have to fight with it? Because the catalog we have today was designed to present everything in alphabetical order, the arrangement you find in a dictionary, this is why Charles Cutter titled his rules “Rules for a Dictionary Catalog”.
That is, a dictionary of the 19th century—not one of the 21st century. For someone using merriam-webster.com or dictionary.com or Wikipedia, all of those tools work completely differently from the dictionaries and encyclopedias in the world of Panizzi and Cutter or even that of only 20 years ago. If I go to merriam-webster.com, I just type in the word I want to know. It helps me even if my spelling is atrocious. I can completely misspell the word “chrysanthemum” http://www.merriam-webster.com/dictionary/krisanthenum and still find it.
Try looking for this word in a printed dictionary if you have one and notice along the way how much you see that has nothing to do with chrysanthemums or flowers or even biology. If you don't have a printed dictionary, I have a link where you can look for “chrysanthemum” in a dictionary from 1823. http://books.google.it/books?id=jlZBAAAAcAAJ This link goes to the cover. Don't cheat and do a text search for the word but browse for it like you would in a physical volume! I don't suggest looking up “chrysanthemum” under “k” but if you want, I have a link to volume 2. http://books.google.it/books?id=qVZBAAAAcAAJ.
Therefore, when we read that the library catalog is a dictionary catalog, which it is, these printed dictionaries are what we should envision. That is because the people who designed our catalogs had those tools right before their eyes since everyone used them all the time. Those old catalogers added all sorts of aids to searching their catalog but those were all made for a physical dictionary catalog and those aids have become useless today. The reason they are useless is that browsing alphabetically, and seeing a huge number of materials that are completely irrelevant to our search have become very strange in the modern world. This is a fact whether we like it or not.
The methods I have briefly described clearly do not work in the current environment. They are never, ever coming back and they shouldn't because they genuinely are obsolete. But that is the way our catalogs work now, whether we like those methods or not. Nevertheless I think it is important to consider that just because the methods may be obsolete doesn't mean that everything is obsolete.
What do I mean? Let's consider some differences from the past. What is a heading? For catalogers today, it means the 1xx, 240, 4xx, 6xx, 7xx or 8xx that today contains controlled vocabulary and provides a link that searchers can click on so they can find related records. In the past, it was something much less vague. It was the part written at the top of a card that determined where that card sat in the card catalog. In the transcript I have an example of a card where I denote the heading in red, and often, subject headings were typed in red too.
In book catalogs, the heading was printed one time at the beginning of a group of records and for groupings that went on at some length the heading would be repeated at the top of the column or the top of the page. In the transcript I provide an example of headings in a printed book catalog and again denote the headings in red. We can see how Cicero's name is not repeated even though there are six items.
Catalogue of the Mercantile Library in New York. New York : E.O. Jenkins, 1844. http://books.google.it/books?id=_mtMx5Z8J28C p. 43:
Catalogue of the Mercantile Library in New York. New York : E.O. Jenkins, 1844. http://books.google.it/books?id=_mtMx5Z8J28C p. 43:
I also have an example of subject headings with subdivisions in a catalog from 1869. We see the beginning of the topic “Moral science” which comes after “Moors in Spain” (dictionary catalog at work) and we see its subdivisions “General works – History” and “Systematic treatises”. There are other subdivisions that come later, such as Miscellaneous works, and all kinds of Special subjects, Anger, Avarice, and others.
Catalogue of the Library of Congress : index of subjects. Washington [D.C.] : GPO, 1869. volume 2, p. 1177. http://books.google.it/books?id=RbtSAAAAcAAJ.
Catalogue of the Library of Congress : index of subjects. Washington [D.C.] : GPO, 1869. volume 2, p. 1177. http://books.google.it/books?id=RbtSAAAAcAAJ.
The purpose of the heading as a designation for a group of records on the same topic or author, is very clear in a book catalog. The methods are obsolete, there is no argument about that. But exactly what do we see here that is so obsolete?
No one today is going to look for “Moral sciences” by starting at “M” and browsing past Metallurgy, Meteorology and Monograms. But it does not necessarily mean that the groupings themselves are obsolete, that is: the sets of records found under each heading.
I believe it is clear that people still want the materials we see grouped together, for instance the materials grouped under the topic Moral science – General works – History. People in 1869 wanted the resources we see grouped there and there is every reason to think that people want exactly those same resources today. Therefore, the grouping, or the set, is not obsolete. Of course it needs to be updated to include modern resources. If we assume that people still want this group today, the question becomes: how does someone want to find this group? Naturally, people would prefer more modern words in place of “Moral science”, such as Ethics, Deontology, Morality, or Morals.
Yet, how can people who want such a group find it if they don't know how it is named, here Moral science – General works – History or even that the group exists? Also, when people find this group, how can they find other groupings of materials that may be of interest to them? Can it be done?
Yes. We saw it in the card catalog. The example with Zydeco music shows how people really could do this in earlier catalogs—if they used those catalogs and other tools correctly. It wasn't easy back then, but it's almost impossible today.
The library catalog provides groupings of resources that have been selected by experts. The groupings and arrangements of the individual resources are based not on statistical relevance, but on the intellectual contents of the items. Naturally, this system has never been perfect but neither are Google or any of the other systems. The traditional way relies on the ups and downs of human frailties, and consequently has missed a lot, but I'll just go ahead and say it: it can't be all that much worse than believing that when I do a full-text search and get a million hits, that only the first three I see are worth considering and that they have risen to the top by magic. I don't believe it.
The reason I don't believe it is that I understand what statistical relevance is and I also understand how library catalogs are supposed to work. I know there must be more.
These are some of the reasons why I don't think RDA or FRBR are going to make any substantial difference in the ways the public uses our catalogs. Even with linked data, why should the public want to manipulate bibliographic data that has no meaning for them? Our catalogs will still be based on the principles of a 19th century dictionary—not even on a 21st century dictionary! The problem is not our records or even the information in them—it is the reliance on alphabetical order that has become obsolete in our new environment.
When I am looking at the set of records under “Ethics”, I want to know that there are many subtopics available for me such as “Cross-cultural studies” that I would never have imagined. I want to know there may be more information under “Philosophy” and “Values”, and that there are all kinds of narrower terms such as “Akrasia” http://lccn.loc.gov/sh2006003161 even though I may have never heard of Akrasia in my life!
So, is there a solution? I think there is and in the next podcast (yes, it continues) I want to discuss something called Information Architecture and how Information Architecture could help library catalogs and even libraries.
The music to end this episode is a little different from what I have chosen before. This time it won't be Italian music, but more in keeping with the spirit of this talk, here is Zydeco music from Louisiana. This is “What you gonna do?” performed by Buckwheat Zydeco and his fabulous group. http://www.youtube.com/watch?v=AQL1eT4crZw à
That's it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.