Tuesday, March 29, 2011

RE:The next generation of discovery tools (new LJ article)

Posting to NGC4LIB

Jonathan Rochkind wrote: (concerning relevance ranking being a "crapshoot")
Well, it depends on what you mean. That's a dangerous statement, because it sounds like the kind some hardcore old school librarians use to say we shouldn't do relevance ranking at all, I mean why provide a feature that's "a crapshoot", just sort by year instead. I don't think it's true that relevancy ranking provides no value, which is what "a crapshoot" implies.

Instead, relevance ranking, in actual testing (in general, not sure about library domain specifically), does _very well_ at putting the documents most users most of the time will find most valuable first. It does very well at providing an _order_. Thus the name "ranking".
In these economically very troubled times, I don't think there is much of a chance that we won't do any relevance ranking at all--my concern is quite the opposite: administrators are indeed *desperate* to save money wherever they can, and today, computerization is seen as a method to save money because people are "so expensive". (A curious idea, by the way) If there is a danger, it is much more that it will be the practice of cataloging that will be tossed overboard, not the computerized relevance ranking.

Perhaps cataloging really will be thrown overboard--I don't know, and the millennia-old practice of cataloging will be done automatically or semi-automatically, by students and secretaries with only a few minutes or hours of training, following few standards, if any whatsoever. I am sure that if there is a danger, it is to cataloging and not to any kind of computerized rankings. In any case, it shouldn't be done without full understanding of what we would be losing.

Let's discuss practice and consider whether Google-type relevance ranking really does "very well at putting the documents most users want". The only way to determine if this is true is to compare it with some kind of alternative. Do we have anything? How about the library catalog?

Let's take as an example that I want to do some kind of research (not for publication, just an undergraduate paper) on air warfare of WWI. Doing this search in Google retrieves 65,400 results http://tinyurl.com/5tbouru (at least on my machine) and gives me at first Wikipedia, something from Firstworldwar.com, Britannica, answers.com, life123.com, pages about games, and so on. In the "Wonder Wheel" I see synonyms for "air warfare wwi" except for naval warfare and surface warfare. Is this relevant to my search? (As an aside, Google's menus letting people re-sort the results in several ways implies that they admit that the single relevance ranking is not enough)

To me, this is similar to my own experiences of very poor reference librarians who you ask for information on a topic, they run off into the stacks and come back with a book, often an encyclopedia, open to a chapter or article more or less on your topic. Then they leave you and return to their other work.

To decide if the Google result is relevant, we need to compare this with the correct, expert search in a library catalog, that I admit, no regular person would ever do: subject search: "World War, 1914-1918--Aerial operations" http://tinyurl.com/6hkgrcq, and I see a grouped result by American, Australian, Austrian, Belgian, etc., i.e. concepts I would not have considered on my own.

Now, if we look at the very first record:
Main title: The achievements of the Zeppelins, by a Swede.
Published/Created: London, T. F. Unwin, ltd. [1916]
Description: 16 p. incl. pl. 15 cm.
Notes: "Reprinted from the Stockholms Dagblad of 19th March, 1916."
Subjects: World War, 1914-1918 --Aerial operations.
LC classification: D604 .A3
There is not a single word of the subject heading anywhere within the description of the item, and therefore, without the subject heading, the person interested in air warfare would not have found this and would have had to come up with "Zeppelins" independently somehow. Full-text would not have helped either, since this publication is from 1916, and WWI was not called WWI until WWII broke out.

What I am trying to show is that the subject heading arrangement--when used as it is designed to be used--is an incredible time saver for the searchers, since very quickly they can get a nice overview of what is in the local collection and decide, e.g. I am not interested in World War, 1914-1918--Aerial operations, Italian, and don't have to look at any of those. This system is far from perfect, but there is real power in the traditional subject headings that *is not replicated* in relevance ranking, that is, so long as the library catalogers do their jobs satisfactorily.

This traditional method was designed for printed catalogs and I readily admit that it does not work in the world of the web, but the question naturally arises: could these clear sorts of result sets be repurposed to function on the web? Of course they could, if the powers-that-be decided to devote the resources. Yet, I fear that there is little chance that we want to devote the resources to this task and we want to put our faith in "relevance" ranking, which I think is, in reality, a search for the "perfect algorithm", that I do not believe exists.

*Everybody* from provider to searcher has an interest in maintaining the idea that relevance ranking really does give us what is "relevant" (in the normal meaning of the term), and not actually some kind of incredibly complex mathematical algorithm that provides result rankings that no human being could ever explain since the mathematics are too complex, but results that that can only be accepted at face value. Yet, we must accept this since the only other choice would be to look at all of the 64,500 hits where the "relevance" really does trail down to .0001 sooner or later (and of course, we know this is only the tip of the iceberg of what is really on the web on this topic).

It would be nice if we could somehow get the two methods to work in tandem somehow, because where the subject headings are strong, relevance is weak, and where the subject headings are weak, relevance is strong.

Something tells me that that will be a very hard case to make however, since the administrators will no doubt claim it is double work, although they can say that it would be nice in a different economic environment and so on and so on.


  1. Since you are responding to my comments, I feel an obligation to defend them.

    You're not comparing apples to apples. Google does not contain the same collection of things as your catalog (which is sometimes good, and sometimes not, depending on what the user is looking for).

    In your catalog, you are assuming the user correctly finds that subject heading -- how do they do so, and do you know how many users with such a need succesfully do so? Those that do not succesfully do so, may find NOTHING, which is even worse than google. If the user enters "air warfare wwi" in your catalog, what do they get? Hmm.

    Also of course, relevancy ranking is not mutually exclusive with subject cataloging. These are two different things. We can have relevance-ranked results on a search that includes hits on cataloger-assigned subject headings -- even including the 'see from' lead in terms.

    To get closer to comparing apples to apples, let's try your same search in my catalog with relevance ranking:


    Okay, that's not so great, no hits! But you know what, let's try the exact same search in a traditional headings-browse no-relevancy-ranking interface:


    Uh oh, that still didn't get them much of anything, and certainly didn't get them to the subject heading you suggest.

    So comparing not very useful results in google to pretty much no results in BOTH of these library catalogs.... I'm not sure which is superior.

  2. Continued...

    But in fact, relevancy ranking and subject cataloging can work great together. One thing missing from my first new catalog example up there is including lead-in ("see from") terms from LCSH in the keyword index. That's not there yet, but really should be -- if it was, maybe "wwi" and "air warfare" would have matched some lead-in terms?

    Of course, you've knowingly picked one of the trickiest searches, which is fine if we admit it. (And that it's difficult for BOTH systems). We know that the now-common "WWI" has been called different things at different points in history, and is not called that in the LCSH. Neither is "air warfare" called that in the LCSH.

    If LCSH has lead-in terms with those common names, the relevancy ranking PLUS subject cataloging approach might work well. Let's test it out with less technically challenging vocabulary.


    (I'll admit that trying "vietnam war air warfare" instead didn't work, but this is a reasonable one).

    Hey, the first hit is relevant, and has the subject heading "Vietnam War, 1961-1975 — Aerial operations, American. " on it, the user can click to look at that set, now that it's discovered. Relevancy ranking and controlled vocabulary working well together.

    Hey, even without doing that, you can expand the "topic" limit on the left, and click to narrow down on different aspects of your result set -- including "Aerial operations, American" and "Vietnam War, 1961-1975", getting at basically the same thing as above in another way. Or you can expand "Topic, Region" or "Topic, Era" to slice your result set differently. Or "Language".

    Yes, those left hand limits rely on cataloger assigned controlled vocabulary. Relevancy ranking and cataloging controlled vocab working well together.

    In traditional alphabetic headings browse, using the same query you suggested above but with 'vietnam war' instead of 'wwi':


    Not so hot. If I scroll through dozens of pages, I might find a useful subject heading eventually. (Yes, that catalog is using a non-standard filing order for LCSH. I don't like it myself, but it's there because the reference librarians insisted that the STANDARD filing order for LCSH, was leaving users not able to find what they wanted. I'm sure that is true, but you switch it and in some cases you find what you want easier and in others you make it harder. Anyway, that's another topic).

  3. Thanks for your comments. I would like to make a few observations of my own, but I think we are substantially in agreement: best would be for the traditional subject headings to work in tandem with relevance ranking. For this to happen however, the system of finding the subject headings must be improved because as I mentioned: no one (other than a cataloger!) would ever come up with terms like "Vietnam War, 1961-1975 — Aerial operations, American". The only way a system of controlled subject headings can work in practice is using the syndetic structure, i.e. the UF, RT, NT, BT structure. For example, "Vietnam War, 1961-1975" has cross-references that people may want. Or, if someone is interested in "Vietnam War, 1961-1975 — Aerial operations, American" they may be interested also in "Vietnam War, 1961-1975--Campaigns." with a whole slew of references.

    This allows a power that you cannot get from full-text searching and relevance ranking, the one works with algorithms based on text, and the other works with concepts. I have discovered that people find it very strange to work with a truly conceptual model and, I think, is one of the main difficulties they have searching the catalog. Yet, once they understand it even a little bit, most like it a lot.

    The current structure of the subjects is based on the functionality of the printed book catalog and the card catalog and not on keyword, which is how people search today. So, I admit it is dysfunctional and must be improved. I've discussed this in more depth here.

    But ultimately, my concern is that administrators will look at relevance ranking and subject headings as double work and a "luxury" that can no longer be continued in our difficult economic environment. One will have to go, and I think we all know which one it will be: the one that almost nobody understands.