Friday, October 12, 2012

Re: [ACAT] New Bibliographic Framework: Update with Eric Miller

Posting to Autocat

On 10/10/2012 18:44, Humpal, Nathan wrote:
<snip>
James Weinheimer wrote: "So far as I am concerned, Lucene indexing is perhaps the greatest revolution in the catalog since computers were invented. Or at least it should be considered that way. I can't believe that the cataloging world simply ignores it! Unbelievable." Could you expand on this a bit? It's entirely possible that I don't fully realize the implications of Lucene indexing, or what you're implying by its impact in the catalog, but to me its flattened out full text searching seems to de-emphasize the potential strengths of a catalog.
</snip>

Lucene-type indexing does a lot, including ranking and sorting; it can work with any text-based formats such as pdfs, html, xml and so on. It has "fuzzy logic" which means that it can allow for typos and word variants (if you want), plus lots more. It is incredibly fast and also allows for faceting. This means it will extract the information from the records retrieved in a search and that information can be used to further limit the search. (You can control what information is extracted) The facets can also be thought of as all kinds of different "filters" that can be applied successively. As an example, I can search "au:homer ti:iliad" in Worldcat http://www.worldcat.org/search?q=au%3Ahomer+ti%3Ailiad, then click on "Videos" (filtering it), then click on "Brad Pitt" (filtering it more) to discover the movie "Troy". I didn't have to know that the movie existed. This is just too simple and easy to achieve something incredibly complex. How could FRBR structures improve on it? I honestly can't imagine.

The Koha catalog has this capability using Zebra (a Lucene version and you can download it for free!) and it is also seen in Worldcat, although I don't know what indexing engine it uses. The best implementation I have seen however, is in an Australian project at http://ll01.nla.gov.au/index.jsp. It used to be lightning fast but seems to have gotten a little slower.

In the Australian catalog I looked for "fascism italy" and retrieved: not only the usual listing of records, it suggested some related searches, plus the facets that you can see in the right-hand column: years of publication, subjects--in full form (very nice!), genres, forms, authors, classifications, blah, blah, blah. Cutter and Panizzi could never, ever have imagined such a display extracted from the information inside the individual records. I wonder what they would have done with these kinds of powers?

In the case of WEMI, as I demonstrated, the navigation of WEMI is easy once the searcher has the correct uniform title (and the catalogers have done their jobs correctly, of course). Therefore, the actual task would be to get people to search the name/title heading correctly. This would mean somehow, when someone does a keyword search, the system would search also the authority file and return the correct heading e.g. someone would search "dostoevsky crime and punishment" and would retrieve "Dostoyevsky, Fyodor, 1821-1881. Prestuplenie i nakazanie" (how that would display to the searcher, I don't know but the search would be correct) which they would click on to get the correct result that they could then filter. http://www.worldcat.org/search?q=au%3A%22Dostoyevsky%2C+Fyodor%22+ti%3A%22Prestuplenie+i+nakazanie%22&qt=results_page

Therefore, it is obvious that the user interfaces need to be improved a lot but a lot can be done. Almost anything. (Again, to toot my own horn, my paper at ALA discussed precisely this http://blog.jweinheimer.net/2012/06/reality-check-what-is-it-that-public.html)

Finally, Lucene is an index and not a database, so if it is meant to index a database, the relational database is where the records are created and maintained. Then the information is exported from the database (normally in an XML flat file) and the index is generated from the XML. In this way, people search the indexes, not the DBMS.

No comments:

Post a Comment