Re: Completeness of records

Posting to RDA-L

On 09/08/2011 15:23, Brenndorfer, Thomas wrote:

<snip>
Perhaps the problem stems from the words you use, such as “allow WEMI”. FRBR is based on an entity-relationship analysis tool that first appeared in about the 1970s, not the 19th century. Catalogs don’t “allow” entities– the FRBR modelling exercise shows what entities have been the basis behind the conventions and mechanisms in traditional catalogs. Part II of AACR2 is very heavy on the concept of “work”, but related data about the work is scattered all over the place in Part I of AACR2. FRBR says here’s what we have always intended– let’s present it in a way that can be sufficiently abstracted so we can do it differently, do it better, do it in a machine-friendly way, do it in a way that is consistent with other, more modern technical standards, do it in a way that can be extended and modified, and, to boot, do it in a way that is compatible with the existing record structure.
</snip>

Well, I have studied databases as well and created a number of them. I personally don’t care one whit whether we do something that is friendly to machines. They can scream for all I care, so long as the job is done. I would much rather do something in a librarian- or cataloger-friendly way and let the machines do more work. This includes being able to achieve some notable successes now, not putting our faith in vague promises of the future, and using the machines to their fullest potential, whether it happens to be friendly or not.

<snip>
So, no, it is not moot. I ran into the FRBR issue in the late 1990s when customizing my first web-based version of the catalog. I couldn’t do things, not because of the limits of the technology (of which there were and still are many), but because of the limitations in the underlying data structure in traditional AACR/MARC records. It was one of those “the emperor has no clothes” moments. FRBR made more sense than the traditional catalog, because it was written in the modern language of databases. And in studying database design in courses, it was quite embarrassing to compare the comfort level students had with concept such as primary keys and relationships between tables. In describing the traditional catalog– “well, we have relationship designators, but they mess up our displays, so we have traditionally made decisions since the 19th century, not based upon efficient database design, but on the vagaries of a medley of display conventions and encoding conventions that are overly contingent and conditional on extraneous factors.”
</snip>

Relational databases are not the only choice today, and we must keep our options open and limit our concerns of “this is not the way it is supposed to be done”. Designing an RDBMS has a certain sense of what I call “computer aesthetics” which I believe should be irrelevant. There is now the very efficient and powerful option of the Lucene search engine indexing (with its variants), which forgoes a relational database altogether and indexes flat files. I have read quite a bit on it although much of it is highly technical and beyond my capabilities, but “the proof is in the pudding”. I have already mentioned that this must be how Worldcat is indexed. But here is an even better implementation (I believe) by our Australian colleagues: http://ll01.nla.gov.au/index.jsp. (They have some links to papers and one is broken although I found it here http://www.nla.gov.au/openpublish/index.php/nlasp/article/viewArticle/1047)

This works on a database of 16 million records. It is very fast and provides the extracted headings for further refinements, a major step forward in catalog technology. Even doing a ridiculous search for “of the a” and limiting it to online resources took less than 20 seconds! (It’s so fast, you don’t need a stopword list) It’s important to note that with Lucene indexing, it does not use a database at all! This Australian project says that it also uses Lucene to store the data and from what I have read, a melding of the two is best: and RDBMS for storage and maintenance, and the full-text search engine for the public, just as Koha is designed. As an added bonus, Lucene-type technologies are open source!

Relevance ranking is a part of these technologies, and this is why (I think) that Eric Hellman in his talk “Library Data, why bother?” http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FipVVoH&h=4AQDVSolC says that libraries should be trying to tweak relevance ranking (i.e. search engine optimization) and adding microdata as more important than anything else. This was discussed in Autocat (I disagreed in part), but Eric Hellman got involved too. It was a very enlightening exchange of ideas.

A lot of this reminds me of my researches into the “library catalog of the future” built under Ernest Richardson when he was at Princeton University back in the 1920s. It was built using the latest technology at the time (linotype slugs). I finally found the scans! They are in the Internet archive at: http://www.archive.org/search.php?query=alphabetical%20finding%20list%20princeton and http://www.archive.org/search.php?query=%22classed%20list%22%20princeton (although it’s incomplete, those who are interested can see how it worked). One of the things was, each record could only be the length of a single linotype slug (around 180 characters, I think) so everything *HAD, HAD, HAD* to fit into that. And the catalogers did as they were told to shoehorn whatever they could into those 180 characters. Plus, he found some professors to publish research “proving” that any information resource could be described “adequately” within 180 characters(! Ha!).

Anyway, the technology turned out to be not so great and his catalog of the future went down the tubes pretty fast. Physically rearranging all of the linotype slugs for the classed list and the author list was a tremendous amount of work and there were other problems. Anyway, this was actually fortunate because Richardson continued on to create the magnificent national treasure, the National Union Catalog, which displays many of his ideas on cataloging (but that is a different matter). If you just glance at one of the volumes of Princeton’s catalog (links above) however, you will pretty quickly see that his idea would be very possible today since he was actually making simple Excel-type spreadsheets and could be sorted in an instant. Doing the same thing with linotype slugs and human power didn’t work out to well, though. Richardson was quite a futuristic thinker.

My point of all this is to say “When your only tool is a hammer, everything looks like a nail.” Linotype slugs were not the only tool to be used; an RDBMS is not either. So we shouldn’t make our data *fit* into the tool and then go on to explain why this is the way it must be done (as Richardson and his researchers did with the 180 characters), but instead, fashion tools to fit your data. Lucene allows this.

-304

Share