Re: [RDA-L] Access to the knowledge of cataloging

Posting to RDA-L

On 12/6/2013 11:12 PM, Kevin M Randall wrote:

<snip>
James Weinheimer wrote:

To be fair, the original version of FRBR came out before (or at least not long afterward) the huge abandonment by the public of our OPACs. Google had barely even begun to exist when FRBR appeared. Still, there could have been a chapter on the newest developments back then. But even today, nowhere in it is there the slightest mention of “keyword” or “relevance ranking” much less anything about Web2.0 or the semantic web or linked data or full-text or Lucene indexing (like what we see in the Worldcat displays). It’s as if those things never happened.

There’s no mention of that stuff because it is *irrelevant* to what FRBR is about. It has absolutely nothing to do with what technologies or techniques are being used to access the data. It’s about the *data itself* that are objects of those keyword searches, or relevance raking, or Lucene indexing, or whatever other as-yet-undeveloped means of discovery there may be. How many times does this have to be said?
</snip>

There is one point where we can agree: it is irrelevant. And that is precisely why FRBR is also irrelevant to how the vast majority of the public searches every single day. It is also irrelevant to implementing the user tasks, since those can be done today. FRBR is irrelevant for linked data. Also (apparently) irrelevant is how much it will cost to change to FRBR structures.

But saying that FRBR is about the data itself, I must disagree. We have gobs of data now, and it is already deeply structured. FRBR does not change any of that. There will still be the same data and it will still be as deeply structured. FRBR instead offers an alternative data model that is designed for relational databases. We currently have another model where all the bibliographic information is put into a single “manifestation” record and holdings information goes into another record. FRBR proposes to take out data that is now in the
“manifestation” record and put certain parts of it into a “work” instance, while other data will go into an “expression” instance.

So why did they want to do that? Designers of relational databases want to make their databases as efficient as they can, and one way to do that is by eliminating as much duplication as possible. This is what FRBR proposes. It is clearest to show this with an example: Currently if we have a non-fiction book with multiple manifestations and this book has three subject headings, the subjects will be repeated in each manifestation record. With FRBR, the subjects will all go into the work instance, and as a result, each manifestation does not need separate subjects because the manifestation will reference the work instance and get the subjects in that way.

What is the advantage? A few. First, the size of the database is reduced (very important with relational databases!), plus if you want to change something, such as add a new subject, you would add that subject only once into the work instance and that extra subject would automatically be referenced in all the manifestations. The same goes for deleting subjects or adding or deleting creators. Nevertheless, the data itself remains unchanged and there is not even any additional access with the FRBR data model. It simply posits an alternative data model and one that I agree would be far more efficient in a relational database.

But as I have been at pains to point out, something that may at first seem rather benign such as introducing a new data model, has many serious consequences that should be considered before adopting such a model. Something that makes the database designers happy may be a monster for everyone who uses it: both the people who input into the database and the people who search it. But the designers remain happy. This is what I say we are looking at now with FRBR.

Strangely enough, we have different technology today, with Lucene-type indexing such as we see in Google and Worldcat with the facets and everything is flattened out into different indexes, since this is how the indexing works. (The best explanation I have found so far is at http://www.slideshare.net/mcjenkins/the-search-engine-index-presentation but it also becomes pretty dense pretty quickly) Essentially what Lucene does is make an index (much like the index at the back of a book) out of the documents it finds. It indexes text by word, by phrase, and other ways as well. It also adds links to each document where the index term has been used and ranks each term using various methods.

The advantage is: when you do a search, it does not have to scan through the entire database (like a relational database does), it just looks up your terms in its index, collates them together and presents the searcher with the result, and it does this blazingly fast as anybody can see when they search Google. The Google index is over 100,000,000 gigabytes! http://www.google.it/intl/en/insidesearch/howsearchworks/crawling-indexing.html

If we want to discuss the usefulness of our catalog records as data, that is indeed a very interesting topic and I have discussed that in my podcast: “Cataloging Matters No. 17: Catalog Records as Data”
http://blog.jweinheimer.net/2013/01/cataloging-matters-no-17-catalog-records-as-data.html

-288

Share