Monday, April 26, 2010

RE:... After MARC...MODS?

Posting to NGC4LIB

On Thu, 22 Apr 2010 21:53:44 +1000, Alexander Johannesen wrote:

>On Thu, Apr 22, 2010 at 21:18, Weinheimer Jim wrote:
>> How about this? Pretend that you are interested in the history of black people in agriculture in the United States. Tell me how you would go about searching and retrieving information in Google or Google Scholar or a related tool using full text.
>Well, I pop "black people in agriculture in the United States" and get back 1.3 million hits of which, one can assume, there is valuable information. I go through them, and copy and paste into my research document anything that smacks of gold.

Sorry I didn't respond to this earlier since I only saw it now. Your results and reactions are extremely interesting.

You say that you get back "...1.3 million hits of which, one can assume, there is valuable information. I go through them, and copy and paste into my research document anything that smacks of gold." What??? You go through 1.3 million hits???? You are a really fast reader! Sorry, but that one I will never believe. And this is one of the main problems when my students find when they use a tool such as Google in practice: the result is completely out of control. When they have gotten the same types of results for simple purposes relating to their own general interest, they don't care so much about the search result since it can be fun to "surf." But when they are grappling with something important such as writing a paper, where they have to stick to a specific topic, and they could flunk out if they quote something stupid, they see Google as something much less useful and more similar to a toy. And it frightens them.

They feel there may be something there in the search result of 1.3 million (and I add: or not there, see below), but the results are in a completely unpredictable order that change constantly, based on the number of links to an item (and thus, place #1 is determined primarily by bloggers) plus there are a number of other factors that determine ranking which are business secrets of Google. It has been shown without a doubt that this order can be manipulated for all sorts of purposes (for obvious examples, see Google-Bombing in Wikipedia, but this is being done constantly in far more subtle ways). As I tell my students, the Google use of the term "relevance" does not at all equal their own understanding of the term "relevance" and they should not confuse the two. The Google use is a secretive business term but one chosen strategically to make their customers more comfortable. It works.

What exactly are you looking at when you see the results from "black people in agriculture in the United States" and also, what are you not looking at? Well, you miss many original documents, because the term "blacks" was not the word used for African-American people in agriculture in the early United States. There were other terms used, some highly insulting today. When a cataloger puts in metadata, it's a completely different matter. In a library catalog, you don't have to search these older terms, but in full-text you do, or they will never come up in the result--and you will never realize it. As a result, you miss entire categories of really useful information.

Other problems: "agriculture" is unnecessarily limiting. You would also have to search at least "farming" but probably others as well. Searching "United States" will miss most of the information in the individual states, where there will be lots of possibly the most interesting resources.

I won't discuss "quality of information" here, which is another huge problem that people have to face every day. You say, "anything that smacks of gold" but how am I supposed to know that? Also, I won't discuss exactly what Google is and is not searching when you do a search, because this is another of their closely-held secrets.

So we see that what at first glance appears to be extremely simple: typing a few words into a box and getting a result, is incredibly complex and terribly limiting. It takes an expert to understand how limiting it is. Google has done an excellent job of making it seem to be simple, and they have done this by designing a tool to make people happy, but we should not confuse this with providing results that are reliable and comprehensible, which is what people really want. And it has serious consequences, as students will tell you.

I would suggest that when people see matters in this way, they will see the immensity of such a task, and that they will have a bit more respect for the work done in catalogs, which smooths the way for people. But of course, "we won't get no respect!"

Perhaps this is too detailed for your purposes, but it certainly is not too detailed for the students I work with, who are being serious about it and, as I say, terribly worried about it since not dealing with it could derail their entire careers.

Library catalogs are designed on different principles and have strengths in exactly these areas, and this is why I think that creating a tool that would bring the strengths of library catalogs together with full-text retrieval tools would be the best. But simply ignoring what our tools can do would be the same as allowing superstition and bias and even censorship to run rampant.

No comments:

Post a Comment