Wednesday, June 30, 2010

Problems With Selection in Today's Information World

Posting to NGC4LIB

On Mon, 28 Jun 2010 07:59:10 -0500, Mitchell, Michael wrote:

We use "Choice" and similar publications for suggestions of quality Web sites to include in our catalog. I have no trouble with the limited number of resulting catalog entries since our catalog is not a Web search engine and I don't think our students expect it to be. I've added 5-600 Web sites in the past year or so. These sites are good resources that happen to be on the Web (and free).

But what do students expect to find in the catalog? The local books plus a small number of websites, and they still must use Google anyway? I am not finding any fault with this--it is happening in every library, including my own--but it results in making unclear the difference between the "Web search engine" and the library catalog. For example, there are wonderful tools such as Intute and Infomine,
and they should not be ignored, plus there are lots of specialist sites. But how do you use these sites?

As a specific example, let's say that the selector decides that all of the resources selected for the "Humanities" section in Intute should be added to your catalog. I don't know how many there are, but there are a *lot* and for some libraries, it could easily be a significant proportion of the yearly catalog production.

Every one of the sites in Intute has been selected by a librarian and/or expert. Does it make sense to recatalog all of these resources one-by-one and then have to go through the hassle of maintaining all of the records whenever something changes? And remember, Intute is only one project and there are many, many more, and while there is metadata, they do not do MARC21/AACR2/LCSH/LCC. Actually, a site such as Intute provides real quality selection and can be pretty well trusted, while a more difficult site to work with would be, e.g. the Internet Archive, which has scads of wonderful resources, but does not have nearly the quality of "selection". The old methods and workflows aim at creating new records in the local database (although when you are lucky you might find usuable copy), and this makes a certain amount of sense when dealing with unchanging physical resources located within the local library, but these same methods result in endless, and essentially useless duplication when used for the so-called
"remote accessed electronic resources".

And of course, this includes the duplication of selection.

As a selector, I would not want to burden my cataloging department by cataloging materials that are already in Infomine and Intute, plus it could take a very long time to get them done. Is there a better and more efficient way?

Copernicus, Cataloging, and the Chairs on the Titanic, Part 1 [Long Post]

Posting to NGC4LIB

Stephen Paling wrote:

I want the option of ignoring that kind of searching [that is, authors and subjects--JW]. The ease of Google and its lack of fields is hugely helpful in my professional life. For example, I'm using a lot of nonparametric statistics in my current research. But when I want to find out about a specific procedure, a catalog is nearly useless.
I don't understand this. While I do completely understand that you want to be able to find information in all kinds of different ways, nevertheless, do these new requirements mean that people *no longer want* to find information by subjects and/or authors? (I am not discussing titles here) So, when you say that you "want the option of ignoring that kind of searching" I don't understand what you mean, and it certainly flies in the face of all of my experience. The people I have worked with want to find the "set of all information/knowledge/documents/whatever" about all kinds of topics, e.g. the possibility of pressing legal action in train disasters in Russia (as only one example). This can be found in the catalog now--within certain known limitations (20% or more of the text, plus adequate staffing to have the time to do it, plus well-trained catalogers) by looking under "Railroad accidents--Law and legislation--Russia (Federation)--Criminal provisions."

I think getting this sort of information may be able to help people in very substantial ways and people want to be able to do this just as much as ever. That people want to do it hasn't changed at all.

What has changed is that the traditional methods for *finding these sets* has fallen apart completely, e.g. for those looking for materials related to the set "Railroad accidents--Law and legislation--Russia (Federation)--Criminal provisions" these terms would probably never enter their minds. The method to find such a heading worked after a fashion in the card catalog; perhaps the traditional methods were never all that good, but that would be a separate topic for library historians. In any case, no one would ever voluntarily come up with such terms and the only way realistic way of finding that kind of subject was by browsing the subject headings; and we must confess that *nobody does that anymore.* Period. But just because the methods of finding these sets does not mean that people no longer want information in this way, e.g. the works of Confucius, or Hundred Years' War, 1339-1453--Prisoners and prisons, or to know that if you are interested in topics under "Hundred Years' War, 1339-145--Campaigns--France" you may also be interested in other sets you may never have even thought of:
Agincourt, Battle of, Agincourt, France, 1415.
Bulgneville, Battle of, France, 1431
Calais (France)--History--Siege, 1346-1347
Crecy, Battle of, Crecy-en-Ponthieu, France, 1346.
Orleans (France)--History--Siege, 1428-1429.
Poitiers, Battle of, France, 1356.

Google *cannot do this.* And my own opinion is that people still want the information organized under the above topics just as much as ever. And I further submit that when people search either in catalogs or in Google, e.g. WWII Rome, they believe they are getting the set of all materials on that topic when they *definitely are not*, that the order magically provides them the resources that are the most "relevant" to their needs, which is *definitely not true* ("relevance ranking" means something completely different and strange, and can be manipulated in all sorts of ways), that the results they get can be "trusted", and so on and so on. I use Google several times every day, but nobody I know has understood any of this until I explain it to them. Therefore, I also believe these misunderstandings are potentially dangerous for society.

I don't believe this is bashing Google, it is a simple statement of fact.

So this is my reply to the part of your second post where you mention:
"But library-assigned subject headings? Only 31%. And classification numbers? Only 14%." The implementation of the traditional methods is so poor in our catalogs that this does not surprise me. But as I attempted to demonstrate above, I believe it is erroneous to conclude that nobody wants the sets of resources found under those headings. Certainly libraries and their catalogs need a lot of changes, but one of the most important is to make the sets we arrange much easier to find for the untrained, or semi-untrained. We need about a zillion more cross-references and clearly-written notes in the authority files.

If you want and need information on the methods used in a document, that is fine, but the traditional catalog is not the tool you should use. Yet, just because it does not fulfill that kind of need (one for which it was never designed) it does not follow that it does not, or could not, fulfill the functions for which it was designed. Perhaps by using different partners and more robust formats, we could create something that could do it all.

Although everything must change in how these items are searched and displayed to our users, changing in these directions would be far more fruitful for everyone concerned than chasing after RDA and FRBR.

Tuesday, June 29, 2010

Copernicus, Cataloging, and the Chairs on the Titanic, Part 1 [Long Post]

Posting to NGC4LIB

I agree with many of the points you make, but on some, I have some difficulties agreeing:

"Stop Bashing Google
Bashing Google for not allowing people to search by title, author, etc., misses the point of a general purpose search engine".
Does it also follow that people prefer to *not be able to find* resources reliably by authors, titles and subjects? My own experience is that people actually believe that when they search for, e.g. Samuel Clemens in Google, they actually believe they are getting "everything" by this author; the same goes for subjects.
I admit that titles are weird.

Just because we can point out problems with Google does not mean we are bashing it. Just as any other information tool, we need to know its strengths and weaknesses.

Stop Bashing Users.
Have you ever implied that users don't use our tools because those users are ignorant of how our tools work? Because the users are impatient? Don't be surprised if those perceptions color our interactions with users. How many times on this list, and others, have we heard people advocate ignoring user desires and needs? Let's stop telling users what they need, and instead focus on meeting those needs as they are.
I think it is also important to point out that expecting people to find useful information/knowledge/whatever-you-want-to-call-it without any training at all is rather naive. People accept that they can't work on their cars or do anything on a computer without training. Yet, research has shown that most people believe they are expert searchers, but of course, they are not. Anybody who has done any reference work knows this. Of course, this doesn't mean that people are ignorant or stupid, but they have been untrained. Just like I cannot work on a car competently or work on my electrical wiring.

It's nice to believe that Google-type searches, i.e. the black box, will solve all problems, but I think it is clear that this is not correct.

Consider Eliminating Cataloging.
[This one from pt. 2]
Users First, Technology and Standards Second.
I sympathize with the concept of users first, and while I may be old-fashioned, I think it is vital for professional librarians to be able retrieve materials in a reliable, almost guaranteed way, from the collection that is under their control so that they can help people and have some kind of control. Perhaps the methods we use will not be obvious to the general populace, but that is not unusual, just as there are special ways into a car for a mechanic, or special methods for a plumber. We all want these experts to have their own ways to maintain our cars and plumbing systems. If there are no standards, I submit there are no real possibilities for experts, i.e. librarians, to help users who experience trouble.

Standards for bibliographic retrieval are necessary, just as they are accepted as necessary in many other aspects of specialized areas of endeavor. It's no secret that I don't believe we need RDA and FRBR, but we need some kinds of standards, in any case. What those standards will be is a very interesting question.

Problems With Selection in Today's Information World

Posting to NGC4LIB

B.G. Sloan wrote:

Jim Weinheimer said:

"What do others think? And yes, in such scenarios the 'catalog' will change, but I think will still be the key to it all."

I'm not so sure I agree that, in some future world, the catalog "will still be the key to it all." I'm not even sure that the catalog is "the key to it all" now. I know it's not "the key to it all" for me. It's more like an inventory tool for me. I discover a book or journal article using some other discovery method. Quite a few times I can locate the item without even using the library. When I can't, then maybe I'll go to the catalog and ask "Is the item available through my library?" I can't believe I'm the only person who does this.

I remember reading a couple of studies about the discovery methods used by academics. I can't recall a lot of details, but I do remember that the library catalog wasn't "the key to it all" when it came to their information seeking behavior.

I guess I didn't make clear in my original post what I think of Google. In my opinion, Google is very much a "catalog". It is a way to find materials added to the "Google collection" using a type of arrangement, only Google does everything automatically and secretly: automatic selection, automatic "metadata" creation, automatic arrangement. I would like to emphasize the secrecy aspect here. Google's algorithm for page rank is completely a business secret. We also don't know exactly what Google contains. My own, very incomplete researches showed around 75% of the open web is in Google and there is also a limit on how deep Google goes into the file structure. Back when I was working on it, Google seemed to go about five levels deep. Naturally, the page rank algorithm is completely secret as well.

So, I agree that the "library catalog" as we know it today will definitely not be the key, but a "catalog" of some sort must be. The question is: are the controls that are available through a traditional catalog no longer needed? Do people no longer want to find materials within a collection by their titles, authors and subjects in a reliable way? I think they actually do want this, but there is an incredible amount of misunderstanding, misinformation, and just plain poor thinking in these areas. A lot needs to be sorted out before we can begin to answer a question like this. If it turns out that people genuinely *do not want* to find resources by their authors, titles and subjects, then I will agree that it is time to throw it all overboard, but I have seen nothing that shows this, although I have seen huge areas that display a lack of understanding. I am not saying that people are ignorant or stupid--what we are dealing with is genuinely complex and we are experts.

One of the main problems that people have with the web (which in the popular mind, equals Google) is that they want to be able to trust the information they get. From a superficial point of view, this means "evaluation of web pages" as we get in the Information Literacy workshops, but there is a lot more to this concept of trust than mere "evaluation of web pages," and this includes trusting the search result.

Monday, June 28, 2010

Problems With Selection in Today's Information World

Posting to NGC4LIB

Following Eric's slap-down (just kidding!), I have decided to pose a question that has not been very well addressed, so far as I know: how selection of digital resources, and especially open-access materials on the web, can be achieved.

Here are some of my own observations concerning the issues:

  1. The non-librarian does not understand traditional library selection;
  2. Library selection has traditionally meant being responsible for a limited budget and adding materials based on a limited amount of resources, both money and shelf space. In essence, it is a process of *inclusion* of specific materials, based on specific policies and limited resources;
  3. When it comes to web resources, the public wants selection of another type. They are very concerned about getting "bad" information. Faculty and scholars are just as concerned as students and the regular public. While they like to know what is "recommended by the most people" this is not enough and they still have concerns;
  4. When we have millions of free materials and no problems with shelf space, library selection becomes something fundamentally different from what it has been; in essence, it becomes (I believe) a process of *exclusion*, i.e. taking the "best" and excluding the "worst", much as the traditional "bibliographies of best books" have tried to achieve (for examples, search the subject: "Best books" in Worldcat);
  5. While it no longer makes much sense to catalog the same text over and over and over in each library, I don't think it makes much more sense to "select" the same thing over and over and over in each library;
  6. The traditional library selector has had a lot of help from book dealers and library profiles. Without them, it would be pretty much impossible to do the work in any sort of comprehensive manner. Book dealers get paid to do this work through approval plans and other ways when libraries buy the physical books (or other resources). It is naïve to believe that similar organizations will do a comparable amount of work for materials that are available for free;
  7. Selecting materials on the web is being done now to a limited extent through heroic efforts in cooperative projects such as Intute, Infomine and other projects (to see the tool I created for my own "selection of web materials" see: If you look at these sites, you will see many items selected that are not in our library catalogs, plus there is metadata work done twice on these sites and in our catalogs. The resources found through these projects are not nearly all of the worthwhile digital sites however;
  8. In the everyday practice of library selection, many people feel ignored and/or left out since you cannot make everyone happy. Now, since there is not the concerns of a limited budget, or of shelf space, each faculty member, teacher, whoever, could equally be a selector. This has obvious advantages as well as drawbacks.

This does not at all exhaust all the concerns, but I think they represent a good beginning. Perhaps others are discussing these matters as well, and if so, could others point me in the right directions. I can envision a cooperative tool that could solve these concerns technically at least, but getting agreement on the huge number of issues would be the challenge, not the least being the explosive question: who will select?

What do others think? And yes, in such scenarios the "catalog" will change, but I think will still be the key to it all.

Meeting library patrons' expectations (Was: Death dates)

Posting to Autocat

Concerning the death dates, I think the real issue is whether it is worth the effort. It is still quite a bit of work for a library to change/update a heading. When an entire network decides to adopt the same heading, this means that every library in the network must change too, so it complicates the task substantially. This is one of the reasons for changing the old rule that said, "use the fullest form of the name", to the rule of "use the form of the name found on the first item entered into the catalog" (or in very rare cases, change the heading following the 80% rule). The number of updates would be limited to the minimum.

To adopt fullest form, it entails a lot of extra work, almost always necessitating some sort of research in various reference works. Different libraries have a different number of resources, and transferring this method into a union catalog, where hundreds of other libraries were to use the same forms of heading, meant that an update to the heading resulted in changes of perhaps thousands of headings in hundreds of catalogs. Now that OCLC has grown so much (see the latest statistics at updating a heading must be considered a very serious matter, and in fact so serious that it justifies the hundreds of man-hours (and woman-hours!) to do the update; man/woman-hours that could be used in better ways to help our patrons, such as cataloging new materials.

I was against updating the death dates, not because I enjoyed seeing Pope John Paul II or Leonid Brezhnev still walking around, but because we must face up to the fact that we have (and will continue to have) diminishing resources that must be put to their most productive uses and to me, since these headings were not "incorrect" and still fulfilled their *function* i.e. to bring together someone's works, and rarely led to any misunderstandings on the part of the users (there are far more difficult parts of a bibliographic record to understand than this), it is not worth the effort to update the headings and is actually like swabbing the decks of the Titanic while it is sinking. While no one should doubt that the practice of cataloging is in deeply serious trouble, lack of death dates is more of an annoyance and is of very minor importance. No one will improve their opinion of the library catalog just because a date of death is updated or not. Things have changed too much.

Besides, with the use of more modern formats and tools, updating a heading could be done automatically, by changing the heading quite literally *one time* in the NAF, where the change there could permeate automatically into each catalog.

Therefore, building that kind of system could *definitely be worth the effort* since it would simultaneously save time and increase accuracy in a number of ways. In fact, we could even return to the rule of "fullest form" which I think everyone would agree, is a much better form than what we have now.

There are lots of possibilities out there if we decide to create and adopt the correct tools.

Friday, June 25, 2010

ALA Session on MODS and MADS: Current implementations and future directions

Posting to NGC4LIB

Bernhard Eversberg wrote:

More generally, we need to look at how people and services on the Web use to refer to and talk about books (or "resources" anyway). The notion of "title" is not used and understood with any degree of formal consistency, to begin with. It depends on how long and specific or unspecific it is, if the author's name is more prominent than the title itself ("read the new Dan Brown yet?"). And if the "title proper" is significant, why bother with the rest of the verbiage on the title page when refering to the book?

Wow! And I thought I was radical! You are absolutely right, and now that people have a genuine choice on search tools other than the ones we create, we must deal with these situations, otherwise people won't use our tools at all.

This reminds me of something I read recently in an absolutely fabulous, scholarly, weighty tome by Elizabeth Eisenstein, "The printing press as an agent of change." ( It is taking me forever to get through because I pause and think after every few sentences!) It is startling that there are so many parallels from the period of the late 1400s to the 1500s and our own time. One thing I read recently in it was that as the Bible was translated into the vernaculars, the Catholic Church naturally tried to stop them all by forbidding vernacular translations. The problem was, they couldn't stop them everywhere, but this meant that when lay people were interested in some of Biblical teachings, the only materials available were Protestant translations. Many Catholic monks in Protestant regions saw this happening and wanted people to have a choice, so some went ahead and translated the Bible into various vernaculars even at the risk of being labeled heretics.

If libraries do not give the public what it wants, the public will easily do without libraries, while certain people bemoan what is happening, all the while predicting the end of civilization.

Yet, I feel that there is--or should be--a certain level of dumbing down that we should not transgress. Titles?! Still, many students that I have worked with do not understand searching by "author" or "title" or "subject". This is what we get with Google.

It's a difficult moment, but I think Alex is right: the first step is to come to some kind of mutual understanding--not agreement--over these different concepts.

ALA Session on MODS and MADS: Current implementations and future directions

Posting on NGC4LIB

Laval Hunsucker wrote:

Let's be frank about it, shall we ? Instead of "we must assume that fewer people will be visiting our world", wouldn't it be more complete and more accurate to say "we must assume that fewer people will be populating our world and visiting our world" ?

I.e., until it becomes so few people that we reach the point at which that world cannot but evaporate altogether. ( Or, just possibly, has morphed into quite another kind of world altogether. )

I can't argue against this, but I also don't see much of a difference. In essence my concern is: what will happen, and is happening now, is that people are beginning their searches for information in other places than the library (that should come as a surprise to no one) and people find so much in these other places that there is little time left for them to use our resources, or to even know about them in the first place. This is the world that we must plan for because it is happening right now, and there is no indication that people will be clamoring for the work of libraries and librarians. The way I look at it, we can either just give up and find jobs in other fields, or actually try to do something about it, and perhaps even look at the present situation as an opportunity.

In this respect, I am mulling over and playing with Alex's idea that (as I read it) one of the main problems is that of gaining a mutual understanding of the differing concepts. As a practical example, there is the bibliographical idea of "title proper" and "other title information" vs. the more popular idea of "title". The first step is briefly to lay out the differences in the concepts, but then the second step would be to actually make decisions, i.e. is it really worthwhile for us to continue the separate coding of 245$a"title proper" and $b"other title information" or is it so important that we should try to get others to code them separately?

Obviously, determining the answer to the second step would be more ominous and traumatic in many ways, not the least is: who will actually make the decision? But no matter what the decision, it will represent change for somebody, somewhere unless we just decide to let everybody do whatever they want.

Still, there is no reason to undertake step 2 until step 1 is made. Actually, this is the *purpose* I have had for the separate section in the Cooperative Cataloging Rules on the "Conceptual Outline" which attempts to bring the different ideas together. I agree that it is very poorly implemented, it is based entirely on ISBD, but it is something, anyway.

Of course, I am sure there are much better ways of handling this. It could all be deleted and start over again.

Thursday, June 24, 2010

ALA Session on MODS and MADS: Current implementations and future directions

Posting to NGC4LIB

Bernhard Eversberg wrote:

Correct. But exactly "what is needed" is what we still don't know or can't pin down. And I think there will not be one easy answer but any community, no, any one person needs something ever so slightly different and something else the next time. So, our data model needs to support exports in many flavors and dressings.

But now, the mighty RDA toolkit is out. Go and look if it might get us closer to solutions, based on the FRBR data model as it is. But don't forget there are gigatons of legacy data.

This is exactly the problem. We have tons of metadata made over many decades (if not more) that--at least I believe--will definitely prove useful in this new environment, but we don't know exactly *how* it will be useful. We do know however, that in the current format (in various senses of this word), what we have will definitely *not* be useful outside our own library world, and we must assume that fewer people will be visiting our world.

The FRBR data model is already an anachronism and has never been tested. I find it amazing that FRBR makes this incredible statement that people want to "find / identify / select / obtain --> works / expressions / manifestations / items --> by their authors / titles / subjects" which we absolutely *know* is wrong! That describes how the card catalog worked--even in the OPAC, keyword searching quickly became the dominant search. Does anybody really do this anymore? Once in a very great while I do when I am searching for a specific edition of, e.g. Burton's Anatomy of Melancholy, but in addition to keyword searching, there is this thing called "Google" and "relevance ranking" that has come around in the last 20 years or so...

I certainly cannot imagine that RDA/FRBR changes will interest anybody in our data, and after looking at some of Karen's attempts, I suspect it isn't even a valid data model for our present records. (Is there really such a thing as an "expression" or even a "work"? This has turned out to be far more difficult in reality than we thought, or what I thought, anyway! Perhaps the work and expression as simple collating points for a card/printed catalog, are not "entities" with lots of "attributes" but something else entirely)

In any case, one of the most basic business practices of new product development has never been tried, at least not to my knowledge: does our projected product (i.e. FRBR displays) fulfill our patrons' needs? Or a significant number of their needs? I haven't seen anything that deals with this. From my own experience with users, and my own searching patterns, I can't imagine that anybody would look at these displays and say, "Yes!! That is what has always been missing!!"

But it seems as if the library world is doomed to continue onward toward spending its limited resources to create an untested product and train staff to create that untested product that, quite probably, nobody wants.

Again, I want to mention that there is a choice: The Cooperative Cataloging Rules at:

Wednesday, June 23, 2010

Blog reply to Not a Crisis, a Transition

I find this difficult to agree with. When the Google Books-Publishers agreement is eventually approved (and it will be someday if not very soon), I am sure that our patrons will want it. (I for one, will want it very much!) And it will be impossible to hold out against these people because if librarians say anything like, “Subscribing to Google Books is not a good deal. I am doing this to protect YOUR best interests,” nobody will believe it. The result would be merely to confirm the idea that librarians are dinosaurs from an earlier time and risk exposing ourselves to general ridicule.

After all, I think people would be correct to ridicule any librarian who wanted to deny patrons access to the riches of Google books–that is, if the patrons want it. And I can’t imagine any patron saying no.

Scholarly communication is changing in almost every way at a frighteningly fast pace, and it is only logical to assume that it will continue in this way for a long, long time. I have welcomed many of the changes, but many others I find quite negative. I’m sure this love-hate will continue as new changes occur. But if librarians are to survive, I think we will have to represent openness and inclusion much more than closed stacks and some anachronicstic idea that we are there to ensure some level of “quality”.

Meeting library patrons' expectations (Was: Death dates)

Posting to Autocat

On Tue, 22 Jun 2010 11:10:12 -0600, john g marr wrote:

The rule stating that we are *not* supposed to change "b." dates to full date ranges when death dates become available thus seems entirely counterintuitive if the purpose of cataloging is to meet library patrons' expectations. So, I suggest we boycott the use of "b." and use "open dates" in all such cases, so that death dates can "legally" be applied when available.

Concerning the open death dates, I wonder how much it has to do with card practice, which more often that not, when setting up a name, went with fullest form, so cards often left a lot of space after an initial in a heading (see: ). So, when a cataloger discovered later that this person's name was actually "Coot Johnson", they wouldn't have to go through a lot of work to update his name. On a humorous note, I have seen this practice carried into retrospective conversion projects, where the people inputting the card obviously didn't understand and left lots of spaces in the headings! But the practice of leaving an open death date on the card may be related to this old practice.

But I would like to ask whether or not it is true that "the purpose of cataloging is to meet library patrons' expectations." Also, from the information expert's point of view, we should ask whether specific parts of a patron's expectation of a catalog are justified or not. For example, should a birth or death date found in a bibliographic heading be considered authoritative, even if a patron expects it to be? 99% of the time, the cataloger just copied what he or she found in the book and did no further research. In this same direction, should a copyright/printing/whatever date be considered valid for legal purposes? Again, the cataloger just copied down what he or she saw, ignoring some dates and printing statements. Speaking personally, I wouldn't want to be held legally responsible for a copyright date.

Patron's expectations are changing constantly, and *if* the FRBR user tasks (almost copied verbatim from Cutter's rules) were correct and valid when they were written, they certainly are not any longer because people expect a lot more. As one example, I find the Wikipedia disambiguation page to be much superior to our traditional library methods for diffentiating concepts. But, the *function* of our headings is of course much clearer than what is
in Wikipedia.

My point is: trying to live up to patrons' expectations is a completely fruitless task: there are just so many different kinds of patrons who expect far too much, that I wouldn't even try. We should also be aware of trying to impose anachronisms on our patrons: the example of getting names to file "correctly" (open date vs. b.) is a non-starter since browsing name headings in that way is a function of the card catalog that has disappeared almost entirely.

We should be focussing on finding out how people are accessing and using information and concentrating on making ourselves relevant in that way. People expect something entirely different. To get an idea of some of these new problems and tasks, take a look at for a great discussion of the possibilities, and the pros and cons, of the Semantic Web. These are the directions where we should be focusing our efforts. I know there is a very important place for us in that new world.

New record use policy 'WorldCat Rights and Responsibilities for the OCLC Cooperative' - effective August 1

Posting to NGC4LIB

Karen Coyle wrote:

While the main thrust of the policy is the same as it always has been (protecting WorldCat), I am glad that OCLC clarified their stand on the copyright issue. That doesn't mean that I agree with their copyright statement, but in the past it hasn't been expressed this clearly and there was a lot of speculation. OCLC claims no copyright in the bibliographic records, but does claim copyright in WorldCat as a whole:

"While, on behalf of its members, OCLC claims copyright rights in WorldCat as a compilation, it does not claim copyright ownership of individual records."

I hope this is correct, but on the other hand, to me there appears to be a contradiction as shown in the definition of Worldcat Data (or Worldcat bibliographic data):
"WorldCat Data. For purposes of this policy, WorldCat data is metadata for an information object, generally in the form of *a record or records* [my emphasis--JW] encoded in MARC format, whose source is or at one point in time was the WorldCat bibliographic database."

The definition of Worldcat data appears in a slightly different form in Section 1:
"The purpose of the policy is to define the rights and responsibilities associated with the stewardship of the WorldCat bibliographic and holdings database by and for the OCLC cooperative, including the use and exchange of OCLC member-contributed data comprising that database."

Worldcat data makes up the Worldcat database, and from this, we see that the Worldcat database is also comprised of "OCLC member-contributed data".

It seems to me from this reading that, because the data went through the Worldcat database, the following conclusion is mathematically inescapable:

Worldcat Data = OCLC member-contributed data

This seems to me to imply some level of ownership.

In any case, from this reading it seems that a local library, if it wants to do something with its own records, perhaps outside of its own catalog, must follow Section 3.A.4. The only out is the vague terminology of "should" and "are encouraged":
"OCLC members should ensure..."
"OCLC member's agent are encouraged, subject to the terms and conditions of a mutually acceptable separate agreement between the agent and OCLC..."

Of course, I am no lawyer either, but while the Google-Publisher agreement is not dealing with library information per se, i.e. we only own copies of intellectual property that belongs to others, librarians have gotten all bent out of shape over it. Here we are speaking of library created metadata that has definitely belonged to the library that created it. Calling it Worldcat data, which then has a number of conditions attached, seems to merit some attention too.

Tuesday, June 22, 2010

ALA Session on MODS and MADS: Current implementations and future directions

Posting to NGC4LIB

The way I look at the issue of MODS vs. MARC is since the vast majority of library records are in some type of MARC now, it is a matter of what is lost vs. what is gained with switching to MODS.=

1) the map from MARC to MODS is at:, and it is clear what semantics are lost. My previous example were the different subfields in 130/240 being lost when mapped to MODS <title type=uniform>, specifically the $s. Again, the information is not lost but the semantics. There is also 111, 711 $a$c$d$e$n$q, but others as well. There are certain concerns I have as well, e.g.
245 $a$f$g$k <title> with no <titleInfo> type attribute and
245 $b <subTitle>

which is rather strange from the logic point of view (e.g. 245 $g is for "bulk dates" which I believe are primarily for archival materials, but as a result would semantically mix together dates of creation with title information). I don't believe I have ever used those fields before, so probably the issue would arise relatively rarely, and yet I may be completely wrong! In any case, this mapping makes clear what would definitely be lost when transferring a record. Whether that would be seen as serious or not needs discussion. MODS could be further modified if some bit were deemed sufficiently important.

At the same time, Ashley decided that MODS granularity was too fine.

It is also important to remember that this information is lost *onlY* when transferring the record from the database.

What we would gain by changing to a more normal type of XML format does not have to be discussed to this group.

At this point, the whole discussion may be moot because of the latest worldcat record policy. I still haven't made up my mind. The section
"WorldCat Data. For purposes of this policy, WorldCat data is metadata for an information object, generally in the form of a record or records encoded in MARC format, whose source is or at one point in time was the WorldCat bibliographic database." seems to claim as Worldcat data a record of 040 $aDLC$cDLC$dMyLibrary, if it was downloaded through Worldcat. If I downloaded it through direct Z39.50, it may be another matter, but I may have to prove that somehow. Again, to me it's like taking my car to the mechanic, I pay him to do some work on my car, and then he claims that I can't sell the car or do anything with it without his approval. And why? Because if I do so without his approval, it may hurt the "local collective" in some way.

It seems to be a done deal however. And I find it amazing that people have gotten so concerned about Google Books!

Thursday, June 17, 2010

New Possibilities in Cooperative Cataloging

Posting to alcts-eforum

Concerning Elaine Sanchez's post, I think she has summed up the problems very clearly. It still has never been shown that the FRBR user tasks have anything that *our users* want, (in fact, the FRBR displays I have seen tend to frighten even me!) although I will agree that FRBR may give librarians and catalogers a few of the tools that they want. So, the "FRBR user tasks" should probably be renamed the "FRBR librarian tasks". As an example, I have mentioned several times on other lists that FRBR-type views will not help my patrons find much of anything, and I must confess, they don't help me find anything I want either. Nevertheless, it is important to keep in mind that librarians need specific views to more clearly understand what is available so that they can do their work.

To get a better understanding of the situation of what *our users* want, I would like to point out a fabulous short video on the Semantic Web that I have just discovered, and I don't believe anyone has mentioned it on any of the lists I follow. Here, you can watch one of the best discussions of the Semantic Web that I have seen, both pro and con, with interviews of Tim Berners-Lee, Clay Shirkey, and several others. I suggest it to everyone.

I agree with the basic ideas discussed in this video concerning what the reality of the problems and the immense possibilities open to people today, and that these possibilities definitely have very little or nothing to do with the FRBR user tasks, but I do have some disagreements with several points of this video. In particular, about 7:50 into the video, Clay Shirkey discusses the Semantic Web. comparing it to the idea of AI (artificial intelligence) and says:

"Instead of making machines think like people, we can describe the world in terms that machines were good at thinking about. So we would switch from trying to build up brains in silicon [i.e. artifical intelligence--JW] and re-render the actual world as information [i.e. the Semantic Web--JW]. And that gets very quickly to one of the deepest questions in all of western philosophy, which is: does the world make sense, or do we make sense of the world? I don't think you can unambiguously describe the world; I don't think you can describe the world or even large subdomains of the world in a way that all observers or even most of the observers will agree with."

The documentary has other sections similar to this. My own opinion is that Clay Shirkey is right up to a point: there is no single way that everyone will agree with. A simple overview of the history of the classification of the sciences, plus the myriad rules we have had for bibliographic description and organization will convince anyone that he is right. BUT stopping there avoids some important subtleties. One very important method is to have a method that more or less guarantees, in a highly-predictable manner, how to find things and then to know clearly what you are looking at, or in other words, one method should be an expert mode.

We have expert modes everywhere in our world, and we want them. While the layperson needs certain tools, the expert needs other tools. The tools for both should advance. So for example, the toothbrush was an advance over the twig for cleaning your teeth, then came floss, then increasingly better toothpaste, the water pick, and whatever there is new available today. These are tools designed mainly for the layperson, who can do an increasingly better job cleaning his or her teeth because of the ever-evolving tools available. The dentist however, is not stuck with the same tools as the layperson, and although the layperson's water pick may be a major advance over the best tools that a dentist had 100 years ago, the dentist's tools have advanced as well. Therefore, the expert's tools have evolved *alongside* the layperson's tools. This has happened throughout modern society.

So, it seems to me that an underlying assumption to Clay Shirkey's argument, or perhaps he is simply not dealing with it, is that in the new information/semantic web world, everyone, from layperson to expert, will be using the same tools, which should be incorrect (I hope).

For some time, I have considered a related point: If you believe that the problem of information retrieval can be solved by devising better and more powerful search algorithms, then it seems to me that this attitude actually betrays a deeper, metaphysical belief: that information resources, at their very fundamental level, are already organized by their very nature, and consequently, if you believe this, the task turns from organizing and describing into finding the correct algorithm that will then discover this deeply-hidden, inherent organization that already exists. Once you achieve this, the resource will then be organized and can be exploited. This assumption seems to be unavoidable for those who believe that algorithms are the solution because the algorithms are searching for something and what else could it be other than this hidden organization?

My own opinion is that computers cannot "think" or "reason" as such; all they can do is perform mathematical operations. Some very clever people have worked these algorithms to such a point that they can take on the appearance of "thinking" and "reasoning", but we should not be confused, and we should certainly not let these computers do our thinking and reasoning for us. They are only tools after all, just like a hammer or a power drill, and they must be used with some degree of knowledge and skill. For example, "relevance ranking" is not the normal idea of "relevance" but something entirely different, although this is very poorly understood in the popular mind.

Therefore, for those who do not happen to believe that information is inherently arranged (I am one of those), and order must be imposed on an otherwise chaotic mass of materials, the search for the "perfect algorithm" becomes very similar to the search by the medieval alchemists for the "philosopher's stone". I do not believe there is a perfect algorithm, even theoretically.

As a consequence, while it will never happen that everyone will agree on the "best" way to find information, this fact is almost irrelevant in my opinion, and it is not even unusual in the real world. We have precisely the same situation with driving a car or working a DVD player. Everyone would probably agree that there are many better ways of accomplishing either of these tasks (operating a DVD player is notoriously complicated, and in light of the oil spill in the Gulf, many people could come up with improvements for cars running on oil). Yet, to get along in the world, everyone is stuck operating DVD players and driving cars. While there is no single way to use and drive a car, there are still many, many more ways *not* to drive a car. Some drivers are experts and others are only more-or-less competent. We want expert drivers, such as drivers of large trucks or ambulances, to utilize their expertise to help everyone. We should not expect them to be "stuck" with the same cars as everyone else. The same possibilities should be ensured for information experts. This involves systems and standards that allow reliable and guaranteed retrieval and understanding.

Of course, this does not mean that "information experts" (i.e. librarians) do not need to change radically, and this video very clearly shows some of the directions they should change, but we should keep in mind that we still need our own special tools, and there is nothing at all strange with it.

Wednesday, June 16, 2010

cooperative cataloging

Posting to alcts-eforum

Maxine Sherman wrote:
There have been many spirited discussions on OCLC-Cat about these [i.e. B&T] records, but I agree that libraries that just input a completely new record without upgrading the level 3 record, just create more duplicate records. What I also find "amusing" are the number of B&T order records with input dates later than full records. Come on, B&T, can't you just add whatever you need to someone else's complete record rather than contributing to the duplicate record problem? Oh, and why can't the order records come with the appropriate type code? Doesn't the person who is ordering the item know in what it is being ordered? Sigh. I, too, have a link to the duplicate records notification web page in my browser. A prominent one.
This illustrates one of the major problems with metadata quality as we begin to share with new partners. Probably B&T saw no problems with the quality of their local database before, but now they are expected to change what they do to suit our needs. While that would suit me just fine, it means that B&T would have to change what they have always done, which means training. Additionally, with RDA, what does that mean? Subscribing to the online RDA, and a much higher level of complexity. (While people may debate whether RDA is a step forward or a step backward, I don't think anybody can seriously argue that RDA is any simpler than what we have now)

I can imagine that if I were at B&T, while I might agree to share the data I have, I don't think I could justify a great expense in the sense of increased training and online subscriptions. Perhaps I could justify it if it were demonstrated that it would increase sales. Although I may be wrong, I don't see that putting a record in a local catalog or Worldcat is going to make that much of a difference in sales, although a record in Amazon, Alibris, or even in Google Books may make a difference.

Now multiply this by the number of other book publishers and related metadata creators, each facing similar situations and we see the scope of the problem.

I am a very firm believer in high-quality records, but I think it is obvious that the definition of "high-quality" must be reconsidered today. The world has changed and quality cannot be defined in the same way as it was 25 years ago.

Tuesday, June 15, 2010

Heads in the Sand

Posting to NGC4LIB

When the Google-Publisher agreement is approved--which could literally occur any minute now, although it's clear that it will happen sooner or later--then we will have millions upon millions of books fully available at the click of a button, 24 hours a day, wherever you happen to be. These materials definitely will be adequate for a high-school education or an undergraduate degree. Perhaps they will also be adequate for a master's degree in many fields. At the moment the agreement is approved, which will happen at most in a few years, everyone will discover the answer to the big question: will people still come to the library to consult copies of books that are online? I think they will not. For those who are interested, I discuss this more deeply in earlier posts, found at: I believe this issue is possibly the greatest challenge facing librarianship in its long history.

During a conference, I was having dinner with some colleagues and mentioned how libraries will change when the Google-Publishers agreement is approved sooner or later but the others didn't want to hear any of it. I tried to press my point for a bit, but finally, everyone clinked their glasses with a toast to "Heads in the Sand!" I clinked and drank, too.

On the email lists, I also get little public feedback when I bring this up and the private replies I get are either highly doubtful that people will actually want to read electronic books, while others are resigned to just riding out the troubles and simply hope for some sort of "deus ex machina". Although the consequences of the Google-Publisher agreement could turn out to be quite unpleasant for libraries, the fact that the true digital revolution must happen sooner or later is rather elementary to predict. In addition, before us we have a very useful example in a cautionary tale of how *not* to react, as seen in the music industry, which has made itself almost universally hated, especially among their biggest fans. It must be especially irksome for people in the music industry who are now reduced to threatening everyone, as they watch their former immense power gradually wither away.

To their credit, the book publishers do not appear to want to follow the same painful example as the music publishers, but they too, are looking at tremendous changes that are inevitable, the consequences of which are quite literally impossible to foresee.

Of course, at the foundation of this tremendous struggle lies the issue of copyright: it is hard to predict how copyright will have to evolve and adapt itself to this new world, that is, if it is not to become ignored entirely. I suspect that as changes in copyright are hammered out, the changes in the various publishing and media industries will also become clearer.

Libraries can only look on from the sidelines as these events work themselves out; after all, while libraries probably own more copies of texts than anyone else, they own extremely few copyrights to he works they hold, if they own any at all. When arrayed against the forces of the most powerful information agencies in the world, it would be simple and entirely natural to simply throw up your hands and give up.And yet, although libraries may have very little control over the final outcome of this struggle taking place among greater powers, there is certainly nothing holding them back from preparing themselves for different possible outcomes. Therefore, a "head in the sand" attitude would not seem to be the wisest course at this moment. Such a course indicates that you resign yourself to whatever happens; that you have a complete lack of power. Such listlessness may be logical and even the proper course in those cases when someone has fought their best battle and still lost, but there is almost no situation that is completely without resources and totally hopeless. Especially when someone's survival is at stake, or that of an entire endeavor, the greatest efforts must be undertaken before giving up.

So, is it all really that bad? It depends on what someone thinks librarians do. If you think it is all about printed books, then it may be completely hopeless, or at least as hopeless as when the music industry insists that people continue buying music CDs at outrageous prices. But just as the music industry could still do very well so long as they reconsider what it is they are really doing, that is, they should not be focusing on the antiquated task of merely creating different types of physical copies of intellectual creativity and sending those copies around the world to be bought by the public in retail outlets. Such a world is disappearing, whether they or we happen to like it or not. I hope the book industry is thinking long and hard about this, because I am sure there will be an important and honored place for them in the future. If libraries and librarians do this same deep thinking, they too may still have a very important role to play in the future information world after the powers-that-be make the millions of books in the Google Books project completely available.

What can librarians contribute in this new world? It would seem simple enough: we should make it easier for people to know about the whole range of resources that are genuinely available to them. Here, I emphasize the term "easier", not "easy", since I fear the task will never be an easy one, and we should not set ourselves up for failure. But I am sure everyone can agree that it can be made progressively easier for people into the future. Plus, we should focus on the resources we believe are worthwhile, that are genuinely available to people. This means valuing a resource by its inherent quality and not to concentrate on the materials that a local institution happens to pay for (i.e. we should do the same as we tell our patrons to do!). If there are just as many or more sources of information available to people for free on the web, and these same resource cannot be found through the library's tools, we shouldn't be shocked that people will have less and less trust and respect for the library's tools.

It is a tremendous undertaking so it will be very tempting to rely on others for much of this; for example, when the Google-Publisher agreement is approved, there will be an overwhelming number of books in the Google Books website and our institutions will be paying a lot for them. It will be logical to assume that people will tend to start their research in Google Books where they will find so much they will never find their way out to other resources. Yet, we as librarians know there will be many other highly valuable resources available outside the Google Books site, both on the web as well as physical books in our local collections. Somehow, librarians need to let the public know about these resources, because it is unrealistic to expect Google to point people away from their own resources where they can make money. How can we achieve this?

This is when the conversation becomes interesting for me because it shows that at least people have gained enough confidence to open their minds to new possibilities, and are talking about what might be done. It is only through the flow of ideas that possibilities can be tried, modified, accepted or discarded. We should try anything except to put our "heads in the sand."

Of course, there are those who believe that the public absolutely loves the printed book and will never give it up, so they believe we are not facing much of a problem. In their opinions, for the foreseeable future, the majority, or at least a sizable minority, of people will still go to the library to borrow a printed copy of a book even though they are already looking at a digital one; also, that our library administrators will allow libraries to buy physical copies when they are already paying for an electronic version. They will agree to pay for ILLs for materials available digitally as well.

While all this may be true, I do not believe that complacency is the correct response, and then to be as surprised as the music publishers were when highly predictable events take place.

"Learning from libraries" (somewhat OT)

Posting to NGC4LIB

"Learning from Libraries: The Literacy Challenge of Open Data"

While I am certainly a very strong advocate for open data and I expect to see it expand in many ways--including major changes in the purposes and the very idea of "intellectual property" which is holding society back in many ways--I would like to pause for a bit on the author's views of "context":

As in the 19th century, these arguments must not prevail. Indeed, we must do the exact opposite. Charges of "frivolousness" or a desire to ensure data is only released "in context" are code to obstruct or shape data portals to ensure that they only support what public institutions or politicians deem "acceptable".
I think this is a rather simplistic way of putting it. "Context" is an important consideration and although it can be seen within a bibliographical system, the concept of "context" there remains rather vague and abstract, but fortunately, "context" can be seen very clearly in the realm of statistical data, and once it is seen there, we can go on to the field of bibliography.

A close friend of mine who is in charge of a major statistical project has described the problem of getting reliable statistics. For example: Here is a statistic giving the number of people living in poverty in a specific country. Who is providing this number? A government agency within that country? What kind of government is it: semi-democratic, military, dictatorial? Does it come from a government agency in another country? Is it an unfriendly government? Does it come from a non-governmental agency with a religious or political agenda? The questions go on.

It becomes clear that while we can take it as a given that somewhere out there, there may be a "real" number of people living in poverty within a specific country, it turns out that no matter what that number happens to be, it will actually be a political issue. As a result, anyone viewing this statistic can and must realize that it is suspect in many ways for all kinds of reasons. Consequently, the "context" surrounding this number becomes just as important as the number itself.

These concerns only scratch of surface of the multiple considerations surrounding "context" in something that at first glance appears to be as simple as a statistical number. For it to have real meaning so that competent decisions can be drawn, someone must consider additional aspects of the "context" of that number.

Experienced librarians and catalogers know that there are similar concerns in our data, stemming from all kinds of issues: historical, cultural, budgetary, level of competence of the cataloger/inputter, and even personal matters. When we take these concerns to a more general "macro" level of the information universe, where all of this is supposed to be shared in some sort of meaningful way, we are dealing with something genuinely new, in my opinion.

How do we adjust to this new world? I don't know, but I do think that the general public needs help to realize at least some of the issues involved, just as someone should be skeptical of the "officially-sanctioned" number of poor people in a country run by a military dictatorship. These are some of the issues my friend is dealing with in the statistical realm, and I think we need to consider them as well.

Monday, June 7, 2010

RE: We have veered way off the topic... (was Are MARC subfields really needed?)

Posting to NGC4LIB

Ted Koppel wrote:

Traditional ILS OPACs (and for the matter, NG OPACs) are simply and *only* for displaying the structured data. As long as you know where to find that data ins a structure (whatever structure - MARC or something else) then combining it, presenting it, sorting it, is the responsibility of the application, not the data. So if someone wants to have a bibliographic record display to a user with two main entries (note that I am purposely *not* saying two 1xx fields), that's the OPAC application's responsibility, not the data's. You want to browse by title? Fine - but that's an application and presentation issue, not a data issue. As long as the data is structured in some sensible way, it can be done.
I agree with you up to this point. A lot of it does have to do with the data. The War and peace example is one: the titles proper file pretty well in Michele's catalog (I still see some problems as I continue browsing), *so long as the coding is consistent*. Again, for someone who hasn't done it, figuring out what is the title and what is "other title information" may seem simple--and mostly it is--but often it is difficult.

Here is an example of a book in the Internet Archive:

On the title page is:
The adventures of Huckleberry Finn
(Tom Sawyer's comrade)
Scene: The Mississippi Valley
Time: Forty or fifty years ago.

Now I will switch to the cataloging thought process:
So, the question is: what is (Tom Sawyer's comrade)? It renames Huck Finn and puts him into a relationship for those who know Tom. But is it part of the title proper? Is it a subtitle?

Essentially, it renames Huck Finn, and in a grammatical sense, it is closely related as a clarifying phrase and therefore would be in $a. Yet, it is in parentheses. If there were just a comma after the word Finn, I wouldn't hesitate to put it in $a, but the parentheses are a pain. Still, in my opinion, Twain threw in Tom Sawyer as a marketing ploy trying to capitalize on the popularity of his earlier novel. Therefore, (Tom Sawyer's comrade) is best placed in $b.

Of course, this would be a mistake because the purpose of a catalog is to be consistent. Therefore, we should look to see if there are other editions and see if someone has dealt with this before and what they have done. And of course, we have: We discover that it is put in consistently as part of the $a (or by what I have seen, it's pretty consistent). Good catalogers! So therefore, based on the rule of consistency, I do *not* add the $b, even though I do not agree with it, because I am a cataloger. This is how to make the machine work correctly.

If we put in the $b, it would (should) file differently. There are hundreds of points like this that we have to deal with every day. This is the reality of what cataloging is, while doing less is more akin to filing out forms.

We can argue that such a level of accuracy and consistency is no longer needed today. That is a separate point, but here I wanted to point out that the coding *and* the data go hand in hand.

As a later post in this thread mentioned, there are then additional considerations concerning a uniform title, or, the idea that what is essentially the same text should come together.

RE: Are MARC subfields really useful ?

Posting to NGC4LIB

Gray-Williams, Donna wrote:

I've been following all of these discussions about MARC and its limitations, and sometimes it's beyond me because I'm not especially familiar with XMTL and the like--so here's my question:
Why can't MARC be expanded to function like these other formats if the data isn't presented in a ideal way to be extracted? Why does it have to be totally thrown out to adopt something completely different? Is that so impossible?

I would like to point out the difference between MARC in its different formats: there is MARC21, UNIMARC, etc. Trying to bring these together is one level of complexity. Then there is the point I bring up, its ISO2709 problem, which is at a level deeper.

An example of some basic structural problems with MARC21 is that it is based on a single main entry, i.e. the 1xx field is not repeatable. Therefore, we have in-depth rules for determining a single main entry although there may be many other authors just as important. While the single main entry was absolutely critical in a printed/card catalog, and in fact, the idea of "multiple main entries" in the card catalog was nonsensical, in computerized records, we have the same problem from the opposite point of view: the idea of making the 1xx field non-repeatable is just as nonsensical.

Yet, if we just decide to make the 1xx field repeatable, we have problems in other fields that allow only single main entries, e.g. analytics and subjects. So for example, while you could easily do something like:
100 1_ |a Masters, William H.
100 1_ |a Johnson, Virginia E.
100 1_ |a Kolodny, Robert C. |d 1946- (I made up thie date)
245 10 |a Masters and Johnson on sex and human loving / |c William H. Masters, Virginia E. Johnson, Robert C. Kolodny.

You run into trouble with another book about this book:
600 1_ |a Masters, William H. |t Masters and Johnson on sex and human loving.
(how do we add Johnson and Kolodny?)

So, therefore, if we would make 1xx repeatable, it means that we would have to make the related subfields in 6xx, 7xx and 8xx repeatable as well. This is complex and one reason why we remain stuck with figuring out a single main entry, all remnants of the bygone print world. While it may be possible to fix this with our current formats, this would become much more complex and there are better formats to work with.

None of this presents any difficulties in more robust formats like XML, which could easily have something like:
<author>Masters, William H. </author>
<author>Johnson, Virginia E. </author>
<author>Kolodny, Robert C.
<title>Masters and Johnson on sex and human loving.</title>

Each field and subfield can easily be entered. When you begin to play with it, the sky is the limit and you can even create things that look totally weird to an experienced cataloger:
<author>Masters, William H. </author>
<author>Johnson, Virginia E. </author>
<author>Kolodny, Robert C.
<title>Masters and Johnson on sex and human loving.</title>
<publisher>Little, Brown</publisher>
<paging>598 p.</paging>
<dimensions>25 cm.</dimensions>
<statement>Rev. ed. of:</statement>
<title>Human sexuality.</title>
<edition>2nd ed.</edition>
<bibligraphyNote>Bibliography: p. [565]-580.</bibligraphyNote>

I'm not saying we should do this, but we could. Instead of entering all of this information manually, it could be imported as well, which is at least some of the idea of FRBR/RDA. (I personally don't think the problems lie with the cataloging rules--RDA even retains single main entry--but with the formats, which are simply worn out and unused by everyone in the world except libraries)

There are lots of other problems with MARC as well, including MARCXML--primarily because it must be "round-trippable" between XML and ISO2709 and as a result, the limitations of ISO2709 are transferred en masse into MARCXML.

So, the idea of reworking ISO2709 seems to me a bizarre, modern example of the old "Horsey Horseless" from 1899.

It's time to move on. And when we do, I am pretty sure that we and our users will find it all much more interesting than what we have today.

RE: Title browse within the new systems (was Are MARC subfields really needed?)

Posting to NGC4LIB

Michele Newberry wrote:
You might want to look at our Endeca-based library catalog to see an example of a title browse within an interface that normally doesn't support this type of browsing.
Click on the "Search begins with" radio button and after the screen refresh, type your title. You can also select Author and Series. Lack of the Subject option is an indicator that we just couldn't quite work out all the issues of those pre-coordinated index entries within this technology.

In this instance, I think it aids the user not to have the content from the subfield c in the display so that subfield has some value to me. We find some value in the subfield b for relevance ranking purposes when we're trying to bring the probably most likely results to the fore. We call it the "on the road" test. This uses the words be searched as a percentage of the words in the title. Differentiating the subtitle is helpful here.
Thanks for sharing this. Certainly, it is a much better display, but if I search for War and Peace, I still find various titles proper filing together. Still, my experience with people is that they almost never know the exact titles of an item they want. Citations are very often incorrect, and the need for browsing titles proper is far more important to librarians and catalogers than to the public. [As an historical aside, from my researches of early catalogs, some *never* made an entry for title, sometimes not even for the Bible. If the cataloger could find no author to enter the record under, they would place these records into an "Anonymous, Pseudonymous Works" or something similar.]

My suspicion is that in the public's mind, much more common is what used to be termed the "catchword title", e.g. they would think "Bury's Later Roman History", and not "History of the later Roman Empire" or "Professor Thompson's book on Alfred Hitchcock" instead of "The moment of Psycho : how Alfred Hitchcock taught America to love murder".

Just to make it clear, I am *not* saying we should stop coding the subtitle separately, primarily because it is codified in ISBD. But its utility does have to be reanalyzed seriously in our new environment, along with *every other part* we do. There are also consequences to consider: if we want to accept metadata from other providers that do not code the subtitles separately, do we continue to edit the subtitles locally? Is that a wise use of our resources?

Yet, if we just accept these other records without recoding, consistency falls apart and what does that mean for quality? If we do not consider the implications and consequences of all of this, then when higher authorities ask what someone has done in the last week, they certainly will wonder when they hear: "I've added 245$b to 400 records!" and when this higher authority asks why this is so important, we won't be able to point to any adverse consequences, so there will be no other answer than: "it's the correct way to do it."

Is this the best use of the staff?

Friday, June 4, 2010

RE: Are MARC subfields really useful ?

Posting to NGC4LIB

Dan Matei wrote:

The question in the subject is brutal, in order to attract attention. I know it is a blasphemy :-)

Of course the subfields are useful (as the TEI tags in scholarly texts)... But:

"Does its usefulness justify the effort ?" That was my real question.

And, of course, we can refine it:

All subfields are useful enough to justify the effort to delimit them ?

I thought we are looking for reducing the cost of cataloguing. Or not ?

Dan is asking exactly the right question, and one that has long needed answering. Although I wasn't working way back then, when they created MARC format, they obviously could have done it much more simply, without all of the subfields. It seems logical to me that they didn't yet know whether this deeper layers of (what we now call) semantics were necessary, but they didn't want to take a chance, so they decided to code everything within an inch of its life.

I think it's time to reconsider the usefulness of a lot of it. If some of these fields haven't been used in almost 50 years now, I think we have enough research data to come to some decisions.

We have already seen some of the real bloopers thrown overboard: some of the tags, the old "main entry in the body of the entry" and others. But nobody has asked the bigger questions yet. For example, I have brought up a few times the need to code separately the 245 $a from $b. I understand it had a purpose before (as I wrote to Alex once), primarily to prevent different texts from inter-filing, e.g. my example "War and peace : the definitive edition" and "War and peace in the nuclear age". In the card catalog, it files correctly, but I haven't seen any OPAC do it "right" and it has always been interfiled and therefore, "wrong".

Still, almost nobody browses titles like this anymore, and I have personally never even heard of a complaint, except from a librarian (like me, who it drives absolutely crazy!). But I ask: if there are no complaints and nobody even notices, is it still "wrong" to interfile different titles proper? I think a case can be made that the 245$b is really outmoded.

If this is accepted, then it is open season on all of the other subfields and fixed fields. Do we really need a separately coded 100$b? or 100$q? Why? Just because it's considered "correct" is not a reason to continue it.

Re: If libraries had shareholders by Peter Brantley

Posting to Autocat

On Thu, 3 Jun 2010 17:36:26 -0600, john g marr wrote:

>On Thu, 3 Jun 2010, J. McRee Elrod wrote:
>>The US is not so much running out of money and using it unwisely, it seems to me ... I don't think cutting corners in libraries is the answer ... a new set of priorities is needed.
> Actually, far more than that is needed-- an entirely new basis for priorities. Rather than meeting society's needs, we have to assist in facilitating the changing of society's needs...
Unfortunately, I don't think that this is going to happen anytime soon, especially as we enter this ominous Darwinian period of budget cuts. Throughout my life I have noticed something that constantly happens when money is being allocated be it in Washington, in organizations, or even in our private lives: while what is really needed is an honest reconsideration of what is *truly important*, and then reallocate money based on those considerations, in reality, the allocation of money is based on sheer power. In many people's private lives, they know they should spend more on healthy food but can't find the strength to resist the power of McDonald's or cigarettes. In political terms, this same phenomena displays itself when the budget goes to those with the most power, no matter whether they are positive or negative influences in the scheme of things, while those with less power end up with whatever is left over.

How do libraries and especially catalogers fit into this scenario? To paraphrase Mark Twain, who said, "Naked people have little or no influence on society." I will change this to say "Catalogers have little or no influence on society." Let's face it: it's becoming a jungle and catalogers are not equipped with sharp teeth and claws.

But just because we don't have much power in that sense doesn't mean that we can't survive and thrive, but we must use our wits and find some allies. Instead of simply repeating the mantra that the tools we make are important and useful, those in power simply don't believe it, so we must demonstrate it very clearly and very obviously.

There are people out there however, who have some appreciation of what we do although they may not understand it very well. Here for example, is a public lecture "Anya Kamenetz: DIY U: The Coming Transformation of Higher Education". She is discussing open-access education (completely free) and it would be almost completely virtual. While I grant she may be discussing the reality of education in the future, she is too radical even for me(!). Yet, she does express admiration for librarians!! This starts at 22.00 minutes into the talk, where she talks about Open University and says that the role of the librarian will be very interesting, "the person who knows where all the books are; they know how to access information."

She displays the popular idea of the librarian's work here. Of course, in reality we don't just "know". We need our tools to be able to find and access, and if you take our tools away, we are *almost* as helpless as anyone else. I emphasize "almost" here because in my experience, I can still find things that non-librarians cannot. I have actually been considering this for some time, and have come to some preliminary conclusions why I succeed while others do not.
  1. I know that not everything is in Google;
  2. I know some specific databases on specific topics that are out there;
  3. I don't give up too quickly;
  4. and this last part is more of a suspicion, but a very strong one: because of my training, I think in a hierarchical arrangement of concepts. Instead of only thinking in terms of synonyms, as most people search Google-type databases, it is natural for me to think in broader terms when I begin to have problems. This is often when I really begin to find things. As a result, I suspect that the traditional hierarchical arrangements of subject headings and subject descriptors could prove to be very powerful even in full-text searching. Still, this suspicion would need some research.
In any case, there are opportunities but I fear we need to find them and make them for ourselves. That may be frightening, but it is also rather liberating when you realize that your future may be more or less in your own hands.

Thursday, June 3, 2010

RE: If libraries had shareholders by Peter Brantley

Posting to Autocat discussing "If libraries had shareholders" by Peter Brantley

In all of these discussions and statistics, apparently mainly taken from the ARL paper at, the results that stand out to me are: the huge increase in ILLs with a corresponding huge fall in reference questions.

To me, the rise in ILLs illustrate the growing awareness among the public of the materials in the whole of the information universe. It appears logical that as the web grew and more library catalogs could be searched, along with, LibraryThing and other sites, people slowly realized what they had been missing earlier and wanted what the local collection did not have. So, one way of looking at it, the earlier, lower rates of ILLs were based on the patrons' focus on the materials that could be found through the local library catalog, and therefore they lacked an awareness of lots of other materials (or, to put it bluntly, the local collections in fact failed the public, but they weren't aware of it). If people had been more aware of the materials in other libraries earlier, e.g. in the 1950s and 1960s, ILL rates would probably have been just as high. Only with the greater awareness of materials as provided through the web, can we see more precisely how the local collections have not supplied what people wanted.

The fall in reference questions masks a somewhat opposite trend, I think. I don't believe that too many information experts will maintain that it is easier to find *reliable information* today than it was before the web. The emphasis on "evaluation" in information literacy workshops demonstrates this. On the other hand, it is much easier today to "use the machine" i.e. you almost never get a zero result with a full-text search, and the results are almost always more than you can look at. The propaganda term "relevance order" is taken literally by many people, i.e. many literally believe that these are the most relevant items to what they want, although "relevance" when applied to a database search result means something quite different.

This is not what people experience searching a library catalog, especially back in the 1980s, which often turned up nothing. So, I conclude that the reason people ask fewer reference questions is that they are happier with the results they find today, and there is a lot of evidence that people believe that they are very good or expert searchers already. If you believe you are an expert, some librarian (who specializes in books) can't help youmuch anyway.

Of course, my own experience shows that people do not know how to search, they don't understand much of anything that happens when they see a search box and type in some words: is the boolean operator "and" or "or"? What information is being searched? What is the arrangement of results? Is there controlled vocabulary? and so on and so on. As a result, I conclude that the number of reference questions actually should be going up, since I personally find it more complex than ever to discover a good site or database, and then to make a good search.

We come to one of my genuine concerns with the Google Books-Publishers agreement when it (or something like it) is eventually approved: the one area that shows real demand in libraries is ILL, but when many of those materials are available at the click of a button, it seems reasonable to assume ILL will drop exponentially. It seems inevitable and we should be planning for this eventuality. If we are not careful, almost everyone may start and end with Google Books and Google Scholar. Is RDA the solution? Or even part of it? I don't think so. Although the situation may seem almost hopeless in many ways, I still think there are major services libraries and librarians provide that are found noplace else, and are necessary for a vibrant society. But it means some soul-searching and fundamental changes.

Wednesday, June 2, 2010

RE: AACR2 and RDA sample records from LC

Posting to Autocat

On Wed, 2 Jun 2010 05:37:29 -0400, Hal Cain wrote:

>Does anybody bother about ISO 2709? System vendors I'm aware of don't -- they simply address (more or less adequately) MARC 21. Wasn't ISO 2709 basically written up on the basis of MARC formats as they were at the time of composition? Details like how many indicator positions are used (some types of MARC used 4, I saw it but have forgotten where) have to be implemented in the system software, and differences like this have a lot to do with why MARC 21 has displaced most other types.

I guess I am not making myself clear. Since the ISO2709 format is what we use every day to transfer records, we are stuck with those limitations, that is unless people are transferring using some other format, such as MARCXML, but when I get records from another library, it's always using Z39.50 protocol to get ISO2709 records, which my catalog then reworks and places into a MySQL database, where the tables are probably different from the tables of other relational databases used in library catalogs. It is the standard method of transfer.

But because of ISO2709, records are limited in many ways, e.g. to 99,999 characters (the first 5 positions of the leader), and other problems related to the structure of the ISO2709 standard. Since one purpose of MARCXML is to be "round-trippable" we are still stuck with those same limitations. So for example, say that I wanted to take a record from another catalog used an API that dynamically includes information from elsewhere, e.g. Delicious tags or reviews from another source. It would be easy to do it with XML, but not when transferring using ISO2709. And of course, if I want to put a non-MARC ISO2709 record into my database, there will be a lot of work.

>And this is in the end the real argument for change: not that change is better, but that others will pas us by and forget about the value we have to offer. And the people (our top managers and boards) who determine our destinies have resounding in their ears the noise of those whose advantage is served by changing in the direction of systems and data formats that can be exploited by mainstream computer tools. "Better" or "worse" has little, if anything, to do with it.

I maintain that it's both reasons. The new formats are more powerful in all sorts of ways and we need to use those powers. We shouldn't think that a format created in the 1960s is still the state of the art.

People are changing in all kinds of ways. Look at the article in the Chronicle, The Humanities Go Google, May 28, 2010: They admit that they will need to keep track of everything, and I think our metadata will become extremely important in that kind of research, but it simply cannot be using what we use now.

RE: AACR2 and RDA sample records from LC

Posting to Autocat

Myers, John F. wrote

Isn't this a bit of a strawman? I cataloged for over a decade, and successfully so, before I finally saw MARC in its raw form as cited below. The only reason I came to learn that MARC was not the nicely formatted display in OCLC or my ILS was that a co-trainer trotted out this raw form to have fun with the students in our class. And yes, it is a little bit scarier and a lot less readable in raw form than a MARCXML rendering. But having recently compared records in the two formats, it is clear that neither is serviceable without an interface.
On top of that, MARCXML requires more space (on the order of 10 times more). After comparing the two formats, I am even more impressed by the genius of Avram's efforts. There are valid arguments for the obsolescence of MARC, but indecipherability in its raw form is not one of them.

I wish you were right, but the fact is, MARC in its ISO2709 format is used for *record transfer* (i.e. a communications format) and this has lots of consequences. It is the format that others get when we share our records, whether we or they like it or not. Certainly they can go through the hassle of parsing, etc. but they ask: why should they have to jump over hurdles? And they won't do it. Yes, if you have the correct software, such as in library catalogs, everything gets parsed out, but not everybody has, or wants to have this parsing software. That's one reason why it's all obsolete. From that interesting article mentioned by Allen "If Libraries had shareholders", it's clear that if we want to join the larger world of information, we will be the ones who will have to change, not everybody else. Others are not standing around helplessly waiting for us. If we don't change, then our materials--and consequently our libraries and librarians themselves--will increasingly become a backwater.

But as I showed, even if somebody has the option, understands it, and chooses the MARCXML format, the fixed field information still needs to be parsed, and still nobody is going to do that.

As I wrote in my original message, there are a hundred limitations held in that horrifying list of numbers:
01142cam 2200301 a

These incomprehensible numbers define the record--not only how long the record as a whole is, but they define the length of each field, and where each starts and stops. If you change a field, e.g. to add "written by" in a 245$c, almost all of these numbers have to change as well.

This is why doing global changes was so terrifying: so many things needed to change, e.g. changing Russian S.F.S.R. to Russia (Federation) (one I remember well) because the definitions of the fields in these numbers all have to change as well, since the former took 16 positions and the latter took 19 positions. As a result, changing a MARC ISO2709 record is very complex, especially for global changes that demand some serious computer resources (especially in the old days), when there was a serious possibility of a crash in the middle of it, which made you shudder even to think about. Therefore, you would just do it one by one, as we did.

The new formats get around all of this and add a wealth of possibilities we haven't had before. My argument is that although almost no library catalog stores these records in this format, we still *transfer* them in this format and as a result, we are still all stuck with those same ISO2709 limitations, while everyone still needs special software to do anything with them. This automatically limits us in many ways from the modern information community. who may want to work with our records, but they obviously refuse to use them in this format. And you can't convince them that MARCXML is much of an improvement.

One huge advantage of XML over ISO2709 is that bits and pieces can be taken out and not the entire record, so someone could get a list of just titles for a mashup without having to download entire records, parse it all, extract the titles and display the results, or you can easily get basic info for an RSS feed, along with a link to the record if people want to see more. In other words, with other formats, you can work with information live, and browsers are even designed to work with many of these formats independently, but with ISO2709, there is a lot of work to do first before you can use it in any way at all.

I personally thought that OAI-PMH waas to be the solution, but it was nixed by the new information behemoth, Google, and so we have to find another solution.

And concerns over length of record are unimportant today, especially for a communcations format. The storage format will always be different, and may even vary from database to database.

To me, it's obvious that changing the rules to RDA will change absolutely nothing in this scenario. Abbreviations? Capitalization? Changing to more modern formats however, would be one important step toward bringing us into the modern world of information exchange. But it is only one step. There will be many more if we want to change those frightening trends we see in library usage shown in "If Libraries had shareholders" and other places, too.

Tuesday, June 1, 2010

RE: AACR2 and RDA sample records from LC

Posting to Autocat

On Tue, 1 Jun 2010 08:38:55 +1000, Tessa Piagno wrote:

I agree with everyone in this thread.

>Apart from the capatilization issue do we really have to have these new fields 336,337,338?

I would like to ask the same question in a slightly different form: while there are some definite advantages to these fields, are they worth the effort? But before we can decide whether they are worth it, we must figure out how difficult it will be to repurpose our existing records to be useful with any new records we create. Therefore, in concrete terms, can the 336 be generated from Leader/06, the 337 from 007/00 and 338 from 007/01?

Have there been any research or practical attempts in this direction? We should be able to, but if we cannot generate these fields automatically, I would say it would not be worth the effort, since we would be making the millions of current records obsolete in this regard, and the advantages from the new fields are a rather paltry return.

But if they can be generated automatically, it would be another matter. There is a lot of useful information in our fixed fields, such as this, but if it is our goal to make our records more useful in the new information universe, we must confess that information buried in the fixed fields based on position(!) is just too antiquated to work with. The new formats work on different principles. If this information can be extracted into the new fields, such as here into the 336 etc. fields, it could be very good.

Here is a specific example of what I mean:
This is one of the ways that I maintain that MARC format in its ISO2709 format is obsolete. Digging the fixed field information out of this record is not easy. In this example, Leader/06 is the first "a". Determining where the 007 information can be even more complicated, would take more time for me to figure out, and it is far too much for a non-librarian type. (Remember
there is a position 0):
01142cam 2200301 a
92005291 DLC 19930521155141.9 920219s1993 caua j 000 0 eng a 92005291 a0152038655 : c$15.95 aDLC cDLC dDLC alcac 00 aPS3537.A618 bA88 1993 00 a811/.52 220 1 aSandburg,
Carl, d1878-1967. 10 aArithmetic / cCarl Sandburg ; illustrated as an anamorphic adventure by Ted Rand. a1st ed. aSan Diego : bHarcourt Brace Jovanovich, cc1993. a1 v. (unpaged) : bill. (some col.) ; c26 cm. aOne Mylar sheet included in pocket. aA poem about numbers and their characteristics. Features anamorphic, or distorted, drawings which can be restored to normal by viewing from a particular angle or by viewing the image's reflection in the provided Mylar cone. 0 aArithmetic xJuvenile
poetry. 0 aChildren's poetry, American. 1 aArithmetic xPoetry. 1 aAmerican poetry. 1 aVisual perception. 1 aRand, Ted, eill.

ISO2709 is actually more confining than only this. There are a hundred little areas where you cannot expand here. XML provides a much better format.

Still, digging this same information out of a MARCXML record is only slightly easier:
Only slightly easier because you *still* have to dig it all out based on position, e.g. in this example MARCXML record, we see

&lt;marc:leader>00925njm 22002777a 4500&lt;/marc:leader>
&lt;marc:controlfield tag="001">5637241&lt;/marc:controlfield>
&lt;marc:controlfield tag="003">DLC&lt;/marc:controlfield>
&lt;marc:controlfield tag="005">19920826084036.0&lt;/marc:controlfield>
&lt;marc:controlfield tag="007">sdubumennmplu&lt;/marc:controlfield>
&lt;marc:controlfield tag="008">910926s1957 nyuuun eng

and in the leader and field 007, there are still the separate values in leader/06 and 007 positions 1 and 2 to dig out. This is totally obsolete. There is no reason to do this anymore and it is unrealistic to expect normal webmasters, such as those at Google, to even begin to understand or implement this. Therefore, opting for separate 336 etc. fields would be a step in the right direction so long as the *rules for input* do not change, or at least not much. The more the rules for input change, the more obsolete become our earlier records. This must be considered with great care, especially in these times where there will probably not be too many new catalogers hired.

The natural question, as you point out, is: should this demand more work from the cataloger? The answer is: of course not. And a related question: to create these 336 etc. fields, do we even need any retraining? Or could catalogers simply continue creating records as they always have, and these new fields would be made automatically with no one the wiser? Each catalog--and today, this should not be limited to library catalogs--could then take this information and display/search it however they want.

While the future for "catalogers" may be rather uncertain (I don't know), the situation may be brighter for "metadata creators" i.e. people who can work easily in a more universal bibliographic environment including that outside of traditional libraries.