Friday, January 28, 2011

RE: Initial articles Do they matter anymore.

Posting to Autocat
On Fri, 28 Jan 2011 13:21:44 +1100, Bill Constantine wrote:
<snip>
I am trying to reconcile data from different data sources to search and display and sort consistently.

One problem I am having is that some data sources can handle initial articles nicely eg MARC, and other data sources I have been told should be able to do it easily but tying down a programmer to go back into what they thought was a perfectly good program is difficult, and yet other data sources don't seem to have the capability to do anything in regards to initial articles.
...
Should I persist with wrestling with initial articles or are these an artefact of the old days?
</snip>

My own opinion is that *if* we want to cooperate with other bibliographic/metadata-type organizations, it will mean change. When people search the "new and cool" information tools, e.g. Amazon, Google, the Internet Archive, while there are many sort options, there is not a "sort by title". In the databases, I can't remember (or quickly find) an option for sorting by title, although there are options for "relevance," date, ratings, etc.

Programming it to work automatically is one of those tasks that seems simple enough to begin with, but can make your heart sink when you start to get into it. *If* all you have are, e.g. English language records in your catalog, then it may be pretty simple but if you have a more expansive catalog, matters get much more complex. For example, here is Princeton's list of initial articles: http://library.princeton.edu/departments/tsd/katmandu/catcopy/article.html. I don't think such a task can be automated.

So, to maintain consistency, everything would have to be coded manually, which seriously limits possibilities of cooperation. By this I mean that we could not accept a record from another database that ignored initial articles, even though it may conform to some kind of future standards for linked data, because somebody would have to code the initial articles.

I think this is one of those decisions that demands research and analysis. The practice may be an anachronism, a holdover from printed and card catalogs, or perhaps not. Still, it is my opinion that something like this should not get in the way of efficient cooperation with other metadata communities. If an automated solution could be found, OK, but cooperation and efficiency are far more important. (By the way, the practice of a title-*added* entry was rarely done in the old, old days. When the title was the main entry, it was normally traced, but sometimes not even then)

RE: Inspiring speech on libraries

Posting to Autocat
Ted P Gemberling wrote:
<snip>
There's a lot in it that people might disagree with, but this address by Philip Pullman is the most passionate defense of libraries I've seen. It deserves to be read widely. My apologies if you've already seen it.
http://falseeconomy.org.uk/blog/save-oxfordshire-libraries-speech-philip-pullman#
</snip>

I absolutely love libraries and have chosen to devote my career to them. I have said before that I am convinced that it was precisely in libraries where I was finally able to learn to think rationally and where I became a human being. But the situation I have loved is changing, and changing before my eyes. While I agree with every single point the author makes in this article, the more of these sorts of writings I read, the more they strike me as eulogies.

The fact is that libraries are in a state of change, and I think it also clear that libraries are evolving into something different. They are undergoing this evolution not from some type of internal struggle, but as a response to changes in the "environment". Mr. Pullman mentions something in his article, described in greater depth by the great editor Jason Epstein in his book "Book Business: Publishing: Past, Present, and Future" (a fabulous book!) and in his recent article "Books: Onward to the Digital Revolution" http://www.nybooks.com/articles/archives/2011/feb/10/books-onward-digital-revolution/ (unfortunately, behind the paywall), the book publishing industry has been in serious trouble for a long time, and for lots of different reasons. In my own opinion, it is similar to the beginning of printing, when so many topsy-turvy changes took place; it turned out that many of the problems those people faced had been simmering away for a long time too; in their case, those problems were inherent to the medieval system, and printing was simply the catalyst that forced people to deal with the problems they had been able to ignore for such a long time.

With the introduction of the World Wide Web, I think matters are very similar: the problems inherent to our system have been around for quite some time, decades perhaps if not more, but it has been a relatively simple matter for people to ignore those problems to focus on other, "more pressing" matters. The web shoves a lot of this in our faces so that we can't just look away any longer.

Naturally, I want libraries to survive and thrive, but to do so, they must evolve somehow. What does a library *really* provide a community?

Of course, no one can know what final form libraries will take and that form will probably arise through some kind of trial and error--or in other words, similar to survival of the fittest.

Perhaps this is somewhat off-topic for this list, and perhaps not. The catalog has been, and I think, will continue to be, the heart of the library. What will it become? Does the library catalog of today serve the community? If not, how does it need to change? I feel it is vital to answer these kinds of questions since it seems to me that as the catalog survives or is discarded, so will libraries.

Thursday, January 27, 2011

RE: Subject analysis

Posting to Autocat

On Wed, 26 Jan 2011 12:13:06 -0500, Jerri Swinehart wrote:
Thank you in advance for your responses!

I'm interested in learning what resources are available for learning how to assign subject headings to original records (any format). It's not that I haven't done this, but my experience is limited to DVDs and a few pieces of music. I know about the Subject Cataloging Manual, AACR (soon to be RDA), Bibliographic Formats and Standards. However, what else is out there?
I helped create an online learning resource for the FAO IMARK program. The module "Digital Libraries, Repositories and Documents" http://www.imarkgroup.org/moduledescription_en.asp?id=111 has a unit "Unit 6 - Metadata and Subject Indexing" where I discuss subject analysis (and other things). I do not focus on LCSH, but on the FAO/AGRIS standards of AGROVOC and AGRIS. Still, the basic principles of assigning subject terms is universal, and I try to be as expansive as possible, including of course, AACR2 and LCSH, among a few others. I believe registration is free.

It may interest other catalogers to know that when I started at FAO, I began cataloging and indexing (the tasks were separate at FAO at the time) documents and individual articles using rules I didn't know, on a topic (agriculture) where I had only a limited understanding, and using a thesaurus (AGROVOC) I had never used.

The results turned out to be very interesting. From the very first record, I could create very acceptable records. This created quite a stir and surprised even me! In the revision I went through, the descriptive part was no problem at all; the problems I had were subject analysis, and not with the actual analysis itself, but--as I was to determine myself later--finding the correct level of exhaustivity. As I was to figure out, even though they did not word it this way, the level of exhaustivity there was about 10%, whereas we have 20%.

So, from the first I could do it, although I was slow. With my "success," I realized that something very important was happening in my brain and I wanted to write it down. Probably every experienced cataloger realizes that when confronted with something completely new, they go into a different "mode of cataloging" than they normally do. This is what I did, automatically. So, I analysed what I did very minutely, and wrote down each step of the process. I think it was about 15 steps (I'd have to find it) that I went through almost subconsciously to determine the correct subject descriptors. And it worked!

This is another one of those articles I always meant to write, but never did!

RE: Reclassification of Materials per Faculty Request

Posting to Autocat

Back when I indexed articles, the article would often have an abstract, along with subject keywords (sometimes taken from a thesaurus), assigned by the author. It was quite amazing that the keywords, and often even the abstract, did not have much to do with the article itself. Writing an abstract is not something you can do without training, and doing subject analysis, even on someone's own article, is not such a simple thing either. Strangely enough, the problems that we saw were that the keywords assigned were normally far too general, and perhaps a term that had been mentioned once or twice in the article would be thrown in. The keywords assigned by the authors seemed random to me. I decided that doing subject analysis, even for your own article, demands at least some level of training and experience.

In this regard, I remember when I read the article "Author-generated Dublin Core Metadata for Web Resources: A Baseline Study in an Organization" http://journals.tdl.org/jodi/article/view/42/45, which was a study of authors assigning metadata for their own articles, I was absolutely astounded at one discovery, which went unremarked: they analyzed a total of 11 articles, and while everybody got the language correct (Thank God for that! It would be really difficult if they messed up the language they wrote in!), only 9 records of the 11 had the correct titles!

I remember being amazed and thinking that this deserves an article--if not an entire dissertation--on itself. Why and how could you possibly mess up the title of your own article? I still do not understand!

Friday, January 21, 2011

RE: Linked data

Posting to RDA-L

Jonathan Rochkind wrote:
<snip>
Okay, again, give me the algorithm for software to use to figure out what title(s) to use to display to the user (assuming we don't just want to put out the whole 245 with ISBD punctuation and all).
How does software know when to get exactly what from a 240 or 740 , and when to use it as a title label? (A 240 is often useless as a title label, eg "Selections", that is NOT in fact the title of the work it's attached to, to any user at all; A work cited in a 740 may or may not actually be the work in the record at hand, it could also be a 'related' work in some way).
</snip>
I am completely confused here. My original contention in this thread was that MARCXML is *not* a perfect solution, but still represents one small, but important, step in the direction libraries need to go, and can be implemented relatively easily. There are still problems with MARC in an XML format but this should not stop us from switching to MARCXML so that we can take that first step to placing our records into the wider information universe. Then it turned into parsing other titles from the 245, but I showed that this is unnecessary since catalogers do this manually at the point of cataloging, so long as the cataloger is following the standards. Any catalogers who do not do this are, automatically, producing sub-standard records.

Now that the problem of parsing has been dealt with, the "problem" shifts to how to *label* the different titles. I want to point out that this is not at all the same issue as the one discussed either by myself or in the article. But OK, let's discuss it.

All of this has been defined for a long time. The 245 is, by definition, the title appearing on the chief source of information of the item the cataloger is describing, transcribed as exactly as possible. 95% of the time, this is a no-brainer; probably 3% of the time it takes some thought but is still easy; 2% of the time, it may be tough and the cataloger must make a decision that others may disagree with. (In my own experience, these kinds of problems mostly stem not from which title to use, but deciding upon one chief source of information from several choices) In any case, the real purpose of the 245 field is *identification* and is *not* for access purposes, although lots of times in the past, a title added entry card was made, while 245s are invariably indexed today. Additional access is made for titles that are buried within the title statement.

All of the other title fields are there for additional access for various reasons, e.g. a uniform title, which is used to bring different versions with variant titles together, such as language versions, variant editions and so on.

Another type of title belongs to something larger or smaller in different ways: a series title, host item entry and constituent entry all describe greater "entities" or lesser ones that the item described relates to. These can be dealt with as separate records or within a single record. This is shown in the different treatments of series/serial records, host/constituent or analytics.

Another type of title is different ways that the title in the 245 may be entered. Today, these are in the 246 field, e.g. corrected spelling of the title, variant titles from the chief source of information, etc. 740 is kind of a grab bag for titles that can't be placed anywhere else. Maybe the 740 is muddled a bit, especially because the 246 field for books was instituted only in the 1990s, but nothing is perfect. (It would be interesting for OCLC to run a test and find out how many 740s there are now and how many of those are not from the 245)

This is about as accurate as we can get in practice. If we want to devote the resources for greater specificity, there should be some kind of evidence that the users of the system need it. I think that would be very difficult to prove. Certainly librarians don't need any greater specificity.

Cataloging theory decided long ago which of these titles to display as "THE" title: what is found on the chief source of the item transcribed as exactly as possible, and this is placed in the 245 field. There is no problem here. A 240 "Selections" is not, by definition, the title of any item, but it is also not useless. It is useless *only* if the remainder of the record is ignored. 240 "Selections" implies, again by definition, a 1xx field. You cannot have a 240 without a 1xx, otherwise you would have a 130. These treatments are followed into the 7xx and 8xx fields and even the 6xx fields when required. (Why there is 1xx/240 as opposed to 1xx$a$t I have never understood, but is a matter for historians to answer)

I personally see no problems with any of this and I think it is pretty well done. There are in-depth rules for all of it. Catalogers have gone out of their way for quite some time to input this information accurately. If there actually is a problem of understanding the practice, it seems to me this may be because everything catalogers do is based on methods designed for card catalogs--where everybody could see much more clearly how they worked than in an OPAC--these methods nevertheless work. And they certainly work far better than *any kind* of metadata I have seen from publishers. So, if our records are poor (which I don't believe is correct), ONIX data cannot be seen as one bit better.

Besides, going back to my original point, none of this justifies the continuation of ISO2709. Certainly MARCXML format as it now stands is not perfect, but we need to take matters one step at a time! Otherwise, we will never start.

This does not end the argument by any means, primarily because we are in the Internet, and not every bibliographic agency follows AACR2 or ISBD, and I am positive many of these other agencies will not follow RDA. Ultimately, we must work with these other agencies. Perhaps sharing our records with the general public could help us solve problems of this sort that we encounter.

Wednesday, January 19, 2011

RE: Linked data

Posting to RDA-L

Jonathan Rochkind wrote:
Concerning: "One example of this can be found reported in this article: http://journal.code4lib.org/articles/3832"
<snip>
Okay, what would someone who "knows library metadata" do to get a  displayable title out of records in an arbitrary corpus of MARC data?
There's an easy answer that only those who know library metadata (apparently unlike people like Thomale or me who have been working with it for years) can provide? I have my doubts.
</snip>
I agree that this is an excellent article that everyone should read, but I wrote a comment myself there (no. 7) discussing how this article illustrates how important it is to know cataloging rules and/or to work closely with experienced catalogers when building something like this. It also shows how many programmers concentrate on certain parts of a record and tend to ignore the overall view, while catalogers concentrate on whole records.

In this case, the parsing is *always* done manually by the cataloger, who is directed to make title added entries, along with uniform titles, including the authors--that is, so long as the cataloger is competent and following the rules. So, it is always a mistake to concentrate only on a single field since a record must be must be considered in its entirety. It would be unrealistic for systems people to know these intricacies, but it just shows how important it is that they work closely with catalogers.

Therefore, it's not *necessarily* arbitrary. Many of these issues have been known since the very beginnings and have been dealt with in various ways.

RE: Linked data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
This tells us something about LC, but about MARC?
LC might, in fact should and certainly could, add MARCXML to the options instead of providing merely ISO there. They might add EndNote and BibTeX as well, and more.
</snip>
I hate to keep repeating myself, but I feel it is important that I make sure my point is clear, whether or not others believe I am correct.

When I mention MARC, I am not talking about the codings and subfields, but the ISO2709 instantiation of it, which is useful *only* to librarians. This is because that for all practical purposes, only librarians have the tools to deal with ISO2709 records. Excellent as it is (I use it every single day!) nobody but a librarian will use MarcEdit, and we shouldn't expect them to. This is why I say that so long as we rely for transfer of library records using the ISO2709 protocol, we remain marooned on "Library Island" because nobody else can use them.

By making our records available in MARCXML, we make library records available to everyone in the world, in a format that allows people to do with them as they wish. If we make BibTex and EndNote available, while that's OK, this is only partial information. If you make the entire MARCXML available, people could create their own style sheets for MARCXML and create their own EndNote, BibTex or any other format(s) they want. Or, they could do much more.

The downside is: to work with MARCXML, developers need to know the MARC codings and subcodings. While this is quite I bit, I think that if people want it badly enough, they can deal with it. These developers are pretty clever folks and libraries should give them a chance, plus a bit of help since they would be helping us too.

As we see how this works out, we can begin to think how to change MARC in the best ways for the public and for ourselves.

Tuesday, January 18, 2011

RE: [RDA-L] Linked data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
Where and how do you receive
ISO records from LC, as a non-librarian?
...
Jim, this gets us nowhere, your preoccupation with ISO! Rest assured,
it is a non-issue for as much as our dealings with the populace are
concerned. Where it still exists, it can be nicely circumvented.
</snip>
Obviously, I am not making myself clear somehow. Why I am "preoccupied" is because I have succeeded with this myself so I know how it works. First though, how do you get ISO records from LC? From inside the database: http://tinyurl.com/4p2vgxz At the bottom, you'll see "Select Download Format" where you can save as text or as an Iso2709 record. Lots of catalogs allow that and in fact, if they didn't, non-OCLC libraries would have trouble getting records. But remember: this is to transfer a single record from one library catalog into another library catalog using supplementary methods such as MarcEdit. Libraries can do this now, but they must do much more however, and this is why XML is so important.

Let me show you how the XML can work. OCLC has created a web service for citation formats. See the sample in XML at: http://oclc.org/developer/documentation/worldcat-basic-api/rss-xml-sample. I have used this web service by having my catalog search for that RSS feed when someone clicks a specific record in my catalog. Then on the fly, when they look at a record such as http://www.galileo.aur.it/cgi-bin/koha/opac-detail.pl?bib=26319 and click "Get a citation" in the right-hand column, the web service has *already searched* Worldcat and returned the citation, my machine has reformatted it and displayed it in this way. I could display this directly without a click, but have chosen to do it this way to keep from cluttering up the display. This cannot be done with ISO2709 records.

This is not all that special and anybody in the world who knows how can do the same thing right now. This is how mashups are created: by automatically searching different sites in the background and bringing in information into a page that is reformatted in various ways. The tool to do this is XML. Amazon does exactly the same thing with its reviews and ratings, Librarything does the same thing. Blackwell does something similar.

Mashups are one of the main ways people are personalizing the web. To do it you must use XML. It is my opinion that libraries and their records must enter that kind of mashup world, and the sooner the better.

So, what I am saying is *if* the web service from OCLC were not just citations in XML, but the entire MARC record in XML, then right now--today--I, myself, could create something with it that I cannot make now using the power of the headings (if not more). I have no doubt that I could create something of great use to my patrons. If queries could return multiple records somehow, while I can't imagine anything precisely at this point, I am sure something great could be built.

Even more important: *anyone else in the world* could do the same thing, just as all kinds of developers are doing now with the citations or from Amazon, from Librarything, or even most open archives allow this sort of power. This is what it would mean to enter that universe.

I have worked with XML rather extensively, so I have seen what it can do. Browsers work with XML, not with ISO2709. If browsers could work with ISO2709, then that format would probably be fine. But I cannot believe anybody would make an ISO2709 parser as part of the browsers because that format is obsolete. This is why switching from ISO2709 is so important: it will be the first step into the greater world where *others* can begin to work with our data.

Librarians can work with our data now, and if that were all that mattered, there would be little reason to change much at all. We need to stop thinking in terms of: here is a record I want, how do I get that record into my catalog? There are well-established methods of doing that, and we must deal with new demands.

Is this really so difficult? I must not be explaining it right. It is crystal-clear to me.

RE: [RDA-L] Linked data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
So, please forget about ISO2709. For all the flaws that MARC is ridden with, and I can give you a long list, this is not among them. It has nothing to do with the *content* of MARC records, and about nothing else do we need to worry, and we can easily give that content to anyone without any trace of ISO.
</snip>
I wish I could forget it, but it's in our faces and we have to deal with it every single day for every single record. This is my entire point. Today, right now, if *anybody* wants to work with library records from any library catalog, e.g. LC's catalog, their *only choice* is ISO2709, except for the non-delimited formats of "Text-Brief" and "Text-Full". OCLC allows citation-type exports, e.g. see "Cite/Export" on any record in Worldcat http://www.worldcat.org/title/metadata/oclc/225088362&referer=brief_results

At least through the Worldcat API, OCLC supplies citations using partial XML (not MARCXML) using the RSS protocol. Using that XML, I am now able to take that information, reformat it *on the fly*, and automatically display it as I want to in my catalog. My patrons really appreciate that. If my only option were to get the ISO2709 record, I would have to devise some system that would download it, parse it, then create the XML before I could begin to do anything at all.

If I received an entire MARCXML record instead of the abbreviated RSS one for citations, I could do even more than I do now using the tools that can work with XML on the fly. I could apply my own style sheet to display what I want how I want, and more importantly, operate as I want, once again, *on the fly* with the browser doing all the work, just as it does now with the citations from Worldcat. If I could get groups of records, well... now that would be interesting.

Since this cannot happen with an ISO2709 record, the result is that the only people in the entire world who can work with library records are other *librarians* because they are the only ones with the special tools such as MarcEdit. As a result, we cannot share our records with *anybody* except other libraries today.

If all we care about is sharing with other libraries for placing records in their catalogs, I agree that there is no problem since it has worked for a long time and everybody has special tools, but our "world views" absolutely must change from this. Since the web browsers can work on the fly with XML, we must use those tools. This means switching to XML. Easiest and quickest is MARCXML.

Today, so long as we stay with ISO2709 for record transfer, it leaves us marooned on "Library Island", completely separated from the rest of the information world. We must share outside our traditional boundaries, and it is 100% impossible to do that today.

Switching to MARCXML would make all of this 100% *possible* from the *technical standpoint*, but I admit, still 98% *impossible* because the general populace does not know what 300$b means.

Still, that is only 98% impossible instead of the 100% impossible we have today. That must be seen as some kind of advance in comparison with today. I agree: forget the ISO2709 format. Then let's get rid of it by stop using it for record transfer. And good riddance.

RE: Linked data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
They could use the MARCXML records right away? You're sure about that? Has this assumption been tested with users who know nothing about MARC?
Of course, they cannot use ISO-wrapped records. But even to use MARCXML records, you still have to have quite a lot of MARC insight the records do not carry with them.
</snip.
You already supplied the same answer I would have: with MARCXML people *can*, i.e. it is possible, while with ISO2709, they *cannot*, i.e. it is impossible (practically).

But yes, you have to know a lot about MARC. Still, I don't see why the general populace would want an entire record, and they could take what they want, e.g. the ISBD information (simple enough to take), the headings, especially if the links were there, ISBN, etc. Librarians could supply those types of style sheets and people could play with them for their own purposes.

Perhaps someone could make an XSLT-generator that would let people choose what information they wanted, and they could choose titles, authors, Dewey, etc. Although this would be difficult for me, creating something like this would probably be child's play for someone out there.

Once you have XML, there are possibilities. Without it, there are none at all.

RE: Linked data

Posting to RDA-L


Karen Coyle wrote:
<snip>
> Then how do you explain the fact that the specification for URIs includes a possibility for a query, http://tools.ietf.org/html/rfc3986#section-5.1 and the link to  http://tools.ietf.org/html/rfc3986#section-3?
We must be talking about different things, because I'm not seeing the relevance of the question. Yes, you can (and many do) send queries in URLs -- but something has to resolve the queries -- they aren't links, they are a way to carry a query from one system to another. I thought  you were saying that an OpenURL is a link to a document, and I was  saying: no, it's a query carried in a URL to some software that then does something with the query.
</snip>
Earlier, I was writing about the possibility of using OpenURL-type queries to make a *URI* (not URL), which would be very quick, efficient and easy, but it depends on: 1) being able to define a URI through a standard query (which you can according to the specification, you just need to specify the base URI), and 2) of course, much more rigorously-controlled, shared metadata, certainly much better than the records I see in the Amazon MARC records I've been looking at (from the other thread. While the records are pretty lousy, the tool is great though!). This way, we could have ready-made URIs.
<snip>
We are demonstrating the first steps in places like http://id.loc.gov and http://metadataregistry.org/rdabrowse.htm. The first steps are defining our data elements and controlled lists in these standard formats that everyone working on linked data can understand. After that, it won't really matter greatly what we use internally as a data carrier for our defined data as long as we express it using linked data standards when we want to communicate to the larger world. I don't know how else to put this, but our problem is not just that MARC is an out of date record format -- a bigger problem is that we have no way to tell others what our data MEANS in a known, mainstream way. Linked data isn't a magic bullet, and something will undoubtedly replace it in the future (perhaps before we even get there), but it has the advantage of using standards to make ones' data *elements* usable in a mixed data environment.
</snip>
I realize this. But if we have to wait even longer before our RDF is verified, agreed to and fixed, PLUS in general operation, it will be forever.
<snip>
But the changes that some of us think matter are changes to our data model and data definitions, not to the carrier.
</snip>
Changing our current data model and the definitions, and most importantly, getting some kind of general agreement on such matters, probably will not happen in our careers. We must realize that in Internet time, this is the equivalent of centuries. It has already taken so long with FRBR and RDA that they are effectively obsolete (in my opinion of course! Apologies for saying this on the RDA-L list). If they had been implemented much earlier, we would be in quite a different situation today. I am not finding fault on this--it's just a very complicated and difficult task to undertake.

The most that I see could possibly happen with a new data model is that each community could define its own model, perhaps on national/cultural grounds, but perhaps by field of endeavor, and then these groups may reach some kind of agreement someday in the far future. Or not.

Our current carrier is totally shot in the greater world and has to change sooner or later. This is about the most painless thing we could do. Of course it will change in the future from an XML view of the ISO2709 record (at least I hope so! For example, the "roundtripability" must be eliminated) but the general populace could use our records right now and in a very public way that could help them. This would be giving developers at least a chance. People could write additional transformations to put the records into whatever formats they want, which--let's face it--they will anyway no matter what kind of RDF the library world may eventually devise.

I still see no reason to wait even longer. It is imperative that we move forward in practical ways that the public can see.

RE: Linked data

Posting to RDA-L

Karen Coyle wrote
<snip>
Actually, an OpenURL requires a program and a database to resolve it. It doesn't link directly to the resource. In fact, that's the point of the OpenURL: it goes through a resolver database in order to provide the "best" source for the resource to the user.
</snip>
Then how do you explain the fact that the specification for URIs includes a possibility for a query,
http://tools.ietf.org/html/rfc3986#section-5.1 and the link to http://tools.ietf.org/html/rfc3986#section-3?
<snip>
No, MARCXML does not move us toward dbpedia. At least, not the MARCXML that is the LC standard. That is, as Jonathan has pointed out, just a different format of the same old MARC record with all of the same constraints. Also, linked data and XML are VERY different approaches to data modeling, and many feel that XML actually gets in the way. The direction that I am trying out (and not sure yet how it will all work out) is to break MARC up into its logical component parts -- it's actual data elements. You can follow this as I develop it at:
http://futurelib.pbworks.com/w/page/29114548/MARC-elements
</snip>
I have major differences with this. When compared with today's ISO2709 records, the ability to add a little suffix to a link that says "&format=MARCXML" would present developers with a wealth of possibilities, for a single record, and even more for multiple hits. Even I, with my limited knowledge, could do quite a bit with that. And if those MARCXML records had links to the authority records in the headings, instead of just text, wow! It's not the end, but a start.

I think we agree in a lot of ways, but I suspect the actual disagreement involves something more expansive: what are these things called linked data and the Semantic Web? And even more important: what constitutes a real step toward them? Linked data and the Semantic Web are both rather vague ideas that people still disagree over, but they sound like something that would be great. This reminds me of stories that circulated in the West about Peking: how beautiful and rich it was, the amazing things to see, and so on. So, people wanted to go there, but they weren't sure how to get there, except "Go east."

Of course, "east" is a very indistinct direction, so if you don't know where something is, it's hard to find it. Maybe it's NE and you go ESE and you will never run into it. At the same time, you honestly don't know if Peking is a beautiful place, if it's really a dump, if it has been sacked and destroyed, or if it is even a real place at all, like the seven cities of gold, the fountain of youth, or the empire of Prester John. Still, in order to begin to find out the answers to any of these questions, you must begin your journey, otherwise you will never know.

This is where I think we are now: we want to get to the Semantic Web or linked data, but we aren't quite sure where they are, or at this point, if they are something we really want, or if they can even exist at all. Some may have more confidence than others, but yet, there is only one way to find out and that is to set out on the journey. That means you have to start moving in the general direction, reevaluate where you are, set out from there, reevaluate, etc. Maybe you'll reach your goal someday.

But different people react to this kind of situation in different ways. Some say that we aren't really making progress to get to Peking until we are past the Urals, so we shouldn't really start the journey until we have everything mapped out to get the Urals. This is what I think explains the statement that changing to MARCXML is not a step toward linked data, since it's not a big enough step.

Others sit around, and say how important it is to get to Peking, how there is no choice except to do so. Yet, when they get up from the couch, all they do is walk over to the refrigerator to grab a beer, and go back to sit on the couch. This is how I see FRBR/RDA, which bustles around but changes nothing.

There is the saying, "A journey of a thousand miles begins with one, single step." To start on the journey, we must take that first step out of the house. That is how a journey really begins: by taking one single step out the door, understanding there's still a long way to go. This is how I see doing things such as switching to MARCXML. If we can't take that step until we have RDA with RDF, that is years and years away. If then.

We absolutely have to take that first step out the door.

RE: Linked data

Posting to RDA-L

Karen Coyle wrote:
<snip>
One hint, though, if I may, is that the goal of linked data is NOT to then put the data in a database. The goal is this one that you list as the third rule
> The third rule, that one should serve information on the web against a URI ...
is the goal: to make your data available on the web. That means not in a closed database, but actually on the web. It's like putting your documents on the web so that anyone can link to them, but in this case you are putting your data on the web. Because each "thing" in your data has a URL, the web allows you to make links.
</snip>
But it is my understanding that you can go through a database to get to the data, and as a result a URI includes the OpenURL. This is a "relative reference URI", where you have to establish a base URI. See http://tools.ietf.org/html/rfc3986#section-5.1.

Here's an example OpenURL I found, which I think everybody on this list can understand:

http://resolver.ukoln.ac.uk/openresolver/?sid=ukoln:ariadne&genre=article&atitle=Information%20gateways:%20collaboration%20on%20content&title=Online%20Information%20Review&issn=1468-4527&volume=24&spage=40&epage=45&artnum=1&aulast=Heery&aufirst=Rachel
The big point is the "?" which means that this is a query, or a search in a database. OpenURL says that everything *after* the "?" should be standardized (i.e. any database should be able to accept this type of standardized query) while everything before the "?" (i.e. the base URI) can change.

This is why I have thought that OpenURL demonstrates the power of consistent, standardized metadata: so long as everything is consistent, the data in one database can be used to automatically find and reliably query another database. But of course, if everybody follows their own rules, the entire thing breaks down. In the example above, for the volume number, is it "24" or "XXIV"? Notice also the author's name, which follows UNIMARC practice of coding first and last names separately.

When the base URI changes, as they always do, so long as everything is set up correctly, all you have to do is change the part before the "?" one time in your wonderful relational database, and all is fine since the rest is created automatically from the record itself. And since it is standardized, you can search any and all (in theory) databases just by adding new base URIs. In this way all you need is highly standardized data, which is the specialty of catalogers.

I have always thought that OpenURLs should be very powerful tools for catalogs, and it is crystal clear to everyone that if you use "Mark Twain" and the others use "Samuel Clemens" the system simply won't work, so the power of consistent standardized metadata becomes desirable even to the public.

Concerning linked data, it is simply the way the World Wide Web works, plus there is the assumption that the more links a resource has both to it and from it, the greater use it has for everyone. I am sure that almost any page you see on the web is composed of dozens of separate files brought together from all kinds of places: within the site itself with such files as headers and footers, images, and from other sites on the web. Some of these files in turn import other files from other places, which can also import still other files, and on and on.

Just examining the NY Times page, http://www.nytimes.com/, each image is a separate file, there are ads belonging to other sites, there is a part giving the temperature in NYC which is probably brought in from outside. I have seen in the past, news feeds from other newspapers as well, I think from Reuter's. But linking files together--sometimes very small files or even program files such as javascripts that you don't even see such as web counts and others--in all kinds of ways is how the web has worked from the very beginning. This is why you need a browser that does the job of bringing it all together.

Linked data actually does more or less the same thing, so it is nothing really new from the technical standpoint, but it is based more on semantics and meaning. So, if somebody wants pictures of Rome, the system should be "smart enough" to know the different forms to get you Rome, Roma, Rzym, and so on. Of course, in practice this means that *computers* need to understand that these are the same. To make your resources work on with linked data, you do things in RDF, which is a type of XML. This is no different from using style sheets on your website, and you do thinks in CSS.

But all of this is being done now in dbpedia. Look at http://dbpedia.org/page/Benjamin_Spock, scroll to the bottom and you can see the record in different types of RDF.

This is one of the reasons why I think switching to MARXML would be one single step in the right direction, and also why we should really consider working with dbpedia: a lot of the technical work has been done already.

Friday, January 14, 2011

RE: Browse and search RDA test data

Posting to RDA-L

Jonathan Rochkind wrote:
<snip>
Many ILS use the MARC _schema_ (aka "vocabulary", aka "list of fields and subfields") as their internal data model, if not the serialized transmission format. The MARC 'schema' is kind of implicit, defined as a byproduct of the transmission format, which is in part what makes it so cumbersome to deal with.
And, unfortunately, it's actually the schema, NOT the transmission format, that is a problem with MARC. It is, as everyone keeps saying, easy enough to change the serialized transmission format to something else (MarcXML, an tab delimited spreadsheet, even RDF (based on marc tags!) if you want, no problem) -- which is exactly why it's not a barrier. The barrier is the lack of power in the actual 'vocabulary' -- a flat list of numeric tags each of which has a single flat list of no more than ~35 single character subfields -- is the barrier. And somewhat harder to change across an ecosystem developed assuming it.
</snip>
I completely agree. It's just I consider this step #2. By switching our focus to providing MARCXML as a primary transmission format for our records, we will still be stuck with a completely flat everything--which is bad--but it could be done probably with not much pain, and it will at least be in XML when we, and hopefully others in the world, can gain a bit of flexibility to begin to play in all kinds of different ways, especially compared with what we have now.

To wait even longer to find agreement on anything more is tough. I think we are running out of time. Look at the debate just over capitalization!

One baby-step at a time....

By the way, the Koha open source catalog stores the MARCXML records and uses them through Zebra indexing (exactly how I'm not sure), plus there are various mysql relational tables.

RE: [RDA-L] Browse and search RDA test data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
Really, I'm not a great fan of MARC, but we do it injustice if we insist  it go away because of ISO2709. The latter has to go, and can go, and isn't being used nor required nor standing in the way in many applications right now, with no harm done to MARC whatsoever.
</snip>
No, no. I guess I am not making myself clear. MARC does *not* have to go away, just its ISO2709 "incarnation". If you look at the MARC standards in the Leader and Directory http://www.loc.gov/marc/bibliographic/bdleader.html, http://www.loc.gov/marc/bibliographic/bddirectory.html, it defines the ISO2709 structure.
The record you show is *not* what people download when they get the MARC format for their catalog. Here it is, straight from the LC catalog:
01070cam a2200289 i 4500001000900000005001700009008004100026906004500067925004400112955012600156010001700282020001800299040002800317042000800345043001200353050002400365082002000389245007400409260006100483300004600544336002100590337002500611338002300636650004500659651004400704651003200748 16097519 20101123143634.0 100219t20102010ncuab 000 0 eng a7bcbccorignewd2eepcnf20gy-gencatlg 0 aacquireb2 shelf copiesxpolicy default apc20 2010-02-19axh00 2010-09-15 to USPL/STMirf08 2010-10-07 (telework) to SLerf08 2010-10-13 to Deweywrd07 2010-11-23 a 2010923073 a9780977968169 aDLCbengcDLCerdadDLC apcc an-us-nc 00aVK1024.N8bN67 2010 00a917.5604/44222 00aNorth Carolina lighthouses :ba field guide to our coastal landmarks. aGreensboro, N.C. :bOur State Magazine,c[2010], (c)2010. a103 pages :billustrations, maps ;c20 cm atext2rdacontent aunmediated2rdamedia avolume2rdacarrier 0aLighthouseszNorth CarolinavGuidebooks. 0aNorth CarolinaxDescription and travel. 0aNorth CarolinavGuidebooks.
So to get the display you showed for the ISBN
020 \\$a9780977968169
MARCedit had to dig out the 020 from the Directory and match it with the ISBN. It did all this by *counting characters*, not by fielding as it is handled today: <020a>9780977968169</020a>. In ISO2709, everything is buried and must be exhumed.

A move to using MARCXML for record transfer gets rid of these hassles (so long as we do not insist on the round-tripability with ISO2709, as it does now, see point 3 of http://www.loc.gov/standards/marcxml/marcxml-design.html) while maintaining the MARC codings. With XML, we can add all kinds of linked data.
MARC in its non-ISO2709 incarnation can stay forever, that's fine with me. Lots of programmers have issues with MARCXML, and they make some good points; still, I figure we need to move forward ASAP, and their--very legitimate--issues can be dealt with gradually. But those issues shouldn't stop us moving forward.
<snip>
It is not ISO2709 that has to do that handling, it is the software processing MARC records. And this processing, I'm very sure, nowadays doesn't, internally, use the ISO directory structure at all but just the tags and codes. Internally, records will most often be represented like this: (the MARCEDIT structure shown by my database)
</snip>
Internally, each database can be different, as each one is today. As I said ISO2709 no longer is used for internal purposes (except for some CDS-ISIS catalogs), and is used only for record transfer.

I think we *may* actually agree ?

RE: Browse and search RDA test data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
Am 14.01.2011 12:24, schrieb Weinheimer Jim:
>
> Bernhard, Sorry to press the point but I think it is a vital one:
> using MARC in its ISO2709 form *cannot* work with linked data.
For all I know, I have to disagree. It is all a matter of field content and then what the software does with that - no matter how it is wrapped up for communication. A MARC field can carry a link (an Identifier) and the software can use it in whichever way, wherever and whenever needed, no matter how the record is wrapped up during storage or transfer. This is in no way different from XML.

It may be a matter of what you are familiar with. If it is XSLT and nothing else, then XML is of course appealing. Other data manipulation languages can do just the same things, in different ways, and some do it more elegantly than XML.
</snip>
I hate to keep harping on this, but I think it is a crucial point since I believe that ISO2709 is one of the key problems holding us back; certainly more important than adopting FRBR or RDA. As I said before, ISO2709 may be able to be revamped to handle linked data, but it seems senseless to me to do that work if tools already exist that can handle the job. If, as you point out, linked data is merely a matter of adding some identifiers, *maybe* it can work although it seems to me that such a system will always need special parsing and recompiling over and over again to be useful. For example, merely doing a linked data search using an ISBN is impossible with an ISO2709 record as it stands.

And although XML may not be the best solution, it can do all of this right now, today, and browsers can handle XML. Somehow, I don't think an ISO2709 parser/compiler would make it into browsers today. And I think time is of the essence to demonstrate how libraries can fit into this coming information world.
ISO2709 served its purpose well, but it is a completely obsolete format that was created for the needs and the technology of the time. It needs to go be placed into the trashcan of history.

Once again, I confess I may be wrong. I am willing to learn. But please, no theory; just some practical examples of how ISO2709 can fill the bill and how it would be better than MARCXML.

RE: Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

On Thu, 13 Jan 2011 09:10:20 -0600, Brian Briscoe wrote:
<snip>
I use the term "local" to mean the collection of materials that are acquired by our library specifically to meet the needs of ours users. Our subscribed users are accustomed to a certain level of cataloging and certain headings to assist them in consistently finding the information they seek. When we vary from that, they let us know.
</snip>
Well, I guess we will just have to agree to disagree. I think libraries must turn their gaze outward instead of inward since this is what our patrons are doing.

The people who use my library care a lot about the *reliability* of the information, but couldn't care less where it is located, so long as it's easy to get at. If they have to stand up and go to another library, they think it's a real chore. They want everything available digitally from home. They actually want those digitized books in the Internet Archive, the government documents and think tank publications and everything else available on the web--once they know about them.
<snip>
We need ILS developers to design catalogs that allow users to search both "locally acquired" materials and the rest of "out there" seamlessly without sacrificing the quality of searching that the library provides. Throwing everything into keyword is not adequate.
</snip>
I agree with this and is what I have attempted with my Extend Search http://tinyurl.com/345mf9t
Who knows if it's successful or not, but it is only a single baby-step. My patrons seem to like it.

RE: [RDA-L] Browse and search RDA test data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
That may be true for some ILS systems but certainly not for all of them. If it is, then it is a weakness of that system, not a feature of MARC. Get rid of those systems or get vendors to understand that this mode of communication is - though it needs not be thrown overboard - not the only mode that is required but what you need is configureble export. Even Z39.50 is not tied in with ISO2709, it is just a convention that this is most frequently used for communication.
</snip>
Bernhard,
Sorry to press the point but I think it is a vital one: using MARC in its ISO2709 form *cannot* work with linked data. At least, it cannot work without *major revamping* which is not worthwhile to undertake. So long as MARC is linked to ISO2709, we remain stuck in place since all you can do with it is transfer a complete record from one library catalog into another library catalog.

Once we are in an XML-type of world, retaining the numbered fields and subfields is OK, although at that point, it is of relatively minor importance. Once the data is in XML, you can do anything--anything--with it: transform them into other bibliographic formats, into citations, pdfs, docs, or even movies. We could even create records in 3D, if that's what we want! http://www.youtube.com/watch?v=u7h09dTVkdw

Z39.50 itself may have a future or not; I don't much care one way or the other since tools exist today to do what we need to do.

RE: Browse and search RDA test data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
Am 14.01.2011 09:54, schrieb Weinheimer Jim:
>
> When we talk about MARC as it is used by libraries today, we cannot
> separate it from the underlying ISO2709 format,...
Oh but we can, we certainly can and we should and we do. A MARC record
can easily be rendered like this:
...
</snip>
I can, and have, reformatted a native ISO2709 record into all kinds of other format: csv, Refworks, OAI-PMH, MARCXML and so on, (although even then it's easier starting with XML since you don't need to parse the thing to begin with) but when I then want to transfer that record that I worked with into my catalog, I have to recompile it back into an IS2709 record so that I import using Z39.50, when we are stuck with each and every one of the limitations of ISO2709.


The one and only purpose today of ISO2709 is to transfer MARC21 records from one library database to another library database. That is the entire problem since it impacts on everything we do. It is the primary way of getting records from one catalog to another, e.g. records are uploaded to WorldCat in ISO2709 and downloaded using that. The Z39.50 search in MARCedit utilizes ISO2709 and then recompiles. Since the method of transfer is ISO2709, we remain stuck with the limitations of that obsolete format.

But I may be wrong. How would we work with linked data with importing of related information, e.g. a contents note and a couple of analytics, in the current world of ISO2709? Can you give me an example? Of course, it would be relatively easy with MARCXML, but it must result in that ISO2709 string with all of the lengths defined in the beginning, as I wrote before.

I confess that I cannot imagine how the FRBR entity relationship model could work, which is all based on linked data, although in XML it would be no real problem.

RE: Browse and search RDA test data

Posting to RDA-L

Jonathan Rochkind wrote:
<snip>
I don't see any significant increase in flexibility to share Marc records by 'switching' to MarcXML. Am I missing something? What exactly would be the advantages of 'switching' to MarcXML?
</snip>
When we talk about MARC as it is used by libraries today, we cannot separate it from the underlying ISO2709 format, since this is the primary (and still the only?) means that we transfer records from one catalog/database to another. It is the ISO2709 format that is completely inflexible. What we see on our computer screens when we catalog something is *not* the MARC format that we are really working with.

Originally, library records were stored, searched, displayed, etc. from the ISO2709 format, but as relational databases appeared and XML indexes such as Lucene and Zebra appeared, the storage of records took on different forms, but the *transfer* of the records has always remained in ISO2709 format, that the computers compile at the time of transfer using Z39.50. So, if we look at http://tinyurl.com/6hfuqjf, and at the bottom, click on "Select Download Format" to one of the MARCs, save it and open it in Notepad (or whatever), this is what is transferred. I am 100% certain that this is not how the record is stored within the Voyager database at LC, but when a library wants to transfer the record into their own catalog, this is the method: by compiling that information into an ISO2709 record.

The record itself:
01468cam a2200325 a 4500001000800000005001700008008004100025035002100066906004500087955012200132010001700254020003600271040001800307043001200325050002300337082001500360100002900375245011500404260005600519300003500575504006400610600006300674650005000737650005000787650003300837856009500870856008700965920004101052991004901093 1182910 20041208175945.0 971106s1998 caua b s001 0 eng 9(DLC) 97043755 a7bcbucorignewd1eocipf19gy-gencatlg apc14 to ja00 11-07-97; jk27/jk15 (desc) to subj. 11-13-97; jk14 to DDC 11-17-97;aa05 11-19-97; cip ver. jb09 09-22-98 a 97043755 a0520212010 (cloth : alk. paper) aDLCcDLCdDLC ae------ 00aNB623.C2bJ64 1998 00a709/.2221 1 aJohns, Christopher M. S. 10aAntonia Canova and the politics of patronage in revolutionary and Napoleonic Europe /cChristopher M.S. Johns. aBerkeley :bUniversity of California Press,cc1998. axvii, 271 p. :bill. ;c27 cm. aIncludes bibliographical references (p. 237-259) and index. 10aCanova, Antonio,d1757-1822xCriticism and interpretation. 0aArt patronagezEuropexHistoryy18th century. 0aArt patronagezEuropexHistoryy19th century. 0aNeoclassicism (Art)zEurope. 423Contributor biographical informationuhttp://www.loc.gov/catdir/bios/ucal052/97043755.html 423Publisher descriptionuhttp://www.loc.gov/catdir/description/ucal042/97043755.html a** LC HAS REQ'D # OF SHELF COPIES ** bc-GenCollhNB623.C2iJ64 1998tCopy 1wBOOKS
This is completely inflexible. The first 5 places 01468 define the length of the entire record. The record cannot vary even 1 character from this length, otherwise it will break. The horrifying string of numbers afterward define the lengths of each field, counting the indicators and subfields. In fact, the field numbers themselves, i.e. 100, 245, 300, etc. are embedded inside this horrifying string of numbers and are separated from their text fields below. So, for the heading of Canova as subject, somewhere in that number string is the "600" followed by the length of the string after that, including the indicators and subfield codes, plus "Criticism and interpretation".

For a long time there was no problem with a completely inflexible record like this since the whole idea was to transfer entire catalog cards from one library system to another. That is no longer the case at all today, and hasn't been for quite some time now. How can that format above work with linked data in various ways without breaking everything? How would the FRBR entity structure work in ISO2709? Because it must work with ISO2709 since that is how our records are transferred.

So, while we can do anything we want *within* our own catalogs, making all sorts of wonderful things, they can't be transferred anywhere else because of this obsolete transfer format, i.e. without considerably reworking it. How could this work with linked data, bringing in information from other places? I compare it to working at Hoover Dam, filled to teeming with fresh water that we keep making sweeter and cleaner, while there are all kinds of towns out there wanting our water desperately, but the pipes out of the dam only have a capacity of 1 or 2 gallons per minute and they leak like sieves. My water ends up useless.

While I won't argue that it may be possible to rework the ISO2709 format to work with linked data, why should we even think about it when we have XML? It would be completely wasted work. Nobody without an ISO2709 parser would even consider using such a format. This is why I mentioned switching to MARCXML, that is for *record transfer*, which could be done with the least disruption. For instance, the FRBR/RDA structures would become more feasible, if it were desired (!). Internally, each catalog can be structured however the designers want it, but for record transfer, it should be MARCXML.

This could be done completely behind the scenes with most not even noticing a difference--at least at first--and while there are problems with MARCXML, switching to any kind of XML format would open a wealth of possibilities.

RE: Browse and search RDA test data

Posting to RDA-L

Bernhard Eversberg wrote:
<snip>
About current MARC practice, your'e right.
While I've never been a dyed-in-the-wool MARC enthusiast, I'm realist enough to recognize that any migration into something else, and then what really?, would be a galactic task. There will have to be a MARC implementation of those ER values or RDA will remain either a theorist's glass bead game, or it will live on as nothing more than the dumbed-down version reflected in that test stuff. The ALL CAPS skirmish drawing on forever and overshadowing all the rest of the
endeavor.
</snip>
Well, when you look at it that way, in this sense it's another one of those retrospective conversion projects that probably everybody on this list has known and loved. When I was doing research on the history of the library catalog of Princeton University, I found ten or twelve conversions since 1755, including when the library burned down a couple of times. (I include recopying the manuscript book catalogs in this, because things were normally reconsidered, reclassed, etc. By the way, one letter I found from a cataloger in 1866 was unforgettable since he was describing the work he was doing and would lose his mind just a few months later!)

When comparing those jobs with what faces us now, I think this conversion will be much easier than before, i.e. in the sense of the actual work, but the task itself is far more complex. It's difficult at this point to understand exactly what we should *convert into*, while it was easier to sit and copy records from one volume to another, or from one size card to another, or from the card to the MARC record, etc. This is on a completely different plane.

My own suggestion would be to do this as quickly as possible with a minimum of information loss. So, the conversion itself could be done incredibly fast by converting to MARCXML (yes, I know that's bad) but at least then we would gain some genuine flexibility to be able to share records. An application profile could be drawn up to allow inclusion of XML information from other projects and perhaps allow some of the more specific relationships mandated by FRBR/RDA (which are in our records now but some are not coded so specifically).

This could probably be done rather quickly, and some open source catalogs created to deal with these records could be devised, since many exist now. Koha, for instance, works on MARCXML. Some internal tweaking would be needed, but perhaps not all that much.

Of course, the separate records/entities for WEMI would be thrown overboard, at least for the foreseeable future, but people probably know my opinion on that....

I think this would be a definite step forward, but a small one.

Thursday, January 13, 2011

Comment to Conversations that Cataloguers Aren’t Hearing

Comment to Conversations that Cataloguers Aren’t Hearing

When Lynne mentions in her comment, the idea of "creative cataloging", the idea makes me cringe. While there are ways to be creative during cataloging, it must be done with tremendous care. In essence, the task of cataloging is to describe a given resource and organize that description so that it fits into an existing intellectual structure as neatly and as well as it possibly can. The actual method for achieving this is the principle of "consistency".

The problem with being "creative" in cataloging means that it *can* lead to resources being misplaced within that intellectual structure. When something is inconsistent (or "creative"), it means that the resource falls outside the expected places within the intellectual structure and in order to retrieve it, equally "inconsistent" (or "creative") methods must be used to retrieve it, or to understand it.

So, even though many of the transliteration practices of some non-Roman languages may strike people as absurd or wrong, the cataloger *must* follow the prescribed practice, just as classifying a book in a "creative" way, (i.e. not with the other, related items) cannot be done because then it becomes far more difficult to find. The same goes for every part of the record, from each part of the description to selection and construction of access points.

From this short explanation, it could be inferred that the practice of cataloging is the most incredibly conservative task there could ever be, and that is a valid point: it *can* be that conservative, and I think people see it all the time. There is a reason for this: our records must be made consistent with records created 100 years ago or more, otherwise things get lost.

At the same time, I want to emphasize that there is such a thing as creativity in the practice of cataloging, and some examples border on brilliance. I have seen it especially in authority work, but it can certainly exist at the point of the individual record. But I will say that "creativity" when applied to cataloging, means something quite different than when "creativity" is used in the everyday world.

RE: Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

Brian Briscoe wrote:
<snip>
But Jim, some patrons DO care where the information is located. Public libraries frequently have users looking for information and they need to use a book. I am thinking mostly of students, but they are not the only ones who, for one reason or another, want information that is local.
</snip>
I'm not really sure what you mean here. If, by "local" you mean "that which is held in the physical collection", we can do that easily enough now in almost any ILS by limiting to different kinds of physical materials: books, videos, etc. If you mean also, "the digital resources the library purchases access to" you can get the digital books, and the serials as a whole but that is a lot of work to keep up with the current holdings and changes in the aggregators, so this is often outsourced through, e.g. Serials Solutions, but if you want individual articles, as everybody does, you have to search in supplementary indexes and return to the local catalog, use Serials Solutions, or abandon the catalog altogether to search the aggregators separately.

But this is exactly where the very concept of "local" disintegrates today, and we can see it best in Google Scholar. In this search (on my machine!), I have searched Google Scholar for "library metadata" http://scholar.google.com/scholar?q=library+metadata&hl=en&btnG=Search&as_sdt=1,5 (I made a screen clip also, since Google results are unpredictable)


The first two hits are into SpringerLink, and my library has no access to it. In the left column, there are the links into SpringerLink, and they ask me for 34 euros for one, and 25 euros for the other (I don't know why there is this difference). *BUT* in the right hand column are links into free versions of the articles available from Open Archives.

Here we see only a single, very tiny, example of the question libraries are facing today (or at least, what they should be facing): "What is local?" From the patron's point of view, the open archived ones are just as "local" as anything the library pays for or what is on a shelf.

If it weren't for Google Scholar providing this--in publishing terms--almost revolutionary display of the free version alongside the pay one (which one would *you* choose?!) many would still be left in the dark. (It also illustrates in graphic fashion how the interests of the scholarly publishers and authors do not necessarily coincide)

Just as illustrative is the Digital Book Index, which I don't like in many ways, but it does the same thing. Look at the works under Mark Twain:
http://www.digitalbookindex.org/_search/search002a.asp?AUTHOR=Twain,%20Mark
and you can get 1917 ed. of "A Connecticut Yankee" for free, or pay money to Questia. Again, which one would you choose?

Are these kinds of materials "local"? In my own opinion: in the interests of my patrons, I have determined that there is no choice. They *are* local materials and therefore must be dealt with "locally" in some way. That is a huge decision, but one that I have solved to a certain extent, not through individual cataloging, but through other automated means.

Wednesday, January 12, 2011

RE: Browse and search RDA test data

Posting to RDA-L

Mike Tribby wrote:
<snip>
Perhaps not surprisingly, I find myself in agreement with both Mac and Hal. And I would ask Jonathan and any other list members who see value in all-caps display of titles if they have any thoughts on how to transcribe a title in which all letters are caps, but the letters at the start of the title (and possibly at the start of each word) are _larger_ caps than the caps that make up the rest of the title. I don't think my keyboard or my cataloging software is capable of creating caps in different sizes in the same field, at least not easily.
</snip>
I assumed that the idea of accepting all caps was to be able to accept ONIX data more easily, but I just looked up in their guidelines (http://www.bisg.org/docs/Best_Practices_Document.pdf p. 11):
"Titles should be presented in the appropriate title case for the language of the title"
and then they have several examples of capitalization in English, Spanish, and French. In addition, on the next page, we read:
"Titles should never be presented in all capital letters as a default. [In fact, the word "never" is underlined--JW] The only times that words in titles should be presented in all capital letters is when such a presentation is correct for a given word. Acronyms (e.g. UNESCO, NATO, UNICEF, etc.) are an example of a class of words that are correctly presented in all uppercase letters. When acronyms are made possessive, however, the terminal "s" should not be capitalized."
And so, the plot thickens! Their guidelines look fairly good to me. Would this be a case of definitely saying that the information received was substandard?

RE: Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

On Tue, 11 Jan 2011 13:36:12 -0600, Mary Mastraccio wrote:
<snip>
>James, I think most everyone agrees with you that there needs to be some major changes in library catalog software, and there is a need to enhance the vocabulary with UF terms. However, addressing questions about what subject terms to use and rules/guidelines for constructing a controlled vocabulary cannot be said to "avoid the primary problems". When I see an optometrist about my eyesight and am given a prescription, I don't say the optometrist "avoided the primary problem"--I'm aging--I recognize he has addressed a specific issue and go elsewhere to address the aging problem [haven't found a solution yet]. Since the original question was what term to use, it was appropriate to answer that question and only that question.
</snip>
That's a fair point Mary, and I stand corrected on that. Still, in my own defense, I had changed the subject to "Helping the Searchers of the Catalog", but in essence, I think we all agree that major changes must take place.
<snip>
>There are three issues.
>1. what terms will be included in the controlled vocabulary;
>2. how will we embellish these terms (with UF and RT and restrictions, etc.); this includes both general rules as well as the mechanics of linking to other vocabularies and possibly languages;
>3. how will those terms be made useful to people using our catalogs.
>
>It is 2 and 3 that are critical to make the work on #1 of any lasting value. I share your frustration because such developments are long overdue.
</snip>
I want to add a corollary here: the universe to apply these issues must change from the traditional view of the library cataloger, who is focussed on the local catalog, plus maybe WorldCat or LC, to that of the user, which is much, much more expansive and includes more or less the totality of the information universe. Again, this is impossible to achieve in reality, and anything we do will be gradual, but we can take some steps toward it.

VIAF points to a possible solution, e.g. the record for Leo Tolstoy http://tinyurl.com/5vmxco2 can lead to related searches on authorized forms in other library catalogs. For instance, if everybody would simply point their headings to http://tinyurl.com/5vmxco2, the patrons will find the correct form used, e.g. in Italy, which is something no English speaker would ever think of: Tolstoj, Lev Nikolaevič, (which is incredibly useful for my patrons, who would never think of ending Tolstoy with a "j"!) plus lots of others. For instance, I really appreciate the corporate bodies, e.g. "Office of the United Nations High Commissioner for Refugees" http://tinyurl.com/679adul

Of course, this tool could be reworked to be clickable for the various forms, to search each catalog(s) for the correct form of name. Something similar could be done with subject headings and descriptors, although I am sure it would be more complicated.

I think that conceptually, the VIAF is extremely important because it provides catalogers with a sense of what the patrons (and reference librarians) confront as they search for the information they need. People searching for information do not want to be (and should not be) limited only to the local catalog and WorldCat, i.e. where the LC authority file operates. This means the complexity increases enormously and I think it is imperative that the catalogers make this task much easier. A person wants information on the aging process in food, and they don't care where the information happens to be. If a specific database uses some kind of authority control, how does someone find that point of control? How do we someone search all of these control points easily? I think the library system should be the place to find help. This is the Semantic Web.

Once again, I don't think that this means the individual records made by the catalogers necessarily need to change, or very little at the most. The task will still to assign headings consistently at the "correct" levels of specificity and exhaustivity. (Although the definition of "correct" may change. Right now, we theoretically assign subjects to 20% or more. This may change to 10% or 5% or 50%. Already there is the discussion over the rule of three for name headings)

Where I think change needs to take place ASAP is to create tools similar to VIAF that bring related access points together at the higher "conceptual level", and then the information architects and programmers can take over and create useful and cool ways to interoperate with that tool.

Tuesday, January 11, 2011

RE: Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

On Tue, 11 Jan 2011 11:04:01 -0600, Mary Mastraccio wrote:
>Joan Jones asked:
>>wouldn't the subdivision "Effect of aging on" be a
>> better subdivision for Melissa Cookson's purposes?
>
>Not necessarily. If the material was about the process of aging cheese it would be better to have "Cheese--Aging".
All of these discussions are highly interesting and productive, but still avoid the primary problems, especially in a networked environment. First, we are supposed to interoperate with other terminologies. In some places, it may be "Fermentation", in others, it may be "Ripening" in others it may be "Aging", or even, "Effect of aging on". (By the way, I personally think this last one should be valid under personal names too!)

For example, in the NAL thesaurus, for cheese, it is "cheese ripening" http://agclass.nal.usda.gov/mtwdk.exe?k=default&l=60&w=5171&n=1&s=5&t=2, but for meat it is "meat aging" http://agclass.nal.usda.gov/mtwdk.exe?k=default&l=60&w=5173&n=1&s=5&t=2, and for wine it is "wine aging" http://agclass.nal.usda.gov/mtwdk.exe?k=default&l=60&w=5175&n=1&s=5&t=2 (no subdivisions at all).

In AGROVOC, the term is "Ripening" http://aims.fao.org/en/pages/594/sub?mytermcode=27924&mylang_interface=&myLanguage=EN. All of these terms have different language forms as well.

In dbpedia (through subj3ct.com), the term is "Fermentation (Food)": https://subj3ct.com/subject?si=http%3A%2F%2Fdbpedia.org%2Fresource%2FFermentation_%28food%29

This is the environment that our users inhabit. If they are going to navigate this effectively, they must know it. It is *unbelievably complicated*. No wonder so many just decide to settle for whatever Google throws out at them.

My point is, in deciding upon subject terms, you will *never ever* find a term that everyone will agree with. Somebody, somewhere, will say they would have never thought of such a ridiculous term for the concept they want. That is the reason for the absolute necessity of the UF terms. For the longest time, there were no cross-references at all in the OPAC; then they were implemented for the browse searches, but people overwhelmingly use keyword now, so they still miss them completely.

Second, many times, a single subject in an item is represented by multiple subject terms, since there is not always a 1:1 relationship. My normal example is a book on the library reference interview. There is not a single term for this, and has been handled as 1. Reference services (Libraries) 2. Interviewing, e.g. http://tinyurl.com/63847pb, and there is also the example in 26.52 Post-coordination of "LCSH, structure and application". http://www.itsmarc.com/crs/shed0143.htm.

We all deal with this every day, and all I am saying is that this system is complicated. I think it actually worked better in the card/physical environment than online because people had no choice except to browse alphabetically (mostly). In any case, we should not conclude that making a new subdivision "Aging" or a bunch of new headings is going to really solve anything. The system we have today was designed for a completely different environment and is broken.

That doesn't mean at all that the public do not want the product of what we make, e.g. the "set of all resources on the aging process in foods" because this is what I think people want more than anything else. When I explain just the idea of what we provide through authority control to non-librarians, it is so completely different from Google-type searching that it comes as almost unbelievable to many people. People want it. We provide this control now; we *can* provide it in a way that may be usable in the environment that our public inhabits (as mentioned above); that is, so long as we, and others such as NAL, AGROVOC, and whoever, decide to cooperate, which means that *everyone* will have to change. Do we have it in us? I honestly don't know.

But how do we achieve this? Through trial and error. I think dbpedia would be a great place to begin, since it could be done pretty much now and lots of people could benefit from our work.

RE: Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

Vosmek, John J. wrote:
<snip>
James Weinheimer wrote: "Of course catalogers need to continue to provide all kinds of controls over our data, including authority control--I don't question that..."

To the contrary, that's exactly what you are doing, i.e., questioning the utility of controlled vocabularies. This discussion ("Why is there no term for this?" "Should there be a term for this?" etc.) is fundamental to the process of creating and maintaining controlled vocabularies, which are never complete and will always need tweaking. If you see this conversation as an indication of what's wrong with our catalogs, then you must see controlled vocabularies as a flaw.
Unless, of course, you are saying that the ability to keyword search full text would be a great additional option to our catalogs for the times when controlled vocabulary searching doesn't return the results that we know are there. Would anybody disagree with this?
</snip>
No, this is not at all what I am saying. There are crucial differences among 1) conceptual grouping materials for consistent retrieval, 2) matching someone's real-life queries to the groupings, 3) actually finding those groupings. In the instance we have been discussing, I posited that someone wanted the concept "the aging of food". How is that translated into finding the related groups?

The conceptual groups suggested were: Food spoilage; Cooking (Meat); Food--Preservation; Game and game birds, Dressing of; Slaughter and slaughter houses; Fermented foods; Fermentation. How in the world can an untrained person come up with any of that? But they need to.

This problem is not new at all. In the card catalogs, it was even worse because the person first had to pull out the correct drawer to get into the correct alphabetical arrangement and then browse to the heading and any subgroups. That was why the red books and more importantly, the reference librarian, were absolutely essential. Since the catalog today has not changed in any essential way since that time, except for adding keyword, people are confronted with what is, in essence, the same problem, and the traditional "solution" of the reference librarian no longer holds since the users are asking many fewer questions from reference than before, especially the remote users.

Yet, the conceptual groupings we make are there and still being maintained, arranged by expert catalogers who (I hope) take their jobs seriously. I am saying that just laying it all on the users and reference librarians ("If you have problems, ask a reference librarian") does *not* work at all today. We must accept that. The catalog (or whatever library system our patrons will use to access our collections) must work for *them* and not just for *us*; otherwise, they'll just go someplace else.

How do we solve this problem? One thing that is absolutely necessary is to make the cross-references actually useful in our catalogs by incorporating them into full-text searches in a coherent manner, and adding about a zillion more UFs, based probably on logfile analysis and with lots of input from the reference department. Doing something like this in a cooperative fashion would be a lot more useful than retyping all of those abbreviations, of that I am sure!

Helping the Searchers of the Catalog (Was: subject heading or subdivision for food aging?)

Posting to Autocat

On Mon, 10 Jan 2011 16:42:38 -0600, Mary Mastraccio wrote:
>> Melissa asked:
>> >Is there really no subject heading or subdivision for the
>> aging of food?
>>
>> Mac suggested
>> > Food spoilage
>>
>> Tina Gunther suggested:
>>
>> Cooking (Meat)
>> Food--Preservation
>> Game and game birds, Dressing of.
>> Slaughter and slaughter houses.
>>
>> "Food aging" is a rather ambiguous phrase, which is probably
>> why it hasn't been established.
>>
>
>"Aging" of food may be ambiguous (is it good aging? or bad aging? or food for the aging?, etc.) but it is a common concept. Aged cheese, aged wine, aged meat, fruitcakes, etc.
>Many things age/grow old not just people so it is not surprising some searchers want to qualify their search "[topic]--Aging."
There are also the possibilities of "Fermented foods" and "Fermentation".

But I think this is a great example of the problems facing the public who want to use our catalog: just from the people who have answered this question here, it wouldn't surprise me if there is close to 100 years or more of cataloging experience. If that kind of deep professional knowledge with vast experience has a problem--and this *is* a good problem--what can we conclude about a normal person with no training, sitting at his or her desk, and just staring at that text box? How can we provide genuine help to that real-life person out there who is looking for the concept of "the aging of food" to steer them into the direction of all of these possibilities?

Compare this to: http://tinyurl.com/5u6uy5j in Google Books or http://tinyurl.com/63dr2hs in Google Scholar. Perhaps not perfect results, but it is very easy and at least there is something for them to grasp onto.

So, if we imagine somebody using our catalog to search for "the aging of food" who finds essentially nothing and gets no help (because people will not actively seek out help on their own), it is only reasonable that they will then go to Google Books and Scholar, and begin to find some things that begin to answer their questions. When the full text is available in Google Books, why would they ever come back? What would you prefer if you were searching on the topic? And next time it is very possible you will go to the Google tools from the beginning... and not have time for our catalogs.

And when we solemnly proclaim that "our tools are better" can we blame them when they roll their eyes and turn away?

These are the types of absolutely important, 100% realistic issues facing the catalog today, *not* typing out abbreviations in full or getting into debates over what is a work or expression, and which "attributes" go with which "entities", all highly reminiscent of the arcana of medieval academics. If the "Statement of International Cataloguing Principles" proclaims (http://www.ifla.org/publications/statement-of-international-cataloguing-principles):
"Several principles direct the construction of cataloguing codes. The highest is the convenience of the user."
then we should take that statement seriously. This question provides a great example of a real problem facing users.

I think there are solutions out there to provide genuine help to people who are using our catalogs. For the longest time, the standard answer was, "If you have problems, ask a reference librarian," but that answer no longer holds (if it ever really did, but that is another topic). Of course catalogers need to continue to provide all kinds of controls over our data, including authority control--I don't question that--but our final product must also be easy enough and useful enough for the general public to use. We must take this task very seriously.

How do we do it?

Monday, January 10, 2011

Library cuts in Google Maps

Posting to NGC4LIB

While this is neither a very fun, nor an uplifting map, it is one that I am sure will strike many others on this list personally, plus it provides yet another example of the possibilities with the new informational tools today:
Public Library Closures in the UK

http://maps.google.co.uk/maps/ms?ie=UTF8&hl=en&msa=0&msid=210849821991286385577.00049636af20aee18bb14

Zooming in displays closures and possible closures in highly graphic fashion, and for those who may be looking at their own neighborhoods, something like this may actually have more of an impact than simply reading words. From this map, it appears to me as if entire communities will be left almost with nothing at all.

I have found a similar one for the US for school libraries and librarians at:

http://maps.google.com/maps/ms?ie=UTF&msa=0&msid=117551670433142326244.000482bb91ce51be5802b

These are built on a collaborative basis. I find both of them very powerful.

Saturday, January 8, 2011

RE: 856 Linking tags

Posting to Autocat

On Fri, 7 Jan 2011 10:42:57 -0600, Tim Skeers wrote:
<snip>One thing this person might want to try, if it hasn't been tried already, is to talk to some of the other reference who object to the links and pin down the exact reasons, e.g., why are they confusing/annoying, what do the patrons say, etc. That might help clarify what needs explaining. Then try to get them to attend a short meeting that could be set up sort of like a training session where they can be *shown* what the links do (as opposed to just verbally explaining) by picking out some records and bringing them up, clicking through, etc.
</snip>
I completely agree. Somewhere in the recesses of my mind, I remember reading that at one time our public was clamoring for *more information* in our records, specifically summaries and tables of contents. It must be pointed out that a library catalog record is certainly no more complicated than a metadata record in Google Books or in Amazon, e.g. compare the displays in my own catalog (semi-ISBD display): http://www.galileo.aur.it/cgi-bin/koha/opac-detail.pl?bib=19272
Amazon.com display: http://www.amazon.com/1491-Revelations-Americas-Before-Columbus/dp/140004006X
or my catalog: http://www.galileo.aur.it/cgi-bin/koha/opac-detail.pl?bib=22691
and Google Books: http://books.google.com/books?id=eaC7yKbxj8UC&source=gbs_navlinks_s

What accounts for the apparent fact that people prefer the Google and Amazon displays to ours? I personally suspect that this is actually signaling a change in user expectations: today, people assume that when they see a link, it will link to the resource itself, and if something does not link to the resource (or at least a related resource that is nevertheless complete, e.g. a review), they are tending to see such links as useless and consequently annoying. This, instead of the original intent of the 856s: as additional information that can help patrons decide if they actually want the resource. After all, it should be better to see a table of contents or a summary note than nothing at all. This is definitely how *I* see the 856s; and yet I accept that most people may not think this way and that the expectations of people are changing. (The underlying cause *may be* that people feel that clicking a link is a type of *work* that entails an investment of labor and therefore, there must be a definite return. Just a few years ago, clicking on a link was seen as incredibly easy, and this attitude may be changing in the popular mind. Still, these are guesses since I don't know)

Since the public is seemingly so enamored of Google and Amazon, perhaps we could make a preliminary conclusion that *if* the same information were incorporated in some way into the record itself (as we see in the endlessly scrolling pages in Google Books and Amazon), and not as separate links to click on, the public would find this information more useful. Of course, this same functionality could be done in other ways, with onMouseOver events, or perhaps how I have implemented the OCLC api with the citations (see either record in my catalog above, and click on "Get a citation" in the right hand column). But it could certainly be done in much cooler ways, utilizing APIs in some ways.

Tuesday, January 4, 2011

Comment to: Death by Irony: How Librarians Killed the Academic Library

Comment to: Death by Irony: How Librarians Killed the Academic Library / Brian T. Sullivan (Chronicle of Higher Education, January 2, 2011)

I guess I'm going to lay myself open because I have a completely different take on this. I am a librarian of some years' experience as well, and I have seen how electronic resources and the web have pretty much killed off science libraries. In a highly-provocative talk of Peter Murray-Rust (a chemist) (see the video at http://www.jisc.ac.uk/whatwedo/campaigns/librariesofthefuture/debate), he tells the truth: he simply says that for the science, technology and mathematics (STM) fields, academic libraries are almost completely irrelevant.

To continue his line of thinking: the social sciences also need libraries less an less, certainly much less than only 10 years ago. Therefore, why would we believe that the humanities are somehow exempt from this trend? And anyway, is the purpose of libraries aimed primarily at the humanities, or equally for all fields? Already, we are seeing brand new possibilities with many new projects that are completely new and outside all of the traditional publisher/library controls, such as Google's Ngram Viewer http://ngrams.googlelabs.com/, which is already inspiring a lively debate. I think we all know this is only the first of many such projects.

Fortunately, Mr. Murray-Rust doesn't simply end there, but goes on to make several suggestions for some of the directions libraries should take. *In theory*, finding useful and reliable information should be easier today than ever before, but this makes a tremendous assumption: that people are much better in their searching skills than any I have witnessed. While somebody may be very good at using their computers to find hotel rooms to book on a specific date, to download a new app for their Iphone, or to look something up in Wikipedia, using that same computer to get enough reliable information to write a paper on the consequences of the fall of communism in Hungary, or to understand the techniques of Raphael's paintings, or for that matter, to get a decent idea of the performance of the Obama administration for purposes of deciding how to vote, are completely different tasks. People have troubles with those searches.

In the pre-computer days, people also had lots of problems with these same sorts of questions, but back then they would tend to find far too little--or nothing--using the card catalogs and paper indexes, while today they normally find far too much that is irrelevant. In the old days however, people had no choice except to ask a reference librarian for help, or just continue to struggle (as many did), but today with the web, there is always someplace else to go, or what I think many do: choose the easiest route by just giving up and accepting whatever the search engine spits out at them. The number of reference questions is clearly going down according to ARL statistics, and web masters know that people very rarely consult the help pages of a site. They just go someplace else or "Google" it.

I don't know what the solutions are, and nobody can know without trial and error, but librarians have to get out of the rah-rah! corner and face facts: they are watching their profession fade away. We have lost the sciences, almost lost the social sciences, and the humanities are endangered. I have no doubt that librarian skills are desperately needed today, but in different contexts and in different venues. We must discover what they are.