Wednesday, April 16, 2014

Re: [ACAT] Advantages of RDA

Posting to Autocat

On 4/16/2014 7:47 AM, Hal Cain <hegcain@gmail.com> wrote:
<snip>
The alternative view is the pragmatic one: it's happened, we'll just do the best we can with it.
</snip>
I've been out for a few days, and just saw this thread.

The advantages of RDA are primarily theoretical: based on entities and attributes, so that the public can navigate the works, expressions, manifestations and items (WEMI) by their authors, titles and subjects (ATS). This is now being supplemented by adding more explicit relationships, so that the searcher knows that the relationship of two entities: e.g. a specific work, has a specific relationship to another entity that is not only an author, but e.g. a thesis advisor.

To believe this, you must ignore a lot of the reality that is right in front of our very faces: right now if someone wants to, anybody in the world can navigate the WEMI by their ATS by using the new technologies that allow facets. All you need is a uniform title and everything works from that. If implemented correctly, you don't even have to know that much to be able to use it, and anything can display however you want it to. As an additional plus: all of the technology is not owned by some monopolistic company that can charge any outrageous fees it wants and you are locked in forever, but because of the essential goodness of some very talented people, it is all open source and is downloadable for free. Those are simple facts and anyone can convince themselves of their correctness in a couple of minutes.

Another fact is that if the public is to search for the specific relationships, e.g. find people as thesis advisors (but there are far more relationships among all of the entities), then it is a fact that none of it can possibly work until those relationships are added to the records--including those that already exist--otherwise people will be searching only the tiniest fraction of what is really available to them. To add those relationships to all the entities (the relationships among the WEMI and ATS are truly complex) will demand vast resources and will have costs, but no one has even suggested any kind of ultimate figure. What we have seen so far in RDA/FRBR implementation represents only the barest costs, and already beyond the reach of many libraries in this economic crisis we are going through.

I shall ignore the obvious question of whether people actually want to find, identify, select and obtain WEMI by their ATS, or if they actually want something else, but will just point out that it is folly to assume that it is what they want without evidence.

To get more of a grasp upon the reality, we should also try to see things from the point of view of the users. I think that a very good way to do this is to watch the talk by a library IT person "E-Books Do Not Exist (and Other Conundrums of Digital Asset Management)" https://www.youtube.com/watch?v=BJ5sSUHJagg but to consider this talk not so much in terms of libraries dealing with this information, but of users dealing with this information. In short, he says that the number of records he is putting into the catalog is 8000+ per day. I may be wrong, but would suspect that the average large library adds anywhere
from 1000 to 2000 records per week. In a year's time, this would be the difference between

8000 x 5 (days per week) x 50 (weeks per year) = 2,000,000 per year
1000/2000 x 50 (weeks per years) = 50,000 to 100,000 per year
(or no more than 5% of what is being added to the collection)

The speaker of this video mentions that since there is so much, he cannot know what records are being put in there--he cannot ensure any kind of authority control, cannot figure out what is and is not duplicated, which links work and which don't, etc. etc. etc. While this is obviously a serious matter for the library, I want to focus on the users. The speaker describes the situation he is in as "being squashed like a bug." I am sure he is right, but if he feels this way, what does this mean for the users? Also, we have to remember that these numbers do not in any way represent everything relevant that is available to the public. As an example, I am personally very interested in following the latest Ukrainian news, and the best place for me to find that information is on the web, not in a library, and certainly not in JSTOR, Proquest, Ebscohost, or anything libraries pay for. On almost any topic, there are wonderful materials on the web that are just as vital, interesting and useful for the public than many of the materials on our shelves--if not more so.

These comments are merely to introduce some of the facts that users and other non-catalogers have to deal with every, single day. While Hal is right in one sense about RDA, "it's happened, we'll just do the best we can with it," it still doesn't mean that RDA is dealing with any of the very real problems people, and libraries, are facing. It was introduced through executive fiat, without a business case, and we are seeing the consequences of such a decision.

Sooner or later, the cataloging community is going to have to deal with it, otherwise at yearly rate of 5% of the whole, the catalog records that we make will constitute tinier and tinier proportions of the growing mass, eventually winking out of realistic existence in the information universe.

I am personally not so pessimistic: I believe there are many, many things catalogers and catalogs and libraries can do--and very important things at that--but first of all, they have to face some very obvious facts, and accept that any solutions must be both practical and sustainable. Concerning RDA and FRBR: do they offer real solutions to the very real problems that people are facing, and are those solutions practical and sustainable? What relationship do these records have with the far greater numbers of items available to the public through the entirety of what the library pays for, plus the even greater numbers of resources available through the web?

I and others have asked these questions repeatedly but neither RDA nor FRBR have been very concerned with the stark realities that catalogers, librarians and the public, all face. There are serious questions that need to be asked, and answered, not "which relator term do I use?" or "Does this author go with the work or expression?"

Friday, April 4, 2014

Re: Transforming non-MARC metadata to MARC to the library catalog

This topic, some threads on various lists, and a talk on youtube (that I suggest all librarians should watch): "E-Books Do Not Exist (and Other Conundrums of Digital Asset Management)" https://www.youtube.com/watch?v=BJ5sSUHJagg, all of these discussions have all really made me think.

Posting to various lists

The task of adding non-MARC metadata to the library catalog is absolutely huge and fabulously important--there is no doubt about that. It also doesn't seem to be discussed much, especially when compared to RDA, WEMI or which relator codes to use. But the challenge is exponentially larger and has the potential to sink the catalog, and some apparently think it already has.

Still, I question whether adding non-MARC metadata (or in other words, non-standard records) is really a new problem or not. Libraries have always had files to all kinds of materials that did not make it into the library's catalog: finding lists for archives, journal and newspaper indexes; lots of analytics, all kinds of subject bibliographies... this list can go on for a long time. It was always too much work for individual libraries to catalog everything that came into the library, so that isn't new. What is really different now is that these things can go into the catalog--that is, if you receive some kind of delimited format, it is possible to convert it into a type of MARC where you can load it into the catalog without the catalog noticeably blowing up. It can be done, but I question: should it be done? And if so, how should it be done?

Zillions of problems arise when adding them to the catalog. The problems of incorrect headings, as discussed by Julie, plus the numerous problems laid out in the youtube talk make me think that adding those records causes as many problems as it proposes to solve. Still, the public wants to be able to search "everything" in one search, and I accept that. Does that also mean we have to destroy the consistency in our catalogs that our predecessors (and me!) have worked so hard to maintain through the decades, and even longer in some cases? I don't think so. The power of today's systems could offer a solution. For instance, with federated
searching, searching "everything" does not mean that it all must be in one database. The information can be almost anywhere, in any format.

There are several open source tools now, such as MasterKey (demo at http://mk2.indexdata.com/). In this demo, you are searching library catalogs, OpenCourseWare, Wikipedia, and PLOS. There is also Wheke http://wheke.org, made by an acquaintance, based on Drupal. Most of the documentation is in French however. There are probably other similar tools as well. Anyway, these will search MARC and non-MARC databases all at once, then sort and even merge the records it finds. The MasterKey demo is very impressive. Of course, when you install it yourself, you can decide what you want to search and how to do it.

What is the advantage? Well, in Julie's case with the "free" Dublin Core records, she would not have to put them into the catalog but she could put the records into another local database, perhaps a very simple mysql one or something similar. It could be searched along with anything else she wanted using the federated searching tool, and the users wouldn't even know the difference. But the real advantage is: the records would be in their own database, you would have additional tools not available in a MARC database, and you could continue to work update/edit/completely overwrite them separately without worrying about how it affects your own catalog records. The result would be that a lot of the terrible headaches mentioned in the youtube talk would disappear.

And, in the spirit of sharing, if Julie were kind enough to let other libraries and catalogers use her mysql  database of those records, the entire workload could be shared out, for instance, updating headings or using URIs. Everyone would benefit.

Records for electronic documents are fundamentally different from records for physical items because they are all pointing to exactly the same files in exactly the same places. Although you may need specific permissions to access the files, that is a separate issue, and fully solvable as well. So, I don't see why each library needs separate records in its catalog that all point to exactly the same things--why not share it if you can, and do it efficiently?

This does raise other questions however, but I'll talk about those in other posts.

Friday, February 28, 2014

Re: [ACAT] Code s in character position 28 of 001

Posting to Autocat concerning the use of the "government publications" code in the MARC format

On 26/02/2014 1.48, john g marr wrote:
<snip>
First, let's start from the bottom instead of with generalities. Which ["very few"] publications by state university presses ARE "government publications" and why, and which state university presses are NOT affiliated with state universities?
With that, we can then establish some guidelines based on fact instead of mumbling about vagaries.
PS: I see that line "Treat an item published by an academic institution as a government publication if the government created or controls the institution" as being intended to distinguish such items from items published by private academic institutions (e.g. Bob Jones University), which brings up the question of whether such private institutions actually have MORE control (and censorship) over what their presses publish and whether that problem should be addressed.
</snip>
I think it's a little late in the day to start trying to figure out how to treat the government publications code. I wanted to check in "The MARC II Format. A Communications Format for Bibliographic Data" from 1968 to discover if it existed in the original format but unfortunately, neither Google Books nor Hathitrust allows me to see the text of the clearly public domain document. (As an aside, I have noticed that Google appears to be withdrawing several materials that used to be open. I have used proxies to check if this is only because I am in Italy, but it appears not to be the case) In any case, I am sure there has been that code for several decades.

As I mentioned in my previous message, I suspect that the code was introduced so that people could limit by "government documents". That would be good and useful. But catalogers immediately had the very legitimate question of: "What is a government document?", a vague and nebulous idea. As happens so many times, to answer this legitimate question, nobody ever went back to the public to ask them what they wanted, and instead, catalogers decided to do it "theoretically": a government document is something that comes from a government body. So, what is a government body? Again, the catalogers fell back on theory and decided:
"Academic publications- In the U.S., items published by academic institutions are considered government publications if the institutions are created or controlled by a government.
University presses- In the U.S., items published by university presses are considered government publications if the presses are created or controlled by a government (e.g., state university presses in the United States)."

Obviously, this becomes useless and is certainly not the general idea of someone who wants to work with government documents. It is also quite a political statement to declare that anything that comes off of a university press--if it is a state university press, or from other academic institutions--creates government documents. Again, I don't think any cataloger ever researches their publishers to find out their relationship to the government. I never have and I won't do it.

This is one of those little points in the format that started with good intentions but became useless since there has been such wide variation in its implementation. We should either fix it, which would demand a huge amount of resources for no purpose, or consider eliminating it.

Unless there is no concern for consistency any longer. Then, we can just keep putting in a useless code.

Re: [ACAT] Online Encyclopedia

Posting to Autocat
This is about trying to cataloging a multi-volume encyclopedia that is in the Internet Archive

On 2/27/2014 4:33 PM, Lisa Romano wrote:
<snip>
Unfortunately, these are Internet Archive records.
</snip>

The IA arranges their materials in a way that is not all that library-cataloger-friendly. In library-cataloging terms, it mixes manifestations, so while it contains some fabulous resources, e.g. Migne's Patrologiae cursus completus in both Latin and Greek--something that would be outrageously expensive for a library that wants it, it exists in the IA for free! e.g.
https://archive.org/search.php?query=Patrologiae%20cursus%20completus%20migne

Unfortunately, there is no overall page and to find a specific volume can be difficult if not practically impossible. There is no way to find out what is there and what is missing. This isn't the case just for the Migne, because of course there are zillions of booksets in the IA, some very large such as this one, but many(most?) booksets lack any volume numbering and I have discovered that even when the records say something is volume 13, there is absolutely no guarantee that the item is volume 13. Therefore, each volume must be examined.

Finally, the individual records in IA are structurally different from traditional library records. For instance, this one
https://archive.org/details/patrologiaecursu00hammuoft
mixes together different formats: 2 types of pdf, epub, kindle, DjVu, etc. In library cataloging terms, each format would be considered a separate manifestation and therefore each would get a separate record. So, the DjVu format would get one record, while the epub gets another and so on. What we see in the IA is a type of Expression record.

Kind of....

To make matters even more complex, in the future, additional formats can be made automatically from what exists now, so if someone at IA wanted to provide .html or Apple's .ibook format, or some new format in the future, that format could be generated automatically and there would be a completely different format to manage. And generated in the blink of an eye! Of course, in the IA record, there would just be an additional option in the left column.

Whew! But the materials themselves are nevertheless highly valuable to our patrons. What is the best way to control it? That's tough, but I think these sorts of resources should make us rethink our normal methods. To do it correctly would mean to catalog these materials so that all manifestations are together, that is: all of the epubs, all of the pdfs, the DjVus and so on. That is a complete rearrangement of what exists now and would take many many hours, perhaps months of horrible work and besides--is that really what the public wants?

In practical terms, if someone wants to add the IA collection of Migne to their library catalog--thereby saving their libraries probably tens of thousands of dollars--and have a record for the entire set so you can see what you have and don't have, should there be one bookset record for all of the pdfs, another for all of the epubs, another for all of the DjVus and so on?

While that would be "correct" in cataloging terms, it is also a ton of work, and anyway: does the public want materials brought together by format in this way? Clearly, the IA does not think so and believes that the public is far more interested in the contents of a resource than its format.

After all of this, my suggestion is: don't even think about cataloging the bookset "correctly". Do it in the most economical way that will help your users find the information they need, but not the format they
need. Create a separate html page that describes your encyclopedia and the links to the individual volumes can go to the IA mixed-format records and then catalog the page you made. Our normal methods fall apart here and lead to too much needless work.

You could also involve others for design of the webpage you make, and anyway, it may be a good idea to involve other library staff. It is an interesting situation and they can see some of the problems we face.

Thursday, February 6, 2014

Re: [RDA-L] Re: Future of WEMI

Posting to RDA-L

On 2/5/2014 7:43 PM, J. McRee Elrod wrote:
<snip>
Most of us are doing manifestation records in MARC. There are no work records (unless you consider 100$a$t and 130 authority records to be such), Bibframe has no expression records (calling them works instead), It seems to me that in MARC we have MI, and in Bibfame WII (work/instance and I assume item). WEMI does not seem to be happening.
</snip>
There are still an entire host of questions concerning WEMI that need to be addressed. First, it seems that with Bibframe, the works and expressions will be merged. Also, who will "own" the W/E instances or will they--along with all the W/E instances that will be made in the future--be in the public domain, or will libraries be expected to pay? Has that issue been solved or discussed? I cannot imagine any organization agreeing to--in essence, give away all of their headings (i.e. W/E information) if there are not some kind of iron-clad guarantees somewhere. Too many libraries have already been burned by losing rights to their own materials in the digital age.

Finally, it remains to be seen whether any of this will have the slightest impact on the users, especially after the fact that anyone is able to find WEMI in Worldcat now (as I have demonstrated several times), and this has had precisely zero impact on the public, on libraries, or on the world of cataloging itself. After we spend outrageous amounts of money converting and retooling each and every database and catalog in the world (and we are only at the very beginnings of the costs), what exactly is the major change that people will be able to experience that they cannot do right now; changes that will make the catalog more relevant to their needs, and will get them to appreciate library catalogs once again, other than merely for inventory
control?

But these questions seem to be among those that nobody wants to discuss.

Wednesday, February 5, 2014

Re: proprietary interest in bib records?

Posting to RadCat

On 2/4/2014 1:45 AM, Linde B. wrote:
<snip>
In OCLC #827230862, there appears an interesting 588:
588 This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</snip>
"Intellectual property" has become increasingly fraught by all kinds of problems ever since the introduction of the world wide web and it will only get worse. Everybody wants to "own" everything today and the ownership of "metadata" of all different types, is becoming more and more of a political issue.

Before the copyright act of 1976, if you wanted something copyrighted, you actually had to go through the process of registration but since 1976, anything is copyrighted from the moment of creation, with no need to register anything. As a result, before 1976 anything that did not have the copyright symbol © or explicitly stated to have a copyright was fair game. We already know that whatever is produced today is copyrighted, so anyone who wants to use it needs to know what to do with it. Since that time, the copyright laws have only gotten more complex with longer terms tending toward infinity.

As an additional point for catalogs, there is the very real question whether metadata is subject to copyright--at least our kind of metadata that summarizes another resource and not the other types of personal metadata (such as phone records) being hotly disputed in the press today. There is little doubt that a huge part of any catalog record cannot be copyrighted, e.g. the title, the number of pages, the type of format, and so on. There must be some level of originality connected to it, so in an individual record, that seems to leave only a summary note (maybe) and, stretching it a long way, the subjects. (See: Karen Coyle's Metadata and Copyright)

While I could make a great argument for the originality and creativity of at least some of the cataloging records I have made, today it is far more important to make our records open. I think the intellectual property laws and concerns over ownership of information have become increasingly crazy and is making everything dysfunctional. When copyright really made a difference, it was when copies were difficult to make. Digitized copies are made automatically with no effort at all. You just did it when you got this email. A little kid can do it; a baby could do it; I'll bet I could teach a dog or a cat to do it.

In 200 years time, I believe people will look back with amazement at the incredible tangled web of our arguments over this strange thing called copyright. They will compare the importance of our current debates to the medieval academic debates such as that seen in Umberto Eco's "The Name of the Rose" where they debated the question, “Did Christ, or did He not, own the clothes that He wore?” That was an important point for those people back then, but it is a different matter today.

Concerning the University of Florida Libraries, I commend them. this is a noble effort to head off a huge debate that can only make everyone angry.

Re: [ACAT] Removal of foreign subject headings

Posting to Autocat

On 2/4/2014 6:34 PM, Zelesky, Mark C. wrote:
<snip>
Hello everyone. I am reaching out to the collective AUTOCAT wisdom about removing unwanted subject headings from your bibliographic records. I am wondering how other libraries approach this task. Do you automatically strip them using a bulk edit program (MARCEdit)? Do you keep them in the records and stop indexing them? I am trying to develop guidelines for handling foreign subject headings, so I want to know what other libraries are doing.
</snip>
Retaining the text in the records seems to present few problems and could even help some people find some materials through keyword. The real problems happen when the foreign subject is made into a link. For instance, here is a record in Worldcat: http://www.worldcat.org/oclc/803423329

If people click on the subject headings
Itàlia -- Guies
Itàlia - Descripcions i viatges - Guies

what do they think they are looking at? These are the sorts of very strange interactions that (I think) the public has with our catalogs. I don't see how they can possibly understand something of this complexity and can lead only to bad results (only a tiny percentage of our records will have a heading such as "Itàlia - Descripcions i viatges - Guies", even if the book itself is in Catalan!). How can an untrained person understand it and even if they did, what would they think about a link that doesn't really work?

Someday in the far future, this may be worked out with linked data but that will be quite awhile yet. So my suggestion is: if you want to retain the different headings in the records, you must turn off the links; otherwise, it is not fair to the public.

Re: [RDA-L] Re: Our goal

Posting to RDA-L

On 1/30/2014 3:33 PM, Thomas Berger wrote:
<snip>
- - Never even try to extract "data" from traditional "headings"
- - "Labels" are a convenience. Consistency within your own environment is a must, uniformity between libraries is nice to have but in contrast to "headings" they shall not be overburdenend with the ambition of universality or eternal fitness to every task imagineable.
</snip>
Agreed, but the purpose of the heading was never to provide "data," especially in the modern notion of data mining or other data manipulation. The purpose of the heading was to provide a label (I must insist!) for a group of resources that have been brought together. Those groupings supplemented the single arrangement of the physical materials on the shelves, because while other physical arrangements were possible, that would have demanded multiple copies of everything and was impractical. So, you would rearrange your materials virtually, using cards. Catalogs do exactly the same thing today only they don't use the cards.

Still, the groupings and arrangements must ultimately mean something to those who use them (humans). No user has ever understood what "DG575.M8 A2 1998" meant but when you went for it, you were among the physical materials and you understood what the number meant when you got there by what you saw arranged under that number and the items around it. In the catalog however, there must be something else that people can understand attached to that alternative grouping. I call that a label, although it doesn't have to be text and could be an image (like a t.p. scan or a screenshot) or a bit of music, a video clip or who knows what that label could be, so long as it tells the human what the grouping contains. Plus, the human must be able to find that grouping, even though when he or she begins the search, they may have no idea the grouping even exists.

The card catalog handled that very nicely with the cross-references, but I have seen nothing similar in library catalogs, except in left-anchored text browses, when you are searching the OPAC like a card catalog.

Friday, January 31, 2014

Re: [RDA-L] Re: Our goal

On 1/30/2014 12:52 PM, Thomas Berger wrote:
<snip>
When you prescribe rules that allow to construct headings to the necessary precision that not only any real catalog but also any hypothetic catalog including descriptions for items that don't even exist at the moment can be arranged as a whole without any ambiguity - then identical items must fall together exactly in one spot (otherwise a future item could separate them).
</snip>
I think we are substantially in agreement, but I believe that thinking in terms of "headings" may be obsolete. Instead, there needs to be a method of some sort that allows human beings to discover what has been done through purpose 1 (i.e. when the catalogers bring together identical items (with respect to one property) at the same place). In the Googles, in the Amazons, this is done by machine through advanced algorithms using massive amounts of metadata of all different types, including information about the searcher and his/her friends and so on.

And then we are supposed to just accept that their search results are close enough to our purpose 1 groups, that the human-created groups of purpose 1 are either unnecessary or impractical. Many, many very powerful people accept this now. I disagree, but to get these people even to consider an alternative will mean creating a prototype that must show proven and demonstrable usefulness.

Adding URIs is but one step. There is still the problem of creating a label that the human will understand, e.g.
http://id.loc.gov/authorities/subjects/sh95009708.html is incomprehensible to a human and it still must have something that a human can understand. For instance "Militia movements" but such a label could take many other forms. There also needs to be a way (many ways!) for people to easily find this URI/Label among the zillions that exist.

Alphabetical arrangement or asking a librarian can still be allowed but of course, can no longer be seen as ultimate or primary solutions.

Much of the problem, I think, is to work on the visualization of the search results. Our search results haven't changed their look in substantial ways since the introduction of the card catalog. Sure, there are full displays, MARC displays, brief displays, but I have seen no real work on visualizing the search result. Aquabrowser doesn't work for me.

Lots of attempts are being made in other fields right now (such as statistics) to visualize the data in more interesting and more compelling ways.

Re: [RDA-L] Re: Our goal

Posting to RDA-L

On 1/30/2014 11:21 AM, Thomas Berger wrote:
<snip>
"Collocation" hat two slightly different meanings or purposes, as we all know:
1. Bringing together identical items (with respect to one property) at the same place
2. Arranging items (with respect to one property) in a meaningful manner
</snip>
In answer, I don't know how many people really understand this. I am glad you do, but the difference is vital to understand.

Purpose 1 is just as important as it has ever been, so that people really can find all materials on a concept, e.g. "WWI" or "Tolstoy, Leo" no matter what text may have been used. Of course, this is limited to the rule of three (now, the rule of one), or for subjects, 20% or more, and so on. So long as catalogers do their jobs properly, the catalog allows this just as well as it ever has. In this sense, catalogs and their purpose are unchanged. This is my opinion, and see what I write under purpose 2. Yet, we should recognize as a fact that if users wanted to access materials brought together by purpose 1, it demanded some real skills and required a well-trained reference librarian close at hand to solve problems.

Purpose 2, on the other hand, has completely broken down in the "new" information environment (although it is tough to call it "new" when it is decades old at this point). In fact, purpose 2 has broken down so completely and for such a long time that members of the public, and even many librarians, have forgotten purpose 1. When purpose 2 broke down, it didn't mean that people stopped finding information, but other tools became available that demanded much less of them than the tools traditionally provided by the library, and although the new tools did not provide purpose 1, it provided something people really liked: immediate access and ease of use.

Today when you bring up purpose 1, many say it never worked anyway, or it is obsolete, a pipe dream, or a folly to pursue. I have heard all of it. Today, many people, IT experts, and even many librarians do not see purpose 1 as a useful, achievable goal. That has huge consequences. My reply is that if we could get purpose 2 to work better today, people may begin to appreciate purpose 1 and start to want it.

Getting purpose 2 to work better is a tremendous undertaking however, but it has almost nothing to do with changing cataloging rules or even formats.