Cataloging Matters #11:
Open Archives, pt. 2
Open Archives, pt. 2
Hello everyone. My name is Jim Weinheimer and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy.
In this episode, I want to continue with part two of my discussion about Open Archives. I intend here to concentrate on some of the technical aspects of how to get these materials under control, primarily from the cataloger’s viewpoint.
In the first part of my discussion on Open Archives, I spoke in more general terms and perhaps most people already knew much of that, but I believe it is necessary to emphasize the importance of the materials in the open archives, and although the materials in open archives are different in the sense that many of them have not gone through pre-peer review or they may differ in several other ways, these facts still do not detract from their importance. Once we accept that these materials really are highly important to our communities, as after all, they are by definition since they have been created and stored by scholarly institutions, some of them great and often including our own home institutions, all at great trouble and expense, then libraries cannot afford to ignore them. As I have said elsewhere, if libraries ignore the materials produced by their own communities, it should not be so surprising when those same communities begin to ignore libraries.
In an article Academic Libraries and the Struggle to Remain Relevant: Why Research is Conducted Elsewhere by John Law of Proquest, the author discusses the results of a project researching how academic patrons search. After discussing library catalogs, the myriad of databases, each with its searching peculiarities, and the real problems of Google Scholar, he writes: “Clearly, the desire among academic researchers is exceptionally high for credible, relevant results that can be refined to show only full-text resources.” This seems to me to be precisely what the open archive initiative is supposed to supply. http://www.serialssolutions.com/assets/publications/Sydney-Online-2009-John-Law.pdf
This is why I consider that, in library terms, the materials placed in open archives have already gone through the process of selection by their respective communities; there is no need to order anything, so the next step in the process that everyone is waiting for is description and organization, otherwise called cataloging.
So, how do we catalog these materials? For those who listened to part one and remember, I used the terms “exponential growth” when describing open archives and mentioned that already open archives hold around 9 million items. While I’ve seen some pretty big backlogs, I’ve never seen anything nearly that big! Of course, these are only the open archives that are registered, and not all are registered, plus there are many wonderful sites floating around on the web that are not in open archives, but I’m not dealing with those at the moment, only those materials in open archives.
In many ways, I think the open archive initiative has taken us all back in time to the beginnings of journals. The librarians and publishers of long ago understood as well as we do today that most people want individual articles out of journals, and not the journals themselves. Back then, a journal would sometimes provide an index in their final issue of the year, so that people wouldn’t have to go through each and every issue, and then to make it easier to find articles, some began to cumulate these annual indexes every 5 years or so, and eventually some even cumulated the cumulations. It turned out however that even with all of these cumulations, people still complained about doing all that work for each journal.
What did the librarians do? They too, quickly learned that, although it was what their patrons wanted, cataloging each article of each journal was an impossible task. William Poole got the brilliant idea of creating an index of the articles from a bunch of journals, printing it, and selling the publication to libraries. In the transcript I use the miracle of the internet to give a link to an early edition from 1853. That edition has a preface where Poole very clearly describes the situation of the mid-19th century, and it mirrors today’s reality almost exactly. Others may find his comments useful. http://books.google.com/books?id=yO9GGjPbPjYC&printsec=frontcover#v=onepage&q&f=false
It turned out that Poole’s solution made everybody happy: he and his publishers made money; the librarians could buy the index, and a nice addition was that librarians had a general guideline of which journals to buy since if a journal happened to be in Poole’s index, it was a good reason for them to buy that journal. Based on that fact, journal publishers naturally wanted records of their articles in Poole’s index. The index made the catalogers happy since they had, in effect, outsourced one of the most difficult parts of the collection, and the patrons were happy, that is, once they learned that if they wanted an article, they had to look in several places: first, they had to find Poole’s index: they could then see which issue of which journal the article appeared, then go over to the library catalog and see if the library had the journal issue and where it was shelved, then finally into the stacks. Some patrons never learned this vital skill, and even the ones that did nevertheless did not like it much. Many never really understood why they had to look in so many different places and why it couldn’t just all be in the catalog.
Today with open archives, as we have seen, there are a huge number of articles and repositories, while their numbers are growing all the time. Just as before, there are not enough librarians to catalog them all, and the work will have to be outsourced in some way, as indeed, it already has been. But now we run straight into the biggest problem: nobody wants to pay anyone to index these materials. As I mentioned in part one, the idea of open archives is not to make money, but to save money. So far, we have lacked the genius of a modern William Poole who, if he were with us today, may have figured out by now how to make money indexing those open archive materials. The remarkable Faculty of 1000 site I mentioned in part one may prove to be a starting point. But however that turns out, it’s clear that our traditional methods and solutions are broken. Therefore, we must consider matters anew, and seriously: do we have any advantages today that our predecessors didn’t have before?
As I already mentioned, open archives include a requirement for an associated metadata record created by the authors (or whoever it is that adds the item to the open archive). These records are then made available in such a way that bigger databases can “harvest” them, i.e. take copies of the records into their own databases so the searcher doesn’t have to search the thousands of open archives separately.
The process of harvesting metadata from open archives is normally not much of a problem. There are various tools you can use, for instance, MarcEdit will do it http://people.oregonstate.edu/~reeset/marcedit/html/index.php, but there are other tools as well, http://www.openarchives.org/pmh/tools/tools.php; even your web browser will do it, although it’s not the most efficient tool. You just need the link, then select the format you want to take, sometimes you can change the query in various ways, by date or subject, and just start in.
For those who want an example, you can do your own harvesting for a series of records in An American Ballroom Companion: Dance Instruction Manuals, ca. 1490-1920, comprising around 200 records in American Memory. In the transcript, I provide links for harvesting these records using the formats OAI-DC (a special form of Dublin Core), MODS and MARC21, and you can do it yourselves. It takes a moment, remember, you’re downloading over 200 records, so have some patience!
If you examine the links, they are all the same except for the “metadataPrefix”, which defines the format of the record you want the computer to serve you; oai_dc, mods, or marc21.
Naturally, once you get the records, you then need some kind of repository to place them. There are lots of options for that too, many of them “free” open source options such as Drupal, but I won’t talk about any of that here.
There are problems with metadata harvesting however: since you are taking copies of the records, somehow those records should be coordinated. The moment a record in the original database is updated, yours becomes obsolete. New records made in the original repositories need to wait until you harvest them and put them into your repository. With harvesting, your database will always be behind. In practice, conversions are often unavoidable. Information may be lost in the process, and many times, the outside archive will have additional powers for search and display that your repository does not have. We’ll discuss an example at some length later in this program.
Harvesting is not the only option if you want to work with open archives; there are tools such as APIs which query the live database, and often allows searchers the option to work with the original database in various ways. There is also my own method which I also won’t discuss right now. There are many, many options available.
My current thinking is that open archives eventually will become specialized by topic, instead of the generalized ones we have now, based primarily on individual organizations. Specialized open archives will be much like the high-energy physics archive at Cornell and the E-LIS archive I mentioned in part one.
Open archives specialized by topic would mirror the history of web site creation. For those who can remember the early days of the World Wide Web when every institution was frantically creating its own web pages, it turned out that the websites of those organizations almost always mirrored the internal bureaucratic hierarchy within an institution. This happened because the websites were made by internal staff with the purpose of ensuring a “web presence” of the specific division or department. Some overall webmaster then collected the links to all of the divisions and departments into a single, overall page. Search engines were highly rudimentary at that time, and therefore, to find information, a searcher had to navigate the internal bureaucracy of an organization, through the divisions, departments, projects, and so on.
I can say that this really did seem logical at the time, but slowly, it dawned on website creators that the true logic of a site on the World Wide Web is to appeal to the greatest portion of the public as possible, and this meant people who had no idea of the internal workings of your organization. This kind of website was compared to a closed intranet, a distinction took me some time to understand. Lacking this internal knowledge, outside patrons could almost never find the information they wanted, no matter how hard they tried.
As a result of this change in focus, today the information architect concentrates on the person who doesn’t know anything about the internal peculiarities of an organization, and so organizes the site to help those people find the information they want. In a similar way, it seems to me that this could easily be the direction that open archives could evolve: open archives specializing by topic would seem to be what most people want, and the totality of what an individual institution creates is much less useful. This would also be similar to people wanting journal articles over journals. All of this would probably make searching easier, but it may very well turn out that I am proven wrong, since after all, that is the nature of prediction.
Another, much more serious problem is: there are almost no standards for the metadata records in the open archives. One obvious problem is with formats. There was a great solution--I thought it was anyway--in the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) http://www.utsc.utoronto.ca/~chan/oaindia/presentations/OAI_PMH.pdf. I won’t discuss it here, but I provide a link in the transcript. What happened was that Google unfortunately decided not to use it, in favor of the much simpler site maps. http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html. If Google had accepted OAI-PMH, it would have been a great shot in the arm for the format. What will be the impact of Google’s decision remains to be seen.
In this program, I do not want to focus on formats. While they are a very important issue, I do not believe formats are where the underlying problems lie; I prefer to discuss the quality of the information within the metadata record. This is where I have heard the quality described as I mentioned before: “Pretty good.”
Problems with Harvesting
Havesting metadata records, if it is to be useful, must be considered in the context mentioned earlier. I like to visualize it as being in a large office building with hundreds of offices, each with a separate file cabinet. To do a thorough search of the documents in the office building, people have always had to go to every office and look through all of those hundreds of file cabinets. If we are to make the searchers’ task easier, one solution is to bring all of the file cabinets into one room so that they can be searched together. But after you do this, you find that this is not enough because you still have to search hundreds of separate file cabinets--all that has changed is that you no longer have to walk all over the building. While your feet may have it easier, the searching itself is not any easier at all since you still have to search each file cabinet separately. To make searching easier, it would need to be one search, and that means merging all of the files in all of the file cabinets into a single file. That’s a lot more work than just getting a few workmen to drag a bunch of file cabinets into one big room!
It means in turn, that all of the idiosyncracies that each person used in his or her own file cabinet have to be ironed out so that people can find materials by UN, ONU, United Nations, non-Roman materials, and all of the other details that I am very happy I do not need to explain here because my audience is catalogers and they understand these matters very well.
Whatever methods you decide to employ to solve this situation, everything will also need to be maintained.
Therefore, although metadata harvesting is very important, it is only one step toward a solution. We can imagine everything in one location (that is, one room or one database), but the searching itself is just as difficult as it has ever been, if there is no standardization. Without standardization, at least on some level, whenever we create an open archive repository, we wind up creating our own little Google, where we are simply hoping that full-text algorithms of the Google-type will solve the problems in deus ex machina fashion. I do not share that faith and believe we need some kind of standardization.
Lacking a miraculous solution, it isn’t as if we are left with nothing at all: there are still all those metadata records for each item, and the question becomes: how can that metadata be standardized?
Before we discuss possible solutions we need to get some idea of the nature of the beast. Let’s examine a practical example in one of the biggest Open Archive harvesters: OAIster.
A very incomplete analysis of OAIster
Before I begin my analysis, please believe me since I am being very sincere: it is not my purpose here to criticize any initiative; I am a big fan of all of them. My purpose here is to try to show how important is the task of cataloging, and to show how well-trained, professional catalogers could help identify and solve some of the problems.
Here is a document I found at random:
Consumption inequality and partial insurance / by Richard Blundell, Luigi Pistaferri, and Ian Preston.
To sum up what follows, after an examination, I discovered there are different versions of this document, at least from 2003 through 2008, when it was published in The American Economic Review, and available in JSTOR for those with a subscription. Let me describe in more detail what I found.
When I did a search in OAIster for this document I find what appear to be duplicate records but on further analysis, it turns out that these are records for different versions, and only some of the records allow access to the actual document. (I ask myself: What happened to the Open part of the Open Archives?) As I examine the metadata in these records, I find that most give the authors’ names in citation format, i.e. surname plus initial, but one record has their names as they appear in the document.
It turns out that all the authors are in the NAF. The forms of the first two, Blundell and Pistaferri, match their NAF forms but the third lacks dates, since the NAF form is Preston, Ian, 1964-. I conclude that no one consulted the NAF. The match of the first two names with the NAF is purely coincidental since their NAF forms are the same as in the document. http://oaister.worldcat.org/title/consumption-inequality-and-partial-insurance/oclc/2390295190605&referer=brief_results
When we begin to examine the documents themselves, matters become more complex. In the 2003 version of the document, that is, one of the versions that is open, the authors mention in a note that it is a version of an earlier paper “Partial insurance, information and consumption dynamics”, also available in the open archive. This is not mentioned in any of the records. (http://oaister.worldcat.org/oclc/2390295663776)
Further examining the documents, we find abstracts along with the keywords: consumption, inequality, and insurance that is, words that are rather useless for searching purposes since they are taken directly from the title “Consumption inequality and partial insurance”. I conclude these keywords were assigned either by the authors or someone who had no interest in subject analysis. http://discovery.ucl.ac.uk/2854/1/2854.pdf
I discover that these records came from the open archive at the University College London and decide to search that archive separately. I find some interesting details, http://discovery.ucl.ac.uk/cgi/search/advanced?screen=Public%3A%3AEPrintSearch&_action_search=Search&_fulltext__merge=ALL&_fulltext_=&title_merge=ALL&title=Consumption+inequality+and+partial+insurance&creators_name_merge=ALL&creators_name=&editors_name_merge=ALL&editors_name=&abstract_merge=ALL&abstract=&divisions_merge=ALL&date=&satisfyall=ALL&order=-date%2Fcreators_name%2Ftitle. While there is no consistency among the records, we see that they contain additional information not in OAIster: none of the OAIster records for these documents have any subjects but in UCL, some records have subjects, yet once again, there is no consistency: some have no subjects, and others have differing subjects. One of the records has what appears to be real subject descriptors: http://eprints.ucl.ac.uk/15896/: LIFE-CYCLE EARNINGS, COVARIANCE STRUCTURE, TAX-REFORM, PANEL DATA, INCOME, HETEROGENEITY, DYNAMICS, WELFARE, UNCERTAINTY, VARIANCE. These terms appear to be authorized forms but I don’t know where these terms come from. Perhaps the EconLit thesaurus would be a good bet.
Again, none of the OAIster records concerning this document have any subjects at all and it appears that OAIster has decided not to harvest the keywords, probably because of the consistency concerns mentioned earlier. See how all of this recreates the scenario I described before? Bringing together hundreds of file cabinets into a single room saves the leather on your shoes, but it doesn’t make the searching itself any easier because you still have to make lots of separate searches.
This doesn’t end our analysis and in fact, it may actually just be starting. When you discuss searching today, everything naturally must be compared with Google, and in our present case, we find the same article in Google Scholar: http://scholar.google.com/scholar?hl=en&q=Consumption+inequality+and+partial+insurance+blundell&btnG=Search&as_sdt=1%2C5&as_ylo=2007&as_vis=1. This has the normal link going to the restricted version at JSTOR, but in the right hand column, there is a link that goes to one of the free versions hosted at the University College London, the one dated Sept. 2003. This is very nice and handy for the patrons but I do not know why this version is singled out.
Yet the Google “metadata record” has something more that I find very impressive: a link going to different versions. If you click on the link labelled “All 46 versions”, you find many, many, many more versions of this article, including the one published in 2008 http://www.econ.upenn.edu/system/files/Blundell.pdf. Is this a legal copy? I don’t know; I don’t care. It’s available and that’s all that matters to me right now.
I confess that I have not looked at all 46 records, so I don’t know what else may be hiding there, but in any case, after this simple examination, the situation seems to this experienced cataloger at least, to be a bit chaotic. Don’t get me wrong: the materials are all great, it’s just very confusing to understand what exists, and if it’s confusing to me, I must assume that it would be just as confusing to non-specialists.
I did not go out of my way to find this example; it seems to be a normal record, and a normal level of metadata quality in a normal open archive. Can and should professional catalogers conclude that such a level of quality is “pretty good”?
As an aside, I can imagine that if anyone has been listening to my podcasts, they could be thinking at this point, “But with all these versions and the chaos you describe, you are actually talking about how we need FRBR! You’ve gone into long tirades over how you don’t agree with FRBR! How do you get out of that one?”
In my defense, what I have said in the past is that FRBR does nothing more than restate the traditional operations of the catalog--it just uses other terminology and posits a different structure by eliminating the unit record. It provides nothing new in the way of searching. The only “innovations” it introduces are in display based on works-expressions-manifestations-items, and even those displays are based on 19th-century models. I maintain that what FRBR intends is designed for librarians over our patrons, and thus is no real change from our current library catalogs. Therefore, I say that what FRBR calls “User tasks” are actually “Librarian tasks”. Librarians have to know exactly what exists so they can organize it for later retrieval--I don’t question that. My stance is that the catalog as it stands today allows all of this right now for librarians and the FRBR “user tasks” are not what the public either wants or needs. Consequently, creating catalogs with FRBR in mind ignores what the public wants.
In the case we have just examined, what would patrons want? Would they want a detailed browsable listing of the 46 or so variants at their disposal, or would that just be too confusing and too big of a pain? My own opinion--and it is an opinion although based on experience--is that patrons would probably be happy with almost any version they could get, and if given the choice, most probably would simply opt for the latest one they could have for free.
As I said earlier, it is not my intention here to point out faults; my purpose is much more positive: I want to demonstrate that harvesting is only one part of the solution, in some ways, the easiest part, and there are other options besides harvesting. All the while, it is important to keep in mind the traditional cataloging concepts, which remain completely valid today, although the traditional practices or techniques used by catalogers may end up in the garbage can.
Are there solutions toward improving this situation? I believe there are, but I think it is clear that solutions should focus not on creating quality in these metadata records, but on managing the quality. The unavoidable fact is that there will not be enough catalogers to create quality, therefore we can only try to manage it as best we can. Accepting this would represent a major shift in the viewpoint of the traditional cataloger. There are many ways to include cataloger-type controls in open archives as metadata managers, and the only limit is our imaginations.
I suggest thinking in terms of creating new tools with the purpose of providing help to catalogers: a useful tool could be one that included items from a local open archive into the main cataloging workflow automatically; tools that allow catalogers to upgrade records en masse; tools to find items that exhibit inconsistencies or other inadequacies in metadata, perhaps through statistical analyses that can be viewed graphically, so that inconsistencies could show as stray dots on a graph, or to point out where subject terms are absent or simply repeat what is in the title. Ontologies need to be built so that when patrons see a subject heading or descriptor in one open archive, they can be led to similar materials in other databases that use different thesauri or subject systems. How about a tool that allows corrections and updates by members of the public, while everything would be done under the watchful eyes of the catalogers. Perhaps a difference in procedures would help: additional descriptive work could be done retrospectively depending on whether a new item is a version of something already in the database. Perhaps if there are no versions, less work can be done originally.
The watchword should be as before: to help our patrons navigate in the information universe, not to expect everything to be in a single standard: that standard being that “we” use, whoever the “we” happens to be. That would be an impossible task leading us to disaster.
The biggest change of all however, would come when librarians honestly consider the whole of the materials in the open archives as fundamental parts of their own collections, just as important as the books on their shelves or the databases to which they subscribe. After all, that’s how our patrons consider them. No single library can control all of those materials; it must be done on a truly cooperative basis.
Librarians must make something that people can use, and I think it should be done soon, since expecting our patrons to wait longer and longer will be tantamount to self-obsolesence and suicide, especially in times such as these. In his Preface to the 1853 edition of the Index to Periodical Literature that I mentioned earlier, William Poole wrote:
“To persons who have given but little reflection to the subject, there are few things which appear simpler than the compilation of a Catalogue or an Index; while those who have had experience in such labor well know that the undertaking is full of difficulties. If the preparation of this work had been delayed until a plan had been fixed upon that reconciled all objections, it would never have been commenced; or, if the labor had been continued until the work was satisfactory to myself, it would never have been presented to the public. My endeavor was to bring the contents of some fifteen hundred volumes into as narrow a space as possible. The ordinary plan of indexing periodicals was, under the circumstances, wholly impracticable.”http://books.google.com/books?id=yO9GGjPbPjYC&pg=PP13#v=onepage&q&f=false
Poole’s remarks describe very well the situation we are facing today. We also have to create something that will help our patrons and it doesn’t have to be perfect, we just need to make tools that are better than what people have today--just today! That’s all. Everyone understands that whatever we make will improve. Nobody expects perfection, but they do expect improvements. Poole improved his index, others took up his baton later, and libraries should follow his example.
And with this we come to the end of my discussion of Open Archives, so no one need worry that it will go on and on like my personal journey with FRBR. I hope you enjoyed it or at least found it interesting. If you have any suggestions for future podcasts, please let me know.
The music I have chosen to end this programme is Pandolfi Mealli’s evocative Violin Sonata Op.4 No.1 called "La Bernabea" http://www.youtube.com/watch?v=oYsbdlyAAMU. Pandolfi Mealli lived in the mid-17th century and very little of his music remains, just two sets of violin sonatas numbered provocatively 3 and 4. This is an excerpt with Andrew Manze on the Baroque violin and Richard Egarr, harpsichord.
That’s it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.