Cataloging Matters No. 17:
Catalog Records as Data
Catalog Records as Data
Hello everyone and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy. My name is Jim Weinheimer.
We hear that the problem with catalog records is that they are not data. This means that the records are meant primarily for display and consequently, are of very limited use in the new information environments. What does this mean and, I question, is it correct?
I have been criticised more than once for underrating, or simply not understanding, the importance of turning catalog records into data. In my own defense, I believe I do understand the concept of “data” and in fact, I think that I have recently discovered more precisely where the problem lies as well as finding a basic difference in the views of a cataloger as opposed to that of an information technologist. For many information technologists, the basic task is to turn the catalog records into data. Although that will not solve everything of course, it is the very first step because at least then our records can begin to work together with other projects on the web.
I question this assumption and in this episode, I want to try to explain why.
To begin though, I want to state that I appreciate that when information becomes data, it can be manipulated in new and interesting ways. Statistical information provides one of the best examples of these new possibilities. Statistics have always been some of the most boring and dreary information that people deal with, or at least it has been for me and those I have known. My statistics courses in college were always great places to catch up on my sleep, and genuine misery meant being handed a table of data and being told to work out graphs by hand. Out of all the subjects I have studied, I have always reserved a special enmity in my heart for anything having to do, even remotely, with statistics.
I know I am not alone.
This has changed in the new world. Let's consider a specific, very simple, example: a table of statistics containing the wheat production of each country in the world. If that table is printed on pieces of paper, it may be good for sitting on a shelf and allowing you to look up how much wheat was produced in India for a certain year, but not good for much else—except as an implement of torture aimed at unsuspecting, innocent students who then have to copy everything out on paper and spend hours calculating everything manually, having their mistakes pointed out gleefully, starting all over again, running out of paper..... It is just too painful to go on. To get anything useful out of the tables printed on paper demands significant amounts of manual work that is the definition of tedium. For me.
With the introduction of computers—and especially personal computers—if that same table is put into a spreadsheet format that a program such as Microsoft Excel can open, then you can run all kinds of statistical functions on it, create charts and so on. You still need to know what you are doing, but most of the tedium is eliminated. If you then put that file on the web, not just as a scan of the printed page in jpeg or pdf format, but in a format that can load into some kind of statistical program, then others can use your information much more easily. If you also code it in certain standardized ways, then your table can be included in with all kinds of tables that others have shared, or even with entirely different services that people have built and shared. For instance, allowing your table of wheat production to be displayed in Google Maps can be a new and highly effective way to display your data to people who want to know about wheat production but understand nothing about statistics. In this way, your table can contribute to the general advancement far more effectively than if it remained in paper on a shelf, in an Excel file on your machine, or in a jpeg image on the web.
Of course, this is an idealized scenario and there are a whole number of pitfalls, primarily centered on issues of consistency, but nevertheless we can see this happening with Google Public Data Explorer, where many agencies now provide their statistical information coded in special ways. You don't have to be a statistician to manipulate that data and get meaningful information out of it. If I were in school today, I might not hate statistics as much as I did.
Similar events are taking place in other fields, such as archaeology. Take a look at the Pleiades site and it is amazing. In the transcript, I have a link to their page for Rome, or as it is called on the site, Roma http://pleiades.stoa.org/places/423025. This page brings information together from all over the web: photos from Flickr, Google Maps, original sources, links into Wikipedia, Worldcat and Unesco. Of course Rome has an immense amount of archaeological information attached to it, and this site is relatively new so it is still lacking most of what is available. Lots of work remains on this page but—I can't restrain myself: Rome wasn't built in a day!
These are only two examples of the powerful developments taking place and the great promise of the web when people share their data. If we are to make these tools work coherently, the key is to turn the information that at one time was just on the printed page into data that can be manipulated by various machines. It only makes sense that libraries should do something similar with their catalog records.
Unfortunately, much of the information in our catalog records is not coded in a way that is very friendly for these purposes. First of all, nobody uses the raw MARC21 format except libraries, and the format must be changed if it is to be made useful. Whether that format is in XML or RDF or whatever, so long as our records are coded with MARC field and subfield tags, they are essentially locked out of these new developments. Although there are some good points about our format, for someone who wants to work with it, they quickly find that it presents some terrible difficulties. I don't want to get into all of the details here, but for those who are interested, I provide a link to a very good overview by Alexander Johannssen “MARCXML, Beast of Burden” http://shelterit.blogspot.it/2008/09/marcxml-beast-of-burden.html.
Therefore, I think I do understand the situation and the potential value of turning our records into data. The real problem lies elsewhere.
Catalogers have always claimed that they are the ones who determine the “metadata,” while the IT people claim that they are the ones who determine metadata. The difference can be seen in the following excerpt from a MARCXML record that is in the transcript. http://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml It is very simple. I give just the title information for Carl Sandburg's Arithmetic coded in MARCXML. A cataloger who does not know XML at all still has no problem understanding it:
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Arithmetic /</subfield>
<subfield code="c">Carl Sandburg ; illustrated as an anamorphic adventure by Ted Rand.</subfield>
Where is the data and where is the metadata here? The IT person will say that the coding
<datafield tag="245" ind1="1" ind2="0">
that that is the metadata and furthermore says that it is incomprehensible. What is this “a” code and what does 245 mean? And when you look at the data itself, it is all messed up: look at the stray punctuation floating around: that slash at the end of the subfield a of the 245 datafield and when you look at the rest of the MARCXML record, there are all kinds of punctuation marks scattered everywhere. For instance, look at that semi-colon floating in the middle of the 245 subfield c. It apparently means something so why isn't that bit coded?
I understand and sympathize with all of this, but to me it is beside the point. What I want to emphasize here is that for the cataloger—and I believe also for the public—the information found in a catalog record—what the IT people call data—is not data—it is metadata. I have seen the word metacontent used for this information but it seems as if it has not come into general use. The distinction is important to understand however. What the IT person claims to be data, the cataloger claims is actually metadata. I realize this goes against the information science idea of metadata, but please continue to follow me because I think it is very important.
As I have pointed out several times, people come to the library to use the books, the articles, the films and the other materials the library contains. They do not come to use the library catalog; they use the library catalog to the minimum that they can: they use it only as a way to get into the materials they want. In this sense, a library catalog is almost like a series of signposts: I want to drive to Paris, or to downtown Main Street, and signs lead me in different directions. I need some groceries and I follow the signs that lead me to different places where I can find a grocery store.
I definitely need the signs, but yet, my interest is always in getting downtown or buying the groceries, not in the signs themselves. After I do what I want, I no longer need any of those signs until the next time. By then, I may know it so well that I no longer need the signs at all, or at least I think I don't need them.
Catalog records are very similar in their ultimate purpose to these signs. They are just far more sophisticated, organized, and focused than other signs we see plus, they provide a necessary inventory control mechanism for the library.
Let's imagine someone who comes to the library because he or she is interested in learning about a topic, let's say the War of the Spanish Succession. The information, the data, they want is not in the library's catalog, but in the library's collection: that is, inside the books and articles and other materials that contain information about the War of the Spanish Succession. The catalog leads them in the different directions where they should go to get that information, but a person interested in the War of the Spanish Succession gets practically no information about that topic from the catalog alone. The data that the public wants is inside the collection. The public would love to be able to manipulate the information inside the books, the articles and other materials about the War of the Spanish Succession, so that a computer could summarize it for them, map it, compare, contrast, and do whatever. They would like that because then they really would be manipulating the “data” that interests them.
Now, let's consider catalog records from the point of view of someone interested in the War of the Spanish Succession. Which parts of the information in the catalog records would they be interested in manipulating to learn more about the War? Remember, they are interested in the War. The number of pages? The ISBNs? The publishers? The publication dates? The titles? The authors? The call numbers? Maybe the call numbers because at least then they could get to the data that interests them.
How would manipulating any of this information help someone understand the War of the Spanish Succession? Clearly it wouldn't. Manipulating this information could help someone learn about the publication history of materials on the topic. For instance, someone could learn which authors have written the most on the topic, what are the newest materials, the original materials; manipulating subjects could allow people to know there are subtopics, related topics and so on. But catalogs allow almost all of that now and have for quite some time.
How does this compare to what we discussed earlier about statistics? Let us now imagine a statistic, a number such as “400”. Alone, this number is meaningless and requires a batch of additional information, that is metadata, to make it meaningful: is this statistic in thousands or millions? In tons or hectares or bushels? Is it about the number of fish or number of births or amount of arable land? With enough metadata, this data (the statistic 400) can be manipulated in many useful ways as we can all see in tools such as Google Public Data Explorer.
Why does the manipulation work with the statistic “400”? Because the “400” is the data, that is, it is the information that people want, while the information in a catalog record is a different type of information that leads people to what they want. For the catalogers, this information is clearly metadata because they have worked personally with the actual “data”, that is, they held the book in their hands, or the journal or map or video, and created the catalog record from it. This is also the way it appears to the public, which is what is really important.
What I have outlined should show that the purpose of a catalog record is not to serve as data for manipulation, although it can be manipulated and we should give it a try to see what happens. For the public however, the purpose of catalog records is to direct them to the information they want. Once they have reached the place where they can find the materials they need, they no longer need the catalog record. This is completely different from the example of the statistic 400, which is the data they want.
So, it seems to me that if people can manipulate catalog records like they can manipulate Google Public Data or something similar, it will probably make very little difference to them because it is not the data that they want. It would be like when I need groceries, I could have a tool that would manipulate all the signs that tell me how to get to the grocery stores. That might be OK but even at the end of all of that, I still wouldn't have my groceries and I wouldn't be any closer to getting them. But, if I could manipulate the information about the actual groceries themselves, how much they cost, their quality, their availability, compare them all so that I could measure the value versus the ease of getting them, that would be information I would want, but I could get relatively little from manipulating the signs that direct me to the grocery stores.
Of course, this does not mean at all that the catalog record is of little value—on the contrary, directions to get to information are just as important as the information itself. Our society would quickly disintegrate if there were no signs, or if the signs were bad. Remember the Apple Maps disaster a few months ago where people in Australia used the application for Apple Maps, were sent the wrong way in the desert and almost died? http://www.cbronline.com/blogs/cbr-rolling-blog/apple-maps-a-danger-to-life-say-australian-police-101212
We shouldn't kid ourselves: the same thing happens in other types of searching thousands of times every single day. Reference librarians see it when students panic because their normal searching strategies stop working. People are forever ending up in the same informational dead-end as those folks in the Australian desert. The failure is just not so obvious when people find themselves in various types of “filter bubbles” or they are overwhelmed and just accept whatever they can find, or give up and decide to watch the latest cat video. If people can't find what they want, if all they can find is junk, or if they don't know about the information in the first place, it doesn't matter if your information or your product is the best in the world. It may as well not even exist. Signs are very important.
In this sense, I believe that many librarians are not seeing the real value of their catalogs. Many think it is in the “data” it contains, but obviously I believe it is something else. The catalog is the doorway into the collection. Without a catalog, you may be able to enter the collection but you remain half-blind and mostly powerless. Without a collection, the catalog is meaningless and becomes a curious document to be placed into some other collection. On the other hand, a collection can be vastly enhanced by a great catalog. Therefore, I believe a collection and its catalog are inseparable. I don't see how the digital world has changed this in any way.
In my next podcast, which I plan to put out in the next couple of weeks, I want to discuss a bit more what the library catalog could provide that is unique, and what I think would be valued by the public.
The music to close this episode is a famous piece from Vivaldi, his Mandolin Concerto in C Major. He wrote it in 1725 in Venice. Incidentally, the story of the discovery of Vivaldi's manuscripts, now housed in the Turin National Library, is fascinating. I provide a link to the story. http://www.classicfm.com/composers/vivaldi/guides/trail-vivaldi-manuscripts/
And, in 2010 another concerto of his was found in Edinburgh. Libraries really do come in handy! http://www.bbc.co.uk/news/uk-scotland-edinburgh-east-fife-11491307
This is the first movement, you can listen to the entire concerto on Youtube
That's it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.