Cataloging Matters No. 17:
Catalog Records as Data
Catalog Records as Data
Hello everyone and welcome to
Cataloging Matters, a series of podcasts about the future of
libraries and cataloging, coming to you from the most beautiful, and
the most romantic city in the world, Rome, Italy. My name is Jim
Weinheimer.
We hear that the problem with catalog
records is that they are not data. This means that the records
are meant primarily for display and consequently, are of very limited
use in the new information environments. What does this mean and, I
question, is it correct?
I have been criticised more than once
for underrating, or simply not understanding, the importance of
turning catalog records into data. In my own defense, I believe I do
understand the concept of “data” and in fact, I think that I have
recently discovered more precisely where the problem lies as well as
finding a basic difference in the views of a cataloger as opposed to
that of an information technologist. For many information
technologists, the basic task is to turn the catalog records into
data. Although that will not solve everything of course, it is the
very first step because at least then our records can begin to work
together with other projects on the web.
I question this assumption and in this
episode, I want to try to explain why.
To begin though, I want to state that I
appreciate that when information becomes data, it can be manipulated
in new and interesting ways. Statistical information provides one of
the best examples of these new possibilities. Statistics have always
been some of the most boring and dreary information that people deal
with, or at least it has been for me and those I have known. My
statistics courses in college were always great places to catch up on
my sleep, and genuine misery meant being handed a table of data and
being told to work out graphs by hand. Out of all the subjects I have
studied, I have always reserved a special enmity in my heart for
anything having to do, even remotely, with statistics.
I know I am not alone.
This has changed in the new world.
Let's consider a specific, very simple, example: a table of
statistics containing the wheat production of each country in the
world. If that table is printed on pieces of paper, it may be good
for sitting on a shelf and allowing you to look up how much wheat was
produced in India for a certain year, but not good for much
else—except as an implement of torture aimed at unsuspecting,
innocent students who then have to copy everything out on paper and
spend hours calculating everything manually, having their mistakes
pointed out gleefully, starting all over again, running out of
paper..... It is just too painful to go on. To get anything useful
out of the tables printed on paper demands significant amounts of
manual work that is the definition of tedium. For me.
With the introduction of computers—and
especially personal computers—if that same table is put into a
spreadsheet format that a program such as Microsoft Excel can open,
then you can run all kinds of statistical functions on it, create
charts and so on. You still need to know what you are doing, but most
of the tedium is eliminated. If you then put that file on the web,
not just as a scan of the printed page in jpeg or pdf format, but in
a format that can load into some kind of statistical program, then
others can use your information much more easily. If you also code it
in certain standardized ways, then your table can be included in with
all kinds of tables that others have shared, or even with entirely
different services that people have built and shared. For instance,
allowing your table of wheat production to be displayed in Google
Maps can be a new and highly effective way to display your data to
people who want to know about wheat production but understand nothing
about statistics. In this way, your table can contribute to the
general advancement far more effectively than if it remained in paper
on a shelf, in an Excel file on your machine, or in a jpeg image on
the web.
Of course, this is an idealized
scenario and there are a whole number of pitfalls, primarily centered
on issues of consistency, but nevertheless we can see this happening
with Google Public Data Explorer, where many agencies now provide
their statistical information coded in special ways. You don't have
to be a statistician to manipulate that data and get meaningful
information out of it. If I were in school today, I might not hate
statistics as much as I did.
Similar events are taking place in
other fields, such as archaeology. Take a look at the Pleiades site
and it is amazing. In the transcript, I have a link to their page for
Rome, or as it is called on the site, Roma
http://pleiades.stoa.org/places/423025.
This page brings information together from all over the web: photos
from Flickr, Google Maps, original sources, links into Wikipedia,
Worldcat and Unesco. Of course Rome has an immense amount of
archaeological information attached to it, and this site is
relatively new so it is still lacking most of what is available. Lots
of work remains on this page but—I can't restrain myself: Rome
wasn't built in a day!
These are only two examples of the
powerful developments taking place and the great promise of the web
when people share their data. If we are to make these tools work
coherently, the key is to turn the information that at one time was
just on the printed page into data that can be manipulated by various
machines. It only makes sense that libraries should do something
similar with their catalog records.
Unfortunately, much of the information
in our catalog records is not coded in a way that is very friendly
for these purposes. First of all, nobody uses the raw MARC21
format except libraries, and the format must be changed if it is to
be made useful. Whether that format is in XML or RDF or whatever, so
long as our records are coded with MARC field and subfield tags, they
are essentially locked out of these new developments. Although there
are some good points about our format, for someone who wants to work
with it, they quickly find that it presents some terrible
difficulties. I don't want to get into all of the details here, but
for those who are interested, I provide a link to a very good
overview by Alexander Johannssen “MARCXML, Beast of Burden”
http://shelterit.blogspot.it/2008/09/marcxml-beast-of-burden.html.
Therefore, I think I do understand the
situation and the potential value of turning our records into data.
The real problem lies elsewhere.
Catalogers have always claimed that
they are the ones who determine the “metadata,” while the IT
people claim that they are the ones who determine metadata. The
difference can be seen in the following excerpt from a MARCXML record
that is in the transcript.
http://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml
It is very simple. I give just the title information for Carl
Sandburg's Arithmetic coded in MARCXML. A cataloger who does
not know XML at all still has no problem understanding it:
<datafield tag="245"
ind1="1" ind2="0">
<subfield
code="a">Arithmetic /</subfield>
<subfield
code="c">Carl Sandburg ; illustrated as an anamorphic
adventure by Ted Rand.</subfield>
</datafield>
Where is the data and where is the
metadata here? The IT person will say that the coding
<datafield tag="245"
ind1="1" ind2="0">
<subfield code="a">
that
that is the metadata and furthermore says that it is
incomprehensible. What is this “a” code and what does 245 mean?
And when you look at the data itself, it is all messed up: look at
the stray punctuation floating around: that slash at the end of the
subfield a of the 245 datafield and when you look at the rest of the
MARCXML record, there are all kinds of punctuation marks scattered
everywhere. For instance, look at that semi-colon floating in the
middle of the 245 subfield c. It apparently means something so why
isn't that bit coded?
I understand and sympathize with all of
this, but to me it is beside the point. What I want to emphasize here
is that for the cataloger—and I believe also for the public—the
information found in a catalog record—what the IT people call
data—is not data—it is metadata. I have seen the word
metacontent used for this information but it seems as if it
has not come into general use. The distinction is important to
understand however. What the IT person claims to be data, the
cataloger claims is actually metadata. I realize this goes against
the information science idea of metadata, but please continue to
follow me because I think it is very important.
As I have pointed out several times,
people come to the library to use the books, the articles, the films
and the other materials the library contains. They do not come
to use the library catalog; they use the library catalog to the
minimum that they can: they use it only as a way to get into
the materials they want. In this sense, a library catalog is almost
like a series of signposts: I want to drive to Paris, or to downtown
Main Street, and signs lead me in different directions. I need some
groceries and I follow the signs that lead me to different places
where I can find a grocery store.
I definitely need the signs, but yet,
my interest is always in getting downtown or buying the groceries,
not in the signs themselves. After I do what I want, I no longer need
any of those signs until the next time. By then, I may know it so
well that I no longer need the signs at all, or at least I think I
don't need them.
Catalog records are very similar in
their ultimate purpose to these signs. They are just far more
sophisticated, organized, and focused than other signs we see plus,
they provide a necessary inventory control mechanism for the library.
Let's imagine someone who comes to the
library because he or she is interested in learning about a topic,
let's say the War of the Spanish Succession. The information,
the data, they want is not in the library's catalog,
but in the library's collection: that is, inside the books and
articles and other materials that contain information about the War
of the Spanish Succession. The catalog leads them in the
different directions where they should go to get that information,
but a person interested in the War of the Spanish Succession
gets practically no information about that topic from the
catalog alone. The data that the public wants is inside the
collection. The public would love to be able to manipulate the
information inside the books, the articles and other materials about
the War of the Spanish Succession, so that a computer could
summarize it for them, map it, compare, contrast, and do whatever.
They would like that because then they really would be manipulating
the “data” that interests them.
Now, let's consider catalog records
from the point of view of someone interested in the War of the
Spanish Succession. Which parts of the information in the catalog
records would they be interested in manipulating to learn more
about the War? Remember, they are interested in the War.
The number of pages? The ISBNs? The publishers? The publication
dates? The titles? The authors? The call numbers? Maybe the call
numbers because at least then they could get to the data that
interests them.
How would manipulating any of this
information help someone understand the War of the Spanish
Succession? Clearly it wouldn't. Manipulating this information
could help someone learn about the publication history of materials
on the topic. For instance, someone could learn which authors have
written the most on the topic, what are the newest materials, the
original materials; manipulating subjects could allow people to know
there are subtopics, related topics and so on. But catalogs allow
almost all of that now and have for quite some time.
How does this compare to what we
discussed earlier about statistics? Let us now imagine a statistic, a
number such as “400”. Alone, this number is meaningless and
requires a batch of additional information, that is metadata, to make
it meaningful: is this statistic in thousands or millions? In tons or
hectares or bushels? Is it about the number of fish or number of
births or amount of arable land? With enough metadata, this data (the
statistic 400) can be manipulated in many useful ways as we can all
see in tools such as Google Public Data Explorer.
Why does the manipulation work with the
statistic “400”? Because the “400” is the data, that
is, it is the information that people want, while the information in
a catalog record is a different type of information
that leads people to
what they want. For the catalogers, this information is
clearly metadata because they have worked personally with the
actual “data”, that is, they held the book in their hands, or the
journal or map or video, and created the catalog record from it. This
is also the way it appears to the public, which is what is really
important.
What I have outlined should show that
the purpose of a catalog record is not to serve as data for
manipulation, although it can be manipulated and we should give it a
try to see what happens. For the public however, the purpose of
catalog records is to direct them to the information
they want. Once they have reached the place where they can find the
materials they need, they no longer need the catalog record. This is
completely different from the example of the statistic 400, which is
the data they want.
So, it seems to me that if people can
manipulate catalog records like they can manipulate Google Public
Data or something similar, it will probably make very little
difference to them because it is not the data that they want.
It would be like when I need groceries, I could have a tool that
would manipulate all the signs that tell me how to get to the grocery
stores. That might be OK but even at the end of all of that, I still
wouldn't have my groceries and I wouldn't be any closer to getting
them. But, if I could manipulate the information about the actual
groceries themselves, how much they cost, their quality, their
availability, compare them all so that I could measure the value
versus the ease of getting them, that would be information I would
want, but I could get relatively little from manipulating the signs
that direct me to the grocery stores.
Of
course, this does not mean at all that the catalog
record is of little value—on the contrary, directions to get to
information are just as important as the information itself. Our
society would quickly disintegrate if there were no signs, or if the
signs were bad. Remember the Apple Maps disaster a few months ago
where people in Australia used the application for Apple Maps, were
sent the wrong way in the desert and almost died?
http://www.cbronline.com/blogs/cbr-rolling-blog/apple-maps-a-danger-to-life-say-australian-police-101212
We shouldn't kid ourselves: the same
thing happens in other types of searching thousands of times every
single day. Reference librarians see it when students panic because
their normal searching strategies stop working. People are forever
ending up in the same informational dead-end as those folks in the
Australian desert. The failure is just not so obvious when people
find themselves in various types of “filter bubbles” or they are
overwhelmed and just accept whatever they can find, or give up and
decide to watch the latest cat video. If people can't find what they
want, if all they can find is junk, or if they don't know about the
information in the first place, it doesn't matter if your information
or your product is the best in the world. It may as well not even
exist. Signs are very important.
In this sense, I believe that many
librarians are not seeing the real value of their catalogs. Many
think it is in the “data” it contains, but obviously I believe it
is something else. The catalog is the doorway into the collection.
Without a catalog, you may be able to enter the collection but you
remain half-blind and mostly powerless. Without a collection, the
catalog is meaningless and becomes a curious document to be placed
into some other collection. On the other hand, a collection can be
vastly enhanced by a great catalog. Therefore, I believe a collection
and its catalog are inseparable. I don't see how the digital world
has changed this in any way.
In my next podcast, which I plan to put
out in the next couple of weeks, I want to discuss a bit more what
the library catalog could provide that is unique, and what I think
would be valued by the public.
The music to close this episode is a famous piece from Vivaldi, his Mandolin Concerto in C Major. He wrote it in 1725 in Venice. Incidentally, the story of the discovery of Vivaldi's manuscripts, now housed in the Turin National Library, is fascinating. I provide a link to the story. http://www.classicfm.com/composers/vivaldi/guides/trail-vivaldi-manuscripts/
And, in 2010 another concerto of his
was found in Edinburgh. Libraries really do come in handy!
http://www.bbc.co.uk/news/uk-scotland-edinburgh-east-fife-11491307
This is the first movement, you can
listen to the entire concerto on Youtube
That's it for now. Thank you for
listening to Cataloging Matters with Jim Weinheimer, coming to you
from Rome, Italy, the most beautiful, and the most romantic city in
the world.
The use of library catalogs is in serious decline. Accessing our records needs to be done through general search engines. So in that sense, local cataloging as it exists today is a waste of time. The Library of Congress or other major research collections may require traditional cataloging. But individual libraries doing local cataloging? What the heck for? People use large union catalogs like Worldcat and simply look for local holdings. Most local cataloging is done from OCLC records these days. Why not just link your holdings to these existing records? These are the databases that count.
ReplyDeleteYou are bringing up a delicate matter that most catalogers, including me, do not want to discuss, but what you say is fully valid. Still, I would like to point out that records don't fall into Worldcat from heaven: people make them, and the first one must be made from scratch.
ReplyDeleteStill, it doesn't have to be a trained, library cataloger who makes this first record. It could be made by some technician, secretary or even a computer, and I am sure many records are made in these ways right now. Ultimately, if catalogs are to survive--and I am not maintaining at the moment that they should--I believe that they must provide something that is different from the other services out there? Do catalogs provide something unique? Or if they don't do so now, can they?
I think that they can provide something quite different from all of the Googles and Yahoos and Amazons. This "something different" can be demonstrated. The real problem in my opinion, is that the user interfaces and structures of our present catalogs are out of the 19th century. If we could upgrade those interfaces, people may find our catalogs genuinely useful again.
I discuss this at some length in my next podcast, no. 18 "Problems with Library Catalogs".
Yet, if library catalogs cannot answer the questions that you ask, I pretty much agree that they are useless. But if the "finding aid" for a library becomes useless, I will venture to say that the library collection is useless as well.
I don't want to accept that. This is why I continue to write and argue about these matters.