Friday, March 2, 2012

Cataloging Matters Podcast no. 14: Musings on the Linked Data Diagram

Cataloging Matters Podcast no. 14:
Musings on the Linked Data Diagram


Hello everyone. My name is Jim Weinheimer and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy.

In this episode, I want to share a few of my own thoughts on Linked Data, the Semantic Web and Web3.0. What are those things anyway? And how do they fit into the future of the library? Or do they?

“Linked data”, the latest hot word for the “Semantic Web” or “Web3.0” is what many librarians and other information professionals point to as our ultimate destination. We are told that the Linked Data universe, or the “Semantic Web,” or “Web3.0,” is the goal that information agencies should strive for, because it's where everyone else is going, and once we all arrive there, our problems will be solved.

The Linked Data universe or the “Semantic Web” or “Web3.0”. Why people feel a need to constantly change such terms has always been a mystery to me. I suspect it is just an attempt to keep most people bewildered. Yes, experts say that these terms represent different things, the semantic web is the idea, while linked data is the means, and Web3.0 is different in yet some other way, or not, but these kinds of discussions remind me of medieval academic disputations that in practice, all come down to the same thing and serve only to keep the average person out of the discussion altogether.

Within the library world, “Linked Data” reigns pretty much unquestioned as the ultimate aspiration for our metadata. Such an unquestioned goal reminds me of the story of the Knights of the Round Table, who set off in search of the Holy Grail after Sir Galahad sat in the Seige Perilous and they all  saw the vision of the Holy Grail. Or at least, they said they did. Going after the Grail was simply something that they had to do. Whenever I see or hear the words “Linked Data” now, I believe I can hear a chorus of angels in the background.

In the transcript is a diagram of the Linked Data Universe. This, or one of its many variants, is only a top-level overview. It shows only complete sites, or more precisely, datasets, that are part of the system but in reality, each dataset represents probably hundreds of thousands if not millions of individual links, so the underlying network is immeasurably more complex than this mere topmost overview. For example, if Worldcat were included in this, with all of its millions of records, and many, many, many more individual links, it would appear only as a single circle labelled Worldcat.

This diagram has been used to illustrate that Holy Grail that we in the “information community” are striving for. And yet, the ultimate goal remains rather elusive to me and seems based on a general sort of faith that once we get “there”--wherever “there” is--our problems will be solved. We read this all the time in articles that state, RDA is the first step to FRBR, and FRBR allows us to participate in Linked Data.

According to our modern fount of unquestioned wisdom and knowledge for all of our information needs today, that is: Wikipedia, the idea of Linked Data is rather new, dating back only to the late 1960s. Now, it seems to me that the idea of Linked Data is, in its basic essentials, rather similar to authority control in library catalogs, so in that case it could be traced much, much farther back. To be fair, at the time I am recording this, some unknown person actually mentioned this precise point in the Wikipedia article on Linked Data (I have added a screen clip to the transcription) but it has that dreaded Wikipedia Mayday warning “[citation needed]” so I had better be careful!

The term “Semantic Web” itself comes from none other than Tim Berners-Lee back in 1999 in his book “Weaving the Web”. Interestingly enough, it was when he described a dream. That's right, a dream. And a dream in two parts. The first part describes the possibilities of genuine collaboration with his human colleagues through the web (or what is now known as Web2.0). He then continues his dream. I am quoting him:
In the second part of the dream, collaborations extend to computers. Machines become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.
End of quote.

What are these “intelligent agents” he mentions? Those are machines that are designed to work for humans but do so without any need for us to monitor them all the time. There are many different types of intelligent agents, and we are surrounded by them. In fact, our bodies have them so that we don't have to remember consciously to breathe or for our livers to function. Mechanical intelligent agents can be thermostats that regulate the temperature in our homes automatically, or sensors of the oil level in our automobiles. Tim Berners-Lee imagines intelligent agents of a substantially different kind. I guess that The Terminator could also be considered an intelligent agent but a little different from a thermostat.

To continue with the dream. Quote:
Once the two-part dream is reached, the Web will be a place where the whim of a human being and the reasoning of a machine coexist in an ideal, powerful mixture. Realizing the dream will require a lot of nitty-gritty work. The Web is far from "done." It is in only a jumbled state of construction, and no matter how grand the dream, it has to be engineered piece by piece, with many of the pieces far from glamorous.
End of quote

His dream does strike me at least, as something that could fairly be labelled “utopian.”

For those who have listened to my podcast on Search, this is what search is turning into: an intelligent agent for informational purposes that will work for you much as a thermostat does for your home: it will function to get you the information you want without any or much thought from you. Sometimes you may have to adjust it, just as you do your thermostat. I have my own opinions on that, and those who are interested can find them in my earlier podcast.

As for “Linked Data”, that term also apparently comes from Tim Berners-Lee, who cited four tasks if you want to implement linked data. The purpose is for machines to be able to talk to other machines in order to create the intelligent agents:
  1. Use URIs as names for things (that is, the name for a concept must use a constant reference instead of using human terminology)
  2. Use HTTP URIs so that people can look up those names. (that is, so that these constant references can actually work on the World Wide Web)
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)  (that is, use complicated computer coding standards)
  4. Include links to other URIs. so that they can discover more things.
Since that time, other simpler methods besides RDF have been devised. Nevertheless, everything here seems good: the links seem good, the dream seems good, the methods seem good and improving, so what can be a problem?

Many times have I seen the linked data diagram, and I have watched it grow, but the last time I looked at it, I realized that my response was not all-together positive, and it occurred to me that I had never really had positive feelings for it. As I tried to sort out what my feelings were, I discovered that the Linked Data diagram actually gives me the “creeps”.

Based on this, I came to think that perhaps I should try to regard the diagram in an openly “creepy” way, and to follow the example of those in the past who would sometimes put a human skull in a place where they could see it all the time. That way, those people could muse over its deeper implications at different points throughout the course of their lives.

One of the most famous scenes where a person comtemplates a skull is of course in Shakespeare's Hamlet. During this amazing scene, the individuals sitting in the audience watch the gravedigger's reaction to the skull, and then Hamlet's strange reaction, and the audience compares it with their own personal reactions. The members of the audience come to realize that the type of person holding and contemplating the skull is critical to how someone relates to it: are you a child seeing a skull for the first time; a chance passer-by discoving a crime; a murderer enjoying a moment of exultation; a gravedigger who knew and hated the fellow, or a tormented soul, such as Hamlet?

Today, television has shows such as C.S.I. where scenes of detached, putrified body parts have strangely become commonplace. This must have at least some effect on how early 21st-century people react to this scene from Hamlet, as opposed to audiences only 20 or 25 years ago.

I have decided to muse on the linked data diagram, and to try to do so from the viewpoints of various types of people: an information “idealist,” two types of information “businessmen,” an information “Security enforcer”, an information “vandal”, plus a couple of others at the end.

I invite you to gaze upon the linked data diagram along with me, as I share my thoughts with you.

I would like to start with The Information Idealist. For this person, each link is a source of awe and wonder:
“Oh! Look at the wonders of what the Semantic Web can do! Or whatever the Semantic Web is called today! This incredible diagram just leaves me all aflutter! All of those links and inter-links and inter-inter-links simply must be wonderful to work with, and I am sure will allow me to go effortlessly from dbpedia to flickr-wrapper to NSZL catalog to MARC Code Lists to Linked LCCN to PBAC to Pokepedia to Jamendo to Muzik-Brainz (DBTunes) to MySpace (DBTunes) and other places. Just thinking about it gives me goosebumps!

Naturally, I understand very little of any of this: flickr-wrapper must have something to do with flickr and what MARC or LCCN are I don't care. What is PBAC? According to, PBAC can mean Program Budget Advisory Committee, Pharmaceutical Benefits Advisory Committee, Palm Beach Atlantic College, peripheral blood adherent cells, Peripheral Bus Access Controller, pictorial bloodloss assessment chart, policy-based access control, or Program Budget Activity Committee. Which one it is doesn't matter since I know it is all wonderful because it is linked. Oops! Well, MySpace (DBTunes) seems to be dead, but it's no loss since I didn't know what it was anyway. The only site I really knew about was Poképedia, which is a database about the game Pokémon.

It is all good, of that I have no doubt at all.”
Next, will be the view of The Information “Businessman-Advertiser-Spammer.” For this person, each link represents a way to get the message out:
“Quite an interesting interlinked community here. Where can my “information business” fit into this in the most useful way? Hmmm.... It seems as if dbpedia is at the center of it all. Information placed there should eventually trickle out into the entire system. There is another major center called Freebase but best forget that one. My companies can't risk any negative fallout if potential customers should confuse it with freebasing crack cocaine.

Other fertile areas seem to be CiteSeer, ACM, or RAE2001. That last one seems to be pretty old and a quick Google search brings up something else called RAE2008. That also seems old. No idea what those are but it doesn't really matter. I just want to get as much of my information out as widely and as efficiently as possible.
Our first attempts should be to blanket as much of the system as possible with information for cheapest prices on Viagra and Botox. All we'll need is .001% to respond and we'll be profitable. If we get kicked out, no problem. That's happened to us several times before in other venues. I'm sure we'll find other ways to get in again.

I have no doubt lots of money could be made.”
From here, I proceed on to this businessman's colleague: The Information “Businessman-Acquirer”. Here each link equals monetary income:
“Which agencies here are in the most control of the entire system? Many would immediately say that the biggest ones are the most important and have the most power, so at first thought someone could decide that dbpedia and ACM would be the best targets for our acquisition attempts, and of course that's very possible. Everybody has a price, after all. If each of the other sites link to dbpedia and my company bought it, there would be an incredible number of ways to force others to pay for those links. Oh! I have to remember to say: “monetize” those links.

But although everybody has a price, I don't want to pay it if I don't have to and anyway, experience shows that more often that not, big agencies are often completely at the mercy of much smaller ones. Those little ones are easier and cheaper to take over but have almost the same effects.

I've always admired China this way since it has that kind of power right now. They are just about the only place in the world where anybody can get rare earths. And the wealthy technology companies absolutely have to have rare earths or they shut down. China has enviable power. I suspect we can find something comparable here. Where exactly are the equivalent caches of “Chinese rare earths” to be found in this system? There may be several.

Therefore, it would probably be better worth our while to examine sites that are much smaller, like LinkedCT or even LOIUS (whatever they are) to see if by acquiring one or both we could gradually gain significant control over the whole. Other companies would probably be interested in these investment opportunities as well.
I have no doubt lots of money could be made.”

From here, we go on to The Information “Vandal”. Each link equals a place for a bomb:
“What would be the most efficient way to bring this system down? What would require the least amount of work from me and my underground group that would produce the greatest chaos and damage? I would like to focus on the areas that are perceived to be the strongest and perhaps the best protected because my group will enjoy the greater challenge of course, but more importantly, if we were to succeed in bringing down the biggest ones, we would make the greatest impression on the rest of the world.

This system is based on links, and that makes it even more interesting. I would imagine that it is not necessarily the biggest sites that make the system most vulnerable, but the ones providing links that are the most widely-used. I suspect there could be some artistry involved here to find the one weakness that appears insignificant, but where one slight tap of a hammer will bring everything crashing down. Now that would be something to be proud of!

Flickr-wrapper may be a good place to start. After all, Flickr is owned by Yahoo, and I have always been mad at them ever since they cooperated with the Chinese authorities a few years ago to send people to jail.

This should make for an interesting project for my group. I wonder if there is any security surveillance of any of this?”
In answer, we continue with The Information Security-Enforcer. Each link equals an unsecured open door, going in and going out:
“Looks like a lot of ground to keep secure. We need to find out which of these are the foreign sites, especially those that are unfriendly to our nation. We must assume that there are criminal organizations included here as well. We'll need to root them out.

We need several options, and perhaps soon. It will be crucial to be able to block propaganda from our enemies in times of national emergency and disrupt their communications, while we need to ensure that their populace will continue to receive the true information from us.

How can the unfriendly sites be brought down by sabotage? Where are their weaknesses? Could these other sites do the same to us? Can we hijack any of them? How can we monitor what is going on inside each of these sites to ensure there are no criminal activities going on?

What is this dbpedia thing and the other major sites? Perhaps we can get some cooperation from them to find out specifically who the people are using this system and how they are communicating.

We can start with some basic background checks, begin to monitor the discussions, and start data mining actitivities. Something will turn up, I am sure.”
Now, I go on to The traditional library Cataloger:
“Links based on millions of machine-readable URIs are fine but stopping there isn't enough. I am interested in human beings and if you want humans involved, then sooner or later those same links must lead to something that is human-readable. Although this system designed primarily for machines, and human understanding is introduced only as an afterthought, the ultimate audience is human beings, after all. They should be included here somewhere.

It seems from this diagram that linked data is a variation on “passing the buck”: links come from links that come from links that come from links, but sooner or later, the buck has to stop and some text has to exist, or an image, or a sound, or something that a human can understand. Those parts that humans can understand are what I, as a cataloger, am interested in. The technology behind it is almost beside the point. What precisely is that text, or sound, or whatever it is that is found at the end of the URI that will communicate information to the human? How does it read, or what does it say, or what does it look like? Does it make sense—to a real, live person? And what kind of person are we talking about? A professor or a child; a journalist or someone who is barely literate?

Besides, who is responsible for making sure that the human-meaningful data itself makes sense? Is it supposed to make sense spontaneously? And makes sense both as an individual resource and in the aggregate? And is even semi-reliable? Links coming from all and sundry lead me to think that the data will become genuinely meaningful to humans only in some kind of magical or miraculous way. Magic only happens in fairy tales and miracles are found in sacred texts.

The Linked Data universe looks like chaos brought together. It will need a lot of work to make it genuinely useful to humans.”
Finally, what would Hamlet say? I have tried on my own, and taken some authorial liberties. I can imagine Hamlet holding the diagram of the linked data system, the diagram that he had loved during earlier and happier times, just as when he was walking through the cemetery with his friends and the gravedigger gave him the skull of his beloved Yorick:
“Alas, poor Web3.0! I knew it, Horatio: a project of infinite jest, of most excellent fancy: it hath borne me on its back a thousand times; and now, how abhorred in my imagination it is! my gorge rises at it.

Here hung those links that I have clicked I know not how oft. Where be your attributes now? your datasets? your rdf? your flashes of semantics, that were wont to set the Internet on a roar?

Not one now, to mock your own entities? quite value-fallen?”
Well, I apologise to everyone for that, and especially to the ghost of Shakespeare, but it seemed fitting and I just could not help myself.

Do I believe that the problems of libraries will be solved by making our metadata/catalog records available through linked data? No. I wish there were some kind of silver bullet that would solve these problems, but while including our information as linked data may make it more widely accessible, I still cannot see that that action will make our data more useful or relevant to the public at large. Much more needs to be done if we are to fit in that way.

Therefore, linked data is only the METHOD toward a solution and not a solution in itself, and it is only a single possible method at that. It may not solve much at all. Of course, we can go ahead and do it, just so long as we do not expect too much and that we do it as cheaply and easily as possible. It seems to me that to find a solution or solutions, a library as a whole has to address the problems, and not as our artificially created individual departmental units of selection, acquisition, cataloging, circulation, conservation, and reference. Approaching the problems that way resembles sending out uncoordinated groups of soldiers against an enemy army that has proven itself highly formidable. It's a recipe for disaster and it was something that Napoleon Bonaparte himself relied on. He much preferred fighting allied armies since he knew that the so-called allies were almost always just as busy fighting each other as they were fighting him.

No one outside of libraries has the slightest idea of our workflows or of our bureaucratic structures nor should they need any. Bureaucratic structures and workflows are notoriously difficult to change. Since changes are so difficult to implement, they take time, so bureaucractic structures and workflows very often reflect capabilities and needs from an earlier era.

Libraries, libraries as a whole, need to rediscover what it is that they really, genuinely do to help their communities, putting aside their individual departments and personal responsibilities. For instance, I keep bringing up that the public today desperately wants selection since it either takes too much of their time, or they don't feel competent to choose. But selection for our patrons must include what is in the patron's information universe, and not only the restricted area within any single library collection.

How does a patron see it? I think it's like in grocery stores when you ask a grocery clerk a question and he responds, “I'm sorry mister. That's not my department” and continues with his work, leaving you all alone.

So, what is it that libraries really do? It seems to me that reviewing the various statements on library ethics could be very helpful in figuring this out. For instance, when discussing selection, the excellent library ethic of not proselytizing a personally held belief, and, that librarians should not make a monetary profit from their recommendations to a user, should be much more widely circulated to the public than it is currently. I think that would be a tremendous point in favor of libraries since people would very quickly see how different we are with other search tools and how librarians really do have their interests in mind, while Google, Yahoo etc. have different purposes.

I would go into this more deeply but I have gone on long enough and I will save this topic for another episode of Cataloging Matters.

The music I have selected is “Variations on the folia of Spain, opus 45” by Mauro Giuliani.  Giuliani was a major musician of his time, and I discovered that he even played in the first performance of Beethoven's Sixth Symphony. He lived in Rome for awhile but ended up doing better in Naples where he died in 1829. Giuliani seemed to have a special love for making variations on different themes, and we can certainly hear it in this piece.

That’s it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.


  1. Great piece, Jim, a few comments;

    The Semantic Web is a platform on which you can build anything you like, using a clearer separation between ontology and data. As you well know, that line is terribly blurred in most library technologies (I'm looking at you MARC / AACR2 / RDA / FRBR ...). Using this technology properly could solve a bunch of infra-structural things (or, inter-library, if you will) more than the outward focus (from which I see most of your podcast focus on). This is a very important distinction that has to be made, that the SemWeb technology solves more problems than whatever dream TBL had about it.

    However, to make it work for you (the library world), that very library world needs to put in the effort to solve their stuff using this technology, but because there's always been this global focus for the SemWeb it was kinda thought that a lot of smaller problems would solve themselves and the libraries could just hang on to whatever showed up. That was a big mistake, of course.

    The library world should have *pioneered* this technology and built a global platform for knowledge sharing, and they could not only have made themselves relevant to the future of computing, but indispensable to the world, like they used to be. But no, libraries wanted to stay put and not do anything too radical.

    The price of that is pretty much the summary of your podcast. Hopefully I'll make it to your lovely Rome. I've been almost everywhere *but* Rome (used to work in Caserta, even, and got family in Torino), but perhaps there's something that will one day draw me in. :)

    1. Thanks for the kind words, Alex.

      I especially like where you say, “...but because there's always been this global focus for the SemWeb it was kinda thought that a lot of smaller problems would solve themselves and the libraries could just hang on to whatever showed up. That was a big mistake, of course.”

      Yes. The "smaller" problems are incredibly big problems to others from the practical viewpoint. It seems like it has been a "If you build it, they will come" mentality. That mentality may be no problem if it only takes a day or so to build "it", costs you almost nothing and there is no problem if nobody really does show up, but if it instead takes you months or years to build "it", costs lots of money, and you are betting the house AND the baby's shoes on the outcome, it's a completely different matter.

      As I have mentioned in other posts, we see an example in AACR2/MARC21:

      100 0_ |a Napoleon |b I, |c Emperor of the French, |d 1769-1821

      what looks difficult to the non-cataloger is the coding: the 100 and subfields. But to the cataloger, they know without a doubt it is the information inside the coding that is much more difficult, and the coding is by far the easiest part about all of this.

      Turning this into a URI, I don't think, in any way changes the need and the complexity of what the human experiences. OK, it makes it easier for the machine, but how does it really make it easier for the human? They have to experience something, after all.

      And yes, libraries should have pioneered this. If the LC Subject Headings and name headings had been let out a decade or so earlier, I am sure they would be in the place of dbpedia now in the Semantic Web, although they would be substantially different from what they are now. I think they would have been vastly improved. But, that is one of those lost opportunities.

  2. I understand fully your suspicious thoughts.. you can imagine how I also have doubts such as the ones that you have..working on AGRIS as a Linked Open Data application (
    It surely does not solve the problems of catalogers.. -- mmmh..btw, what are the problems of the catalogers? -- but, at least make us .. "dream" of a world of data that is mashed up altogether, extending and expanding the (sometimes too) dry and not informative citation entered by catalogers, which may not suffice to students and researchers, who too often cannot access that piece of information for copyright reasons, especially when you face restrictions from some editors or commercial publishing companies (Elsevier, for example)..
    Yes, sometimes it seems like you think that you are doing something that may last forever and will never be fully accomplished.
    I do not think that the LOD believers are so presumptuous to hope that this will change the attitude of sharing and acquiring information, and that alive URIs + ontologies will change the world of catalogers.. There are far too many obstacles in getting in the LOD cloud that you show at the start of your podcast that we still need to work hard simply to disambiguate things in our collection..