Cataloging Matters Podcast no. 14:
Musings on the Linked Data Diagram
Hello everyone. My name is Jim Weinheimer and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy.
In this episode, I want to share a few of my own thoughts on Linked Data, the Semantic Web and Web3.0. What are those things anyway? And how do they fit into the future of the library? Or do they?
“Linked data”, the latest hot word for the “Semantic Web” or “Web3.0” is what many librarians and other information professionals point to as our ultimate destination. We are told that the Linked Data universe, or the “Semantic Web,” or “Web3.0,” is the goal that information agencies should strive for, because it's where everyone else is going, and once we all arrive there, our problems will be solved.
The Linked Data universe or the “Semantic Web” or “Web3.0”. Why people feel a need to constantly change such terms has always been a mystery to me. I suspect it is just an attempt to keep most people bewildered. Yes, experts say that these terms represent different things, the semantic web is the idea, while linked data is the means, and Web3.0 is different in yet some other way, or not, but these kinds of discussions remind me of medieval academic disputations that in practice, all come down to the same thing and serve only to keep the average person out of the discussion altogether.
Within the library world, “Linked Data” reigns pretty much unquestioned as the ultimate aspiration for our metadata. Such an unquestioned goal reminds me of the story of the Knights of the Round Table, who set off in search of the Holy Grail after Sir Galahad sat in the Seige Perilous and they all saw the vision of the Holy Grail. Or at least, they said they did. Going after the Grail was simply something that they had to do. Whenever I see or hear the words “Linked Data” now, I believe I can hear a chorus of angels in the background.
http://www.myzeitgeist.com/wp-content/uploads/2010/09/lod-datasets_2010-09-22_1000px3.png This, or one of its many variants, is only a top-level overview. It shows only complete sites, or more precisely, datasets, that are part of the system but in reality, each dataset represents probably hundreds of thousands if not millions of individual links, so the underlying network is immeasurably more complex than this mere topmost overview. For example, if Worldcat were included in this, with all of its millions of records, and many, many, many more individual links, it would appear only as a single circle labelled Worldcat.
This diagram has been used to illustrate that Holy Grail that we in the “information community” are striving for. And yet, the ultimate goal remains rather elusive to me and seems based on a general sort of faith that once we get “there”--wherever “there” is--our problems will be solved. We read this all the time in articles that state, RDA is the first step to FRBR, and FRBR allows us to participate in Linked Data.
According to our modern fount of unquestioned wisdom and knowledge for all of our information needs today, that is: Wikipedia, the idea of Linked Data is rather new, dating back only to the late 1960s. Now, it seems to me that the idea of Linked Data is, in its basic essentials, rather similar to authority control in library catalogs, so in that case it could be traced much, much farther back. To be fair, at the time I am recording this, some unknown person actually mentioned this precise point in the Wikipedia article on Linked Data (I have added a screen clip to the transcription) but it has that dreaded Wikipedia Mayday warning “” so I had better be careful!
The term “Semantic Web” itself comes from none other than Tim Berners-Lee back in 1999 in his book “Weaving the Web”. Interestingly enough, it was when he described a dream. That's right, a dream. And a dream in two parts. The first part describes the possibilities of genuine collaboration with his human colleagues through the web (or what is now known as Web2.0). He then continues his dream. I am quoting him:
In the second part of the dream, collaborations extend to computers. Machines become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.End of quote.
What are these “intelligent agents” he mentions? Those are machines that are designed to work for humans but do so without any need for us to monitor them all the time. There are many different types of intelligent agents, and we are surrounded by them. In fact, our bodies have them so that we don't have to remember consciously to breathe or for our livers to function. Mechanical intelligent agents can be thermostats that regulate the temperature in our homes automatically, or sensors of the oil level in our automobiles. Tim Berners-Lee imagines intelligent agents of a substantially different kind. I guess that The Terminator could also be considered an intelligent agent but a little different from a thermostat.
To continue with the dream. Quote:
Once the two-part dream is reached, the Web will be a place where the whim of a human being and the reasoning of a machine coexist in an ideal, powerful mixture. Realizing the dream will require a lot of nitty-gritty work. The Web is far from "done." It is in only a jumbled state of construction, and no matter how grand the dream, it has to be engineered piece by piece, with many of the pieces far from glamorous.End of quote
His dream does strike me at least, as something that could fairly be labelled “utopian.”
For those who have listened to my podcast on Search, this is what search is turning into: an intelligent agent for informational purposes that will work for you much as a thermostat does for your home: it will function to get you the information you want without any or much thought from you. Sometimes you may have to adjust it, just as you do your thermostat. I have my own opinions on that, and those who are interested can find them in my earlier podcast.
As for “Linked Data”, that term also apparently comes from Tim Berners-Lee, who cited four tasks if you want to implement linked data. The purpose is for machines to be able to talk to other machines in order to create the intelligent agents:
- Use URIs as names for things (that is, the name for a concept must use a constant reference instead of using human terminology)
- Use HTTP URIs so that people can look up those names. (that is, so that these constant references can actually work on the World Wide Web)
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) (that is, use complicated computer coding standards)
- Include links to other URIs. so that they can discover more things.
Many times have I seen the linked data diagram, and I have watched it grow, but the last time I looked at it, I realized that my response was not all-together positive, and it occurred to me that I had never really had positive feelings for it. As I tried to sort out what my feelings were, I discovered that the Linked Data diagram actually gives me the “creeps”.
Based on this, I came to think that perhaps I should try to regard the diagram in an openly “creepy” way, and to follow the example of those in the past who would sometimes put a human skull in a place where they could see it all the time. That way, those people could muse over its deeper implications at different points throughout the course of their lives.
One of the most famous scenes where a person comtemplates a skull is of course in Shakespeare's Hamlet. During this amazing scene, the individuals sitting in the audience watch the gravedigger's reaction to the skull, and then Hamlet's strange reaction, and the audience compares it with their own personal reactions. The members of the audience come to realize that the type of person holding and contemplating the skull is critical to how someone relates to it: are you a child seeing a skull for the first time; a chance passer-by discoving a crime; a murderer enjoying a moment of exultation; a gravedigger who knew and hated the fellow, or a tormented soul, such as Hamlet?
Today, television has shows such as C.S.I. where scenes of detached, putrified body parts have strangely become commonplace. This must have at least some effect on how early 21st-century people react to this scene from Hamlet, as opposed to audiences only 20 or 25 years ago.
I have decided to muse on the linked data diagram, and to try to do so from the viewpoints of various types of people: an information “idealist,” two types of information “businessmen,” an information “Security enforcer”, an information “vandal”, plus a couple of others at the end.
I invite you to gaze upon the linked data diagram along with me, as I share my thoughts with you.
I would like to start with The Information Idealist. For this person, each link is a source of awe and wonder:
“Oh! Look at the wonders of what the Semantic Web can do! Or whatever the Semantic Web is called today! This incredible diagram just leaves me all aflutter! All of those links and inter-links and inter-inter-links simply must be wonderful to work with, and I am sure will allow me to go effortlessly from dbpedia to flickr-wrapper to NSZL catalog to MARC Code Lists to Linked LCCN to PBAC to Pokepedia to Jamendo to Muzik-Brainz (DBTunes) to MySpace (DBTunes) and other places. Just thinking about it gives me goosebumps!Next, will be the view of The Information “Businessman-Advertiser-Spammer.” For this person, each link represents a way to get the message out:
Naturally, I understand very little of any of this: flickr-wrapper must have something to do with flickr and what MARC or LCCN are I don't care. What is PBAC? According to allacronyms.com, PBAC can mean Program Budget Advisory Committee, Pharmaceutical Benefits Advisory Committee, Palm Beach Atlantic College, peripheral blood adherent cells, Peripheral Bus Access Controller, pictorial bloodloss assessment chart, policy-based access control, or Program Budget Activity Committee. Which one it is doesn't matter since I know it is all wonderful because it is linked. Oops! Well, MySpace (DBTunes) seems to be dead, but it's no loss since I didn't know what it was anyway. The only site I really knew about was Poképedia, which is a database about the game Pokémon.
It is all good, of that I have no doubt at all.”
“Quite an interesting interlinked community here. Where can my “information business” fit into this in the most useful way? Hmmm.... It seems as if dbpedia is at the center of it all. Information placed there should eventually trickle out into the entire system. There is another major center called Freebase but best forget that one. My companies can't risk any negative fallout if potential customers should confuse it with freebasing crack cocaine.From here, I proceed on to this businessman's colleague: The Information “Businessman-Acquirer”. Here each link equals monetary income:
Other fertile areas seem to be CiteSeer, ACM, or RAE2001. That last one seems to be pretty old and a quick Google search brings up something else called RAE2008. That also seems old. No idea what those are but it doesn't really matter. I just want to get as much of my information out as widely and as efficiently as possible.
Our first attempts should be to blanket as much of the system as possible with information for cheapest prices on Viagra and Botox. All we'll need is .001% to respond and we'll be profitable. If we get kicked out, no problem. That's happened to us several times before in other venues. I'm sure we'll find other ways to get in again.
I have no doubt lots of money could be made.”
“Which agencies here are in the most control of the entire system? Many would immediately say that the biggest ones are the most important and have the most power, so at first thought someone could decide that dbpedia and ACM would be the best targets for our acquisition attempts, and of course that's very possible. Everybody has a price, after all. If each of the other sites link to dbpedia and my company bought it, there would be an incredible number of ways to force others to pay for those links. Oh! I have to remember to say: “monetize” those links.
But although everybody has a price, I don't want to pay it if I don't have to and anyway, experience shows that more often that not, big agencies are often completely at the mercy of much smaller ones. Those little ones are easier and cheaper to take over but have almost the same effects.
I've always admired China this way since it has that kind of power right now. They are just about the only place in the world where anybody can get rare earths. And the wealthy technology companies absolutely have to have rare earths or they shut down. China has enviable power. I suspect we can find something comparable here. Where exactly are the equivalent caches of “Chinese rare earths” to be found in this system? There may be several.
Therefore, it would probably be better worth our while to examine sites that are much smaller, like LinkedCT or even LOIUS (whatever they are) to see if by acquiring one or both we could gradually gain significant control over the whole. Other companies would probably be interested in these investment opportunities as well.
I have no doubt lots of money could be made.”
From here, we go on to The Information “Vandal”. Each link equals a place for a bomb:
“What would be the most efficient way to bring this system down? What would require the least amount of work from me and my underground group that would produce the greatest chaos and damage? I would like to focus on the areas that are perceived to be the strongest and perhaps the best protected because my group will enjoy the greater challenge of course, but more importantly, if we were to succeed in bringing down the biggest ones, we would make the greatest impression on the rest of the world.In answer, we continue with The Information Security-Enforcer. Each link equals an unsecured open door, going in and going out:
This system is based on links, and that makes it even more interesting. I would imagine that it is not necessarily the biggest sites that make the system most vulnerable, but the ones providing links that are the most widely-used. I suspect there could be some artistry involved here to find the one weakness that appears insignificant, but where one slight tap of a hammer will bring everything crashing down. Now that would be something to be proud of!
Flickr-wrapper may be a good place to start. After all, Flickr is owned by Yahoo, and I have always been mad at them ever since they cooperated with the Chinese authorities a few years ago to send people to jail.
This should make for an interesting project for my group. I wonder if there is any security surveillance of any of this?”
“Looks like a lot of ground to keep secure. We need to find out which of these are the foreign sites, especially those that are unfriendly to our nation. We must assume that there are criminal organizations included here as well. We'll need to root them out.Now, I go on to The traditional library Cataloger:
We need several options, and perhaps soon. It will be crucial to be able to block propaganda from our enemies in times of national emergency and disrupt their communications, while we need to ensure that their populace will continue to receive the true information from us.
How can the unfriendly sites be brought down by sabotage? Where are their weaknesses? Could these other sites do the same to us? Can we hijack any of them? How can we monitor what is going on inside each of these sites to ensure there are no criminal activities going on?
What is this dbpedia thing and the other major sites? Perhaps we can get some cooperation from them to find out specifically who the people are using this system and how they are communicating.
We can start with some basic background checks, begin to monitor the discussions, and start data mining actitivities. Something will turn up, I am sure.”
“Links based on millions of machine-readable URIs are fine but stopping there isn't enough. I am interested in human beings and if you want humans involved, then sooner or later those same links must lead to something that is human-readable. Although this system designed primarily for machines, and human understanding is introduced only as an afterthought, the ultimate audience is human beings, after all. They should be included here somewhere.Finally, what would Hamlet say? I have tried on my own, and taken some authorial liberties. I can imagine Hamlet holding the diagram of the linked data system, the diagram that he had loved during earlier and happier times, just as when he was walking through the cemetery with his friends and the gravedigger gave him the skull of his beloved Yorick:
It seems from this diagram that linked data is a variation on “passing the buck”: links come from links that come from links that come from links, but sooner or later, the buck has to stop and some text has to exist, or an image, or a sound, or something that a human can understand. Those parts that humans can understand are what I, as a cataloger, am interested in. The technology behind it is almost beside the point. What precisely is that text, or sound, or whatever it is that is found at the end of the URI that will communicate information to the human? How does it read, or what does it say, or what does it look like? Does it make sense—to a real, live person? And what kind of person are we talking about? A professor or a child; a journalist or someone who is barely literate?
Besides, who is responsible for making sure that the human-meaningful data itself makes sense? Is it supposed to make sense spontaneously? And makes sense both as an individual resource and in the aggregate? And is even semi-reliable? Links coming from all and sundry lead me to think that the data will become genuinely meaningful to humans only in some kind of magical or miraculous way. Magic only happens in fairy tales and miracles are found in sacred texts.
The Linked Data universe looks like chaos brought together. It will need a lot of work to make it genuinely useful to humans.”
“Alas, poor Web3.0! I knew it, Horatio: a project of infinite jest, of most excellent fancy: it hath borne me on its back a thousand times; and now, how abhorred in my imagination it is! my gorge rises at it.Well, I apologise to everyone for that, and especially to the ghost of Shakespeare, but it seemed fitting and I just could not help myself.
Here hung those links that I have clicked I know not how oft. Where be your attributes now? your datasets? your rdf? your flashes of semantics, that were wont to set the Internet on a roar?
Not one now, to mock your own entities? quite value-fallen?”
Do I believe that the problems of libraries will be solved by making our metadata/catalog records available through linked data? No. I wish there were some kind of silver bullet that would solve these problems, but while including our information as linked data may make it more widely accessible, I still cannot see that that action will make our data more useful or relevant to the public at large. Much more needs to be done if we are to fit in that way.
Therefore, linked data is only the METHOD toward a solution and not a solution in itself, and it is only a single possible method at that. It may not solve much at all. Of course, we can go ahead and do it, just so long as we do not expect too much and that we do it as cheaply and easily as possible. It seems to me that to find a solution or solutions, a library as a whole has to address the problems, and not as our artificially created individual departmental units of selection, acquisition, cataloging, circulation, conservation, and reference. Approaching the problems that way resembles sending out uncoordinated groups of soldiers against an enemy army that has proven itself highly formidable. It's a recipe for disaster and it was something that Napoleon Bonaparte himself relied on. He much preferred fighting allied armies since he knew that the so-called allies were almost always just as busy fighting each other as they were fighting him.
No one outside of libraries has the slightest idea of our workflows or of our bureaucratic structures nor should they need any. Bureaucratic structures and workflows are notoriously difficult to change. Since changes are so difficult to implement, they take time, so bureaucractic structures and workflows very often reflect capabilities and needs from an earlier era.
Libraries, libraries as a whole, need to rediscover what it is that they really, genuinely do to help their communities, putting aside their individual departments and personal responsibilities. For instance, I keep bringing up that the public today desperately wants selection since it either takes too much of their time, or they don't feel competent to choose. But selection for our patrons must include what is in the patron's information universe, and not only the restricted area within any single library collection.
How does a patron see it? I think it's like in grocery stores when you ask a grocery clerk a question and he responds, “I'm sorry mister. That's not my department” and continues with his work, leaving you all alone.
So, what is it that libraries really do? It seems to me that reviewing the various statements on library ethics could be very helpful in figuring this out. For instance, when discussing selection, the excellent library ethic of not proselytizing a personally held belief, and, that librarians should not make a monetary profit from their recommendations to a user, should be much more widely circulated to the public than it is currently. I think that would be a tremendous point in favor of libraries since people would very quickly see how different we are with other search tools and how librarians really do have their interests in mind, while Google, Yahoo etc. have different purposes.
I would go into this more deeply but I have gone on long enough and I will save this topic for another episode of Cataloging Matters.
The music I have selected is “Variations on the folia of Spain, opus 45” by Mauro Giuliani. http://www.youtube.com/watch?v=aEYUPPi0scU Giuliani was a major musician of his time, and I discovered that he even played in the first performance of Beethoven's Sixth Symphony. He lived in Rome for awhile but ended up doing better in Naples where he died in 1829. Giuliani seemed to have a special love for making variations on different themes, and we can certainly hear it in this piece.
That’s it for now. Thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.