Metadata Creation–Down and Dirty (Updated)

Metadata Creation:
Down and Dirty

James L. Weinheimer

[This was originally written in 1999 and was cited in several places at that time. That text is now in the Internet Archive but without the images, so I am putting it on my blog, replacing the images more or less as I remember them, made some edits for additional clarity, and added an update at the end. I cannot remember one image at all, so I have deleted the reference in this version.]


How is metadata created? How should it be created? Who should do it? Who WANTS to do it?

These are not the usual questions that draw spirited discussion. While there are many articles and discussions about different metadata formats, there is very little discussion about how to go about making the metadata itself, i.e. the information that goes inside the metadata format

<META NAME=”Creator,Title,Format,etc.” CONTENT=”The part that hasn’t been discussed very much.”>

Traditional metadata creation has been practiced for millennia under various names and the actual procedures have changed tremendously. In its basics however, it has always remained the same. Its fundamental purpose is: to bring similar items together. The idea of bringing similar items together ensures that a user does not have to look in dozens–or hundreds–of different places for the same thing. These items can be similar in all sorts of ways according to the needs of the collection: it may be items by same author, or items using the same methodology, the same printers, the same colors, or anything at all.

This may sound simple enough but as happens so often, the practice of a simple rule results in incredible complexity.

Let’s start with a very simple example: deciding where to place a physical item in some sort of subject arrangement. Since it is a single physical item it can go only into a single place in the arrangement. Let us further suppose that this collection and its arrangement has been around for a long time and that many people have worked on it and used it. For our purposes, this subject arrangement is the Library of Congress Classification (it has been around for about 100 years) and its outlines are reproduced below.
subjects
Let us assume that I have a new item to add to the collection. The topic of my item is somewhat complex, let us assume that it is a book of photos of naval fighter planes used during WWII in the Pacific, and it could go into several places in the arrangement. It could go under photography, WWII, airplanes, military avionics, naval studies, or if they were taken by a single person even under the photographer. I have a good understanding of the classification and feel that my item belongs best in a certain place, naval science:
subjects1
Is this correct? After all, this is where I truly believe it should go. But, if I just put the item where I believe it should go, I may be making a serious mistake. Why? We must always remember that the purpose of all of this work is to put similar items together. Therefore, I must look to see if there are similar items are in the collection and where they are.

In this case, I find 25 similar items and discover that they have always been placed into a different area, under WWII. While I realize this area is also a possibility, I do not think it is nearly as good as mine. (This happens all the time)
subjects2a
Now I am faced with a dilemma. I actually disagree with the way the items have always been treated. What am I to do?

The answer may be surprising. If I decide that it is just a difference of opinion and simply put the item where I think it goes and forget it, it’s obvious that people will have to look in two different places to find all the similar items. How are people supposed to know that?
subjects2a
Also, when the collection inevitably gets a similar item in the future, that librarian may think that these items belong in still another area. It turns out that if everybody did whatever they wanted, similar items inevitably would be scattered all over the place. This is not how information retrieval is supposed to work. Therefore, if I am resolved to put my item where I believe it should go, I must take the 25 similar items that were already cataloged, and rework them in some way so that they will all stay together.
subjects4
Now, what have I accomplished? Not only have I created a lot of additional work for myself and other library staff (which makes my library colleagues angry), I’ve also made the users angry because I have removed the items from the areas they were accustomed to finding them. They knew how to find things before, but now I have forced them to go searching all over again.

In a few years’ time, a new item on a similar topic will inevitably arrive and that librarian may decide that where I have placed the books is wrong. He or she will move the books yet again—back to where they were originally or someplace different—thereby sending people wandering around again. This can go on forever, causing lots of extra work for everybody and only making things harder to find. The only practical solution is to swallow what I believe to be “correct” and place my item where everyone has been accustomed to finding such things, in this case, under WWII with the similar materials.
subjects1a
So, have I done something that is actually incorrect? No, because what is “correct” in terms of information access is different from other senses of “correct.” If I really believe my item is basically different from the others already there, then that would be another situation, but if I think it is the same, I have no real choice. “Correct” in the sense we are discussing here is: bringing similar items together. This is also called consistency.

The only time I will change what has already been done is when I have found a true error, which rarely happens. For instance in one collection I worked in, I found the works of D.H. Lawrence and T.E. Lawrence had been mixed together. I moved the items by T.E. Lawrence. Most of the time however, it amounts merely to a difference of opinion.

The natural response at this point is: “What does all this have to do with other sorts of items? Surely you’re talking about books on shelves here, but links to websites can be placed in more than one point on the line. In fact, the entire concept of lines is irrelevant, too, with the added possibilities we have with computers. Books can only go in one place, unless you want to buy extra copies.”

To answer this, let us continue to imagine a digital article that is placed somewhere on the web. A person has written something you violently disagree with; everything the fellow says is wrong. You disagree so much that you write a withering response and also place it on the web.

Now the task is to ensure that when people find one article, they find the other one.

You easily can guarantee that when people find your article, they will know about the first article merely by adding a link from your article to the other one, e.g.

To read this fellow’s ridiculous article, click HERE.

Things are rather more difficult in reverse, however. How can you ensure that when people find his page, they will find your page? That is what you really want, after all. One thing you could try would be to contact the fellow and politely ask if he would make a link to your article–which he probably wouldn’t do.

Let’s further assume that this fellow has added metadata to his paper and that you have examined it. You disagree with every word he has chosen and decide to use other, much better words in your metadata.

His Metadata

Your Metadata

<META NAME=”Keywords” CONTENT=”Inferior words”>

<META NAME=”Keywords” CONTENT=”Other, much better words”>

What have you succeeded in doing? You have just guaranteed that when people search the metadata used in his page, no one will find your page. Just as before, people would have to look in two different ways for the same things. How can people know that?

Remember that the purpose is to bring similar things together. So, even though you disagree with every word the fellow says, and you are free to write whatever you want in your article, if you want people to know about your article when they find his article, you must use the same metadata, even though it irks you, just as much as it did when we put the book in the “wrong” place.

Someday advanced search engines may allow searchers somehow to find the URL in your paper and bring your article together with his in some way, but let us further suppose that someone else disagrees with both articles and has written still another page on the same topic. This author is fed up with both of you and decides to ignore your writings completely. This fellow certainly wants searchers to find his article when they find yours. Neither of your URLs appear in the page so that can’t work. But what sort of metadata should this author use?

It becomes clear: the same metadata as you two have used.

In this way, we can see that standardized terminology is a natural outgrowth of the primary task of bringing things together. It works on the Web just as it does in other media. The task is to describe similar items in a consistent fashion. [This will be discussed further in the Update]

More often than not, it turns out that a topic is expressed in more than one word, or that the subject of an article encompasses more than a single topic and can expressed in various ways. For example, an item may be about the Aesthetic movement as it was seen in the architecture during the late 1800’s in England. Any of the aspects of this subject may be handled in very special ways. Obviously, the tasks of doing this analysis and relating it to the collection can become highly complex.

In some of the earliest years of information retrieval, various goals were laid down that determined what would constitute a successful–or unsuccessful–information structure. Among other things, an information retrieval system has allowed a user to find what the collection has by its authors, titles, and subjects. [This will be discussed further in the Update]

It is important to note that this doesn’t mean users should be able to find just a few works by a certain author, they should be able to find everything in the collection by that author. Therefore, when users find the author Dostoyevsky, Fyodor, they should find everything that is in the collection by this author no matter how his form of name appears on any item. Or when they search Aesthetics movement, they should find everything. The forms of the words used in the metadata are based on bringing all works with the topic under a single form of name. Many times this form is based on the first item entered into the collection–a practice that should be more understandable now.

In reality, this is a tremendous undertaking and it doesn’t work perfectly. For instance, the idea of “everything in the collection” should not be taken literally but comparatively. In practice for libraries, this means 20% of any single item. So, if there is a paragraph about the aesthetics movement in a 600 page book, it will not be included in the metadata, but if 120 pages of the book were about the Aesthetics movement, it would be included in the metadata. If the book is 100 pages, only 20 pages needs to be devoted to the Aesthetics movement. Nevertheless, the goal is not to enable people merely to find something in the collection on a specific topic (which is a relatively simple undertaking), but to find all the things in the collection, within the 20% rule and some other guidelines. The two goals are completely different.

The choice of the words themselves for the metadata is far less important for information access than the fact that similar items are brought together. Why? Because long experience has shown that no matter which word someone chooses, it is inevitable that others will come up with terms they believe to be “better”. The term an expert will use will very often be different from the one that a novice will use. Neither one is correct nor is it incorrect.

One of the traditional tasks of information retrieval is to forego the “correctness” or “incorrectness” of a term (so far as this is possible) and concentrate efforts on helping people find the term used for bringing the items together. This is done through a system of Use: and Related: cross-references.

What is the “correct” name of Geneva, Switzerland? It depends on where you come from. It can be: Geneva, Genf, Ginevra, Jih-nei-wa, Ginebra, Cheneba, Geneua, or of course, Genève, along with lots of other possibilities. None of these forms is incorrect and no one should be faulted for searching under any of these forms, but they need help to find the one that has been chosen. Therefore, there are cross-references. e.g.

find Cheneba ==> Use: Geneva.

Additionally, someone may not be aware that Geneva was an independent republic from 1536 to 1798 and there are also items in the collection under another form. At these times, a cross-reference can come in very handy. e.g.

find Geneva ==> See also: Republic of Geneva.

For all of this to work, as metadata creators create new metadata records and they discover a new term for a concept that is already in the collection, if a librarian adds an item that uses the term “Cenevre” for Geneva, they must add a new cross-reference:

find Cenevre ==> Use: Geneva.

In the last twenty years or so, the introduction of computers has allowed users to search databases by separate words in the record, called keyword searching. This can even be done with entire texts. From the beginning of study into such searches, the problems of bypassing the standardized terminology were clear. One question was obvious: how do searchers know whether a record (or text) with certain words in it has anything to do with their topic? This problem was compounded by another dilemma: what is the best way to order the keyword results for the user? Attempts have included: by date of publication, latest date of input into the collection, and by location of the word, among other attempts. [This will be discussed further in the Update]

The problems with keyword searches are many: there are synonyms, e.g. “fossil” is a term for the stratified remains of past life, a company that makes watches, a pejorative term for an old person, and so on. Also, the results of keyword searches are almost always much larger than those of traditional searches and users can find themselves sorting through masses of irrelevant material. There is also not even the possibility that a searcher is retrieving all the information on a topic, even if it is limited to 20% of an item.

Many users also make incorrect assumptions about their searches. They tend to believe that when they make a keyword search, e.g. “World War II”, they are searching for the concept of the Second World War, when they are actually searching for three words scattered in various ways throughout a text. If they would search the standardized terminology, they would indeed be searching concepts, but with keyword searching it cannot be assumed.

Relevance is one of the latest attempts to order the results of keyword searches and has become the most popular method of searching the Internet. The results are ranked by the number of times a term is used, how it is used in a text, how often other articles cite it, or other ways, depending on what a specific database considers to be “relevant”. The results of relevancy searches can be excellent, but all of us have been mystified by some items at the top of the relevancy rankings that have nothing at all to do with what we want, while other sites that are much more relevant show up much farther down the list. [This will be discussed further in the Update]

Relevance ranking also tacitly assumes that authors in the past have used similar terms in similar mathematical correlations and that they continue to follow these same criteria today–something that is highly dubious, at the very least.

Traditional information specialists welcome the increased power from keyword searching, including relevance ranking, and they have joined in the task of discovering new and better methods to improve keyword searching, but it has not eliminated the need for authorized forms and consistent analysis, although those methods are changing. It has turned out that one of the most powerful uses for keyword searching is that it can simplify the user’s task of finding the standardized terms used in the metadata records.

Traditional information specialists look at the present problems of information retrieval from a unique vantage point: how can we bring similar things together in a way that is useful to the user, and how can we make sure that we are retrieving everything from a search, i.e. when we search for the history of World War I, how can we guarantee that we are getting everything on WWI and not just a few random items? If we are getting just a few items, then which ones are we getting? If we can’t answer these questions, what is the goal of information retrieval today? Are the traditional goals of information retrieval even relevant in today’s environment? Is it that people no longer want to find items by their authors, titles, and subjects? Or is it a different problem altogether?

In the “free and open” world of the Internet, there are no accepted standards for metadata content at the moment, and little is being discussed. There are no authorized forms, and there is no way that one person can “correct” the metadata embedded in another’s site (which is a frightening thought!).

The examples given here are very simple and literally scratch the surface of traditional metadata creation. As we have seen, experts traditionally have had the authority to change any metadata they wish, but even this is more complicated than appears at first glance, and can entail lots of work with associated frustrations for the users. Traditional information retrieval may not have all the answers, but it can pose some very good questions. Information retrieval specialists have unique knowledge and experience which should be of tremendous help when we tackle the problem of finding items on the Internet.

Update

When I re-read this document I was surprised that it needs relatively few updates. One example is the antiquated coding. Today, the coding would be changed into RDF triples, XML, microformats or something else. The basic idea remains the same however.

Aside from this, the major changes lie primarily in three areas: Linked Data, Search and the Single Search Box.

Linked Data

With linked data, we are not dealing with that much that is fundamentally different from what we have already discussed. Exactly the same things happen as described before, except it is not that people must assign the same words (text) but the same URIs. For instance, if there is an item about cats, the text can be in many words: cat, gatto, Katz, Кот, मांजर, قط or any of the words seen in the section owl.sameAs of this page http://dbpedia.org/page/Cat. This is the page for the URI, which is one part of the linked data network.

How would this work? In a correctly designed search system where everything is based on URIs, someone would be able to enter a search for “gatto” and in the background, the system would translate this word into the URI http://dbpedia.org/page/Cat then it would search where this URI has been used elsewhere and in this way, can find قط and Katz and cat and so on, wherever the URI http://dbpedia.org/page/Cat has been included. In this way, we see that the traditional method of adding standardized vocabulary still holds, except that it has turned into adding the same URIs.

Also, to compare this to the previous example of different people writing articles on the same topics and disagreeing violently on everything the others wrote, yet they must use the same metadata keywords if others are to find all of their writings, then the same thing happens here except that everyone must use the same URIs. In a similar fashion, if people decide not to use the same URIs, they guarantee that when someone finds one item on a topic, they cannot find the others.

The result is that in a linked data universe there are no longer “headings” in the sense that librarians traditionally think of them. The “heading” for cat becomes http://dbpedia.org/page/Cat while all text forms associated with it become more similar to cross-references. The power of modern systems allows the human display of the “URI heading” to be almost anything: a single bit of text (cats), multiple bits of text (where the words cats, felines, kittens can display at the same time), an image

Cat

a sound (Cat meow), or a video


and the display could even be customized by each person, so that one person might see text while another sees an image.

Nevertheless, linked data could solve some important parts of the problem: once a system of linked data is in place, people would no longer all have to search the same text, but somebody, somewhere would still have to add the same URIs to the items, just as earlier, they had to add the same text to the items. And as we have seen, there are many problems with that.

An additional problem arises however: there are several systems of URIs. Which is someone supposed to use?
http://dbpedia.org/page/Cat
http://id.loc.gov/authorities/subjects/sh85021262
http://aims.fao.org/aos/agrovoc/c_1390.html
http://vocab.getty.edu/aat/300265960
https://www.freebase.com/m/01yrx

Is the solution to link these systems together? Some are trying to do it but that’s not so easy either. What will be the final product for the searcher and how can it all be made coherent to a human? Will any of it really be useful? Nobody knows. Anyway, it is difficult to say how long it will take to build such a system and even once the system is done, it needs to be seen whether and how many people will be willing to implement it. The usefulness of the final product for the end users also needs to be demonstrated.

Linked data has great promise to provide practical results to the majority of people, but a lot of work remains to be done and there are many unanswered questions.

Search

Search is probably the area where the most important changes have taken place since I wrote this article in 1999. Searching/Search has changed almost completely since that time. Larry Page and Sergey Brin had begun their Google project at Stanford only a year before so obviously, much of what I wrote needs reconsideration.

At the end of the 1990s, the only search engines were AltaVista and WebCrawler and others that were similar, and the results obtained from them were inevitably subjects for some very funny jokes. It wasn’t until Google came up with PageRank that something substantively more useful came about, and there have been many developments since that time.

I wrote a podcast about Search, and everything there still holds true. Today, the growth of semantic technologies, which is based primarily on the so-called “big data” about you, where you go, what you look at on the web, what is in your emails or who you talk to on your phone, where those people go, what is in their emails, any social interactions all of you may have, what you have bought over the web and so on, are all changing the very concept of searching and in fact, searching as we have traditionally known it is expected by many to diminish as other methods take over.

This is what Tim Berners-Lee intended with his call for creating “intelligent agents” by building “The Semantic Web.” I discussed this in my podcast and it is now coming true.

For instance, there is something called “intuitive search” where the system, based on your activities that will be monitored in a increasingly detailed ways, will predict what you want even before you know you want it yourself. Many have experienced this already. Perhaps you are in another city; it is time to eat and you find a message on your smart phone with an ad for a nearby restaurant that has gotten high reviews, perhaps by friends of yours, or by their friends.

In the future, when the web turns into the “Internet of Things” when your refrigerator and almost everything else is hooked into the web, you could be coming home from work and get a message from your refrigerator saying that there is no milk, so you should stop and get some. With wearable technology, our very bodies can be monitored constantly, so that we can be told to take a medication that we forgot, or perhaps the bathroom scale decided I weigh too much today, conferred with the bracelet on my wrist that monitors my blood pressure, communicates this to my Google glasses, which recognizes a doughnut in my hand, and I get some sort of message that tells me to get some exercise and put that doughnut down.

Some may find such a future wonderful; others may find it horrifying, but no matter what we think, this is what many very powerful companies are planning for and is the thrust behind the idea of “intuitive search.” It is clear that for intuitive search to work at all, a very powerful system must know an awful lot about you. Some may consider this an invasion of privacy or not, but no matter: it is what many organizations are planning for and why many believe that the idea of “traditional search” such as we see in Google today, will gradually disappear. (For more on this, see “What’s Ahead For Paid Search?Search Engine Land,  and “Google Hummingbird: A Sophisticated, Intuitive Search Tool” from Syracuse University. School of Information Studies)

While most of these developments are focused on business and in the social world, such developments will of course have a major impact on what users expect from libraries. Already, I have discovered that the idea of searching for authors or titles or subjects is being forgotten by many young people and they think only in terms of keywords. Even the notion that searching for information can actually be hard work is difficult for many to grasp when, in other spheres, they can find a new app or find reviews for a nearby restaurant in just a few seconds. When they have trouble finding information for a paper for class, they often think it is a problem not with themselves, but with the systems—especially library systems.

Users who have grown up on Google and Yahoo searches have already changed their expectations and their opinions of libraries, along with their catalogs, from what earlier generations expected. It is difficult to imagine how their expectations and opinions of libraries will change when “intuitive search” really gets going, but they will most definitely change.

My own opinion is that library methods still provide important “added value” found nowhere else and should be retained, but if libraries do not follow these new developments very closely and institute methods to adapt to them, our traditional methods will look more and more obsolete and antiquated.

Single Search Box

The single search box is one of the library answers to the point mentioned above, that the “… idea that searching for information can actually be hard work is difficult for many to grasp…” One answer to make things easier has been to institute the “single search box” that searches many databases and different kinds of tools at once. This can be done in a few ways. One option is to convert records from these other databases–that are based on other rules or no rules at all–into MARC21 (the library’s format) and put them into the local library catalog, so that when people search the library’s catalog, they are searching “everything.”

Or they can institute a type of federated searching. This happens when you implement a system that searches all of the sites simultaneously, brings back the results and puts them all together in a single result. (For a demo, see this search for the term metadata, which is searching library catalogs, Wikipedia, OpenCourseWare and other resources all at the same time. This list can be expanded.  These resources do not follow–and cannot follow–the simple rule to “bring similar items together”  because many of these systems do not follow such a rule) No records need to be downloaded and converted into the local catalog and everything happens on the fly. Those who search such a system experience no real differences from the first option. The final product will obviously have problems with “keeping similar things together” as I have described.

I shall refer to a post I made recently that discusses the problem:

“… I will take a backseat to no one concerning the importance of consistency. It is one of the reasons why I have been against a lot of changes with RDA.

BUT (there is always a “but”), the fact is: we are living in transitional times. At one time–and not that long ago, just 10 or 15 years ago–the library catalog was a world apart. It was a closed system. A “closed system” meant that nothing went into it without cataloger controls, and when the catalog records went out into the wider world, they went into a similar, controlled union catalog, such as OCLC, RLIN, etc.

The unavoidable fact is, that world has almost disappeared already and the cataloging community must accept it. The cataloging goal of making our records into “linked data” means that our records can literally be sliced and diced and will wind up anywhere–not only in union catalogs that follow the same rules, not only in other library catalogs that may follow other rules, but quite literally anywhere. That is what linked data is all about and it has many, many consequences, not least of all for our “consistency”.

Plus there is a push for libraries to create a “single search box” so that users who search the library’s catalogs, databases, full-text silos and who knows what else, can search them all at once. Again, the world takes on a new shape because these other resources have non-cataloger, non-RDA, non-ISBD, non-any-rules-at-all created metadata, or no metadata at all: just full-text searched by algorithms. Those resources are some of the most popular materials libraries have, or have ever had. They are expanding at such an incredible rate that they would sink entire flotillas of catalogers working 24 hours a day. The very idea of “consistency” in this environment begins to lose its meaning.

For example, if a normal catalog department can add, let’s say, 70,000 AACR2/RDA records to their catalog per year, but the IT department is adding hundreds of thousands or even millions of records that follow no perceptible rules at all from the databases the library is paying lots of money for (this is happening in many libraries right now), then in just a few years, the records from non-library sources will clearly overwhelm the records from catalogers. That is a mathematical certainty.

Even without data dumps into the catalog itself, instituting the single search box will result in exactly the same thing from the searcher’s perspective: records of all types will be mashed together, where there will be far more non-library-created records than library-created records.

So the logical question about consistency is: Consistency over what, exactly? It is hard to escape the conclusion that it is consistency over an ever diminishing percentage of what is available to our users.

Yet, I still believe very strongly in consistency, but it must be reconsidered in the world of 21st century linked data and the abolition of separate “data silos”. It is all coming together, and both the cataloging community and the library community seems to want this. I want it too.

The idea and purpose of consistency will change. It must change, or it will disappear. Is it at all realistic to think that these other non-library databases will implement RDA? [That is, the current library rules for metadata] Hahaha! But if a huge percentage of a catalog follows no rules at all, how can we say that consistency is so important? If consistency is to mean something today and in the future, it will have to be reconsidered. What are its costs and benefits?

I consider these to be existential questions for the future of cataloging. I don’t see these issues being discussed in the cataloging community, but I have no doubt whatsoever they are being discussed in the offices of library administrators, whether they use words such as “consistency” or not.”

Conclusions

To sum up how my own ideas have changed: today I am much less certain than I was 15 years ago about the superiority of library methods. I still believe in the library goals of providing resources that have been selected by professionals according to open, professional standards. Afterwards, those resources should be described and organized in professionally standardized ways, while all standards are governed by the good of the users and aim to be as objective as possible. I do not consider those goals obsolete in any way, and they are completely different from goals seen from other projects such as “the semantic web” or “intuitive search.”

Traditional library methods are completely different however. As this article shows, methods used in other types of information processing are developing at an incredible rate and these methods are specifically designed to appeal to the public. These developments simply must have consequences for libraries, but it is difficult to predict exactly what those consequences might be. For instance, searching by last name [comma] first name, which everyone did automatically not that long ago, is for all intents and purposes, gone. As mentioned before, the simple concept of searching by author, by title, or especially by subject, is being forgotten. This doesn’t mean that people would not like these options if they knew about them, understood them, and it was easy to do, but simply to get to that point will be a huge undertaking.

Nevertheless, I think it would be a worthwhile and a noble endeavor.

-1105

Share