Cataloging Matters No. 21
A Few Thoughts about BIBFRAME
Hello everyone and welcome to Cataloging Matters, a series of podcasts about the future of libraries and cataloging, coming to you from the most beautiful, and the most romantic city in the world, Rome, Italy. My name is Jim Weinheimer.
The impetus for this podcast is Jeff Edmunds’ essay “BIBFRAME as Empty Vessel” that I cite in the transcript. I began writing but then decided to turn it into a podcast mainly because I haven’t done one in awhile.
“BIBFRAME as Empty Vessel / Jeff Edmunds. 23 February 2017” https://drive.google.com/file/d/0B1IKJYVwLwHyX1VnblJFZ3EtS1U/view
First, I want to say that I second all the main points made by Jeff Edmunds’ excellent essay about the problems of BIBFRAME. In this podcast, I want to discuss something a little different.
It seems to me that if we are to evaluate BIBFRAME, we must relate our judgment to its purpose or purposes. The basic purpose of a traditional catalog has always been very simple: to help users of a library collection discover what materials are in that collection and then to let them know where those materials are located. Of course, the practice is much more complicated than this simple statement suggests but basically that’s what it does. The purpose of BIBFRAME seems to be quite different however.
So, what is that purpose? That is the subject of this podcast. The original purpose of BIBFRAME (that I can discover) is in the initial BIBFRAME announcement where they quoted from The Working Group of the Future of Bibliographic Control:
“Recognizing that Z39.2/MARC are no longer fit for the purpose, work with the library and other interested communities to specify and implement a carrier for bibliographic information that is capable of representing the full range of data of interest to libraries, and of facilitating the exchange of such data both within the library community and with related communities.”
https://www.loc.gov/BIBFRAME/news/framework-103111.html [this link now goes into the Internet Archive because it disappeared from LC’s site and I can no longer find it]
I agree with that too-technical statement, but what does that part really mean: no longer fit for the purpose? After considering it at some length, it seems to me that what that statement actually says is that the goal of the ILMS (the Integrated Library Management System) turned out to have some major defects. I need to explain.
First, for those younger librarians who may be listening, the ILMS was the idealized dream of an earlier generation of librarians. It envisioned bringing together all the computerized systems that a library used to manage its technical services. At that time computers were far less powerful than they are today, and the development of hardware and software happened more slowly. Open source systems, feedback to the developers and collaborative development all took place at a slower pace than today. And it was expensive.
In the library, each task was considered to be so complex that more often than not it meant completely different computer systems for each task. Normally, this meant there was the online public access catalog, or the OPAC, for the public to use, and the related technical tasks of the library: cataloging, acquisitions, and circulation with each task performed by a different system. That meant staff needed to know how to operate systems that were quite different from one another, and sometimes libraries even needed different IT services. The specialized demands of each computer system led to individual staff members specializing in specific systems, which in turn required separate organizational departments based on the staff who specialized in each computer system. Another result was duplicated labor: the same information often could not be shared among the systems and had to be reentered manually. It was all kind of crazy compared to today and the dream at that time was to bring it all into one single system, which would make everything much easier for everyone.
When the dream turned into reality, it turned out there was a major problem: although the ILMS had integrated the tasks within the library more or less well—and I won’t get into that—it was insufficiently integrated with the other computer networks that had been built. For example, librarians would be excited that they had at last gotten their ILMS but … discovered that their acquisitions module would not “talk” with their university’s accounting system or their circulation module did not “talk” with some other management system. All kinds of workarounds had to be devised sometimes requiring …. manually reentering information! Much of the problem was based on MARC format because in the ILMS, all of the tasks tended to be centered around the bibliographic record, and the bibliographic record was in MARC while the non-library systems used other formats.
At around the same time the World Wide Web popped up–seemingly out of nowhere–and there was a mad rush for everyone to get on. The WWW was born when Tim Berners-Lee created HTML at CERN in Switzerland, and he based HTML on what was easily available to him: an in-house version of SGML called SGMLguid. (https://en.wikipedia.org/wiki/HTML)
A short digression here: over the years it has occurred to me that if Berners-Lee had chanced to be a librarian, perhaps he would have based his WWW language on MARC format instead of SGML, and a MARC variant would be the language of the web today, but it was fated to be otherwise.
In any case, computer technologies developed in a whole number of ways and the Integrated Library Management System which for so many years had been seen as the “El Dorado” among the library community, turned out—amazingly!—to be a dinosaur even before it was implemented. There was no time for reflection though: task number one was that libraries HAD to get their catalogs onto the World Wide Web. As fast as possible.
So they did, and they did it by making the front-ends of their OPACs compatible with the World Wide Web and they had their MARC records display in HTML. In this way, anyone in the world with a web browser could navigate to a library’s catalog, search that catalog, and display any records found in the OPAC within a web browser. Plus, thanks to the ILMS, people could even know if something was on order or if it was checked out and when it was due back in the library. Problem solved, and brilliantly too!
Or so it seemed.
It soon become clear that just putting an ILMS on the web wasn’t nearly enough. The problem was that part where I said that “… anyone in the world with a web browser could navigate to a library’s catalog …” It turned out that few people did that. What happened was that the public began to use the general search engines more and more often: among others there was Lycos, Web Spider (my favorite), Altavista, Yahoo! and later came the new upstart, Google, and people relied on whatever they found in those search engines. They didn’t take the extra trouble to navigate to an individual library’s catalog.
Therefore, it turned out that, in a sense, anything that was not in the search engines didn’t exist—at least not for that group of people who relied almost exclusively on what they found in the search engines. And for better or worse–I thought it was worse, but what I thought didn’t matter–that group of people turned out to be multiplying at a nightmarish rate.
Libraries began to learn about something new: competition. Amazon.com seemed to understand this very quickly and they allowed their individual metadata records to be added to the search engines so that when someone searched for an individual book, e.g. someone would search “twain huckleberry finn” in Yahoo, they would find a link to the individual record in amazon.com where they could “easily” buy a copy with one click. But that same search “twain huckleberry finn” in Yahoo would not show a searcher a similar link into a library’s catalog, even though the catalog was on the web, the library itself was located right next door, and had a nice copy sitting on the shelf.
Why didn’t people know about that book on the library shelf? Because people preferred to stay in the search engine results and were not taking the trouble to search the library’s catalog separately, where they would have found it. The individual records of a library’s catalog were not in the search engines and couldn’t be found there, although Amazon’s records were. Why weren’t the library records there? There were several reasons, but one of the main ones was because the library records were in MARC format. Therefore, it turned out that as long as MARC didn’t change into something more web friendly, records of a library’s catalog could never be found in a search engine results page.
With so many individuals and institutions participating in that incredible landrush into the WWW at that time, people who used the search engines were already retrieving hundreds or thousands of results for almost any conceivable search and it was much easier for them to choose one of those options instead of navigating to their library’s catalog. Insisting that the public was just being lazy! and that they should be taught to how to use their library catalogs!–correct sentiment or not—was as useless as trying to beat back the ocean. As a result, not only were library catalogs being used less and less, their library collections were used less, and libraries themselves were frightened that they were being used less too. Libraries risked being turned into isolated and bypassed islands, or to use the technical term, “information silos” that would remain forever separate from the wider web community. All because of an obsolete format.
So, there was increasing pressure to change formats. One of the main library initiatives was, and it seems to be still on-going, OAI-PMH (that mouthful means Open Archives Initiative Protocol for Metadata Harvesting). Lots of people worked on this method and it was all the rage in libraries for a time. There were negotiations between the OAI-PMH community and the search engines. I personally thought this was going to be the ultimate answer for libraries, but in 2011, negotiations were broken off and the search engine community of Google, Yahoo, Bing and later Yandex announced that they would support only their own method called schema.org. Around this same time, the BIBFRAME initiative also began with its focus on FRBR and Linked Data.
I think this foray into history is necessary to help understand what is the purpose of BIBFRAME so that people can then determine for themselves whether it will succeed or fail. Also, it’s interesting to remember an earlier utopian idea among the library community, the Integrated Library Management System, and see how its realization did solve some problems but created some highly serious ones as well. As we have seen, one of those problems is that the public of today has turned to different information tools than they used before the World Wide Web and search engines; that was something the dream of the ILMS never imagined.
Based on these considerations, the solution today for many librarians is obvious: a library’s catalog records must be put into the search engines because that is where the users are. As we have also seen, the search engines have already decided that if you want to do that, it must be in schema.org.
Well, let’s be brutally honest here: it doesn’t HAVE to be schema.org, but that is the way it is supposed to work if you have structured data and you want to maintain it in the search engines. A library could have always converted their records into flat files—the electronic equivalent of catalog cards—and fed all of those into the search engines. That’s probably what Amazon.com did. I don’t know how many libraries have done that however. After all, if you can’t search Google or Bing by author, title or subject, only by keyword, what is the use of retaining all of that extra coding in the files you share with them?
But, let’s see how schema.org is supposed to work. To demonstrate the power of schema.org, I can’t do it with books right now but I can with something else: recipes. Suppose that someone has a favorite recipe they want to share. That person could put it on the web somewhere—anywhere, on their own blog or someone else’s, and mark up that recipe using the Recipe section of schema.org (https://schema.org/Recipe). Then, a search engine will come along, index the page and if the search engine builds the correct interface, as Google has done with Google Recipes, you can then interact with the search result.
Here, someone has searched for “vegetable biryani” and we can see how the result can be limited by ingredient, cook time or calories. All very nice and useful, but ….
Google cancelled Google Recipes, after who knows how many people marked up their recipes, and Google ended it without any notification at all. I still can’t find any notification. Today if you search Google for “chicken marsala” you will get a bunch of recipes plus the “Knowledge Graph” that essentially uses linked data to copy the information from Wikipedia. But you cannot interact with the search results the way you could when Google Recipes existed. I liked Google Recipes. It was fun and I was sad to see it go.
If you search Bing, you will find a type of interactivity but it is different from what was found in Google Recipes. On Bing, recipes from the site allrecipes.com have a type of interactivity within a single recipe where you can click to get an overview, the list of ingredients and so on. This works for searches done today.
After examining the search results it appears that all of this interactivity is with recipes from allrecipes.com. I suspect they pay Bing for this but I cannot be sure.
Today, people who want similar capabilities to what Google Recipes had must turn to special sites and apps such as Yummly, which uses schema.org for recipes. (http://www.yummly.co/how-it-works/) For those who want to share their own recipes, there are special plugins today that help bloggers to mark up their recipes with schema.org. (https://wordpress.org/plugins/yummly-rich-recipes/) Then you can try to get a service such as Yummly to index your recipe but they might not. Although you can still find lots of recipes in Google, those earlier capabilities we saw are gone and you must use other tools.
Within the more generalized Google search results, it seems as if schema.org is still used today but only in highly specific ways: pages marked-up in schema.org are supposed to rise higher in the search results than those without, but that is only a single part of the huge, complex, deeply secret, and constantly changing Google algorithm that determines search rank result. Some of the search engines however, do use schema.org to allow web masters to have greater control over how their information displays in the search results by using something called “rich cards” or “rich snippets” as we saw in the recipe from Bing. (https://webmasters.googleblog.com/2016/05/introducing-rich-cards.html)
Why am I discussing this at such length? To show that if the purpose is for libraries to get their records into the search engines where the public can see them, they don’t need BIBFRAME, so that cannot be a purpose of BIBFRAME. They need schema.org and to be honest, they don’t really even need that. It is true that MARC format wouldn’t be a viable option, though.
Although schema.org is not all that good for library materials right now, records could still be converted as well as possible, and then when schema.org for library materials is completed, the original conversion could be completely overlaid and re-fed into the search engines. That may sound like a major task but updating huge sections of their indexes is just a part of every day business for the search engines.
In the transcript I have added an appendix that will not be in the podcast that briefly describes the problems search engines have with databases and how webmasters can get the contents of their databases indexed by the search engines.
Someone may take exception and say: but if we convert directly into schema.org, we lose the linked data we may have. My reply is: first, nothing is ever lost so long as you have everything retained locally. Second, depending on whether you have linked data in your records now and how you do the conversion into schema.org, what you share with Google can include the linked data because schema.org definitely allows for linked data, but all of that is beside the point: it is a fact that what people will see in the search engine results after your data is “fed” into and “digested” by the search engine indexes, the results will not have linked data. Except for the Google Knowledge Graph, what we see in the search engine results pages do not use linked data at all, there is just the link to the item on the web along with a short description that can be customized a bit by the “rich snippets” as we saw.
If it turned out that Google or Bing would someday create a special tool for library materials such as we saw with Google Recipes, and it utilized linked data, it’s true that our records would then be lacking the links. No problem. Our records could be updated in exactly the same way as I described before. No big deal. Yet, there is absolutely no evidence that the search engines are going to do anything like that and even if they did, any tool they made could disappear in an instant without notification, just as happened with Google Recipes and so many other of their products.
There is another, even more ominous finding to consider. The rise of the mobile web appears to be reducing the importance of the big search engines. Already, people access the web using their mobile devices more often than they use desktops or notebooks and the trend is for an even greater move into mobile use. There are two other findings of interest: the vast majority of mobile use is with smartphones over tablets (http://bgr.com/2016/11/02/internet-usage-desktop-vs-mobile/) and 90% of mobile use is within apps instead of browsers. (http://flurrymobile.tumblr.com/post/127638842745/seven-years-into-the-mobile-revolution-content-is)
What this means is that more and more people who want information, for instance, on some political issue, will use the tiny screens on their smartphones and tap on the CNN app—or the Foxnews app or BBC app or Facebook app or whatever their preferred source is—instead of opening the Chrome or Firefox app, navigating to a search engine, and searching there for the political issue where they would see links to all of those sources plus much more. Since mobile apps control everything that takes place inside them and the screen is much smaller, the result is a far different experience than using a desktop with a large screen to open a web browser to go to Google or Yahoo and search for information on their topic where they will see all of those results plus a lot more.
These are trends that do not show any sign of reversing.
I think it would be just too ironic if libraries would finally get their catalog records into the search engines, only to find that most users had already abandoned them in favor of all kinds of separate apps. The historian part of me would love to consider the parallel with the earlier generation who finally implemented their Integrated Library Management Systems—at great expense by the way—only to find that everybody had turned to web search engines! On the other hand, separate apps may prove to be the greatest boon for libraries. Maybe. The fact is, nobody can possibly know.
Beyond these technical issues, I personally believe there are other major concerns surrounding catalog records in the search engines. These concerns range from ethical issues to practical issues but I’ll stick with the major practical one, the question that lots of web masters ask: How can I get my links to be no. 1 in the search results, or at least in the top 10? Otherwise my links may as well not be there at all. Getting catalog records into the search engines is only the very first step, similar to the classic children’s story of The Little Red Hen. For those who haven’t read that story, we learn that if we want to eat a slice of bread, the work does not stop after we we find a few wheat seeds and push them into the ground. There’s a whole lot more to do and the real work is only beginning. If we don’t do that work, then somebody else is going to have to do it for us, otherwise there is no way we will be able to eat that slice of bread.
If the purpose is to get the public to actually SEE your records in Google, it takes a lot more work than simply putting your records INTO Google. A lot more. That kind of work has a name: Search Engine Optimization or SEO. For more details, read The Little Red Hen. (http://abralite.concordia.ca/pd/en/story5.pdf)
The other concern about success or failure of BIBFRAME deals with linked data. I have already discussed this at length so I will keep my remarks here short. I will state once again that the idea of linked data is not that it allows YOU to do new and wonderful things with your own data because you already understand your own data; you have complete control over that data right now and can do anything you want with it. Anything. Right now.
OTHERS who may want to use your data don’t have those advantages. Even if you just give it to them, they don’t understand the structures or the underlying meaning of some of your data: is this a date and if it is, is it a date of birth or a date of publication or a date of an event or what? Or is it some other kind of number and if so is it in acres, square kilometers, microns or light years? It makes a difference! The people who own and control the data know all of that already but anyone else has to be told, that is, if the data is to be shared in a coherent way.
This means that saying we need to make our records into linked data so that WE can do all kinds of things we couldn’t do before makes no sense. We have always been able to do whatever we wanted with our own data and it has been our choice not to. Whether that choice was justified or not is beyond the scope of this podcast.
Making our records into linked data will allow OTHERS, who do not understand how our data is structured, to use our data in ways THEY could never have done before. Also, it is important to keep in mind that if people use linked data for some tool they make, they are in no way obligated to provide any linked data to anybody else. So, you can be a linked data consumer without being a linked data provider.
How would that work? Lots of ways. Let’s say that a library wants to include the Google Knowledge Graph in its catalog. The Google Knowledge Graph uses linked data. All you have to do is add the Knowledge Graph Search API (https://developers.google.com/knowledge-graph/), which doesn’t take too long to just get it to work, and poof! You’re done and your catalog uses linked data! And you don’t have to provide any linked data to anybody.
Of course, the real problems are only beginning. The tough part remains. You need to decide how best to present the Knowledge Graphs to the users of your catalog, and much more important, to determine if the Knowledge Graphs provide your users with the type of information they find useful or if it only gets in their way. Determining those sorts of facts can be very difficult and should not be underestimated.
Here is another possibility that is more complicated: a few teen-age designers use someone’s basement to build an app that uses linked data from 4 or 5 providers, and they obviously have no data of their own to share. Let us suppose that you are in charge of one of those 4 or 5 linked data providers they use. Let’s also assume the very best scenario: the new app they create that uses your linked data becomes wildly popular and makes the designers fabulously wealthy. You will get none of that. If you don’t like the fact that they are enriching themselves from your labor while you get nothing, you only have a couple of options: one is to ask them to pay you and if they will not, you can shut down your linked data service, or at least try to shut them out of it, and the other option I’ll discuss in a moment.
Clearly, linked data is, at base, a deeply altruistic idea. A noble one perhaps, but definitely altruistic. It declares: I will share my data with you and you do not have to give me anything in return. Obviously, to do so is not free for me and it will cost me something since I must buy, set up and maintain, among other things, computers, bandwidth, and software along with the necessary staff. Perhaps it will cost me a lot, but I have faith that in the long run it will be worthwhile for everybody, including me.
The utopian aspects of the linked data idea are obvious. Many business people would undoubtedly leave at this point, laughing at such a naive scenario, and it is difficult to simply discount their skepticism out of hand. The linked data community expects that whatever money they need will just be there, happily supplied by … someone.
I am, of course, talking about linked OPEN data, which is available for free, and not linked CLOSED data, which is available for a price. To make your linked data closed is that second option I mentioned. I haven’t seen the issue of open vs. closed linked data discussed in the library linked data community, but furnishing linked data is very expensive. It is tough to say how sustainable linked open data is in the long run. To make open data closed however, and to expect developers to pay for the linked data they consume changes everything. First, developers had better want your data very badly, in preference to other data sources that are open and that cost them nothing. Although you may think your data is “special” that doesn’t matter much. What is important is that developers believe your data is “special,” otherwise they will not buy it. Linked closed data also requires much more infrastructure including sales, regular business practices, improved customer services and support, security issues, and creates a plethora of new problems that are far less important or even completely lacking with linked open data.
How open or closed will library linked data be, and if it is open, how will it be paid for and by whom? I have no idea and I have never seen any discussion about this.
Finally, linked data has yet to prove that it is such an advance that it is a worthy goal for libraries to pursue. It remains one of those technologies that is forever promising but has yet to be realized. We have yet to see the “killer app” of linked data. There remain many who are very skeptical of linked data, such as Google, which bought one of the main linked data nodes, Freebase, and soon dumped it after the initial attempt with the Knowledge Graphs. Google now seems to be focusing on improving its data mining and predictive algorithms as we see with their “Featured snippets” or as some are beginning to call them, the “one, true answer.” (http://searchengineland.com/googles-one-true-answer-problem-featured-snippets-270549)
Personally, I think linked data could be useful but for specialized types of projects. For instance, with statistical or geospatial data, it can provide new and very useful views that have never existed before. I can also imagine that linked data could be useful in other specific purposes. For instance, the creators of a specific resource will have a good idea of what its users will want. I am thinking of resources such The Pompeii Bibliography and Mapping Project (http://digitalhumanities.umass.edu/pbmp/?page_id=1258) which is a database for archaeological excavations that supposedly uses linked data in some way. The specialists who created that project will know the specific resources that an archaeologist interested in a certain block of buildings will want. The creators can then link to those specific linked data tools for specific purposes.
But for other, more general purposes however, such as those found for users of a library catalog, I haven’t seen much that I consider really useful. If someone searches a library catalog for “Pompeii” they could want almost anything: video games, epic movies, novels or scholarly literature. Deciding which linked data sources to display is much more complex than on the specialist site I mentioned before.
The traditional purpose of a library catalog was always to help users find what materials are in a library’s collection, tell them where those materials are located in the library, and to do so as quickly and efficiently as possible. This idea assumed that the users of a library want to spend the vast majority of their time with the materials in the collection and the least possible time with the catalog.
BIBFRAME seems to turn this on its head and expects people to spend much more time with the catalog, where they will work with the linked data information that has been brought together for them there, and that necessarily leaves less time for the materials in the library’s collection. I suspect that the goals of a tool that wants to bring together all kinds of related information using linked data are at fundamental odds with a quick finding tool such as a library catalog, and will result in a contradiction of purposes. It’s very possible that after a linked data tool is built by libraries, even assuming that it will bring in information that is “relevant”–whatever that will mean—and further assuming that a significant proportion love it—I expect that there may still be a very high demand for something much, much simpler: a quick and easy finding tool of the physical and digital materials in a library’s collection.
I’ll say it again: I am very much for the BIBFRAME initiative. I always have been, even though it has become far too complex for non-librarians to implement, but I basically agree that MARC format must change into something that can be shared more widely outside the catalog itself. It should have been done decades ago as the very first of all the changes. If libraries had done so, I think they would have learned a lot by now about how the general public understands—or misunderstands—our data and how they would like to use it. We could have changed what we did based on that input. Unfortunately, we still don’t know much about that. Finally, I fear that librarians will be sadly disappointed after BIBFRAME becomes a reality because many are expecting great improvements from BIBFRAME and linked data when there is little basis for such expectations.
Still, who knows? Of course I could be wrong. It could happen that some developer in Asia or Africa or South America may hit on some incredible idea that takes our data, turns it into something that is great for everyone, and that actually turns libraries into “the next big thing!” The chances of something like that happening are very small of course, and is equivalent to simply waiting for some kind of divine intervention to solve our problems, such as we see at the end of Shakespeare’s play As You Like It. That play has one of my favorite endings, where the characters have gotten themselves into such a crazy and hopeless mess that it takes the god of marriage, Hymen himself, to come down from Heaven and set everybody straight. Small as the chance is that “the god of libraries” will come down to save the day, it is nevertheless undeniable that so long as libraries remain with MARC format, there is no chance at all of even that happening.
That’s why I am for BIBFRAME. Or schema.org, or OAI-PMH, or all of them, or none of them. I am for whatever allows people to more easily interact with our catalog records. Still, I expect no great changes even if and when that happens.
To sum up, if libraries are intent on using BIBFRAME and Linked Data and to get their records where the public can actually see them, it seems as if they will have no choice except to make some kind of a Yummly for Library Materials, try their best to pitch it to the public, then hope a significant number of people will give it a fair try, and that a significant percentage of those people will actually like and use it. In other words, libraries will have to involve themselves deeply in app development just like everybody else. That kind of development will demand even more time and more funding before anything substantial can reach the public.
The music to end this podcast is from La Folia by Francesco Geminiani, who lived from 1687-1762. The melody of this piece is ancient and many of the greatest composers made variations on it as Geminiani does here, including Vivaldi, Handel, Bach and Purcell. Geminiani studied under Scarlatti and then came to Rome to study under Corelli. After that he traveled widely, had a successful career, and ended up dying in Dublin—Ireland that is, not Ohio—heartbroken after learning that a musical treatise he had just finished had been stolen by one of his servants who had been deceiving him all along. Incidentally, he must have been quite a performer. His students gave him an interesting nickname: Il Furibondo, or The Wildman, apparently for the way he played the violin. (https://www.youtube.com/watch?v=rnkEnfXv3R8)
That’s it for now. I hope you enjoyed this episode and thank you for listening to Cataloging Matters with Jim Weinheimer, coming to you from Rome, Italy, the most beautiful, and the most romantic city in the world.
Search engines do their work using special robots called web spiders or web crawlers, that automatically find web links on a web page, follow those links and bring back the contents of those web pages to the search engine, where those pages are added to the search engine’s database and indexed. The web spider then continues to crawl any links it finds in the new pages, brings those back for indexing and continues on its way, finding more links to crawl.
This method works well with any static page on the web, say an organization’s web site. For catalogs and other databases however, it is more difficult to get to their contents because a search must be performed before the web spider can reach any page. Web spiders do not do searches, so the information held within databases cannot be included in the search engines unless additional work is undertaken.
In concrete terms, a web spider can index all the information in this LC catalog record for “Fantastic beasts and where to find them / Newt Scamander ; special edition with a foreword by Albus Dumbledore.” found at https://catalog.loc.gov/vwebv/holdingsInfo?bibId=12233956
But to get to that page, someone needs to do a search first. Web spiders can’t do that. What the spider needs is that link to the individual record.
This is achieved by creating a sitemap. A sitemap is a collection of the URLs for the items in a database that you want included in a search engine. So, a sitemap for the LC catalog that wanted to include the book “Fantastic beasts” would include the link to the record https://catalog.loc.gov/vwebv/holdingsInfo?bibId=12233956. The sitemap will include all of the links to each record in the database that you want included in the search engine.
After the sitemap is complete, you put it on a server somewhere and the next step is to go into the search engine’s Webmaster Tools. Each search engine has a similar set of webmaster tools [Google: https://www.google.com/webmasters/, Bing http://www.bing.com/toolbox/webmaster, Yandex https://webmaster.yandex.com/] Once there, you enter the link to your sitemap file along with some other information. This allows the web spider to do its work because it now has separate URLs to follow and will add each of your records into the search engine as it normally would. Webmaster tools also allow you to manage how often your sitemaps are crawled, and they give you additional information such as statistics, backlinks and so on.
There is a standard protocol for sitemaps based on XML and there are some limitations. For instance, Google allows a maximum of 50,000 URLs and 50 megs for a single sitemap. Therefore, if you have more than 50,000 URLs you will need multiple sitemaps. A site with 1 million records would need 50 sitemaps. This is some work, but it can be done and is done all the time.
Adding schema.org to your database pages is a little different. It occurs after the sitemap work is completed and the web spider takes your page to add it to the search engine’s database. This is easier to demonstrate than explain. Worldcat records include schema.org markup. The record for Fantastic Beasts is at http://www.worldcat.org/oclc/952207432. If you go there, right click on your mouse and look at the page source, that is, the actual coding of the page, you will find (eventually, beginning on line 1134) the “Microdata Section” (it’s easiest to search the page for “Microdata Section”) and within that section, you will find codes with “http://schema.org”. I shall excerpt a few pieces here, removing unnecessary coding:
<a href=”http://schema.org/alternateName“>schema:alternateName</a> “<span property=”schema:alternateName”>Fantastic beasts and where to find them</span>”
<a href=”http://schema.org/author“>schema:author</a> <<a href=”http://experiment.worldcat.org/entity/work/data/884179#Person/scamander_newt” property=”schema:author” resource=”http://experiment.worldcat.org/entity/work/data/884179#Person/scamander_newt“>http://experiment.worldcat.org/entity/work/data/884179#Person/scamander_newt</a>>Newt Scamander</span>
<a href=”http://schema.org/datePublished“>schema:datePublished</a> <span property=”schema:datePublished”>2017</span>
<a href=”http://schema.org/description“>schema:description</a> <span property=”schema:description” xml:lang=”en”>Offers alphabetically arranged entries detailing the characteristics of such mythical beasts as hippogriffs, blast-ended skrewts, dragons, and unicorns.
This is not a tutorial on schema.org but a short examination of this code should provide a basic understanding of how schema.org works. The book’s title, given here as “schema:alternateName” may seem strange to catalogers, but examining the schema for Books at https://schema.org/Book, we find the description for alternateName is “An alias for an item” and there is no other place for a book title. There is alternativeHeadline which is “A secondary title of the CreativeWork” but that is clearly something else.
Schema.org also allows additional possibilities, as we see how schema:author includes linked data, and the link to a special linked data resource at experiment.worldcat.org.
This is basically how schema.org works. When Google Recipes existed and a page was added for a recipe that included this kind of coding, the Google search engine results page would let the searcher interact with the results based on this coding. In theory, if Google made a similar tool for Book data, searchers would be able to interact with that information in a similar way.
Search Engine Optimization
Finally, I shall give a simple example of one difficulty with Search Engine Optimization (or SEO). Let us say that a library wants to add a record to the Google search engine for the same book as we discussed before, Fantastic beasts and where to find them. A search on Google now for “Fantastic beasts and where to find them / Newt Scamander” brings back (on my machine) the featured snippet for the book in Wikipedia, then come pages from special Harry Potter fansites, links to Barnes and Noble, Amazon, Goodreads, reviews in various newspapers, and so on, but after looking at the first 300 results, there were no links to Worldcat, although the Worldcat record is in Google. I can see the record if I search for it specifically, but I don’t know where it actually is in the results list. (See: https://www.google.com/search?q=fantastic+beasts+and+where+to+find+them+%2F+newt+scamander+site%3Aworldcat.org)
What does this demonstrate? Here, it shows that any library that wants to put their records into the search engines must assume that their records will come up even lower in the search results than the records from Worldcat. If you want to use the search engines so that members of the public will actually see your records, it is not nearly enough simply to get your records into Google. Much more remains to be done.