Message to Autocat
On Fri, 5 Mar 2010 09:15:40 +1300, Anne wrote:
>Do you know if this is how OCLC's "Extract metadata" function is supposed to work?
>It seemed to work well in a recent OCLC webinar demonstration extracting from a PDF to populate a MARC record.
>But I haven't been able to produce anything useable myself.
I don't know how OCLC's system works, but I've done something similar with pdf and other documents myself a few years ago. A pdf does not use XML (that I am aware of) but almost every document has something called "Properties." If you open any pdf file, there should be something under a menu tab somewhere that says "Properties."
If you open this document http://www.ideals.illinois.edu/bitstream/handle/2142/3957/gslisoccasionalpv00000i00164.pdf?sequence=1, you can see the properties. Using my version of the pdf viewer, I have to put the arrow on the text and right click. Then I can see the Document Properties, and you will see all kinds of information: what program produced it, security and fonts, and so on, and in this case, someone has even entered the author's name in library-correct fashion. There are areas for other kinds of information as well. As you see, "Properties" is the local flavor of metadata. Other formats allow this kind of metadata, such as png image files.
Anyway, this is the information that I have dug out of a pdf file and I suspect that OCLC is doing something similar. I have done similar things with doc files. I've always thought this could be used much more powerfully than it is now.
But returning to XML, perhaps the simplest way to explain what I mean is to see it in action with RSS feeds. Here is the "Top News" feed from Reuters http://feeds.reuters.com/reuters/topNews If you look at the page source, probably by right-clicking and selecting "View Page Source" or something like that, you will see the native XML, which looks absolutely horrible. But ignore that for a moment and search the page for <title> a few times, and you will see the title of the first story in the RSS list. For me right now, it is "House OKs $15 billion jobs bill."
This title is taken directly from the article itself, since the BBC website uses XML. Nobody builds the RSS feed by copy and paste, but it is built automatically from the website itself. If a reporter would change the title in the story, it would be reflected automatically in the RSS feed.
The same thing could work in all sorts of other ways, including dynamically putting in the information into our catalog records. Instead of
<title>it could just as easily be
<datafield tag="245" ind1="1" ind2="0">(See why transcription could even improve over what we do today?)
But it doesn't have to stop here, applications can be built, and are being built now, that take information from all kinds of different sites (reviews, amazon, OCLC, etc.) to create something entirely new.