Posting to NGC4LIB
There are a few projects dealing with this [i.e. data mining for new types of resource discovery]. First, there is simply Google, which has the option in the left-hand menu of plotting any search to a timeline, e.g. search for “wisdom”: http://www.google.com/search?hl=en&hs=7ZI&tbo=1&tbs=tl%3A1&q=wisdom&aq=f&aqi=g10&aql=&oq=&gs_rfai=
How this is generated, I have absolutely no idea, but just glancing at it, it looks as if the word “wisdom” was widely used around 200BC, in 0AD it stopped being used until about 50AD; it went through sporadic use until around 900AD when it became popular again, and then with the rise of printing, its use went up more or less steadily.
Does anybody really believe that?!
There is also the Corpus of Historical American English (COHA) at http://corpus.byu.edu/coha/, which has many more controls. They have an interesting comparison with Google’s Ngram tool at: http://corpus.byu.edu/coha/compare-culturomics.asp.
And of course, there are the notable OCR problems, discovered and blogged simultaneously by many people (including myself!) who apparently think alike. http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181 is one example.
I mentioned my own amazement to find this “specific word” in the book “The Act of Tonnage and Poundage, and Rates of Merchandize” from 1702, where I found the exact usage: http://books.google.com/books?id=Zjk7AAAAcAAJ&pg=PA201&dq=%22fuck%22&hl=en&ei=ilALTbPpIo72sgb8h63jDA&sa=X&oi=book_result&ct=result&resnum=3&ved=0CC8Q6AEwAjgK#v=onepage&q=%22fuck%22&f=false in the sentence:
“Every Merchant making an Entry of Goods, either Inwards or Outwards shall be dispatched in such Order as he cometh;…” and it misread the old spelling of “such”. So, not only did it mistake the medial s for an f, it also misread the h as a k.
The poor author must be spinning in his grave! It appears that Google’s OCR tool is more similar to many human beings than I had suspected: both have filthy minds! 🙂
Of course, this is far from the only OCR problem. To be fair, this sort of “data mining” is in its very earliest stages, so it is easy to point out problems. It will take time, plus trial and error, to discover if these techniques lead to anything of value.
We are in a time of experimentation.