Book Search Category Archive
Google has announced that Google acquires reCAPTCHA. reCAPTCHA is a clever use of scanned, poorly OCRed text as a Captcha that prevents bots from spamming forms and at the same time helps improve OCR (Optical Character Recognition - the process of taking a scanned image of a page of text and converting it into searchable text).
I had always liked the idea of reCAPTCHA, especially since it was reputedly helping the Internet Archive with their scanning of books which (unlike Google books) they make open to everyone and focus on clearly out of copyright works.
However, with the Google announcement, I saw very little mention of how this might impact the Internet Archive. I assumed that Google would switch the reCAPTCHA underlying data from the Internet Archive to the Google Books project (which is not open and it remains to be seen how willing, if at all, Google will be to let other search engines use the searchable data from all their scanning).
Then I was even more surprised to read at reddit that the Internet Archive had never received any correction data from reCAPTCHA. "I don't expect to get any data from the reCaptcha project, since we've asked several times and received no response."
Just another example of a great sounding project that failed to deliver the results it implied. I'm sure Google will make sure to have it help their scanning and OCR projects, but I, for one, am no longer interested in using it.
I am very disappointed to see that Amazon seems to no longer have the Search Inside The Book feature. Clued into this loss by Marydee's post yesterday, I can no longer even get Marydee's workarounds of going to the Canadian or British versions of Amazon to work. So what's missing?
- No more Search Inside the Book arrows on certain cover shots
- None of the text statistics and ability to search within a book on those books
- A search on a unique term like tilinca which only occurs in the contents of books now gets no results.
- No Look Inside excerpts or any view of the contents of the book
I certainly hope that this is a temporary issue and not a permanent closing of the program. Some of the initial cover shots still have the Search Inside graphic on top as in the following example. But clicking on the link takes me to a page with no ability to view any of the contents.
Searching via the nearly-defunct A9, I was able to get a list of results for books including my search term, but I still could not get inside the books at Amazon. If this continues, it would be a significant and major loss of searchable content in books!
[Update: Just before clicking post, I am starting to see the Search Inside links again. Phew! I hope this was just a temporary glitch and not a major change in policy!]
I've been meaning to add a page to help keep track of some of the book search engines -- those that search the full text of digitized books, not book store searches -- and now I have finally added a book searching page. You can also view my presentation on book searching that I gave last week at the Computers in Libraries 2007 conference in DC. I plan on expanding the analysis and trying some more in-depth comparisons of these tools in the future.
Last week, Bowker announced an agreement with Microsoft that its Global Books In Print database will be used for "basic and value-added data that will enhance descriptions of books incorporated in the new Live Search Books." Considering that Live Books are primarily out-of-print, out-of-copyright books and that Global Books In Print covers, surprise, in-print books, it would be interesting to know how many matches between the two are found. I have yet to see any examples. Today, Google announces the addition of geographic data to its books. Books are analyzed for place names and a Google Map with a list of names and text snippets appear on some books' "About this book" page. It includes some snippet, limited preview, and full text books. According to Google,
When our automatic techniques determine that there are a good number of quality locations from a book to show you, you'll find a map on the "About this book" page.The only way to find out if a particular book has been so analyzed is to look at that book's "About this book" page.
I'm somewhat surprised that I've not heard more librarians complaining about this. I had not really considered all the ramifications about it. Philipp Lenson on Google Blogoscoped posts about Freeing Google Books. Basically, he notes that Google scans public domain books available from libraries and then appears to add further restrictions for those books including restricting commercial republication and the removal of the "digitized by Google" mark. Since that bothered him, he has pulled 100 titles from Google Books and "set them free" on his own Authorama Public Domain Books site (with the "digitized by Google" mark removed).
Walt Crawford gives an extensive overview in his Open Content Archive (OCA) and Google Book Search update. It is an excellent summary of other comments and discussions over the past several months. Of course, as soon as he publishes this, stating "The Internet Archive includes 35,000 books scanned as part of OCA (as of early December)," the OCA adds a whole new collection and announces passing the 100,000 volume mark.
Wednesday, I came across an AP article about book searching, "Google Book-Scanning Efforts Spark Debate." It mentions a "$1 million grant to the Internet Archive, a leader in the Open Content Alliance, to help pay for digital copies of collections owned by the Boston Public Library, the Getty Research Institute, the Metropolitan Museum of Art." The article also discusses concerns with Google's project. As interesting as this article is, it lead to an even more fascinating (to me, at least) comparison of news sources and news search engines.
Last Tuesday, Microsoft announced the launch of Live Search Books. Consisting of copyright-free works, Live Search Books launches with a considerable collection. I have a more in-depth review in Information Today's NewsBreaks. In working on this story, I came across an interesting example of how some books may only be found at one service or the other.
You will only notice this once you click on a result, and in particular on a Full View or Limited Preview result. The Snippet view has changed a bit with the addition of 'Key words and phrases' at the top and a 'Contents' section and some other additional information depending on the book record.
But take a look at a Full View or Limited Preview record to see significantly more differences. The left frame lets the reader scroll down from one page to the next without clicking on a next page link. Google has finally added a zoom option. Many records also have a list of 'Related Books.'
Mick O'Leary has an excellent overview "Google Book Search Has Far to Go" for his Nov. column in Information Today. In particular, he compares Google Books Search to Amazon's Search Inside the Book and notes that
. . . Amazonâs feature has several critical advantages over Book Search. The most important is that Amazon has the latest books; Book Search does not. Perhaps because of differing licenses with the publishers, Book Search is often several years behind; Amazon has the latest releases and also lists forthcoming titles. For example, Amazonâs feature has the latest books by Pat Buchanan, James Lee Burke, Ann Coulter, Jeffrey Deaver, Tom Friedman, and Robert B. Parker; Book Search does not (and is usually two or three books behind with these popular authors). This seriously devalues Book Search as a tool for finding, buying, or researching books.
Since the announcement of the Open Content Alliance (OCA) back in Oct. 2005, I have been waiting to see the results of the project and view some of the scanned books. While the major search engine partners like Yahoo! and Microsoft have not yet launched any product, I noticed that the Internet Archive does have quite a few books already available in their
Both Live and Cornell have announced that Cornell University is joining the Live Book Search project. As with such announcements from Google Book search, that means it will be awhile (perhaps a year or more) before any books from that library are available online. For that matter, Live Book Search is not yet available, although it is supposed to go live (no pun intended) later this year.
FreeTechBooks.com is a nicely organized directory of freely available programming and computer science books and lecture notes. The free books listed are usually available directly on the publisher's Web site. The classifications on the left side includes books in Computer Science, Operating Systems, Programming and Scripting, and Related Fields. A search box will search title and descriptions of the books but not their full text. Based on the numbers per category, it looks like it includes about 100-200 titles. While some of the titles are available via sources like The Online Books Page, others are not included there.
Yesterday, A9 had a major redesign of its site along with a major accompanying loss of features. The A9 announcement notes that they have "redesigned the A9.com website to make it easier and quicker to discover information from more sources." It has more of a Web 2.0 look and feel, and I think they have achieved a more usable site. The databases are grouped together in the left column and are customizable. Searchers can build their own groups from the more than 400 source databases. Each column (one per database) now features continuous scrolling (like the beta of Live search used to offer and the Live Image search database still does).
The Google Blog, in a post entitled "Find the wealth in your library" talks about the expansion of links to national library union catalogs at Google Books. More than 15 union catalogs are includes, not just Open WorldCat. It is not always easy to connect to each of these union catalogs, and I still find plenty of records without a "Find this book in a library" link, even when the books are listed in WorldCat. Gary makes some pointed comments as well.
Even more interesting is the report of the availability of some of these scanned books from with the University of Michigan's online catalog, MIRLYN. Some of the government publications which Google only shows in snippet view are available in full text via Michigan. The problem is to find these. Try going to MIRLYN, click on the Advanced Search link near the top, change the Format limit to "electronic resources," and then you might find one. However this does not just limit to Google. Try adding "Michigan Digitization Project" as a "Words Anywhere" and look for records with links both to "Google Online" and "U-M Online." The latter gives the Michigan version.
Search Engine Watch reports on a People's Daily article about Google's plans to launch a book search service in China. Meanwhile, Baidu has announced a plan to integrate book catalog records into its search engine reports Shanghai Daily. "Baidu.com, has signed an agreement with the mainland's top libraries to include their catalogues on its search site, making it the largest Chinese books database in the world." Perhaps, but it is not a full text database if it only includes library catalog records.
Seen intermittently earlier in March, Amazon now has Statistically Improbable Phrases or SIPs displayed in its book records when they also have Search Inside the Book access to the full text.
In keeping with a sudden frenzy of new initiatives, Google is now starting to include records and extracts from published books along with a few connections into library holdings information. These two initiatives are currently separate from each other, and since they are both experimental, they may change or stop appearing at any time. Neither one tends to show up in search results very often, but here are a few links to see what they look like.
First, the Google Print inside the book content, which is not as useful as the Amazon Search Inside the Book since it only includes extracts. Note that the text actually resides on Google's servers. See the Google Print FAQ for more information.
Second, Google has some links to OCLC's Open WorldCat pilot project. It took awhile to find a search that would retrieve one, but Maureen Whitebrook toads seems to work. After finding such a record, the user will need to enter a ZIP code to have it identify local libraries holding the particular book.
While at first glance, both of these book-related efforts seem like good ideas, they may well just confuse the sense of what Google is indexing. Most library patrons are still better served by checking directly with their library's own catalog. And book buyers are likely better served at Amazon or another book retailer.