Speed reader

Two days ago, a colleague forwarded an article from the New Yorker that I’ve been talking about ever since: Google has plans to scan every book ever published.

They’ve already started.

Every weekday, a truck pulls up to the Cecil H. Green Library, on the campus of Stanford University, and collects at least a thousand books, which are taken to an undisclosed location and scanned, page by page, into an enormous database being created by Google.

Google scans tens of thousands of books a week from various libraries, and dumps all that text into a full searchable search engine.

It’s already online: books.google.com

Google Books sample page

Of course, only a smattering of the total content is there right now. How big is the project?

No one really knows how many books there are. The most volumes listed in any catalogue is thirty-two million, the number in WorldCat, a database of titles from more than twenty-five thousand libraries around the world. Google aims to scan at least that many. “We think that we can do it all inside of ten years,” Marissa Mayer, a vice-president at Google who is in charge of the books project, said recently, at the company’s headquarters, in Mountain View, California. “It’s mind-boggling to me, how close it is. I think of Google Books as our moon shot.”

Yowzer!

I’ve always been a fan of public domain efforts like Project Gutenberg. That effort has been running since 1991, when they entered two books a month, typed by hand (it now had 20,000 books available for free download, all scanned by volunteers.)

Kirtas book scannerAnd today, by coincidence, I saw the mother of all scanners.

Google is using it’s own “custom-made scanning equipment”, but I imagine it’s similar to the mechanical marvel that I spotted at the Ontario Library Association’s Superconference (librarians really dig the CBC Digital Archives, so we do our dog and pony show there each year.)

The particular scanner shown here is the Kirtas APT BookScan 2400 – so named because it can scan a purported 2,400 pages an hour.

animation of book scannerIt’s quite a sight to behold (at least at a librarian’s conference it is… then again, it was the only thing in the building that moved or made noise – and nobody shushed.)

As you can see in this image from their site, a vacuum-equipped arm turned the pages every couple of seconds. The rig I saw had a pair of 16 megapixel Nikon digital SLRs mounted up top to snap the pictures (cheaper versions have one camera and a mirror to capture both sides.) The images are fed to a pair of CPUs for processing, OCR, etc. The unit I saw ran about $200,000 – a lot of money for a librarian, but peanuts for a Google.

Another cool feature – the guys at the booth showed me an ancient children’s book that was falling apart. They ran it through the scanner and sent the created file to one of those publish-on-demand book machines, and got back a brand new version of the book that had no noticeable decline in quality.

I picture the two linked machines replicating food, like on Star Trek (or maybe the machines from The Fly…)

longpenThe OLA conference had something like that, too (but less sinister) - a booth where people could chat with an author located in another city, and get their books signed remotely. The machine, called LongPen, was developed by Margaret Atwood’s company Unotchit (”you no touch it!”)

There was a long lineup today for people wanting to interact remotely with an author in Montreal.

Ironically, yesterday Margaret Atwood was at the conference in person. When I saw her, there were perhaps three people lined up to speak to her.

I may have glimpsed the future of the book industry: a machine that writes, and a machine that reads. (What would happen if we plugged Unotchit into BookScan?!?)

I just hope actual people are still required somewhere…

Posted by: Paul Gorbould | 02-02-2007 | 05:02 PM
Posted in: Teh Internets

No Comments »

No comments yet.

Comments RSS TrackBack URI

 

Leave a comment