Saturday, June 14, 2008

Reading OCR'd Scans

I've been doing a lot of reading recently, a fair amount of it on the
computer because of the move. All my hard copy books, and anything
else I want, are buried under piles of my cruft and other people's
junk. It's a mildly amusing situation. Anyway, most of the scans of
current books I get are not revised. Someone scans in the books,
OCR's the whole thing, and dumps the text output online.
Unfortunately, OCR isn't all it's cracked up to be, and sometimes the
recognition is a little off. In the particular book I'm reading now,
"~" swapped for an "s" is a fairly common occurrence, and bad
recognition like "W~Uiam Gib~3on" instead of 'William Gibson' happens
occasionally. The neat part is that unless I'm reading word by word
the typos are irrelevant and I fill in the banks from context. It
makes me wonder why there hasn't been an adaptation to the OCR
software to try to catch these things: if you know it's a book, and
that it's in English, you can do all sorts of word checking. It would
certainly be a boon to organizations like Project Gutenberg, as well
as Google Book search.
Robert Alverson


