TMG-L Archives

Archiver > TMG > 2011-02 > 1297353971


From: "John Cardinal" <>
Subject: Re: [TMG] A Collection of Examples
Date: Thu, 10 Feb 2011 11:06:11 -0500
In-Reply-To: <2011210104512.070418@Terry>


Terry Reigel wrote, in part:
> Strangely enough, it found Wiemann, but neither Fenker nor
> Gapsch are found, though they appear quite clearly in several
> directories in the collection.

Terry,

In the words of the poet C. Berry, "it goes to show you never can tell."

> I suspect they may use a
> spell-checker of some sort to improve the results of the OCR
> process and it dismisses those rare names. But it does find
> Glabe, but that surname isn't quite so rare.

Most OCR programs use dictionaries to try and improve the results, but in my
experience with them, they will not exclude words that fail the
spellchecker. They will attempt to convert an unrecognized word to a
recognized word if there is a close match, but leave the unrecognized word
if there is not high confidence in the correction.

Way back when, I was involved in writing some software that processed OCRd
text, and our spellcheck was optimized to the sorts of character recognition
errors I described earlier rather than to a more typical spelling correction
where software tries to improve the writing skill of a human author. If a
word failed the spellcheck, our OCR software would attempt to correct for a
series of possible recognition errors. If the word contained two successive
Ns, it would change the two Ns to a single M and retry the spellcheck, for
example. The confidence factor was computed based on the types of
corrections made and the number of corrections relative to the length of the
word. The state of the art has advanced considerably since I worked with
OCR, and I suspect the methods have changed, but I bet they still have
spellcheck software that is optimized for words with possible character
recognition errors.

Eventually, genealogy gets us involved in just about every area of
technolgy, eh?

John



This thread: