Lossy compression vs. archiving and OCR (was Re: Many things)

Jim Battle frustum at pacbell.net
Mon Jan 31 17:48:25 CST 2005


Eric Smith wrote:

> Jim Battle wrote:
> 
>>When you
>>scan to bilevel, exactly where an edge crosses the threshold is subject
>>to exact placement of the page, what the scanner's threshold is, and
>>probably what phase the 60 Hz AC is since it to some degree couples to
>>the lamp brightness (hopefully not much at all, but if you are splitting
>>hairs...).  Thus there is no "perfect" scan.
> 
> Never claimed there was.  But I don't want software to DELIBERATELY
> muck about with the image, replacing one glyph with another.  That's
> potentially MUCH WORSE than any effect you're going to get from the
> page being shifted or skewed a tiny amount.

"potentially" is the key word.  if the encoding software is crappy, then 
they such a substitution could turn all "e"s into "x"s.  sure.  but the 
djvu encoder doesn't make gross substititutions like that.

Contrary to what you say, skew has a much larger effect on the sampling 
than djvu's encoders have.  Which scanner you use has a much larger 
effect on the sampling too.

...
> I normally scan at 300 or 400 DPI; when there is very tiny text I
> sometimes use 600 DPI.
> 
> Even at those resolutions, it can be difficult to tell some characters
> apart, expecially from poor quality originals.  But usually I can do
> it if I study the scanned page very closely.  No, OCR today cannot do
> as good a job at that as I can.  Someday OCR may be better.  But
> arbitrarily replacing the glyphs with other ones the software considers
> "good enough" is going to f*&# up any possibility of doing this by
> either a human OR OCR.

Eric, in picking a case where the djvu algorithm *might* cause problems, 
you must also confess that in this case scanning in bilevel, even 
lossless, is going to be a bad choice too.  If the page is that poor, 
you should be using grayscale.

Why be religious about lossiness and claim anything less is going to 
"f*&#" up your efforts when you've just tossed away the bulk of the 
information?

> And all to make the file a little smaller.  DVD-R costs about $0.25
> to store 4.7GB of data, so I just can't get excited about using lossy
> encoding for text and line art pages that usually don't encode with
> lossless G4 to more than 50K bytes per page.

"A little" can be 3x.  For distribution, it is a big deal.  Until 
recently, it made a signficant difference on disk price too, but now 
that you can get 120 GB hard drives in a box of cereal, that isn't so 
much of a concern.

Of course you can use whatever format you want for your archiving. 
Making it available in a more accessible format means that more people 
are likely to take advantage of it.

For most documents, it is the information that I care about preserving, 
not the pixels.  I would be overjoyed if Adobe would buy out lizardtech 
and adopt some of their technology, even the lossy bits.





More information about the cctalk mailing list