Many things

Jim Battle frustum at pacbell.net
Mon Jan 31 20:27:14 CST 2005


Antonio Carlini wrote:

>>Although it doesn't really know text is per-se, one of its 
>>algorithms is 
>>to find glyph-like things.  Once it has all glyph-like things 
>>isolated 
>>on a page, it compares them all to each other and if two glyphs are 
>>similar enough, it will just represent them both (or N of 
>>them) with one 
>>compressed glyph image.
> 
> 
> That looks like information loss to me.

yes, it is information loss.  scanning bilevel is a much worse 
information loss.  scanning at 300 dpi, or 600 dpi, or 1000 dpi is 
information loss.  viewing the document on a CRT is information loss.

 > If one of those glyph-like
> things was not the same symbol as the others, then the algorithm
> has just introduced an error.

yes, you are right, *if*.  And that is where you are wrong to assume it 
is likely to make a difference.

>>So for OCR purposes, I don't think this type of compression 
>>really hurts 
>>-- it replaces one plausible "e" image with another one.
> 
> 
> But one of them might have been something other than an "e".

Antonio --

yes, if you assume that the encoder is going to make gross errors, then 
it is a bad program and it shouldn't be used.  but have you ever used 
it?  it doesn't do anything of the sort.

imagine a page with 2000 characters, all of one font and one point size, 
and that 150 of them are the letter "e".  In a tiff image, there will be 
150 copies of that e, all very slightly different.  In the djvu version, 
the number of unique 'e's will depend on the scanned image, but it isn't 
going to replace them all with a single 'e' -- there might be 50 'e's 
instead of 150.  Thank about that -- to the naked eye, all 150 look 
identical unless yo blow up the image with a magnifying tool.  djvu is 
still being selective enough about what matches and what doesn't that it 
still has 50 copies of the 'e' after it has collapsed ones that are 
similar enough.  It isn't very agressive at all about coalescing glyphs. 
  As far as I know there is a bound on how small of a size it will try 
to group so that for really small point sizes, nothing bad happens at 
all.  The differences it allows are truly inconsequential.

It is like complaining that mp3 (or insert your favorite encoder here) 
sucks because in theory it can do a poor job of it.  In fact, ones that 
do a poor job get left behind and the ones that do a good job get used.





More information about the cctalk mailing list