Compute magazine (retry)

Jim Battle frustum at pacbell.net
Wed Oct 19 02:12:12 CDT 2005


Holger Veit wrote:
...
> 2. Scanning: You find a sample issue at 
> http://www.ais.fraunhofer.de/~veit/v2n7.pdf (2MB). This was scanned B&W 
> 400dpi, stored as TIF and  converted with Acrobat.
...
> Regards
> Holger

Holger, others have pointed out rightly that for low contrast pages, you have to 
spend some time playing with thresholding to get it to digitize OK.

What I want to address here it the file size.  If you assemble a PDF from a 
collection of page images (jpgs, TIFF) acrobat appears to simply puts a wrapper 
around them and that is that.

If you read in a PDF and use the File/Reduce File Size... menu, it often doesn't 
have much effect (I think if you have high DPI images it may resample them, 
which may not be what you want).  I tried it on your file and there was barely 
any reduction in size.

However, if you scan the files from within acrobat (Create PDF/From scanner...), 
it applies a lot more intelligence to the task.  It is also affected by a few of 
the preferences you can set.

Something that has worked for me for reprocessing existing files is this 
procedure (I've only used it for 1 bpp images).  Read in the PDF.  File/Save As 
TIFF.  This produces one TIFF page for each source page.  Then use the Create 
PDF/From Multiple Files... and read back in all the TIFF images.  It will 
recompress them.  Now if you are using G4 compression, this will help only if 
the page images were encoded with something worse (like LZW).  The important 
step is to set the preference for reading in TIF files to allow JBIG2 
compression -- this saves about 20% for typical pages, and can be dramatically 
better for images with halftoning.  The real savings come when you select JBIG2 
(lossy).  Yes, it does change the image in imperceptible ways.  For some, this 
is heresy, but i'd point out that you are scanning at 1 bpp, so why be a 
stickler about what you get?

To make this concrete, your original document is 28 pages and is 2266 KB.  After 
my preprocessing step, it is 849 KB and looks every bit as good to my eyes.  See 
for yourself:

	http://home.pacbell.net/frustum/v2n7-repacked.pdf

Looking closely (like 800% magnification) your scan shows a LOT of dithering on 
all characters.  Perhaps this is the result of the low contrast source, but more 
likely your scanner is doing dithering.  I've seen this on the high speed office 
scanner at my work -- the dithering is nice visually when making copies, but it 
introduces a lot of edge deltas that G4 compression spend a lot of bits 
encoding.  My cheap home scanner doesn't dither so agressively and produces 
smaller scans.

Then zoom in on the repacked pdf that I made.  The vertical edges of characters 
have a lot less dithering. No doubt this leads to the smaller file size.

One problem that I'm wrestling with is this.  If you scan via acrobat and use a 
grayscale or color option, acrobat tries to identify regions of pages, perhaps 
the whole page, that can be quantized down to 1 bpp B&W for the best 
compression.  Sometimes it works brilliantly, other times it decides sections of 
a page are best encoded as jpeg images, resulting in barfalicious artifacts.



More information about the cctalk mailing list