Inventory for handling scanned documents (was: Better indexing on bitsavers)

J. David Bryan jdbryan at acm.org
Wed May 25 13:53:00 CDT 2005


On 20 May 2005 at 11:20, Al Kossow wrote:

> Since it turned out that scanning wide-edge first results in straighter
> scans....

In preparation for scanning and PDFing the 200-odd HP manuals in my 
possession, I've been experimenting with batch image-processing programs.  
For deskewing scans, the Leptonica library at:

  http://www.leptonica.com/

...provides a simple solution that works quite well.  Deskewing is 
literally little more than calling the "pixRead", "pixDeskew", and 
"pixWrite" library functions.

For pages containing text and screened photos, I scan once at 600 dpi 
bilevel (for the text) and a second time at 200 dpi grayscale for the photo 
using the descreening feature of the (horrible) HP imaging software that 
came with the scanner.  Manually, I erase the screened area from the first 
image and crop the second, saving the latter as a JPEG.  I've modified 
"tumble" to composite images, so the resulting PDF page has the TIFF G4 
text background with the JPEG photo superimposed.

I wrote a simple masking program to clean up the edges of the scanned 
images.  I wrote another program that takes a directory of image files, 
parses the filenames for section, chapter, and page number information 
encoded in the names, and creates a "tumble" control file to create the PDF 
with appropriate bookmarks, page labels, and blank pages -- the latter to 
allow for easy duplex printing (I've also modified "tumble" to create blank 
PDF pages instead of embedding a blank TIFF page image).

Finally, I use Ghostscript to linearize the tumbled PDF.

The only significant manual work is rescanning the photos in order to 
descreen them.  I'd like to find a batch descreener that would take a 
bilevel screened image file and produce a grayscale image.  The Leptonica 
library has such a function, but my first attempts yielded visible Moire 
patterns.  I need to investigate further.

                                      -- Dave



More information about the cctalk mailing list