Inventory for handling scanned documents (was: Better indexing on bitsavers)

Jules Richardson julesrichardsonuk at yahoo.co.uk
Fri May 20 14:02:46 CDT 2005


On Fri, 2005-05-20 at 19:29 +0200, Jan-Benedict Glaw wrote:
> On Fri, 2005-05-20 17:08:34 +0000, Jules Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
> 
> > I can't say I've found many bad TIFF viewers for single images though
> > (on any platform); it's only when multiple images are put into the same
> > file that a lot of tools start falling over.
> 
> We've now named quite a lot of applications and concepts about how to
> handle scanned documents. I'd like to get the big picture:
> 
> - How do you scan a paper document? Page by page? Two pages at once
>   (with a sufficient large scanner)? Do you use a script or something
>   like that? ...or are there well-working applications out there that
>   aid in scanning some 100 pages?

Single page here for A4 docs; for A5 I'll do two pages at once and have
written scripts before to automate the rotating / cropping process (and
chop out some of the 'noise' in the white background to reduce file
size). I tend to use 300dpi.

(Lots of Windows scanner software seems to try and be clever and adjust
the contrast on the fly btw - which isn't so good when you're trying to
get consistency across multiple pages!)

>  Do you directly scan b/w, or first use grayscale/colour and then degrade that to b/w?

I use 8 bit greyscale for all pages, 24bit colour for covers. The former
because I don't want my scanning process to harm future OCR attempts;
I'd rather leave as much info in the images as possible rather than chop
to b/w at a certain threshold and find much later down the line that
information had been lost. If there's no assumption that the docs will
never be OCR'd then that's not a problem though. I could probably get
away with 16 grey levels actually rather than 256 as a reasonable trade-
off between flexibility and storage requirements.

> - How do you work on the scanned images: Do you cut off the white rim as
>   much as possible? How do you deal with images that are a tad rotated?
>   Accept that? Re-scan to hopefully get a better image? Revert rotation
>   in software? 

I've been known to clean up scans and sort out rotation (in particular
photocopied docs tend to be not very straight). It's a time-consuming
job though; I'm almost tempted to say it's not relevant at the scanning
stage and can be deferred (much like the OCR process). 

> How do you deal with single black dots in white areas or the other way around?

I don't. Lots of storage space can be saved by altering the black and
white threshold of the image though (without checking, I think I found
treating the bottom 10% as black and the top 20% as white worked
particularly well)

> - What digital format do you like to get when it's all finished? Plain
>   PDF? PDF with some bookmarks? PDF with all headings as bookmarks? A
>   new PDF-hyperref based index? Multiple TIFF/PNG/whatever images?
>   Something like a web-based slide-show? ...or multiple formats
>   (web-based for viewing, PDF for printing, ...)?

Seperate TIFF images, one image per page in the original doc. I scan
blank pages too (!) just so that things will work out at any printing
stage. I suppose I want to capture a bit of the "feel" of the original
document as well as the raw content.

> - What do you currently use as your software:
> 
> 	Operating system:

Linux for processing / scripting images, scanner's currently an awful
USB thing hooked up to a Windows box though (urgh).

> 	PDF viewer:

Try to avoid it where possible. I think I'll end up settling on Acrobat
5.1 under Linux; 7.0 is way too bloated. XPDF and KDE's offering don't
handle PDFs of document scans at all well. Ghostscript seems to have
issues with rendering quality (that could just be a misconfig / RTFM
issue on my part though)

> 	TIFF viewer:

GQView's my current favourite general image viewer; it seems fast at
decoding and does a good job as a browser / thumbnail handler. I've not
tried it on multi-page TIFFs yet though to see what it does...

> 	Browser/other viewers you'd love to use:

Well, I'd still like a native Linux version of Paint Shop Pro... :)

cheers

Jules



More information about the cctalk mailing list