Inventory for handling scanned documents (was: Better indexing
on bitsavers)
Jules Richardson
julesrichardsonuk at yahoo.co.uk
Fri May 20 14:02:46 CDT 2005
On Fri, 2005-05-20 at 19:29 +0200, Jan-Benedict Glaw wrote:
> On Fri, 2005-05-20 17:08:34 +0000, Jules Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
>
> > I can't say I've found many bad TIFF viewers for single images though
> > (on any platform); it's only when multiple images are put into the same
> > file that a lot of tools start falling over.
>
> We've now named quite a lot of applications and concepts about how to
> handle scanned documents. I'd like to get the big picture:
>
> - How do you scan a paper document? Page by page? Two pages at once
> (with a sufficient large scanner)? Do you use a script or something
> like that? ...or are there well-working applications out there that
> aid in scanning some 100 pages?
Single page here for A4 docs; for A5 I'll do two pages at once and have
written scripts before to automate the rotating / cropping process (and
chop out some of the 'noise' in the white background to reduce file
size). I tend to use 300dpi.
(Lots of Windows scanner software seems to try and be clever and adjust
the contrast on the fly btw - which isn't so good when you're trying to
get consistency across multiple pages!)
> Do you directly scan b/w, or first use grayscale/colour and then degrade that to b/w?
I use 8 bit greyscale for all pages, 24bit colour for covers. The former
because I don't want my scanning process to harm future OCR attempts;
I'd rather leave as much info in the images as possible rather than chop
to b/w at a certain threshold and find much later down the line that
information had been lost. If there's no assumption that the docs will
never be OCR'd then that's not a problem though. I could probably get
away with 16 grey levels actually rather than 256 as a reasonable trade-
off between flexibility and storage requirements.
> - How do you work on the scanned images: Do you cut off the white rim as
> much as possible? How do you deal with images that are a tad rotated?
> Accept that? Re-scan to hopefully get a better image? Revert rotation
> in software?
I've been known to clean up scans and sort out rotation (in particular
photocopied docs tend to be not very straight). It's a time-consuming
job though; I'm almost tempted to say it's not relevant at the scanning
stage and can be deferred (much like the OCR process).
> How do you deal with single black dots in white areas or the other way around?
I don't. Lots of storage space can be saved by altering the black and
white threshold of the image though (without checking, I think I found
treating the bottom 10% as black and the top 20% as white worked
particularly well)
> - What digital format do you like to get when it's all finished? Plain
> PDF? PDF with some bookmarks? PDF with all headings as bookmarks? A
> new PDF-hyperref based index? Multiple TIFF/PNG/whatever images?
> Something like a web-based slide-show? ...or multiple formats
> (web-based for viewing, PDF for printing, ...)?
Seperate TIFF images, one image per page in the original doc. I scan
blank pages too (!) just so that things will work out at any printing
stage. I suppose I want to capture a bit of the "feel" of the original
document as well as the raw content.
> - What do you currently use as your software:
>
> Operating system:
Linux for processing / scripting images, scanner's currently an awful
USB thing hooked up to a Windows box though (urgh).
> PDF viewer:
Try to avoid it where possible. I think I'll end up settling on Acrobat
5.1 under Linux; 7.0 is way too bloated. XPDF and KDE's offering don't
handle PDFs of document scans at all well. Ghostscript seems to have
issues with rendering quality (that could just be a misconfig / RTFM
issue on my part though)
> TIFF viewer:
GQView's my current favourite general image viewer; it seems fast at
decoding and does a good job as a browser / thumbnail handler. I've not
tried it on multi-page TIFFs yet though to see what it does...
> Browser/other viewers you'd love to use:
Well, I'd still like a native Linux version of Paint Shop Pro... :)
cheers
Jules
More information about the cctalk
mailing list