Inventory for handling scanned documents (was: Better indexing onbitsavers)

Antonio Carlini a.carlini at ntlworld.com
Fri May 20 14:05:43 CDT 2005


Jan-Benedict Glaw wrote:

> - How do you scan a paper document? Page by page? Two pages at
>   once (with a sufficient large scanner)? Do you use a script
>   or something like that? ...or are there well-working
>   applications out there that aid in scanning some 100 pages?

Both the previous place I was at and my current employer have
B&W scanners with an autofeeder and scan to PDF. The previous
one would drop a PDF on the desktop, the current one emails
it to me. Both would do double-sided. 

>   Do you directly scan b/w, or first use grayscale/colour and
> then degrade that to b/w? 

Scan in bitonal (1-bit, B&W). Then I (usually, now) post
process to convert to G4 encoded TIFFs within a PDF.

> - How do you work on the scanned images: Do you cut off the
>   white rim as much as possible?

I leave them as they come. If I scan a booklet, or something 
that cannot be non-destructively taken apart (and later
reassembled) then I will scan two pages at a time and
post-process manually. In that case I'll probably end up 
cropping at the physical page edges.

>                                  How do you deal with images
>   that are a tad rotated? Accept that? Re-scan to hopefully
>   get a better image? Revert rotation in software?

If it is bad I will try to rescan, especially if it is only a
few pages. I've never tried to rotate in software.

>                                                     How do you
>   deal with single black dots in white areas or the other way
> around? 

Never worried about that.

> - What digital format do you like to get when it's all
>   finished? Plain PDF? PDF with some bookmarks? PDF with all
>   headings as bookmarks? A new PDF-hyperref based index?
>   Multiple TIFF/PNG/whatever images? Something like a
>   web-based slide-show? ...or multiple formats (web-based for
>   viewing, PDF for printing, ...)? 

I produce PDF as the final format. I do not usually bother
adding bookmarks or whatever.

> - What do you currently use as your software:
> 
> 	Operating system:

Linux (Debian), Solaris, VMS, DOS and Windows.

> 	PDF viewer:

Acrobat or xpdf.

> 	TIFF viewer:

IrfanView, but most things I use PDF

> 	Browser/other viewers you'd love to use:

I'd love to see near-perfect OCR (average error
rate of say one missed/misinterpreted character
per 500 pages on 5th generation photocopies of
technical manuals from the 1960s).

If I get a second wish, I would like something that
can take multicolour text pages (like the
RSX-111 MPLUS manuals) and slice it and dice
it into multiple layers (blue/pink/red/black
etc. and put it back together as a PDF. 
I still have hundreds (maybe more) pages waiting
to be processed - each is currently a 24MB (or so)
24-bit colour TIFF (at 600dpi). Help :-)

Antonio

-- 

---------------

Antonio Carlini arcarlini at iee.org






More information about the cctalk mailing list