Inventory for handling scanned documents (was: Better indexing
onbitsavers)
Antonio Carlini
a.carlini at ntlworld.com
Fri May 20 14:05:43 CDT 2005
Jan-Benedict Glaw wrote:
> - How do you scan a paper document? Page by page? Two pages at
> once (with a sufficient large scanner)? Do you use a script
> or something like that? ...or are there well-working
> applications out there that aid in scanning some 100 pages?
Both the previous place I was at and my current employer have
B&W scanners with an autofeeder and scan to PDF. The previous
one would drop a PDF on the desktop, the current one emails
it to me. Both would do double-sided.
> Do you directly scan b/w, or first use grayscale/colour and
> then degrade that to b/w?
Scan in bitonal (1-bit, B&W). Then I (usually, now) post
process to convert to G4 encoded TIFFs within a PDF.
> - How do you work on the scanned images: Do you cut off the
> white rim as much as possible?
I leave them as they come. If I scan a booklet, or something
that cannot be non-destructively taken apart (and later
reassembled) then I will scan two pages at a time and
post-process manually. In that case I'll probably end up
cropping at the physical page edges.
> How do you deal with images
> that are a tad rotated? Accept that? Re-scan to hopefully
> get a better image? Revert rotation in software?
If it is bad I will try to rescan, especially if it is only a
few pages. I've never tried to rotate in software.
> How do you
> deal with single black dots in white areas or the other way
> around?
Never worried about that.
> - What digital format do you like to get when it's all
> finished? Plain PDF? PDF with some bookmarks? PDF with all
> headings as bookmarks? A new PDF-hyperref based index?
> Multiple TIFF/PNG/whatever images? Something like a
> web-based slide-show? ...or multiple formats (web-based for
> viewing, PDF for printing, ...)?
I produce PDF as the final format. I do not usually bother
adding bookmarks or whatever.
> - What do you currently use as your software:
>
> Operating system:
Linux (Debian), Solaris, VMS, DOS and Windows.
> PDF viewer:
Acrobat or xpdf.
> TIFF viewer:
IrfanView, but most things I use PDF
> Browser/other viewers you'd love to use:
I'd love to see near-perfect OCR (average error
rate of say one missed/misinterpreted character
per 500 pages on 5th generation photocopies of
technical manuals from the 1960s).
If I get a second wish, I would like something that
can take multicolour text pages (like the
RSX-111 MPLUS manuals) and slice it and dice
it into multiple layers (blue/pink/red/black
etc. and put it back together as a PDF.
I still have hundreds (maybe more) pages waiting
to be processed - each is currently a 24MB (or so)
24-bit colour TIFF (at 600dpi). Help :-)
Antonio
--
---------------
Antonio Carlini arcarlini at iee.org
More information about the cctalk
mailing list