Inventory for handling scanned documents (was: Better indexing
on bitsavers)
Jan-Benedict Glaw
jbglaw at lug-owl.de
Tue May 24 02:38:44 CDT 2005
On Mon, 2005-05-23 15:42:23 -0700, Eric Smith <eric at brouhaha.com> wrote:
> Jan-Benedict Glaw wrote:
> If you're going to use TIFF as your storage format, why not store
> them as multi-page TIFF? Anyone that needs individual pages can
2GB file-size limit and BigTIFF is yet in early draft stage. This
*might* be a problem at some time.
> easily enough "burst" them with a utility like tiffsplit. If you
> store them as a bunch of single files, it increases the chance that
> someone will end up with only a partial document, bad pages, etc.
> (same reason programs are often distributed as ZIP or tar files rather
> than a bunch of smaller files).
I'm not sure of available tools can handle that, esp. copying unknown
tags (esp. those of newer date like the "new" UNDEFINED tag of TIFF6.0).
But libtiff is available, so if existing tools cannot yet handle it,
they either could easily be extended or we write our own.
> Are you talking about a TIFF reader? I'd like to see better PDF readers.
Basically yes, but with the ability to use additional embedded data like
captions, printed page number and index words.
> Evince is already much better than xpdf was, but there's probably room
> for further improvement.
I'd really give it a try :)
> > Eric, I don't know how well-working your bookmark generation code is.
> > Can it already handle really tree-like looking bookmarks if the data was
> > available in tumble's input files?
>
> Yes.
Super :)
> I don't like the way the tumble control files work now, which is
> part of why they're not documented. I'm probably going to redesign it
> to use an XML-based control language.
I'm really thinking about doing some testing with custom TIFF tags. With
that, an external control file may not be needed at all. Though, the
tools need to extract/import that data to/from text files of course.
> In my copious free time. Sigh.
That's always a problem I guess. I just quit my contract (will remain at
my current employer's site for another month continuing hacking _and_
teaching coworkers) and of course I'd like to hack vax-linux, too. (It
was about 04:20 when I went to bed this night, 07:00 is when my alarm
clock rings up and I'm doing that for too long already...) But don't
start too fast with a new XML interface. Maybe the TIFF files will ship
all data for free at some time :)
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de . +49-172-7608481 _ O _
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O
fuer einen Freien Staat voll Freier Bürger" | im Internet! | im Irak! O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));
More information about the cctalk
mailing list