Better indexing on bitsavers
Jan-Benedict Glaw
jbglaw at lug-owl.de
Fri May 20 07:05:36 CDT 2005
On Fri, 2005-05-20 11:36:24 +0000, Jules Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
> On Fri, 2005-05-20 at 09:31 +0200, Jan-Benedict Glaw wrote:
> > On Thu, 2005-05-19 22:20:53 +0000, Jules Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
> > > On Thu, 2005-05-19 at 23:46 +0200, Jan-Benedict Glaw wrote:
> > > > I'm still thinking about how paper-based documentation can be made up
> > > > cleverly enough to gain text as well as images and mixing meta-data into
> > > > that. Maybe I'd do some C programming and hack something nice producing
> > > > PDF files helding everything? But first, I'd need to understand PDF
> > > > (whose specification actually is about 8cm thick...)
> > >
> > > Doesn't this sort of imply that PDF is the wrong choice of format for
> > > jobs like these? (plus I'm pissed at Adobe because their current readr
> > > for Linux eats close on 100MB of disk space just to let me read a PDF
> > > file :-)
> >
> > There are alternatives, like:
> >
> > - A tarball containing all the TIFF (or whatever) images as well
> > as some (generated)
>
> See my other post; that's my preference and what I tend to do with all
> image-based PDF content I download from anywhere anyway...
For the records (and my education), how do you extract these?
> > HTML page (containing some kind of slide
> > show) as well as a small description file (use this with some
> > program (to be written) to generate the HTML file(s)).
> >
> > This gives the chance that the description file can be done
> > quite clever, so you'll get eg. a clickable index for the TIFF
> > files (though, needs to be done manually, but now this work
> > load can actually be *distributed*)
>
> One of the things that I was working on a few years back was layering
> multiple delivery mechanisms over one form of content (where the dataset
> was sufficiently large that storage in multiple formats wasn't
> justified).
>
> Data was kept in the "purest" form on the server side, and a client
> could ask for content in whatever format they wanted (in this case raw
> images, PDF, HTML etc.) and over whatever interface mechanism they
> wanted (HTTP, FTP, WAP, email, network filesystem etc.)
Actually, I was working on something like that as well, but with a
different ulterior motive: build something like this as a redundant,
peer-to-peer capable database and many of archiving-old-data problems
just vanish. (Indeed, it would make up a nice P2P system as well.)
> I could see some of the big archives around the planet (regardless of
> content) going this way in the future; user base is maximised through
> offering different formats whilst the "pure" dataset is all that's
> backed up and actually kept on disk.
That's what I dream about in long nights, but not as a centralized
database, but a distributed one. Imagine you shift in a raw-encoded
audio CD (with all the interleaved stuff intact, even containing all the
mischievous, intentional errors. ...and a front-end interlacing WAV
files (which another front-end could use to produce ogg/mp3/wma/you name
it).
Concepts like that can be applied to nearly all media types. Just record
that the underliing (sp?) recording/reading machinery gets and write
filters for that. These filters may get as complex as filesystem
drivers. In fact, this *is* a layered filesystem. Remember the thread(s)
about how to rescue tapes? ...or other HDD images? Apply these general
concepts and things may get easier :)
For now (as we're not (yet) there), prividing space isn't my main
problem. It's about having (time to write) the software.
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de . +49-172-7608481 _ O _
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O
fuer einen Freien Staat voll Freier Bürger" | im Internet! | im Irak! O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));
More information about the cctalk
mailing list