Disk archival techniques
Brian Wheeler
bdwheele at indiana.edu
Tue May 17 17:06:22 CDT 2005
As coincidence would have it, I work at Indiana University's Digital
Library Program and there was a lecture on archiving audio which hits
many of the same issues that have come up here. The conclusions that
they came up with for the project included:
* There's no such thing as an eternal media: the data must be
transportable to the latest generation of storage
* Metadata should be bundled with the content
* Act like you get one chance to read the media :(
While this is a different context, the principle is basically the same.
I've got a pile of TK50 tapes I'm backing up using the SIMH tape format,
so this is relevant to that process as well.
I think the optimum format for doing this isn't a single file, but a
collection of files bundled into a single package. Someone mentioned
tar, I think, and zip would work just as well. The container could
contain these components:
* content metadata - info from the disk's label/sleeve, etc
* media metadata - the type of media this came from
* archivist metadata - who did it, methods used, notes, etc
* badblock information - 0 blocks which are actually bad.
* content - a bytestream of the data
I don't think there's any real need to document the physical properties
of the media for EVERY disk archived -- there should probably be a
repository of 'standard' media types (1541's different-sectors-per-track
info, FM vs MFM per track information, etc) plus overrides in the media
metadata (uses fat-tracks, is 40 track vs 35, etc).
Emulators could use the content part of the file as-is and collectors
would have enough information to recreate the original media. It would
also allow for cataloging fairly easily.
Brian
On Tue, 2005-05-17 at 21:04 +0000, Jules Richardson wrote:
> On Tue, 2005-05-17 at 15:07 -0500, Randy McLaughlin wrote:
> > I like and prefer media images as straight data dumps but I want the
> > formatting information of the original media somewhere. I even want data
> > from media that is incomplete or has errors, also documented.
>
> Yep, me too. From when we were bashing around ideas about this though a
> few months back it seems that's a minority viewpoint; most people want
> data embedded in the metadata.
>
> For hard drive images I zero-pad any bad data but also include metadata
> in a seperate file - including disk geometry, which blocks are bad,
> resulting dump checksum, timestamp etc. along with anything else that
> might be particularly useful. For floppy images things would be
> significantly more complex though (due to factors as mentioned - variant
> sectors/track, different encoding for different tracks etc.)
>
> The idea behind futurekeep though was to make the metadata highly
> structured and in a similar vein to HTML in that clients could handle as
> much of the data as needed (eg. someone not dealing with variable bit
> rate images wouldn't need a decoder that could handle them). Ideally
> it'd be human-readable too (after a fashion) - e.g. XML - so that the
> data could be reconstructed into a disk image "by hand" even if some
> whizzy util to do it wasn't present. (understanding it at a file level
> is obviously outside the scope)
>
> That doesn't seem *too* much to ask; basic metadata can be created for
> existing images without a lot of hassle *if desired*. To me such a
> format's more useful for future image creation though, particularly in
> the case of less-common systems; the popular machines are likely to be
> covered by their own archive formats already and the following large
> enough that lack of data is not (yet) a problem. Rather than messing
> around with proprietary image formats for those, or formats that aren't
> particularly descriptive, it'd be nice to start from day 1 using
> something that allows us to capture all the useful stuff that goes along
> with the raw data.
>
> cheers
>
> Jules
>
>
>
More information about the cctalk
mailing list