Disk Maintenance


Disks fail. The most catastrophic computer disaster is a destroyed disk pack.
New parts can be ordered to fix a broken CPU. A lost disk, however, will always
contain irreplaceable data. 

An RSX system manager can minimize disk failures and more importantly, the
damage caused when failures do occur. RSX provides various tools to help
maintain disks. The same tools are provided with each RSX system. However the
techniques used for a large, multi-user PDP-11/70 system may not apply to a
small Micro/RSX system. System managers must strike a balance between the
cost of using tools and damage caused by a failure.


		Bad Blocks
		--- ------

Disk surfaces are not perfect. Imperfections show up as bad disk blocks. It is
not unusual for an RP06 disk pack to have 2 to 10 bad blocks. Some disks record
on special tracks a list of bad blocks which were found when the manufacturer
formatted the disk. These disk, called last-track devices, include the RK06/07,
RL01/02, RP07, and the RM02/03/04/80. The latest fixed media disks such as the
RA80 come with hardware mechanisms which allow bad blocks to be automatically
revectored to replacement blocks. 

Regardless of the type of disk pack, RSX has its own mechanism for handling bad
blocks. The utility BAD writes and reads a specified pattern to every disk
block and logs resulting bad blocks on the last good block on the disk. When
the disk is initialized by INI or BRU, this block is read and any bad blocks
found are allocated to the special file [0,0]BADBLK.SYS. In theory, bad disk
blocks cannot show up in any user files. 

Disk imperfections will worsen over time. A block on a new disk which passes
BAD's test may develop into a bad block over time. If you are initializing a
disk pack which will be in constant use, it is a good idea to run BAD several
times with different test patterns and mark all blocks found as bad. One site
which uses single fixed-media LSI-11/73 systems stress test its disks with a
20-hour procedure. With no second disk drive, any disk failure which occurs
after the system is in production will take the entire system down. The command
file repeatedly runs BAD using seven different patterns. 

The command sequence below shows a sequence of BAD commands using different
patterns. The /LI switch will cause each test to list any bad blocks found. The
final command uses the /MANUAL switch to store any bad blocks detected by the
previous tests. 

	>BAD ddn:/LI/PATTERN=000000:000000
	>BAD ddn:/LI/PATTERN=177777:177777
	>BAD ddn:/LI/PATTERN=000000:000000
	>BAD ddn:/LI/PATTERN=177777:177777
	>BAD ddn:/LI/PATTERN=052525:052525
	>BAD ddn:/MANUAL

Even after the most rigorous testing, bad blocks will occasionally still
develop. These show up as a disk read or write errors. The usual error codes
are parity error (-4) or fatal hardware error (-59). The best solution for bad
block problems is to copy the entire disk to a new empty volume and recheck and
initialize the old volume. If this is not possible, the new bad block can be
added to the bad block file ([0,0]BADBLK.SYS) using the BAD /ALLOCATE command.
The file containing the bad block must first be deleted. 


		Removeable Media
		---------- -----

Ten years ago when an RK05 was state-of-the-art disk drive, it was not uncommon
for a RK05 disk pack to be in and out of many different disk drives on a daily
or weekly basis. Such an environment would turn a minor disk failure into a
major disaster. A failed disk would corrupt the heads on one drive. The bad
heads would then scratch the disk surface of the next pack inserted. The second
pack would then corrupt the heads of the next drive and so forth. Error
recovery mechanisms may hide developing problems until the chain has progressed
some distance. I first observed this problem when a site I was working at
ruined seven RK05 drives and 18 disk packs. 

With today's high capacity disks, the frequency of roving killer disk is less.
The capacity for damage, however, is correspondingly greater. If you use
removeable media, never move a disk pack you are having trouble with to a new
disk drive. You may be ruining a second set of disk heads. The same logic
applies to putting a new pack into questionable disk drive. 

The first rule when a suspicious error occurs is not to panic. Think calmly
about the problem. If necessary toss everyone out of the machine room. The
drive and pack should be visually inspected using a flashlight before being put
back into use. The slightest sign of burn marks, dust, or particles should
result in a field service call. 

The best preventative measure is to always use the same removeable media with
the same disk drives. A log should be kept for any media which moves freely
among several disk drives. The log will help in backtracking any ongoing
problems. 


		INI and HOM
		--- --- ---

RSX organizes disks into directories and files. The formal name for this
organization is Files-11, On Disk Structure Level 1 (ODS1). The initial ODS1
files and parameters are created when the disk is initialized using the INI
utility. HOM is the INI task running under a different name. HOM can be used to
change parameters after a disk is initialized. 

The main INI switch used with disk maintenance is /BAD. INI processes the
bad block information and creates [0,0]BADBLK.SYS. The /BAD switch controls how
the processing occurs. The default value, /BAD=[AUTO], reads the bad block
data if present from manufacturer recorded last-track and the information left
by the BAD utility. INI will prompt for bad blocks if the /BAD=[MAN] switch
is typed. Both operations may be combined by specifying /BAD=[AUTO,MAN].

Certain blocks of a Files-11 disk are critical. If these blocks cannot be read,
it will be difficult if not impossible to recover any data from the volume. Two
keys parts are the index file bitmap and the volume bitmap file. The index file
bitmap is a special area used to keep track of free file headers. The second is
special file [0,0]BITMAP.SYS and keeps track of free blocks on the disk. 

If an error is ever reported regarding these areas, do not dismount the volume.
Critical information is still in memory. Instead, immediately attempt to copy
the disk using BRU to another media. This sometimes recovers from the problem. 


		VFY
		---

System crashes will occassionally corrupt a Files-11 disk. At a minimum, space
on the disk is lost after a crash. For instance, TKB marks its temporary work
files for delete when the file is opened. When the file is closed, the system
deletes it. If the RSX crashes while TKB has a temporary work file open, the
file is still left on the disk. 

The VFY utility checks a disk for proper ODS1 structure. There are three basic
VFY checks: integrity, directory validation, and lost files. The first is the
default VFY operation. VFY reads each file header and checks the blocks shown
allocated against the volume free block bit map. Any descrepancies are serious
disk errors. The last two checks insures all directory entries are valid and
all files have directory entries. 

The integrity check first reports any problems with file headers. File headers
are the key part of the ODS1 specification. No part of a file can be accessed
if the file header cannot be read. A bad file header probably indicates a spot
on the disk which is going bad. The best step is to copy the volume to a new
disk, restore file(s) which had bad file headers, and run BAD on the old disk.
If this is not possible, the VFY /HD switch can be used to delete the bad file
header(s). The integrity check should then be retried.

One common error message in this phase which does not indicate a serious
problem is 'file marked for delete'. This message indicates a file whose header
shows the file should be deleted. The TKB work files from the example above
will show up as such files. 

	>VFY DL0:

	FILE ID 13,2 FOO.BAR;1 OWNER [7,114]
	FILE IS MARKED FOR DELETE
	
You can complete the file deletion by using PIP. Delete the file by using
the File-ID instead of the file name. There is no directory entry; the name
shown is from the file header. 

	>PIP DL0:/FI:13:2/DE
	
	PIP -- FAILED TO MARK FILE FOR DELETE-NO SUCH FILE

The above error message is normal. The file is deleted but no directory entry
could be found. 

Once all headers are read, the VFY integrity check reports the number of blocks
free and allocated according to the headers and free block bitmap. Three
protential problems can result: multiply allocated blocks, allocated blocks
shown free in the bitmap, and lost blocks. 

Any multiple allocated blocks must be resolved first. VFY makes a second pass
and reports each file containing such blocks. Use PIP to copy and delete these
files. You should expect some garbage in these files where the same block(s) is
used twice.  Then rerun VFY to see if all multiply allocated blocks have been
fixed. 

Allocated blocks which are shown free in the bitmap are fixed using the /UP
switch. A sample of this problem is shown below. The file headers in the
index file show 18400 blocks used. The bitmap has one of these block marked
as free.

	>VFY DL0:

	CONSISTENCY CHECK OF INDEX AND BITMAP ON DL0:

	INDEX  INDICATES 2000. BLOCKS FREE, 18400. BLOCKS USED OUT OF 20480.
	BITMAP INDICATES 2001. BLOCKS FREE, 18399. BLOCKS USED OUT OF 20480.


The opposite problem, lost blocks, is fixed using the /RE switch. The number of
free blocks in the index file line above would be greater than the free blocks
on the second line showing the bitmap totals. 

The integrity check makes sure all file headers are valid. ODS1 separates files
from directories. Two additional checks should always be made to make sure
directories are valid. The /DV switch verifies that all directory entries point
to valid files. The /LO switch checks for any files which do not have a
corresponding entry. Any such files are entered in directory [1,3] and can be
disposed of in any fashion (typically deleted). 


		Error Logging
		----- -------

Another essential tool for disk maintenance is the RSX error logging facility.
All RSX disk drivers report errors to the error logger. An error report
includes the size and location of the transfer, a copy of the hardware
registers and retry status.

A report based on raw data in the error log file is produced by RPT. If your
system has periodic preventaive maintenance checks, it is a good practice to
get error log reports two days before the Field Service visit. The following
command line generates the information useful to Field Service:

	RPT list=err/DA:R:90./DE:A/F:B/SU:(E:G)/TY:A

This command selects a one-summary (/F:B) of all packets (/TY:E) for all
devices (/DE:A) that have been produced in the previous 90 days (or whatever
preventative maintenance cycle your system uses). Finally, error and geometry
summaries (/SU:ALL) will be produced. The geometry summary is particularly
useful for spotting tracks and surfaces that are developing problems. 

If the summary reports show any repeating problems, you should advise Field
Service by telephone so they can be prepared to look at the problem. You
may wish to get full listings (/F:F) for some specific error packets.


		Disk Backups
		---- -------

Disk backups take three forms: image, structure, and partial. An image backup
is a block-by-block copy of a disk to another disk or magtape. The entire disk,
including free blocks is copied. The old RSX utility PRESRV performed an image
copy. 

A structure backup copies all current directories and files. The Disk Save and
Compress utility (DSC) is an example of this type of backup. DSC copies files
to the beginning of the new volume. Free disk space is collected at the end. 

Partial backups allow some subset of the disk volume to be copied. Files can be
selected by directory, name, or date. RMSBCK/RSMRST and BRU are examples of
structure backup utilities that also can be used for partial backups. Some
sites have also used PIP and FLX for copying some set of files. 

Unless you have some specific purpose, BRU should be used for your RSX disk
backups. BRU can perform structure or partial backups. A full restore will
compress the disk files together. BRU is also the fastest backup utility
available from Digital. 

One disadvantage of BRU is lack of support on other systems. BRU cannot be used
to move files to RT-11, RSTS/E, or VAX/VMS systems. If it is important that
backup tapes be readable on other systems, RMSBCK/RMSRST should be used. 

There is no single backup schedule appropriate for all systems. Disk failures
will occur. Your disk backup schedule should let you recover from the failure
in an acceptable time period. At some sites it means a single backup should be
taken when the disk is created. Other disk volumes may need daily backups so
the previous 24 hours of work and data can be recovered. 

Backups fall into three clases: major, full, and incremental. A major backup is
not just a full backup but also involves verifying the tape(s) and off-site
storage. RSX has a rocky history of disk backup tools. None have been perfect.
The latest tool, BRU, had a series of major problems in its first releases.
Current BRU versions seem to work much better. It is good practice to
occasionally restore a backup volume to make sure the procedure works. The
checkout performed in a major backup guarantees the tapes can be used to
restore the disk volume. The following steps are recommended for a major
backup: 

    1.	Obtain exclusive access to the disk. For system and public disks,
	this means standalone operations.

    2.  Verify the disk volume and resolve any disk structure errors.

    3.	Perform a full backup of the disk, including backup verification.

    4.	Restore the disk to a seperate disk volume and verify the new disk
	volume. The new disk now becomes the on-line disk. The old disk is saved
	in case of problems.

Step 4 is only performed when a separate disk volume is available. The extra
step checks for any errors in the restore procedure. Never restore over the
initial volume. Any error during the restore will leave you with an invalid
disk. 

Always store a separate copy of the backup utility used to make
a major backup. Past history shows that versions of BRU have not been able
to correctly restore early backup tapes. It is a good practice to store
standalone, bootable copies of each version of the backup tool which is used.

Full backups are not as rigorous as the major backup. It is good practice to
verify the disk structure integrity before performing the backup. Backup
verification is usually not done, although it is a good idea if time permits. A
common practice is to use a three rotation set of media for the full backups.
At each backup period, rotate to the next set of media. If your site has
removeable media, it is less time consuming and more reliable to use disk packs
as the media and not magtapes. 

Many sites find regular full backups sufficient coverage. If not, BRU is
capable of selecting only files created and/or revised since a specific date.
Partial backups are usually done on-line, but at a nonpeak hour.

Backup frequency will vary from site to site. A typical schedule for an active
disk is daily partials, once-a-week full backups, and quarterly major backups.
On a smaller system, the periods will be less frequent. Whatever the schedule,
periodically check that the backup mechanism works. Do not wait until after
a disk is destroyed to find out about a bug in BRU unique to your system.

Micro/RSX systems present a special challenge for disk backups. It is extremely
unwieldy to perform major or full backups of a RD51/RD52 Winchester disk to
RX50 floppy diskettes. It takes 25 RX50 diskettes to save a full RD51 and over
3 times as many for a RD52. 

Unless the Micro-11 system has one of the new cartridge tape drives, the
techniques described above are impractical. One technique is to copy files to
backup floppies on an application and/or user basis. The files are copied
whenever sufficient change has occured to warrant a new backup copy. 

If the Micro system is located on a fast network such as Ethernet, the network
can be used for backup. You could use the same technique as above and copy
files using NFT to another node in the network. At the remote node, files could
be written to tape. The advantage over floppies is no operator intervention is
required to change floppies. The network backup copying could be done at night
using batch control files. 

LAD/11 from my company is another solution for Micro/RSX system backup. With
LAD/11, you can run BRU on the Micro/RSX system and write to a tape located on
some other system in the network (or alternatively, read the disk remotely and
write to a local tape). Meridian has developed a standalone version of LAD/11
which can be booted from floppy and used to initially load a Micro/RSX system
disk from a central node. 

Disk maintenance takes time and patience. You can spend hundreds of hours
checking disks for bad blocks, verifying disks, and writing full backups and
never see a single problem. From my past experience, it is still worth it.