Disk Maintenance Disks fail. The most catastrophic computer disaster is a destroyed disk pack. New parts can be ordered to fix a broken CPU. A lost disk, however, will always contain irreplaceable data. An RSX system manager can minimize disk failures and more importantly, the damage caused when failures do occur. RSX provides various tools to help maintain disks. The same tools are provided with each RSX system. However the techniques used for a large, multi-user PDP-11/70 system may not apply to a small Micro/RSX system. System managers must strike a balance between the cost of using tools and damage caused by a failure. Bad Blocks --- ------ Disk surfaces are not perfect. Imperfections show up as bad disk blocks. It is not unusual for an RP06 disk pack to have 2 to 10 bad blocks. Some disks record on special tracks a list of bad blocks which were found when the manufacturer formatted the disk. These disk, called last-track devices, include the RK06/07, RL01/02, RP07, and the RM02/03/04/80. The latest fixed media disks such as the RA80 come with hardware mechanisms which allow bad blocks to be automatically revectored to replacement blocks. Regardless of the type of disk pack, RSX has its own mechanism for handling bad blocks. The utility BAD writes and reads a specified pattern to every disk block and logs resulting bad blocks on the last good block on the disk. When the disk is initialized by INI or BRU, this block is read and any bad blocks found are allocated to the special file [0,0]BADBLK.SYS. In theory, bad disk blocks cannot show up in any user files. Disk imperfections will worsen over time. A block on a new disk which passes BAD's test may develop into a bad block over time. If you are initializing a disk pack which will be in constant use, it is a good idea to run BAD several times with different test patterns and mark all blocks found as bad. One site which uses single fixed-media LSI-11/73 systems stress test its disks with a 20-hour procedure. With no second disk drive, any disk failure which occurs after the system is in production will take the entire system down. The command file repeatedly runs BAD using seven different patterns. The command sequence below shows a sequence of BAD commands using different patterns. The /LI switch will cause each test to list any bad blocks found. The final command uses the /MANUAL switch to store any bad blocks detected by the previous tests. >BAD ddn:/LI/PATTERN=000000:000000 >BAD ddn:/LI/PATTERN=177777:177777 >BAD ddn:/LI/PATTERN=000000:000000 >BAD ddn:/LI/PATTERN=177777:177777 >BAD ddn:/LI/PATTERN=052525:052525 >BAD ddn:/MANUAL Even after the most rigorous testing, bad blocks will occasionally still develop. These show up as a disk read or write errors. The usual error codes are parity error (-4) or fatal hardware error (-59). The best solution for bad block problems is to copy the entire disk to a new empty volume and recheck and initialize the old volume. If this is not possible, the new bad block can be added to the bad block file ([0,0]BADBLK.SYS) using the BAD /ALLOCATE command. The file containing the bad block must first be deleted. Removeable Media ---------- ----- Ten years ago when an RK05 was state-of-the-art disk drive, it was not uncommon for a RK05 disk pack to be in and out of many different disk drives on a daily or weekly basis. Such an environment would turn a minor disk failure into a major disaster. A failed disk would corrupt the heads on one drive. The bad heads would then scratch the disk surface of the next pack inserted. The second pack would then corrupt the heads of the next drive and so forth. Error recovery mechanisms may hide developing problems until the chain has progressed some distance. I first observed this problem when a site I was working at ruined seven RK05 drives and 18 disk packs. With today's high capacity disks, the frequency of roving killer disk is less. The capacity for damage, however, is correspondingly greater. If you use removeable media, never move a disk pack you are having trouble with to a new disk drive. You may be ruining a second set of disk heads. The same logic applies to putting a new pack into questionable disk drive. The first rule when a suspicious error occurs is not to panic. Think calmly about the problem. If necessary toss everyone out of the machine room. The drive and pack should be visually inspected using a flashlight before being put back into use. The slightest sign of burn marks, dust, or particles should result in a field service call. The best preventative measure is to always use the same removeable media with the same disk drives. A log should be kept for any media which moves freely among several disk drives. The log will help in backtracking any ongoing problems. INI and HOM --- --- --- RSX organizes disks into directories and files. The formal name for this organization is Files-11, On Disk Structure Level 1 (ODS1). The initial ODS1 files and parameters are created when the disk is initialized using the INI utility. HOM is the INI task running under a different name. HOM can be used to change parameters after a disk is initialized. The main INI switch used with disk maintenance is /BAD. INI processes the bad block information and creates [0,0]BADBLK.SYS. The /BAD switch controls how the processing occurs. The default value, /BAD=[AUTO], reads the bad block data if present from manufacturer recorded last-track and the information left by the BAD utility. INI will prompt for bad blocks if the /BAD=[MAN] switch is typed. Both operations may be combined by specifying /BAD=[AUTO,MAN]. Certain blocks of a Files-11 disk are critical. If these blocks cannot be read, it will be difficult if not impossible to recover any data from the volume. Two keys parts are the index file bitmap and the volume bitmap file. The index file bitmap is a special area used to keep track of free file headers. The second is special file [0,0]BITMAP.SYS and keeps track of free blocks on the disk. If an error is ever reported regarding these areas, do not dismount the volume. Critical information is still in memory. Instead, immediately attempt to copy the disk using BRU to another media. This sometimes recovers from the problem. VFY --- System crashes will occassionally corrupt a Files-11 disk. At a minimum, space on the disk is lost after a crash. For instance, TKB marks its temporary work files for delete when the file is opened. When the file is closed, the system deletes it. If the RSX crashes while TKB has a temporary work file open, the file is still left on the disk. The VFY utility checks a disk for proper ODS1 structure. There are three basic VFY checks: integrity, directory validation, and lost files. The first is the default VFY operation. VFY reads each file header and checks the blocks shown allocated against the volume free block bit map. Any descrepancies are serious disk errors. The last two checks insures all directory entries are valid and all files have directory entries. The integrity check first reports any problems with file headers. File headers are the key part of the ODS1 specification. No part of a file can be accessed if the file header cannot be read. A bad file header probably indicates a spot on the disk which is going bad. The best step is to copy the volume to a new disk, restore file(s) which had bad file headers, and run BAD on the old disk. If this is not possible, the VFY /HD switch can be used to delete the bad file header(s). The integrity check should then be retried. One common error message in this phase which does not indicate a serious problem is 'file marked for delete'. This message indicates a file whose header shows the file should be deleted. The TKB work files from the example above will show up as such files. >VFY DL0: FILE ID 13,2 FOO.BAR;1 OWNER [7,114] FILE IS MARKED FOR DELETE You can complete the file deletion by using PIP. Delete the file by using the File-ID instead of the file name. There is no directory entry; the name shown is from the file header. >PIP DL0:/FI:13:2/DE PIP -- FAILED TO MARK FILE FOR DELETE-NO SUCH FILE The above error message is normal. The file is deleted but no directory entry could be found. Once all headers are read, the VFY integrity check reports the number of blocks free and allocated according to the headers and free block bitmap. Three protential problems can result: multiply allocated blocks, allocated blocks shown free in the bitmap, and lost blocks. Any multiple allocated blocks must be resolved first. VFY makes a second pass and reports each file containing such blocks. Use PIP to copy and delete these files. You should expect some garbage in these files where the same block(s) is used twice. Then rerun VFY to see if all multiply allocated blocks have been fixed. Allocated blocks which are shown free in the bitmap are fixed using the /UP switch. A sample of this problem is shown below. The file headers in the index file show 18400 blocks used. The bitmap has one of these block marked as free. >VFY DL0: CONSISTENCY CHECK OF INDEX AND BITMAP ON DL0: INDEX INDICATES 2000. BLOCKS FREE, 18400. BLOCKS USED OUT OF 20480. BITMAP INDICATES 2001. BLOCKS FREE, 18399. BLOCKS USED OUT OF 20480. The opposite problem, lost blocks, is fixed using the /RE switch. The number of free blocks in the index file line above would be greater than the free blocks on the second line showing the bitmap totals. The integrity check makes sure all file headers are valid. ODS1 separates files from directories. Two additional checks should always be made to make sure directories are valid. The /DV switch verifies that all directory entries point to valid files. The /LO switch checks for any files which do not have a corresponding entry. Any such files are entered in directory [1,3] and can be disposed of in any fashion (typically deleted). Error Logging ----- ------- Another essential tool for disk maintenance is the RSX error logging facility. All RSX disk drivers report errors to the error logger. An error report includes the size and location of the transfer, a copy of the hardware registers and retry status. A report based on raw data in the error log file is produced by RPT. If your system has periodic preventaive maintenance checks, it is a good practice to get error log reports two days before the Field Service visit. The following command line generates the information useful to Field Service: RPT list=err/DA:R:90./DE:A/F:B/SU:(E:G)/TY:A This command selects a one-summary (/F:B) of all packets (/TY:E) for all devices (/DE:A) that have been produced in the previous 90 days (or whatever preventative maintenance cycle your system uses). Finally, error and geometry summaries (/SU:ALL) will be produced. The geometry summary is particularly useful for spotting tracks and surfaces that are developing problems. If the summary reports show any repeating problems, you should advise Field Service by telephone so they can be prepared to look at the problem. You may wish to get full listings (/F:F) for some specific error packets. Disk Backups ---- ------- Disk backups take three forms: image, structure, and partial. An image backup is a block-by-block copy of a disk to another disk or magtape. The entire disk, including free blocks is copied. The old RSX utility PRESRV performed an image copy. A structure backup copies all current directories and files. The Disk Save and Compress utility (DSC) is an example of this type of backup. DSC copies files to the beginning of the new volume. Free disk space is collected at the end. Partial backups allow some subset of the disk volume to be copied. Files can be selected by directory, name, or date. RMSBCK/RSMRST and BRU are examples of structure backup utilities that also can be used for partial backups. Some sites have also used PIP and FLX for copying some set of files. Unless you have some specific purpose, BRU should be used for your RSX disk backups. BRU can perform structure or partial backups. A full restore will compress the disk files together. BRU is also the fastest backup utility available from Digital. One disadvantage of BRU is lack of support on other systems. BRU cannot be used to move files to RT-11, RSTS/E, or VAX/VMS systems. If it is important that backup tapes be readable on other systems, RMSBCK/RMSRST should be used. There is no single backup schedule appropriate for all systems. Disk failures will occur. Your disk backup schedule should let you recover from the failure in an acceptable time period. At some sites it means a single backup should be taken when the disk is created. Other disk volumes may need daily backups so the previous 24 hours of work and data can be recovered. Backups fall into three clases: major, full, and incremental. A major backup is not just a full backup but also involves verifying the tape(s) and off-site storage. RSX has a rocky history of disk backup tools. None have been perfect. The latest tool, BRU, had a series of major problems in its first releases. Current BRU versions seem to work much better. It is good practice to occasionally restore a backup volume to make sure the procedure works. The checkout performed in a major backup guarantees the tapes can be used to restore the disk volume. The following steps are recommended for a major backup: 1. Obtain exclusive access to the disk. For system and public disks, this means standalone operations. 2. Verify the disk volume and resolve any disk structure errors. 3. Perform a full backup of the disk, including backup verification. 4. Restore the disk to a seperate disk volume and verify the new disk volume. The new disk now becomes the on-line disk. The old disk is saved in case of problems. Step 4 is only performed when a separate disk volume is available. The extra step checks for any errors in the restore procedure. Never restore over the initial volume. Any error during the restore will leave you with an invalid disk. Always store a separate copy of the backup utility used to make a major backup. Past history shows that versions of BRU have not been able to correctly restore early backup tapes. It is a good practice to store standalone, bootable copies of each version of the backup tool which is used. Full backups are not as rigorous as the major backup. It is good practice to verify the disk structure integrity before performing the backup. Backup verification is usually not done, although it is a good idea if time permits. A common practice is to use a three rotation set of media for the full backups. At each backup period, rotate to the next set of media. If your site has removeable media, it is less time consuming and more reliable to use disk packs as the media and not magtapes. Many sites find regular full backups sufficient coverage. If not, BRU is capable of selecting only files created and/or revised since a specific date. Partial backups are usually done on-line, but at a nonpeak hour. Backup frequency will vary from site to site. A typical schedule for an active disk is daily partials, once-a-week full backups, and quarterly major backups. On a smaller system, the periods will be less frequent. Whatever the schedule, periodically check that the backup mechanism works. Do not wait until after a disk is destroyed to find out about a bug in BRU unique to your system. Micro/RSX systems present a special challenge for disk backups. It is extremely unwieldy to perform major or full backups of a RD51/RD52 Winchester disk to RX50 floppy diskettes. It takes 25 RX50 diskettes to save a full RD51 and over 3 times as many for a RD52. Unless the Micro-11 system has one of the new cartridge tape drives, the techniques described above are impractical. One technique is to copy files to backup floppies on an application and/or user basis. The files are copied whenever sufficient change has occured to warrant a new backup copy. If the Micro system is located on a fast network such as Ethernet, the network can be used for backup. You could use the same technique as above and copy files using NFT to another node in the network. At the remote node, files could be written to tape. The advantage over floppies is no operator intervention is required to change floppies. The network backup copying could be done at night using batch control files. LAD/11 from my company is another solution for Micro/RSX system backup. With LAD/11, you can run BRU on the Micro/RSX system and write to a tape located on some other system in the network (or alternatively, read the disk remotely and write to a local tape). Meridian has developed a standalone version of LAD/11 which can be booted from floppy and used to initially load a Micro/RSX system disk from a central node. Disk maintenance takes time and patience. You can spend hundreds of hours checking disks for bad blocks, verifying disks, and writing full backups and never see a single problem. From my past experience, it is still worth it.