Recovering from Disk Disasters Mike Higgins The Computer Entomologist Duncans Mills, CA ABSTRACT ________ Several years ago, I gave a talk on types of disk disasters and how to recover from them. Since then, a few programs have been written to make this even easyer to do, and a lot of data has fortunately been recovered. In this talk, I would like to review some of the problems that can happen with FILES11 disk volumes, and describe some of the programs available that can help you recover if you ever have a disk disaster on your hands. The programs that I have used were all for RSX11-M systems but the algorithms, if not the programs themselves, should be adaptable to VMS. BIG DISKS CAN HAVE PROBLEMS Several years ago, the company that I worked for purchased a large 300 megabyte disk drive. Since we have had this disk, we've had a number of problems with it. Some of the problems were caused by the fact that it was not a DEC disk, and some of them were brought about by problems in the hardware. However, most of our problems were related to the fact that it is just a very large disk, and a certain amount of problems is to be expected. Everyone expects disks to occasionally make mistakes, and it makes sense to assume that the bigger the disk, the greater the chance that you are going to get an error. Sometimes I wonder if DEC ever expects to get errors on disks, since some of these minor problems can be turned into disasters by "features" of FILES11 itself. But first, let's run through some of the problems that can occur on a disk. LOST HEADER BITS Along with every block on the disk, a collection of bits, called the header (not to be confused with file headers), is also stored. This header area is used by the disk controller to make sure it has the right block and as a place in which to store things like the parity, checksum or ECC data. These headers are written on the disk when you format a pack, and are never destroyed (purposly) untill you re format the pack. Since this information is used to identify the block when it is read, lost (or corrupted) header information is bad news. Some other types errors are transient problems and can be repaired, but if the header bits are corrupted, the disk controller cannot even find PAGE 2 the block to attempt a read or write to it. If this happens, even when there is no physical bad spot on the disk, you cannot fix the header without reformatting the entire pack. Until the pack is reformatted, that block is lost forever. PARITY ERRORS A parity error will occur if the disk discovers that its error checking bits, be they parity, checksum, or ECC values, do not match the data that it just read. This means, to a user, that one or more bits in the block just read are bad. Most DEC programs simply abort with a message if they get this kind of error. some sample DEC programs that will simply give up on a simple error are: FILES11, MOU, DSC, BRU. Of course it's really not quite that simple, the system will attempt to read the bad block several times before giving up. But some of these programs cause worse problems by giving up than if there was a way to ignore the error. BAD BLOCKS IN THE BIT MAP FILE Getting one bad bit in the bit map file doesn't sound like a very bad problem, but this is one of the cases where FILES11 turns a small glitch into a major disaster. The symptoms that we got from this type of problem were FCS -32 and -33 errors (device read/write errors) every time we tried to allocate a new file. Our first reaction was to dismount the disk so that it could not be messed up any more than it might be already. The plan was to bring up another copy of our system on another disk, then mount the bad disk and run verify on it to find out what the problem was. This did not work because MOU refuses to mount disks that have bad blocks in the bit map file. Once we had dismounted the volume, we could never get it back up again. So there we sat with 300mb's of very valuable data, and presumably one bit in one block of one file (albeit this file was the bit map file) was preventing us from reading all the rest of this large disk volume. We further discovered that DSC refused to copy a disk with a bad bit map block. You can imagine that we were not very happy about this. INDEX BLOCK PROBLEMS Another problem that we have had was bad blocks in the index file. Once again, you would think that one bad block in an index file (which is over 14000 blocks on our system disk) would only mess up one file, but this is not the case. When RSX finds a bad index block while allocating a new file, it tries an extraordinary number of times to read the block and then gives up, returning an FCS error code of -32. Since RSX allocates blocks in the index file from the top down, once you run into one of these bad blocks, you cannot allocate any more files at all, because the bad one is always the next in line to be used. Purging any or all directories on the disk would open up a few PAGE 3 file headers before the trouble spot, and things will appear to be O.K. again for a while. A curious aside on this error was the fact that RSX seemed to try an inordinate number of times to read the bad header before returning the -32 error. We think that this was aggravated by our brand X disk, because it did several attempts slightly off of the disk track to try to find the block. If you read the SYSGEN options for the RP04 disk drive, this feature will sound familiar to you. The DEC RP03 does not have this option, but our brand X disk does this in the controller without notifying the PDP unless all the attempts fail. This means that for each of the normal 8 retries that RSX does, our controller tries 8 more times each, for a total of 64 times. When reading bad index blocks, RSX appears to try a great deal more than 8 times, and multiplied by 64, this takes a finite amount of time to do. While the disk is busily performing these 512+ read attempts from the disk, all of the terminals lock up and refuse to echo. The symptoms from the terminal look just like a crash due to lack of pool space, except that the system starts working again after the -32 error is printed. WHAT TO DO ABOUT: PARITY ERRORS IN BIT MAP AND INDEX FILE BLOCKS If you get errors in your bit map file or in your index file, there is a possibility of recovering the block. First, find the disk address of the error, using a programi wrote called CHECK. Then, write a small MACRO program to read the block, ignore read errors, and write the bad data back in. When the disk controller rewrites the block, it recalculates the parity of the block -- taking the bad bit(s) into account. The next time FILES11 tries to read that block, there will be no parity error so everything should work again. If this happens in the bit map file, you should mount the disk and run VFY to allocate or recover blocks whose status was changed by this error. If the problem returns, then the parity error was not a transient problem, and you must copy to another disk. Fortunately, running the small MACRO program again will clear up the problem enough to allow DSC to copy the disk for you. Perhaps the best policy would be to always copy to another disk, then run BAD on the problem disk before using it again. The CHECK program performs a read check on all blocks of a disk. This program was written to read check disks that we couldn't run VFY on because they were unmountable. Unlike some DEC programs, CHECK does not "know" how many blocks there are on a disk. The program keeps going until the system returns an illegal block number error code. This means it even works on brand X disks! The program reads 32 blocks at a time, barring errors, so it can read check a 300MB disk in 5 to 10 minutes. We built CHECK with the /PR:0 taskbuild option, so it can even be PAGE 4 run on a mounted disk. If errors are found, they are reported in the following format: ERR NNN HHHHHH LLLLLLL, where NNN is the FCS error code (usually 374), HHHHH and LLLLLL are the high order and low order words of the 32 bit block number where the problem occurred. WHAT TO DO ABOUT: LOST HEADER BITS IN BIT MAP AND INDEX FILE BLOCKS If you perform the above procedure and discover that the small MACRO program gets errors when it tries to write to the bad _____ block, then you have lost the header bits for the block (There is no other way, without error logging, I know of to tell the difference, RSX lumps parity errors and lost header bits into the same error code before returning it to your program). The easiest solution to this problem is to copy the disk to another using the PRESERVE program. PRESERVE is not too smart for its own good (like DSC), so when it finds an unreadable block, PRESERVE simply prints out the block number and does not copy that block to the new disk. When the transfer is done, what you end up with is a disk that is identical to the bad one, except the bad block is replaced with a block that is simply empty. These empty blocks can, however, be read or written so you can mount the volume and fix bad bit map blocks now by running VFY. If the problem was in the index file, the block will simply be allocated and initialized by the system the next time it needs an index block. We had one additional problem with PRESERVE: it would not copy all of our brand X disk. We had to write our own special purpose program, called ERRCPY. The PRESERVE program, alas, is too smart for its own good in this case: it "knows" how many blocks there are on an RP03, and it only copies that many. ERRCPY copies DP0: to DP1:, similar to the way PRESERVE does; but like the CHECK program, it keeps copying until the system tells it to stop. Why have system structures like the UCB unless you put them to good use? ERRCOPY can copy a whole 300MB disk in about 20 minutes, unless errors are found, in which case it runs a little slower. Errors are printed in the same format as the CHECK program. If ERRCPY is built with the privileged option (/PR:0), it can even copy from a mounted device, making a PRESERVE-like backup while minimal usage of the disk is still going on. This may be a little dangerous, however, because the program would also have permission to write on mounted volumes (ouch!). If the bad block happens to be in the index file, there is another approach to the problem. In the Software Dispatch, Sequence 5.1.15.1, a procedure is given for eliminating a bad index block by marking it as active in the index file bitmap. This is a risky maneuver, so if you can possibly copy the disk, you should do that instead of playing with the index file. The PAGE 5 first time this problem occurred, we did not have a second disk drive, and we were forced to use the procedure from the Software Dispatch, so we know that it works, but be careful! OPERATOR ERRORS Everyone who uses RSX a lot has probably accidentally deleted files with PIP and then wished that they could be un-deleted. The worst case we have had was when a privileged user executed the PIP command >PIP [*,*]*.*;*/DE. This disk had originally been created by me way back in the past, so of course my main directory was created first and is the first one searched by PIP. This means my files are in the front lines when someone starts deleting. This isn't really a "vulnerable point" in the file system, but is an example of a type of operator error that probably happens more often than others. WHAT TO DO ABOUT: ACCIDENTALLY DELETED FILES We had a program (named LAZARUS) on RSX11M V2.0 which could resurrect a file in place after it had been deleted, but this program was made obsolete by some changes in FILES11 that came out at the same time as RSX11M V3.0. The approach is to find the index block for the deleted file, patch it back up, then then go patch up the index file bitmap and bitmap file. On the symposium tape there should be a program, written by Larry Baker at USGS, that undeletes files. If it can be arranged, I am going to include this program in my directory on the tape: [307,22]. Larry's program, called UNDELETE, has a select switch like SRD. You can undelete files by directory, name, extention, and version number. UNDELETE only cleans up the header and the index file, however, it is imperative that you immediately run VFY to mark all the recovered blocks on the disk as allocated in the bitmap file. If new files were created between the time that your files are deleted and the VFY run, you are taking the chance that index blocks or data blocks from your files will be re-used. Most people seem to realize their mistake immediately after it is too late to abort the delete, so there is usually time to run UNDELETE. RUNNING INITVOLUME ON THE WRONG VOLUME A very vulnerable area on a FILES11 disk is the index file, and DEC has a program specifically designed to clobber it. This program is called INITVOLUME or INI. An error that you don't want to happen on your system is to have an operator run INITVOLUME on a disk that still had valuable data on it. However, when INI initializes a disk, it only destroys the beginning of the index file, which includes among other things, the index block of the index file itself. On large disk packs, the index file starts out small and then grows as new files are allocated. The index file on our system disk is now over 14,000 PAGE 6 blocks big. Since INI only initializes the first part of a new index file (about 25 blocks), then the majority of the index blocks will still be there on the disk somewhere. However, blocks in the index file are allocated just like blocks in any other file, so they can be scattered anywhere at all on the disk. This means with the index file index block gone, it is not possible to predict where all those intact index blocks are, as much as you might like to find them. RUNNING BAD ON THE WRONG DISK If you thought initializing the wrong disk was bad, imagine your operators running BAD on a disk with valuable data on it. When BAD runs to completion, it writes on every block of the disk, and there is no hope of recovering anything. But think positively, perhaps you can abort the BAD run very soon after it starts. If you do this soon enough, the program will only have destroyed part of the disk, and you can treat the recovery the same as a lost index file. I never had a disk destroyed by BAD or INI, but we had something just as good happen. One day our DEC Field Circus Technician, while initializing some new diagnostic RK05 cartridges, accidentally initialized our big disk. It was really stupid of us to leave a valuable data disk up and running on a system that had diagnostics running on it, but it was too late before we realized that. You can bet we haven't done it again since. Well, initializing a diagnostic pack seems to involve writing mostly zeros over the first 222 (octal) blocks of the disk. Since our index file was on the beginning of the disk, this had about the same effect as running INI on a pack. You may ask, what was our index file doing on the begining of the pack? Good question, I asked that of myself at the time also, since INI will put it in the middle of the pack by default. I let INI default, so who was moving the index file around? The culprit turned out to be DSC, the backup utility available to us at the time. DSC always puts the index file at the begining and doesn't give you any choice in the matter. Nowadays we are useing BRU, which gives you the option to put it where you want it. If the index file had been in the middle, the first 222 blocks of the disk would have included data from a bunch of files at random. Since I know how to recover from lost index files now, the prospect of loosing a bunch of data blocks at random sounds much worse. Who cares if a chunk of my index file gets clobbered? I'm not sure where to recommend you put your index file anymore. PAGE 7 WHAT TO DO ABOUT: A LOST INDEX FILE When the index file is lost, it is still possible to find index blocks by reading the whole disk and looking at every block. There are clues in a header block to help you identify them. The first word of a header always contains 27027 (octal). This word is really two offsets, H.IDOF and H.MPOF, but they always seem to contain the same values. If the header is active, the second word will contain a non-zero file ID number. The last word of the header must contain a correct checksum: the sum of all the The program I wrote to do this, called UICREC, also checks the owning UIC number in the header to see if the file belongs to a specific directory. After a block has been identified as an index block, it is possible to link through all the retrieval pointers in it and obtain the disk block numbers for all the blocks in the file. In our recovery program, it was most convenient to recover files off of the corrupted disk with QIO$ IO.RLB's and recreate them on another disk with normal FILES11 WRITE$'s. In order to recreate these files with the file attributes of the originals, the user file attributes section of the index block is also lifted and used when opening the reconstructed file. The blocks are then read one at a time from the corrupted disk and written into the new file. The current version of UICREC recovers files from DP0: onto SY: in the current UIC. This means you must SET /UIC to the UIC that you wish to recover and ASN SY: to the disk you want to store recovered files on. The recovery disk must be mounted and must have the appropriate UFD created on it. Every time UICREC finds a header that is owned by your UIC, it prints (on LP1: -- this is a taskbuilder option) the file name, the two word disk address of the index block, and four FCS error codes. The four codes are for (respectively): 1) opening the output file, 2) allocating the required number of blocks, 3) the last write into the new file, and 4) closing the file. If all four codes are 1, then the recovery was successful. Errors that you might get are: -16 -- you forgot to mount the recovery disk, -26 -- you forgot to create a UFD for this UIC, -24 -- the recovery disk is full. UICREC cannot recover multi-header files. The program does not even check for multi header files so you must find them by scanning the error output. if the same file name and version appears several times with different index block numbers, it's probably a multi-header file. UICREC scans for index blocks 30 or so at a time, so it can search a 300MB disk in 5 to 10 minutes, plus the transfer time of any files recovered. An important note: the UIC that this program sees is the owning ______ UIC, not necessarily the directory that the file was last seen in. If your file does not pop up where you expect it, try to remember what UIC you were in when you created the file. _______ PAGE 8 Our system did not have many large multi-header files, so the way I used to recover them was not very "clean". When the UICREC program finds a header that belongs to a multi-header file, it attempts to read the file but usually fails. In any case, UICREC does print the file name and disk address of the header on our line printer. For each multi-header file that we wanted to recover, we dumped all the headers found by UICREC and examined their extension segment numbers (M.ESQN). With these numbers, we were able to sort all the block numbers into correct order. It is important to make sure that all headers are accounted for, since if any were destroyed, the file is not recoverable. To recover the file, we built the sorted list of index blocks into the MACRO program named BIG. Every time you need to recover such a file, you must edit the program at the label LIST to add your list of index disk block numbers. The list is terminated by a -1 after your last block number. The program comes with a sample list built into it if you are unsure how to do this. BIG recovers files from DP0: and writes a recovery tape on MT0:. Sorry, only one file to a tape. BIG reads all the blocks pointed to by all the headers pointed to by your list, and writes them onto the tape. BIG also writes enough information on the first block of the tape for the file name and attributes to be recreated correctly. A slightly more general purpose program, called READ, can read any tape created by BIG and reconstruct the file on a mounted volume later. READ does not need to be reassembled for each file; it gets the file name and attributes from crumbs that BIG leaves behind on the tape. READ writes the file onto SY: in the UIC that you happen to be SET to, so you need not be in the same UIC that the file was recovered from. There are several reasons why the multi-header recovery was done in this manner. First of all, we did not need to recover many large files. At the time we did not have another big disk drive on which to recover large files. And finally, the manual sorting of the headers saved us the complexity of a program "smart" enough to find all the headers for a multi-header file. Remember, the index blocks are scattered virtually at random over the disk, and without the index file index block, the whole disk must be searched to find all the headers for a multi-header file. It is unreasonable to expect a program to remember all block numbers of all extension headers that it finds, just in case they are needed. The more realistic approach would be to ignore all extension blocks until the first header block of a large file is found. Then the disk would be scanned again specifically for extension blocks that belong to that file. This would, by the way, add another 5-10 minutes every time a multi-header file is recovered. The multi-header programs I just described went off to tape and back on again because we had only one disk drive large enough to hold any files that big. Since then, we have purchased a PAGE 9 second drive and I wrote a program that copies files directly. This program, called MULTI, has not been debugged, but I am putting it on the tapeanyway. I do so with the warning: ***USER BEWARE***! After writing the program to recover files without the index file, I came up with different approach to recovering deleted files that merrits thinking about. Consider a LAZERUS program that digs blocks off of one volume, but reconstructing the file ______________ ___ ____ on a separate volume with normal FILES11 macros. This means that __ _ ________ ______ both disks are mounted, and all allocation of new blocks is done though the file system the normal way. There is no chance of corrupting either of the volumes, and a VFY run is not necessary. Robert Brown of the National Center for Atmospheric Research modified the UICREC program to do just that. The only differences in this new version of the program were to check for a zero in word two, (the file ID number), and to check for a zero checksum instead of a correct one. My final vision is of a super recovery program that does it all: recovers files whether they are deleted or not; single or multi-header; by UIC, name, type, or date; and recovers off of disk volumes whether they are mounted or not, and have or no longer have an index file, parity errors, or lost headers! Two years ago I wanted to write that program, but too many other projects got higher priority. There was a program on the spring 1980 RSX symposium tape, called REI, that seems to come close to this. This program was submitted by Howard Palmer while he worked at NASA/AMES, and it should be in directory [307,4]. If I can get a copy in time, it will be included on the RSX tape from this symposium as well. Happily, I have never had the need to try out this program, so I cannot vouch for it at this time. REI, like UICREC, can recover files from a disk without a good index file, but it is a little easier to use. It allows you to specify a UIC and filespec on the command line, including wildcards. The program can recover deleted or undeleted files. This program does have its problems, however. It prompts you for every file and asks you where you want to put the recovered one. Recovering a lot of files can thus be tedious, although its still easier than doing it with ZAP. It is possible with REI to recover a deleted file back onto the disk it was deleted from, but the blocks for the file are re-allocated as the file is coppied. This means it may re-alocate some of the files own blocks and destroy them as it saves them. One user of REI has told me that Murfies law guarantees the first file you recover this way will destroy the second file you wish to recover. If you use a second disk to recover on, as I prefer anyway, you avoid this problem completely. PAGE 10 PROGRAMS USED IN RECOVERY OPERATIONS All the programs mentioned above that were written by me were included on the 1979 fall RSX symposium tape in directory [307,22]. I am going to include them again for convenience on the fall 1981 RSX symposium tape as well in the same directory. The programs are not very easy to use, please be forgiving. Remember that they were all written under extreme duress, so the code is not very clean or well documented. I also normally do a slightly better job of "userizing" a program. All of the programs do ALUN$ assignments to devices that I had at my last job, so they will require reassembling to make them work on most other systems. Of cource I can make even fewer promises about programs I include that were written by other people. I'm ashamed to notice, however, that both UNDELETE and REI are much easier programs to use than the ones I originally wrote.