Recovering from Disk Disasters

                                    Mike Higgins
                              The Computer Entomologist
                                  Duncans Mills, CA



                                      ABSTRACT                                      ________
          Several years ago, I gave a talk on types of disk  disasters  and
          how  to  recover from them.  Since then, a few programs have been
          written to make this even easyer to do, and a  lot  of  data  has
          fortunately been recovered.  In this talk, I would like to review
          some of the problems that can happen with FILES11  disk  volumes,
          and  describe  some  of  the programs available that can help you
          recover if you ever have a disk  disaster  on  your  hands.   The
          programs  that  I  have used were all for RSX11-M systems but the
          algorithms, if not the programs themselves, should  be  adaptable
          to VMS.




                             BIG DISKS CAN HAVE PROBLEMS

               Several years ago, the company that I worked for purchased a
          large  300  megabyte  disk  drive.   Since we have had this disk,
          we've had a number of problems with it.   Some  of  the  problems
          were  caused  by the fact that it was not a DEC disk, and some of
          them were brought about by problems in  the  hardware.   However,
          most  of  our problems were related to the fact that it is just a
          very large disk, and a  certain  amount  of  problems  is  to  be
          expected.   Everyone expects disks to occasionally make mistakes,
          and it makes sense to  assume  that  the  bigger  the  disk,  the
          greater the chance that you are going to get an error.  Sometimes
          I wonder if DEC ever expects to get errors on disks,  since  some
          of   these  minor  problems  can  be  turned  into  disasters  by
          "features" of FILES11 itself.  But first, let's run through  some
          of the problems that can occur on a disk.

                                   LOST HEADER BITS

               Along with every block on the disk, a  collection  of  bits,
          called the header (not to be confused with file headers), is also
          stored.  This header area is used by the disk controller to  make
          sure  it  has  the  right  block and as a place in which to store
          things like the parity, checksum or ECC data.  These headers  are
          written  on  the  disk  when  you  format  a  pack, and are never
          destroyed (purposly) untill you re format the pack.   Since  this
          information  is  used to identify the block when it is read, lost
          (or corrupted) header information is bad news.  Some other  types
          errors  are  transient  problems  and can be repaired, but if the
          header bits are corrupted, the disk controller cannot  even  find

                                                                     PAGE 2



          the  block  to  attempt  a read or write to it.  If this happens,
          even when there is no physical bad spot on the disk,  you  cannot
          fix  the  header without reformatting the entire pack.  Until the
          pack is reformatted, that block is lost forever.

                                    PARITY ERRORS

               A parity error will occur if the  disk  discovers  that  its
          error  checking bits, be they parity, checksum, or ECC values, do
          not match the data that it just read.  This  means,  to  a  user,
          that  one  or more bits in the block just read are bad.  Most DEC
          programs simply abort with a message if they  get  this  kind  of
          error.   some  sample  DEC programs that will simply give up on a
          simple error are:  FILES11, MOU, DSC, BRU.  Of course it's really
          not  quite  that  simple, the system will attempt to read the bad
          block several times before giving up.  But some of these programs
          cause  worse  problems  by  giving  up than if there was a way to
          ignore the error.

                            BAD BLOCKS IN THE BIT MAP FILE

               Getting one bad bit in the bit map file doesn't sound like a
          very  bad  problem,  but  this  is one of the cases where FILES11
          turns a small glitch into a major disaster.  The symptoms that we
          got from this type of problem were FCS -32 and -33 errors (device
          read/write errors) every time we tried to allocate  a  new  file.
          Our  first reaction was to dismount the disk so that it could not
          be messed up any more than it might be already.  The plan was  to
          bring  up  another copy of our system on another disk, then mount
          the bad disk and run verify on it to find out  what  the  problem
          was.   This  did not work because MOU refuses to mount disks that
          have bad blocks in the bit map file.  Once we had dismounted  the
          volume,  we  could  never  get it back up again.  So there we sat
          with 300mb's of very valuable data, and presumably one bit in one
          block  of  one  file  (albeit this file was the bit map file) was
          preventing us from reading  all  the  rest  of  this  large  disk
          volume.   We  further  discovered that DSC refused to copy a disk
          with a bad bit map block.  You can imagine that we were not  very
          happy about this.

                                 INDEX BLOCK PROBLEMS

               Another problem that we have had was bad blocks in the index
          file.  Once again, you would think that one bad block in an index
          file (which is over 14000 blocks on our system disk)  would  only
          mess up one file, but this is not the case.  When RSX finds a bad
          index  block  while  allocating  a  new   file,   it   tries   an
          extraordinary  number  of  times to read the block and then gives
          up, returning an FCS error code  of  -32.   Since  RSX  allocates
          blocks in the index file from the top down, once you run into one
          of these bad blocks, you cannot allocate any more files  at  all,
          because  the  bad  one  is  always  the  next in line to be used.
          Purging any or all directories on the disk would open  up  a  few

                                                                     PAGE 3



          file  headers  before the trouble spot, and things will appear to
          be O.K.  again for a while.

               A curious aside on this error was the fact that  RSX  seemed
          to  try  an  inordinate  number  of  times to read the bad header
          before  returning  the  -32  error.   We  think  that  this   was
          aggravated  by  our brand X disk, because it did several attempts
          slightly off of the disk track to try to find the block.  If  you
          read  the  SYSGEN  options  for the RP04 disk drive, this feature
          will sound familiar to you.  The DEC  RP03  does  not  have  this
          option,  but our brand X disk does this in the controller without
          notifying the PDP unless all the attempts fail.  This means  that
          for  each  of  the normal 8 retries that RSX does, our controller
          tries 8 more times each, for a total of 64 times.   When  reading
          bad  index  blocks,  RSX  appears to try a great deal more than 8
          times, and multiplied by 64, this takes a finite amount  of  time
          to  do.   While  the  disk  is  busily performing these 512+ read
          attempts from the disk, all of the terminals lock up  and  refuse
          to  echo.   The symptoms from the terminal look just like a crash
          due to lack of pool space, except that the system starts  working
          again after the -32 error is printed.

                                  WHAT TO DO ABOUT:
                              PARITY ERRORS IN BIT MAP
                                AND INDEX FILE BLOCKS

               If you get errors in your bit map  file  or  in  your  index
          file,  there  is  a  possibility of recovering the block.  First,
          find the disk address of the error, using a programi wrote called
          CHECK.   Then,  write  a  small  MACRO program to read the block,
          ignore read errors, and write the bad data  back  in.   When  the
          disk controller rewrites the block, it recalculates the parity of
          the block -- taking the bad bit(s) into account.  The  next  time
          FILES11  tries  to read that block, there will be no parity error
          so everything should work again.  If this happens in the bit  map
          file,  you  should  mount  the  disk  and  run VFY to allocate or
          recover blocks whose status was changed by this  error.   If  the
          problem  returns,  then  the  parity  error  was  not a transient
          problem, and you must copy to another disk.  Fortunately, running
          the small MACRO program again will clear up the problem enough to
          allow DSC to copy the disk for  you.   Perhaps  the  best  policy
          would  be  to  always  copy  to another disk, then run BAD on the
          problem disk before using it again.

               The CHECK program performs a read check on all blocks  of  a
          disk.   This  program  was  written  to  read check disks that we
          couldn't run VFY on because they were unmountable.   Unlike  some
          DEC  programs, CHECK does not "know" how many blocks there are on
          a disk.  The program keeps going  until  the  system  returns  an
          illegal  block  number  error  code.  This means it even works on
          brand X disks!  The program reads 32 blocks at  a  time,  barring
          errors, so it can read check a 300MB disk in 5 to 10 minutes.  We
          built CHECK with the /PR:0 taskbuild option, so it  can  even  be

                                                                     PAGE 4



          run on a mounted disk.  If errors are found, they are reported in
          the following format:  ERR NNN HHHHHH LLLLLLL, where NNN  is  the
          FCS error code (usually 374), HHHHH and LLLLLL are the high order
          and low order words of the 32 bit block number where the  problem
          occurred.

                                  WHAT TO DO ABOUT:
                             LOST HEADER BITS IN BIT MAP
                                AND INDEX FILE BLOCKS

               If you perform the above procedure  and  discover  that  the
          small MACRO program gets errors when it tries to write to the bad                                                           _____
          block, then you have lost the header bits for the block (There is
          no  other  way,  without  error  logging,  I  know of to tell the
          difference, RSX lumps parity errors and lost header bits into the
          same  error  code  before  returning  it  to  your program).  The
          easiest solution to this problem is to copy the disk  to  another
          using  the  PRESERVE  program.  PRESERVE is not too smart for its
          own good (like DSC),  so  when  it  finds  an  unreadable  block,
          PRESERVE  simply  prints  out  the block number and does not copy
          that block to the new disk.  When the transfer is done, what  you
          end  up  with  is a disk that is identical to the bad one, except
          the bad block is replaced with a  block  that  is  simply  empty.
          These  empty  blocks  can, however, be read or written so you can
          mount the volume and fix bad bit map blocks now by  running  VFY.
          If  the  problem  was in the index file, the block will simply be
          allocated and initialized by the system the next time it needs an
          index block.

               We had one additional problem with PRESERVE:  it  would  not
          copy  all  of  our brand X disk.  We had to write our own special
          purpose program, called ERRCPY.  The PRESERVE program,  alas,  is
          too  smart  for  its  own good in this case:  it "knows" how many
          blocks there are on an  RP03,  and  it  only  copies  that  many.
          ERRCPY  copies  DP0:   to DP1:, similar to the way PRESERVE does;
          but like the CHECK program, it keeps  copying  until  the  system
          tells it to stop.  Why have system structures like the UCB unless
          you put them to good use?  ERRCOPY can copy a whole 300MB disk in
          about  20 minutes, unless errors are found, in which case it runs
          a little slower.  Errors are printed in the same  format  as  the
          CHECK  program.   If  ERRCPY  is built with the privileged option
          (/PR:0), it can  even  copy  from  a  mounted  device,  making  a
          PRESERVE-like  backup  while  minimal  usage of the disk is still
          going on.  This may be a little dangerous, however,  because  the
          program  would  also  have permission to write on mounted volumes
          (ouch!).

               If the bad block happens to be in the index file,  there  is
          another  approach  to  the  problem.   In  the Software Dispatch,
          Sequence 5.1.15.1, a procedure is given  for  eliminating  a  bad
          index  block  by  marking  it as active in the index file bitmap.
          This is a risky maneuver, so if you can possibly copy  the  disk,
          you  should  do that instead of playing with the index file.  The

                                                                     PAGE 5



          first time this problem occurred, we did not have a  second  disk
          drive,  and we were forced to use the procedure from the Software
          Dispatch, so we know that it works, but be careful!

                                   OPERATOR ERRORS

               Everyone who  uses  RSX  a  lot  has  probably  accidentally
          deleted  files  with  PIP  and  then  wished  that  they could be
          un-deleted.  The worst case we have had  was  when  a  privileged
          user  executed the PIP command >PIP [*,*]*.*;*/DE.  This disk had
          originally been created by me way back in the past, so of  course
          my main directory was created first and is the first one searched
          by PIP.  This means my files are in the front lines when  someone
          starts  deleting.   This isn't really a "vulnerable point" in the
          file system, but is an example of a type of operator  error  that
          probably happens more often than others.

                                  WHAT TO DO ABOUT:
                             ACCIDENTALLY DELETED FILES

               We had a program (named LAZARUS) on RSX11M V2.0 which  could
          resurrect  a  file  in  place after it had been deleted, but this
          program was made obsolete by some changes in  FILES11  that  came
          out at the same time as RSX11M V3.0.  The approach is to find the
          index block for the deleted file, patch it back up, then then  go
          patch up the index file bitmap and bitmap file.  On the symposium
          tape there should be a program, written by Larry Baker  at  USGS,
          that  undeletes  files.   If  it  can  be arranged, I am going to
          include this program in my  directory  on  the  tape:   [307,22].
          Larry's  program,  called UNDELETE, has a select switch like SRD.
          You can undelete files by directory, name, extention, and version
          number.   UNDELETE  only cleans up the header and the index file,
          however, it is imperative that you immediately run  VFY  to  mark
          all  the  recovered blocks on the disk as allocated in the bitmap
          file.  If new files were created between the time that your files
          are deleted and the VFY run, you are taking the chance that index
          blocks or data blocks from your  files  will  be  re-used.   Most
          people  seem to realize their mistake immediately after it is too
          late to abort the  delete,  so  there  is  usually  time  to  run
          UNDELETE.

                        RUNNING INITVOLUME ON THE WRONG VOLUME

               A very vulnerable area on a FILES11 disk is the index  file,
          and  DEC has a program specifically designed to clobber it.  This
          program is called INITVOLUME or INI.  An  error  that  you  don't
          want  to  happen  on  your  system  is  to  have  an operator run
          INITVOLUME on  a  disk  that  still  had  valuable  data  on  it.
          However,  when  INI  initializes  a  disk,  it  only destroys the
          beginning of the index file, which includes among  other  things,
          the  index  block of the index file itself.  On large disk packs,
          the index file starts out small and then grows as new  files  are
          allocated.   The index file on our system disk is now over 14,000

                                                                     PAGE 6



          blocks big.  Since INI only initializes the first part of  a  new
          index  file  (about  25  blocks),  then the majority of the index
          blocks will still be  there  on  the  disk  somewhere.   However,
          blocks  in  the  index file are allocated just like blocks in any
          other file, so they can be scattered anywhere at all on the disk.
          This  means  with  the  index  file  index  block gone, it is not
          possible to predict where all those intact index blocks  are,  as
          much as you might like to find them.

                            RUNNING BAD ON THE WRONG DISK

               If you thought initializing the wrong disk was bad,  imagine
          your  operators  running  BAD on a disk with valuable data on it.
          When BAD runs to completion, it writes  on  every  block  of  the
          disk,  and  there  is  no hope of recovering anything.  But think
          positively, perhaps you can abort the BAD run very soon after  it
          starts.   If  you do this soon enough, the program will only have
          destroyed part of the disk, and you can treat  the  recovery  the
          same as a lost index file.

               I never had a disk destroyed by  BAD  or  INI,  but  we  had
          something  just  as  good  happen.   One day our DEC Field Circus
          Technician,  while  initializing   some   new   diagnostic   RK05
          cartridges, accidentally initialized our big disk.  It was really
          stupid of us to leave a valuable data disk up and  running  on  a
          system  that  had  diagnostics running on it, but it was too late
          before we realized that.  You can bet we haven't  done  it  again
          since.   Well,  initializing  a  diagnostic pack seems to involve
          writing mostly zeros over the first 222  (octal)  blocks  of  the
          disk.   Since  our  index  file was on the beginning of the disk,
          this had about the same effect as running INI on a pack.

               You may ask, what was our index file doing on  the  begining
          of  the  pack?  Good question, I asked that of myself at the time
          also, since INI will put it in the middle of the pack by default.
          I  let INI default, so who was moving the index file around?  The
          culprit turned out to be DSC, the backup utility available to  us
          at  the time.  DSC always puts the index file at the begining and
          doesn't give you any choice  in  the  matter.   Nowadays  we  are
          useing  BRU,  which gives you the option to put it where you want
          it.  If the index file had been in  the  middle,  the  first  222
          blocks of the disk would have included data from a bunch of files
          at random.  Since I know how to recover  from  lost  index  files
          now,  the  prospect  of  loosing a bunch of data blocks at random
          sounds much worse.  Who cares if a chunk of my  index  file  gets
          clobbered?   I'm  not  sure where to recommend you put your index
          file anymore.


                                                                     PAGE 7



                                  WHAT TO DO ABOUT:
                                  A LOST INDEX FILE

               When the index file is lost, it is still  possible  to  find
          index  blocks  by  reading  the  whole  disk and looking at every
          block.  There are clues in a header block to  help  you  identify
          them.   The first word of a header always contains 27027 (octal).
          This word is really two offsets,  H.IDOF  and  H.MPOF,  but  they
          always seem to contain the same values.  If the header is active,
          the second word will contain a non-zero file ID number.  The last
          word  of  the header must contain a correct checksum:  the sum of
          all the The program I wrote  to  do  this,  called  UICREC,  also
          checks  the  owning  UIC  number in the header to see if the file
          belongs to a specific directory.

               After a block has been identified as an index block,  it  is
          possible  to  link  through  all the retrieval pointers in it and
          obtain the disk block numbers for all the blocks in the file.  In
          our recovery program, it was most convenient to recover files off
          of the corrupted disk with QIO$ IO.RLB's  and  recreate  them  on
          another  disk with normal FILES11 WRITE$'s.  In order to recreate
          these files with the file attributes of the originals,  the  user
          file  attributes  section  of  the index block is also lifted and
          used when opening the reconstructed file.  The  blocks  are  then
          read  one  at a time from the corrupted disk and written into the
          new file.

               The current version of UICREC recovers files from DP0:  onto
          SY:  in the current UIC.  This means you must SET /UIC to the UIC
          that you wish to recover and ASN SY:  to the  disk  you  want  to
          store  recovered files on.  The recovery disk must be mounted and
          must have the appropriate UFD created on it.  Every  time  UICREC
          finds  a header that is owned by your UIC, it prints (on LP1:  --
          this is a taskbuilder option) the file name, the  two  word  disk
          address  of  the index block, and four FCS error codes.  The four
          codes are for (respectively):  1) opening  the  output  file,  2)
          allocating  the required number of blocks, 3) the last write into
          the new file, and 4) closing the file.  If all four codes are  1,
          then the recovery was successful.  Errors that you might get are:
          -16 -- you forgot to mount the recovery disk, -26 --  you  forgot
          to  create  a UFD for this UIC, -24 -- the recovery disk is full.
          UICREC cannot recover multi-header files.  The program  does  not
          even  check  for  multi  header  files  so  you must find them by
          scanning the error output.  if the same  file  name  and  version
          appears  several  times  with different index block numbers, it's
          probably a multi-header file.  UICREC scans for index  blocks  30
          or  so  at  a  time,  so  it  can  search a 300MB disk in 5 to 10
          minutes, plus the transfer  time  of  any  files  recovered.   An
          important  note:   the  UIC  that this program sees is the owning                                                                     ______
          UIC, not necessarily the directory that the file  was  last  seen
          in.   If  your  file  does not pop up where you expect it, try to
          remember what UIC you were in when you created the file.                                                 _______


                                                                     PAGE 8



               Our system did not have many large  multi-header  files,  so
          the  way  I  used to recover them was not very "clean".  When the
          UICREC program finds a header  that  belongs  to  a  multi-header
          file,  it  attempts  to  read the file but usually fails.  In any
          case, UICREC does print the file name and  disk  address  of  the
          header  on  our line printer.  For each multi-header file that we
          wanted to recover, we dumped all the headers found by UICREC  and
          examined  their  extension  segment numbers (M.ESQN).  With these
          numbers, we were able to sort all the block numbers into  correct
          order.   It  is  important  to  make  sure  that  all headers are
          accounted for, since if any  were  destroyed,  the  file  is  not
          recoverable.   To  recover  the file, we built the sorted list of
          index blocks into the MACRO program named BIG.   Every  time  you
          need  to  recover  such  a file, you must edit the program at the
          label LIST to add your list of index  disk  block  numbers.   The
          list  is  terminated  by  a -1 after your last block number.  The
          program comes with a sample list built into it if you are  unsure
          how  to  do  this.   BIG  recovers  files from DP0:  and writes a
          recovery tape on MT0:.  Sorry, only one  file  to  a  tape.   BIG
          reads  all the blocks pointed to by all the headers pointed to by
          your list, and writes them onto the tape.  BIG also writes enough
          information  on the first block of the tape for the file name and
          attributes to be recreated correctly.

               A slightly more general purpose program,  called  READ,  can
          read  any  tape  created  by  BIG  and  reconstruct the file on a
          mounted volume later.  READ does not need to be  reassembled  for
          each file;  it gets the file name and attributes from crumbs that
          BIG leaves behind on the tape.  READ writes the file onto SY:  in
          the  UIC  that you happen to be SET to, so you need not be in the
          same UIC that the file was recovered from.

               There are several reasons why the multi-header recovery  was
          done  in  this  manner.  First of all, we did not need to recover
          many large files.  At the time we did not have another  big  disk
          drive  on  which to recover large files.  And finally, the manual
          sorting of the headers saved  us  the  complexity  of  a  program
          "smart"  enough  to find all the headers for a multi-header file.
          Remember, the index blocks are scattered virtually at random over
          the  disk, and without the index file index block, the whole disk
          must be searched to find all the headers for a multi-header file.
          It  is  unreasonable  to  expect  a program to remember all block
          numbers of all extension headers that it finds, just in case they
          are  needed.   The more realistic approach would be to ignore all
          extension blocks until the first header block of a large file  is
          found.   Then  the  disk  would be scanned again specifically for
          extension blocks that belong to that file.  This  would,  by  the
          way,  add  another 5-10 minutes every time a multi-header file is
          recovered.

               The multi-header programs I just described went off to  tape
          and back on again because we had only one disk drive large enough
          to hold any files that big.  Since  then,  we  have  purchased  a

                                                                     PAGE 9



          second  drive  and  I wrote a program that copies files directly.
          This program, called MULTI, has  not  been  debugged,  but  I  am
          putting it on the tapeanyway.  I do so with the warning:  ***USER
          BEWARE***!

               After writing the program to recover files without the index
          file,  I  came  up  with different approach to recovering deleted
          files that merrits thinking about.  Consider  a  LAZERUS  program
          that  digs  blocks off of one volume, but reconstructing the file                                                    ______________ ___ ____
          on a separate volume with normal FILES11 macros.  This means that          __ _ ________ ______
          both  disks are mounted, and all allocation of new blocks is done
          though the file system the normal way.  There  is  no  chance  of
          corrupting either of the volumes, and a VFY run is not necessary.
          Robert Brown of the  National  Center  for  Atmospheric  Research
          modified   the   UICREC  program  to  do  just  that.   The  only
          differences in this new version of the program were to check  for
          a zero in word two, (the file ID number), and to check for a zero
          checksum instead of a correct one.

               My final vision is of a super recovery program that does  it
          all:   recovers files whether they are deleted or not;  single or
          multi-header;  by UIC, name, type, or date;  and recovers off  of
          disk  volumes  whether  they  are  mounted or not, and have or no
          longer have an index file, parity errors, or lost  headers!   Two
          years  ago  I  wanted  to  write that program, but too many other
          projects got higher priority.  There was a program on the  spring
          1980  RSX symposium tape, called REI, that seems to come close to
          this.  This program was  submitted  by  Howard  Palmer  while  he
          worked at NASA/AMES, and it should be in directory [307,4].  If I
          can get a copy in time, it will be included on the RSX tape  from
          this  symposium  as  well.  Happily, I have never had the need to
          try out this program, so I cannot vouch  for  it  at  this  time.
          REI,  like  UICREC,  can recover files from a disk without a good
          index file, but it is a little easier to use.  It allows  you  to
          specify  a  UIC  and  filespec  on  the  command  line, including
          wildcards.  The program can recover deleted or  undeleted  files.
          This program does have its problems, however.  It prompts you for
          every file and asks you where you want to put the recovered  one.
          Recovering a lot of files can thus be tedious, although its still
          easier than doing it with  ZAP.   It  is  possible  with  REI  to
          recover  a  deleted  file back onto the disk it was deleted from,
          but the blocks for the file  are  re-allocated  as  the  file  is
          coppied.   This  means  it  may  re-alocate some of the files own
          blocks and destroy them as it saves them.  One user  of  REI  has
          told  me  that  Murfies law guarantees the first file you recover
          this way will destroy the second file you wish  to  recover.   If
          you  use  a  second  disk  to recover on, as I prefer anyway, you
          avoid this problem completely.

                                                                    PAGE 10



                        PROGRAMS USED IN RECOVERY OPERATIONS

               All the programs mentioned above that  were  written  by  me
          were  included  on  the 1979 fall RSX symposium tape in directory
          [307,22].  I am going to include them again  for  convenience  on
          the  fall  1981 RSX symposium tape as well in the same directory.
          The programs are not very  easy  to  use,  please  be  forgiving.
          Remember  that they were all written under extreme duress, so the
          code is not very clean or well documented.  I also normally do  a
          slightly  better  job  of  "userizing"  a  program.   All  of the
          programs do ALUN$ assignments to devices that I had  at  my  last
          job,  so they will require reassembling to make them work on most
          other systems.

               Of cource I can make even fewer promises  about  programs  I
          include  that  were  written  by  other  people.   I'm ashamed to
          notice, however, that both  UNDELETE  and  REI  are  much  easier
          programs to use than the ones I originally wrote.