VULNERABLE POINTS OF THE
                               RSX FILES11 SYSTEM


                                  Mike Higgins
                               CETUS Corporation
                              Berkeley California


                                    ABSTRACT                                    ________

                    There are areas on a FILES11 volume that
                    can  cause  major  problems  if they are
                    lost.  As disk drives  get  bigger,  and
                    bigger  these  vulnerable areas also get
                    larger and  more  susceptible.   Several
                    programs  are  described which were used
                    to recover data from disk packs that had
                    bad  bit  map  blocks  bad  file  header
                    blocks, and even a lost index file.


                           BIG DISKS CAN HAVE PROBLEMS

               CETUS has a PDP11/45 with several RK05'S, and  about  a
          year  ago we purchased a large 300 megabyte disk drive.  (Of
          course this is a brand  X  disk  because  DEC  doesn't  make
          anything  like that yet.) Since we have had this disk, we've
          had a number of problems with it.  Some of the problems were
          caused  by  the fact that it was not a DEC disk, and some of
          them  were  brought  about  by  problems  in  the  hardware.
          However,  most of our problems were related to the fact that
          it is just a very  large  disk,  and  a  certain  amount  of
          problems  is  to  be  expected.   Everyone  expects disks to
          occasionally make mistakes, and it  makes  sense  to  assume
          that  the  bigger  the disk, the greater the chance that you
          are going to get an error.  Sometimes I wonder if  DEC  ever
          expects  to  get  errors on disks, since some of these minor
          problems can be  turned  into  disasters  by  "features"  of
          FILES11  itself.   But  first, let's run through some of the
          problems that can occur on a disk.

                                LOST HEADER BITS

               Along with every block on the  disk,  a  collection  of
          bits,  called  the  header  (not  to  be  confused with file
          headers), is also stored.  This header area is used  by  the
          disk controller to make sure it has the right block and as a
          place in which to store things like the parity, checksum  or


                                                                PAGE 2


          ECC  data.   Since  this information is used to identify the
          block  when  it  is  read,  lost,   or   corrupted,   header
          information is bad news.  Some errors are transient problems
          and can be repaired, but if the header bits  are  corrupted,
          the  disk controller cannot even find the block to attempt a
          read or write to it.  If this happens, even when there is no
          physical  bad  spot  on  the disk, you cannot fix the header
          without reformatting the entire pack.

                                  PARITY ERRORS

               A parity error will occur if the  disk  discovers  that
          its  error  checking  bits, be they parity, checksum, or ECC
          values, do not match the  data  that  it  just  read.   This
          means,  to  a  user, that one or more bits in the block just
          read are bad.  Most DEC programs simply abort with a message
          if  they  get  this kind of error.  (FILES11 is one of those
          programs.)

                         BAD BLOCKS IN THE BIT MAP FILE

               Getting one bad bit in the bit map file  doesn't  sound
          like  a very bad problem, but this is one of the cases where
          FILES11 turns a small glitch into  a  major  disaster.   The
          symptoms  that we got from this type of problem were FCS -32
          and -33 errors (device  read/write  errors)  every  time  we
          tried  to  allocate  a  new file.  Our first reaction was to
          dismount the disk so that it could not be messed up any more
          than  it might be already.  The plan was to bring up an RK05
          version of our system, then  mount  the  bad  disk  and  run
          verify  on  it  to  find out what the problem was (we keep a
          reasonably up-to-date bootable version of our system on both
          types  of  disks for just this kind of emergency).  This did
          not work because MOU refuses to mount disks  that  have  bad
          blocks  in  the  bit  map  file.  Once we had dismounted the
          volume, we could never get it back up again.   So  there  we
          sat  with  300mb's of very valuable data, and presumably one
          bit in one block of one file (albeit this file was  the  bit
          map  file)  was  preventing  us from reading all the rest of
          this large disk volume.   We  further  discovered  that  DSC
          refused  to  copy  a disk with a bad bit map block.  You can
          imagine that we were not very happy about this.

                              INDEX BLOCK PROBLEMS

               Another problem that we have had was bad blocks in  the
          index  file.  You would think that one bad block in an index
          file (which is over 14000 blocks on our system  disk)  would
          only  mess  up one file, but this is not the case.  When RSX
          finds a bad index block while  allocating  a  new  file,  it
          tries  many  times  to  read  the  block  and then gives up,
          returning an FCS error code of  -32.   Since  RSX  allocates
          blocks  in  the  index  file from the top down, once you run


                                                                PAGE 3


          into one of these bad blocks, you cannot allocate  any  more
          files at all, because the bad one is always the next in line
          to be used.  Purging any or  all  directories  on  the  disk
          would  open  up  a few file headers before the trouble spot,
          and things will appear to be O.K.  again for a while.

               A curious aside on this error was  the  fact  that  RSX
          seemed  to try an inordinate number of times to read the bad
          header before returning the -32 error.  We think  that  this
          was  aggrevated  by our brand X disk, because it did several
          attempts slightly off of the disk track to try to  find  the
          block.   If  you  read  the SYSGEN options for the RP04 disk
          drive, this feature will sound familiar  to  you.   The  DEC
          RP03  does  not  have this option, but our brand X disk does
          this in the controller without notifying the PDP unless  all
          the attempts fail.  This means that for each of the normal 8
          retries that RSX does, our controller  tries  8  more  times
          each,  for  a  total  of  64  times.  When reading bad index
          blocks, RSX appears to try a great deal more than  8  times,
          and  multiplied by 64, this takes a finite amount of time to
          do.  While the disk is busily  performing  these  512+  read
          attempts  from  the  disk,  all of the terminals lock up and
          refuse to echo.  The symptoms from the  terminal  look  just
          like  a  crash  due  to  lack of pool space, except that the
          system starts working again after the -32 error is printed.

                               WHAT TO DO ABOUT:
                            PARITY ERRORS IN BIT MAP
                             AND INDEX FILE BLOCKS

               If you get errors in your bit map file or in your index
          file,  there  is  a  possibility  of  recovering  the block.
          First, find the disk address of the error, using  the  CHECK
          program  described later;  then, write a small MACRO program
          to read the block, ignore read errors,  and  write  the  bad
          data  back in.  When the disk controller rewrites the block,
          it recalculates the parity of the block --  taking  the  bad
          bit(s)  into  account.   The next time FILES11 tries to read
          that block, there will be  no  parity  error  so  everything
          should work again.  If this happens in the bit map file, you
          should mount the disk and run VFY  to  allocate  or  recover
          blocks  whose  status  was  changed  by  this error.  If the
          problem returns, then the parity error was not  a  transient
          problem,  and  you  must copy to another disk.  Fortunately,
          running the small MACRO program  again  will  clear  up  the
          problem  enough  to  allow  DSC  to  copy  the disk for you.
          Perhaps the best policy would be to always copy  to  another
          disk,  then  run  BAD  on  the  problem disk before using it
          again.


                                                                PAGE 4


                               WHAT TO DO ABOUT:
                          LOST HEADER BITS IN BIT MAP
                             AND INDEX FILE BLOCKS

               If you perform the above procedure  and  discover  that
          the  small  MACRO program gets errors when it tries to write                                                                 _____
          to the bad block, then you have lost the header bits for the
          block  (RSX does not differentiate between parity errors and
          lost header bits).  The easiest solution to this problem  is
          to  copy  the  disk  to  another using the PRESERVE program.
          PRESERVE is not too smart for its own good  (like  DSC),  so
          when  it  finds  an unreadable block, PRESERVE simply prints
          out the block number and does not copy that block to the new
          disk.   When the transfer is done, what you end up with is a
          disk that is identical to the bad one, except the bad  block
          is  replaced with a block that is simply empty.  These empty
          blocks can, however, be read or written so you can mount the
          volume and fix bad bit map blocks now by running VFY.

               We had one additional problem with PRESERVE:  it  would
          not  copy  all of our brand X disk.  We had to write our own
          special purpose program, called ERRCPY, which copies  blocks
          until  it  gets  an  illegal  block  number.   The  PRESERVE
          program, alas, is too smart for its own good in  this  case:
          it "knows" how many blocks there are on an RP03, and it only
          copies that many.

               If the bad block happens to be in the index file, there
          is  another  approach  to  the  problem.   In  the  Software
          Dispatch,  Sequence  5.1.15.1,  a  procedure  is  given  for
          eliminating a bad index block by marking it as active in the
          index file bitmap.  This is a risky maneuver, so if you  can
          possibly  copy  the  disk,  you  should  do  that instead of
          playing with the index file.  The first  time  this  problem
          occurred,  we  did not have a second disk drive, and we were
          forced to use the procedure from the Software  Dispatch,  so
          we know that it works, but be careful!

                                 OPERATOR ERRORS

               Everyone who uses RSX a lot has  probably  accidentally
          deleted  files  with  PIP and then wished that they could be
          un-deleted.  The worst case we have  had  was  when  a  user
          deleted  all files in a directory and wanted to get them all
          back.  This isn't really a "vulnerable point"  in  the  file
          system,  but  is an example of a type of operator error that
          probably happens more often than others.


                                                                PAGE 5


                     RUNNING INITVOLUME ON THE WRONG VOLUME

               A very vulnerable area on a FILES11 disk is  the  index
          file, and DEC has a program specifically designed to clobber
          it.  This program is called INITVOLUME  or  INI.   An  error
          that  you  don't want to happen on your system is to have an
          operator run INITVOLUME on a disk that  still  had  valuable
          data  on  it.  However, when INI initializes a disk, it only
          destroys the beginning of the  index  file,  which  includes
          among  other  things,  the  index  block  of  the index file
          itself.  On large disk packs,  the  index  file  starts  out
          small  and then grows as new files are allocated.  The index
          file on our system disk  is  now  over  14,000  blocks  big.
          Since  INI  only  initializes  the first part of a new index
          file (about 25 blocks),  then  the  majority  of  the  index
          blocks  will still be there on the disk somewhere.  However,
          blocks in the index file are allocated just like  blocks  in
          any  other file, so they can be scattered anywhere at all on
          the disk.  This means with the index file index block  gone,
          it  is  not possible to predict where all those intact index
          blocks are, as much as you might like to find them.

                          RUNNING BAD ON THE WRONG DISK

               If you thought initializing the  wrong  disk  was  bad,
          imagine  your  operators running BAD on a disk with valuable
          data on it.  When BAD runs to completion, it writes on every
          block  of  the  disk,  and  there  is  no hope of recovering
          anything.  But think positively, perhaps you can  abort  the
          BAD  run  very  soon  after  it starts.  If you do this soon
          enough, the program will only have  destroyed  part  of  the
          disk,  and  you  can  treat  the recovery the same as a lost
          index file.

               At CETUS, we have never had a disk destroyed by BAD  or
          INI, but we had something just as good happen.  One day, our
          DEC repairman, while initializing some new  diagnostic  RK05
          cartriges,  accidentally  initialized  our big disk.  It was
          really stupid of us to leave a valuable  data  disk  up  and
          running  on a system that had diagnostics running on it, but
          it was too late before we realized that.   You  can  bet  we
          haven't   done   it   again  since.   Well,  initializing  a
          diagnostic pack seems to involve writing mostly  zeros  over
          the  first  222 (octal) blocks of the disk.  Since our index
          file was on the beginning of the disk, this  had  about  the
          same effect as running INI on a pack.


                                                                PAGE 6


                               WHAT TO DO ABOUT:
                               A LOST INDEX FILE

               When the index file is lost, it is  still  possible  to
          find  index  blocks by reading the whole disk and looking at
          every block.  There are clues in a header block to help  you
          identify  them.   The first word of a header always contains
          27027 (octal).  This word is really two offsets, H.IDOF  and
          H.MPOF, but they always seem to contain the same values.  If
          the header  is  active,  the  second  word  will  contain  a
          non-zero  file  ID number.  The last word of the header must
          contain a correct checksum:  the sum of all the other  words
          in  the  block.   Our  UICREC program (described below) also
          checks the owning UIC number in the header  to  see  if  the
          file   belongs  to  a  specific  directory.   An  even  more
          sophisticated program could check the  blocks  for  specific
          file names, types, or even creation and modification dates.

               After a block has been identified as an index block, it
          is possible to link through all the retrieval pointers in it
          and obtain the disk block numbers for all the blocks in  the
          file.   In  our  recovery program, it was most convenient to
          recover files off of the corrupted disk  with  QIO$ IO.RVB's
          and  recreate  them  on  another  disk  with  normal FILES11
          WRITE$'s.  In order to recreate these files  with  the  file
          attributes  of  the  originals,  the  user  file  attributes
          section of the index block is  also  lifted  and  used  when
          opening  the  reconstructed  file.  The blocks are then read
          one at a time from the corrupted disk and written  into  the
          new file.

               The above  procedure  does  not  work  on  multi-header
          files,  but  on  our  system  there  were  only four or five
          multi-header files that needed to be  recovered.   When  the
          UICREC program finds a header that belongs to a multi-header
          file, it attempts to read the file but  usually  fails.   In
          any  case,  UICREC does print the file name and disk address
          of the header on our line printer.   For  each  multi-header
          file  that  we  wanted to recover, we dumped all the headers
          found by UICREC and examined their extension segment numbers
          (M.ESQN).   With these numbers, we were able to sort all the
          block numbers into correct order.  It is important  to  make
          sure  that  all headers are accounted for, since if any were
          destroyed, the file is  not  recoverable.   To  recover  the
          file,  we  built  the  sorted  list of index blocks into the
          MACRO program named BIG.  This program reads all the  blocks
          pointed  to  by all the headers and writes them onto a tape.
          BIG also writes enough information on the first block of the
          tape  for  the  file  name  and  attributes  to be recreated
          correctly.  A slightly more general purpose program,  called
          READ,  can  read any tape created by BIG and reconstruct the
          file on a mounted volume later.


                                                                PAGE 7


               There are several reasons why the multi-header recovery
          was  done  in this manner.  First of all, we did not need to
          recover many large files.  At  the  time  we  did  not  have
          another big disk drive on which to recover large files.  And
          finally, the manual sorting of  the  headers  saved  us  the
          complexity  of  a  program  "smart"  enough  to find all the
          headers for a multi-header file.  Remember, the index blocks
          are scattered virtually at random over the disk, and without
          the index file index block, the whole disk must be  searched
          to  find  all  the  headers  for a multi-header file.  It is
          unreasonable to expect  a  program  to  remember  all  block
          numbers of all extension headers that it finds, just in case
          they are needed.  The more realistic approach  would  be  to
          ignore  all extension blocks until the first header block of
          a large file is found.  Then the disk would be scanned again
          specifically  for extension blocks that belong to that file.
          This would, by the way, add another 5-10 minutes every  time
          a multi-header file is recovered.

                               WHAT TO DO ABOUT:
                           ACCIDENTALLY DELETED FILES

               Recovering accidentally deleted files requires  one  of
          two  things:  either someone spends the time to write a file
          recovery program before it is needed or you madly run  about
          writing  a  special purpose program to recover the files you
          delete after it happens.  We had a program  (named  LAZARUS)
          on  RSX11M V2.0  which could resurrect a file in place after
          it had been deleted, but this program was made  obsolete  by
          some  changes  in  FILES11 that came out at the same time as
          RSX11M V3.0.  The LAZARUS program,  besides  being  obsolete
          now,  was cumbersome to use:  it could only be run on a disk
          mounted with the /UNL switch, and  VFY  had  to  be  run  to
          re-mark   blocks   as   allocated  every  time  a  file  was
          resurrected.  In the process of writing the UICREC  program,
          I  came  up  with several techiques that are superior to the
          old approach.  The traditional approach is to find the index
          block  for  the deleted file, patch it back up, then then go
          patch up the index file bitmap and  bitmap  file.   Although
          this  technique  would undoubtably work if written properly,
          it has a great potential for  corrupting  good  volumes.   I
          envision a new LAZERUS program looking a lot like the UICREC
          program, digging blocks off of one volume the hard way,  but
          reconstructing  the  file  on  a separate volume with normal          ______________  ___  ____  __  _ ________ ______
          FILES11 macros.  This means that both disks are mounted, and
          all  allocation of new blocks is done though the file system
          the normal way.  There is no chance of corrupting either  of
          the  volumes.   For  systems  that  lack  the advantage of a
          second disk drive, there is one other possibility:  the code
          for a recovery program does not amount to much, and there is
          address space left over for at least 32  disk  blocks.   For
          small files and when only one file needs recovering, LAZERUS
          could recover the file into core, then write it out onto the


                                                                PAGE 8


          same  volume  that it came off of.  If the file is too large
          or there are several files to  recover,  the  program  could
          have  a switch to allow it to write back to the same volume.
          There would be a chance of losing some blocks  if  they  are
          reallocated before you recover them, but that is better than
          no chance at all.

               My final vision is of a  super  recovery  program  that
          does  it  all:   recovers  files whether they are deleted or
          not;  single or multi-header;  by UIC, name, type, or  date;
          and recovers off of disk volumes whether they are mounted or
          not, and have or  no  longer  have  an  index  file,  parity
          errors,  or  lost  headers!   Needless to say, I will not be
          able to justify writing such a program until the  next  time
          my  DP  manager accidentally deletes a bunch of files in his
          directory.


                                                                PAGE 9


                      PROGRAMS USED IN RECOVERY OPERATIONS

               All the following programs, as well as the RNO file for
          this  document, will be available on the RSX symposium tape.
          The programs are not currently very easy to use,  please  be
          forgiving.   Remember  that  they  were  all  written  under
          extreme duress, so the  code  is  not  very  clean  or  well
          documented.   I  also  normally  do a slightly better job of
          "userizing"  a  program.   All  of  the  programs  do  ALUN$
          assignments  to  devices that we have here at CETUS, so they
          will require reassembling to make them work  on  most  other
          systems.

          CHECK

               This program performs a read check on all blocks  of  a
          disk.   Unlike  some DEC programs, like PRESERVE, CHECK does
          not "know" how many blocks there are on a disk.  The program
          keeps going until the system returns an illegal block number
          error code.  This means it even works on brand X disks!  The
          program reads 32 blocks at a time, barring errors, so it can
          read check a 300MB disk in 5 to 10 minutes.  We built  CHECK
          with  the /PR:0 taskbuild option, so it can even be run on a
          mounted disk.  If errors are found, they are reported in the
          following  format:  ERR NNN HHHHHH LLLLLLL, where NNN is the
          FCS error code (usually 374), HHHHH and LLLLLL are the  high
          order  and  low order words of the 32 bit block number where
          the problem occurred.

          ERRCPY

               This program copies DP0:  to DP1:, similar to  the  way
          PRESERVE does;  but like the CHECK program, it keeps copying
          until  the  system  tells  it  to  stop.   Why  have  system
          structures  like  the  UCB  unless you put them to good use?
          ERRCOPY can copy a whole 300MB disk in  about  20  miunutes,
          unless  errors  are  found,  in  which case it runs a little
          slower.  Errors are printed in the same format as the  CHECK
          program.   If  ERRCPY  is  built  with  the privleged option
          (/PR:0), it can even copy from a mounted  device,  making  a
          PRESERVE-like  backup  while  minimal  usage  of the disk is
          still going on.  This may be a  little  dangerous,  however,
          because  the  program would also have permission to write on
          mounted volumes (ouch!).

          UICREC

               The current version recovers files from DP0:  onto  SY:
          in the current UIC.  This means you must SET /UIC to the UIC
          that you wish to recover and ASN SY:  to the disk  you  want
          to  store  recovered  files  on.   The recovery disk must be
          mounted and must have the appropriate  UFD  created  on  it.
          Every  time UICREC finds a header that is owned by your UIC,


                                                               PAGE 10


          it prints (on LP1:  -- this is  a  taskbuilder  option)  the
          file name, the two word disk address of the index block, and
          four  FCS   error   codes.    The   four   codes   are   for
          (respectively):   1)  opening the output file, 2) allocating
          the required number of blocks, 3) the last  write  into  the
          new file, and 4) closing the file.  If all four codes are 1,
          then the recovery was successful.  Errors that you might get
          are:   -16  -- you forgot to mount the recovery disk, -26 --
          you forgot to create a UFD for this UIC, -24 -- the recovery
          disk   is   full.    Remember  that  UICREC  cannot  recover
          multi-header files;  if  the  same  file  name  and  version
          appears  several  times  with  different block numbers, it's
          probably a multi-header file.  Read the procedure above, and
          use the BIG and READ programs described below.  UICREC scans
          for index blocks 30 or so at a time,  so  it  can  search  a
          300MB disk in 5 to 10 minutes, plus the transfer time of any
          files recovered.  An important  note:   the  UIC  that  this
          program   sees  is  the  owning  UIC,  not  necessarily  the                                   ______
          directory that the file was last seen in.  If your file does
          not pop up where you expect it, try to remember what UIC you
          were in when you created the file.                           _______

          BIG

               This program  is  used  to  help  recover  multi-header
          files.  Every time you need to recover such a file, you must
          edit the program at the label LIST to add your list of index
          disk  block  numbers.   The list is terminated by a -1 after
          your last block number.  The program  comes  with  a  sample
          list  built  into  it if you are unsure how to do this.  BIG
          recovers files from DP0:  and  writes  a  recovery  tape  on
          MT0:.  Sorry, only one file to a tape.

          READ

               This program reads tapes created by  the  BIG  program.
          READ does not need to be reassembled for each file;  it gets
          the file name and attributes from  crumbs  that  BIG  leaves
          behind  on  the tape.  READ writes the file onto SY:  in the
          UIC that you happen to be SET to, so you need not be in  the
          same UIC that the file was recovered from.