See CRASH.CMD for figure.


	Crashes - (Part 1)

Have you ever noticed how RSX says goodbye to you just as you are getting ready
to leave on Friday for the big weekend in the mountains: 

	CRASH -- CONT WITH SCRATCH MEDIA ON MT0

System crashes are a fact of life. While 5:00 pm Friday is certainly not a
convenient time for a crash, neither is any other time of the week. A crash is
never a welcomed experience, but if you are prepared crashes need not be fatal
to your system or your home life. 

This article lists the steps you should take at your site to prepare for when
your RSX system crashes. Next month's article will discuss how to be a crash
detective and solve the really tough crash dumps. 

Crashes come in several forms. Most people equate RSX system crashes to the
above message and subsequent CPU halt. However, I define crash as any system
failure which can be fixed only by rebooting the system. This broader
definition includes conditions such as CPU halts or uninterruptable loops, pool
exhaustion, memory deadlocks, critical task failure (i.e. F11ACP aborts), and
other miscellaneous errors such as removing INStall. 

All computer systems crash. The suggestions in this article are mostly common
sense and apply to handling system crashes for all systems. However, examples
and specific details will be drawn from RSX systems. 

The first question you need to answer regarding crashes is whether you need to
bother with them at all? There is a dollar cost associated with handling and
analyzing crashes. The cost comes mostly in the form of manpower, either your
own or outside experts when you have no in-house systems expertise. 

The payback for the time and money invested in solving crashes comes from the
production losses you avoid in the future. A system crash is not a random
event. Unlike a spurious hardware glich, if you do not find and fix the problem
which caused your system to crash, the next time the same set of events occur
the system will crash again.

For the lucky sites which consider interruption of service to be only a minor
annoyance, crashes can be ignored. However most sites can calculate a real cost
if their systems fail. The higher this cost, the more steps it is realistic to
take to avoid system outages - 7x24 Field Service coverage, hot and cold system
backups, or redundant systems. Crash handling and analysis procedures are an
integral part of these steps. 

There are three distinct phases to solving system crashes: collecting
information at the time system fails, checking the symptoms for any
obvious causes, and finally systematically analyzing the crash dump and other
data to pinpoint and fix the reason for failure. 

Collecting information about system failures presents some real problems. When
your system is down you will be under a great deal of pressure to get the
system back up. No one will be happy if it takes you 20 minutes to find and
load a scratch magtape in order to get a crash dump. If you give in to peer
pressure and reboot, the same problem could rear its head next month when you go
into real production. 

Responding to system failures is something you practice. Anyone who knows how
to reboot the computer system should know how to take a crash dump and log
the crash data. 

Training to get crash dumps starts with the four different methods you can use
to bring RSX systems to the standard crash routine. The simplest case is the
system has already printed the crash message on the console and you are ready
to mount a scratch medium and proceed. If the executive has broken to XDT
instead, typing an "X" will cause XDT to jump to the crash routine. 

RSX always stores a jump instruction to the crash routine in physical memory
location 40. If the system is in a tight loop, halted, or is in its console
ODT mode you can force a crash dump by starting execution at location 40. For
instance, I would type 40G on the console of my PDP-11/24. When you take this
approach, you should write down the program counter (PC) of where the system
halted as the PC is not saved in the crash dump.

If you can still type MCR commands and want to force a crash, I recommend
depositing a 101(8) at location 100(8) using the OPEN command.

	>OPEN 100/KNLD
	00000100 /xxxxxx 101<ESC>

The system will crash with an odd-address trap the next time the system clock
tics (sometime in the next 1/60th of a second). This technique can also be
used from the console when the system seems to be in a tight but interruptable
loop. The advantage over the 40G mechanism is current program counter and
processor status are pushed onto the stack by the clock interrupt.

If none of these methods result in the crash message, either the executive
crash routine is corrupted or there is a severe hardware error. In either case
it will be impossible to get a crash dump and you should simply log the failure
and try to reboot the system. 

Once your system halts after writing the crash message, you are ready to load
the crash medium and write the crash dump. You select the crash device when you
generated your RSX-11M or RSX-11M-Plus system. Micro/RSX and pre-generated
RSX-11M-Plus systems have loadable crash device support. Obviously the
crash device must have been loaded into memory prior to the system failure. 

You should always choose magnetic tape devices over disk drives and floppy
disks over hard disks as crash devices. If you must use a disk device, be sure
you train your people to write protect other disk drives. The crash routines
write over the first 'n' blocks of a disk, which is a sure-fire method of
destroying the boot block and home block of a RSX disk. 

The system halts again after the crash dump had been written to the scratch
medium. If you continue from this halt, the crash routine loops and writes
another crash dump. If a problem occurs writing the crash dump, such as a no
write ring in the tape, wait till the system halts again, fix the problem,
and continue. 

The various executive crash routines work by writing physical memory to the
crash device until a non-existent memory error occurs. Thus the crash dump does
not include any device or CPU registers located in the I/O page. If you suspect
a hardware device error caused the crash, it is a good idea to manual examine
the device registers from the console when the system halts after writing the
crash dump.

A system crash is the wrong time to look through hardware manuals for
various addresses. Put together a simple 1-2 page document with contact names,
reboot procedure, and any relevant CPU and device registers and tape the
list to the side of the CPU.

A system crash is also the wrong time to scramble around looking for a scratch
medium. You should dedicate tapes or disks as crash medium and not use for any
other purposes. You do not want to be in the situation where you lose a crash
dump because you do not have a scratch tape. You this have to suffer another
production loss before you can get the information needed to solve the problem
and avoid system outages. 

You are not quite finished when you have taken the crash dump and rebooted the
system. The crash dump records what happened inside the computer at the time of
the crash. You need to record what was happening outside the system. Do not
rely on the memory of yourself and others. It may be days or months before you
really dig into this crash dump.

This is really nothing more than informally asking different people what they
were doing when the system crashed and what they think went wrong. Pay special
attention to anything new which was just attempted. You should also look at the
console terminal and other terminals for messages out of the ordinary. You
looking to get a picture of what was going when the system crashed. 

The resulting information needs to be logged in some fashion. You do
not expect system crashes. You probably have not allowed 2-3 hours it takes to
start digging into a crash dump. For the time being, you take the crash dump,
reboot the system, get the picture of what is going on,
and get back to productive work. The crash becomes tomorrow's problem
(unless of course the system crashes again in ten minutes). 

Some sort of filing system for crash dumps is required. I use the piles of
crash listings approach. As soon as the system comes back up, I use the command
file CRASH.CMD to read the dump to a disk file and generate an initial listing.
I put all my crash dumps in the same account and name the files according the
date.

The crash listing becomes my file for this crash. I write all notes about the
crash directly on the listing. If there is any console output or other notes to
keep, I staple the paper to the crash listing. My filing system is the stack(s)
of crash listings.

When I solve a crash, the crash dump file extension is changed from CDA to YEA.
The notes on the crash listing are updated and all but the first few pages are
thrown away. These cover sheets and any attachments are put into a actual
file drawer.

Eventually crash files and listings accumulate and I need to recover the disk
and office space. Any solved and dead-end crash dump files are appended to a
BRU tape used just to hold such files. Again just the cover sheets from the
discarded crashes are kept and everything else is thrown away. 

The full output from the Crash Dump Analyzer can easily use several trees to
print (over 300 pages for large systems). I prefer to use just a few switches
in the initial listing and get more detailed output later if I need it. The two
most useful switches are /ACT and /PCB. The /ACT switch generates a list of all
the active tasks in the system. The /PCB switch is useful because of the memory
map it outputs. 

You start analyzing a crash dump by looking through the crash listing and
getting a picture of what was happening inside the system when the crash
occurred. You look for the obvious and the unexpected.

Probably the most common RSX problem is pool exhausted. The beginning
section of the crash listing shows if pool is low. If so, sometimes the
initial listing shows what is consuming pool. Look for large I/O counts
and long receive queues.

One example of the unexpected is active tasks which should not be active. Tasks
such as TKTN or PMD mean some task has aborted. If ERRLOG is active, some
device errors are being logged. 

You should also look for the opposite, tasks which should be active but are
not. The tasks ...LDR, F11ACP, and MCR... should be active at all times. Your
application probably has similar tasks. The active task portion of the crash
listing should also be examined for tasks in with outstanding I/O. This
probably not a direct indicator of the problem, but many system crashes are
related to I/O in progress. 

You now have have pictures of what was happening inside and outside the system
when the crash occurred. You are now ready to play detective (but you will have
to wait for next month's exciting conclusion).