Crashes - (Part 2)

Different programmers enjoy different parts of the software cycle. The majority
take greatest job satisfaction from actual coding while others like the design
phase. Several years ago I heard rumors about someone down in Texas who thought
writing documentation was the best part of their job. 

My speciality is debugging. I especially like solving system crashes. Crash
dump files have the same effect on me as a mysterious caller to 221B Baker
Street had on Sherlock Holmes. 

My approach to crash dumps also follows Holmes' behavior at a crime scene.
After quickly scanning a few crash dump listing pages, I alternate between
periods of staring into space and frenzy page turning.  When the culprit is
apprehended, what I consider a perfectly logical explanation might seem to you
has reached only through black magic. 

Just as Sherlock Holmes was an expert in cigar ash, posions, and footprints, I
have mastered tracing through the system stack, locating the stray I/O, and
accounting for all pool usage. Our expertise in our respective professions
gives both Holmes and myself the ability to observe minute facts and make
startling deductions. 

While crime is probably too large of an area to be mastered by current
artificial intellegence technology, crash analysis is a perfect application for
an expert system. In fact, such a system could work almost unattended, getting
the necessary input from the crash dump file and working on a rule base which
includes the current sources and all known problems. 

This article looks at some design goals for a crash analysis expert system and
the initial rule base an RSX implementation of the system might follow. 

The goal of crash analysis is to find and fix the logic which caused the system
to crash. If we cannot reach this goal, our purpose is to better understand why
the system crashed at this particular point in time. Learning to avoid a crash
is almost as good as an actual fix. 

Our crash analysis expert system starts with our observations of what was going
on at the time of the crash. The first question the expert system would always
ask is "What was new or unique about the system at this point in time?" If we
answer Joe ran his new program for the first time, the system would ask Joe to
run the program again before allowing normal usage. A second crash might
result in giving Joe the rest of the day off as a means of avoiding future
crashes. 

One crash in 10 is resolved by just reviewing your observations, especially
crashes caused by pilot error. It is more likely you will isolate the reason
for crashing and move further crash analysis work on the problem to off-hours.

    	Physical Crash Point
    	-------  ----- -----

In any case, nine times out of 10 you need to dig into the crash dump. The
hypothetical expert system would first ask some questions on how the system
crashed: did a failure take you to XDT, directly to the crash routine, or did
you force a crash by starting at location 40. See Part 1 (February 1987) for
details on all the different ways an RSX system crashes. 

Our expert system then starts examining the crash dump file. Its first goal is
to find the instruction which caused the system fault. The instruction which
caused the system fault is called the physical crash point. For instance, the
following RSX executive instruction will cause a crash if R5 is an odd address
instead of the expected UCB address: 

	MOV	U.RED(R5),R5		;Get next redirect chain

The various ways you can crash an RSX system determine how the expert system
would find the Program Counter (PC), Processor Status (PS), stack pointer (R6)
and general registers (R0-R5) for the physical crash point. The easiest case is
an XDT crash. If the crash was generated by typing 'X' to XDT, the physical
crash PC/PS are shown on crash listing on the "BEFORE CRASH PC/PS" line on the
first page. The next line lists the general registers at the time XDT was
called and SP(K) value on the first line is the kernel stack pointer (R6) when
the crash occured. 

The above values are also used if a crash is forced by starting at location 40
except the BEFORE CRASH PC/PS. The Program Counter and Processor Status values
in the listing are meaningless. Instead you use the PC/PS you wrote down when
you halted the system. These crucial values are lost when you restart the CPU
at location 40. 
 
The crash listings are misleading when you try to find the physical crash point
when the system jumps directly to the crash routine. The BEFORE CRASH PC/PS
are for the standard instruction (an IOT) RSX issues when it has decided the
system must crash. Furthermore, the registers and kernel stack pointer reflect
the processing RSX did in deciding to crash. We really want to find the place
which forced RSX to execute this code. 

You need to remember that all crashes in this category are initiated by some
form of instruction trap: odd address, illegal instruction, segment fault,
stack limit violation,  or instruction (BPT, IOT, EMT, or TRAP). All of these
traps are processed in the same way in the module SSTSR: the registers are
saved, some trap specific data is pushed onto the stack, and RSX decides if the
trap belongs a user task or the executive. 99.9% of time user code is blamed
and some task gets aborted (or an SST trap is declared to the task). 

One time in a thousand executive code caused the trap and the system must
crash. To get to the crash code, SSTSR issues an IOT instruction. The IOT
processing in SSTSR includes a jump to the crash routine. The PC/PS, registers,
and kernel stack of this IOT instruction are reported on the first page of the
crash listing. 

However, it is the contents of the kernel stack which interest us. There are
six different cases to consider. Some simple pattern matching makes it easy to
identify the different cases from each other. 

In all cases, you start by getting the kernel stack pointer from the first page
of the crash listing. Then you turn to the kernel stack dump on the second page
and start up from the value of the stack pointer. If the kernel stack pointer
has a value of 376, the trap is a stack violation and the stack looks like
Figure 1. 

The next four cases - odd address (trap 4), illegal instruction (trap 10),
breakpoint (trap 14), and segment fault (trap 250) - are all variations on the
same theme. Figure 2 shows the general case. Reading up the stack is the PC/PS
of the final IOT in SSTSR, two constant values which vary according to the type
of trap, and any optional parameters. Next comes the return address of the
executive cooroutine used to save the registers and the saved R0-R5. We finally
come to the PC/PS of the actual trap. This is the physical crash point. The
registers at the time of the crash have been neatly saved on the stack. 

The two constant values make it easy to find out which trap occured. The table
in Figure 3 shows the relationships. The pattern of odd address trap (4,0) is
probably the most common. Only segment fault (12,2) has any optional
parameters. The memory management registers SR0, SR2, and SR1 repectively are
pushed onto the stack. 

The last case is none of the above and is the IOT crash. The IOT istruction is
the RSX crash instruction. There are over twenty places inside RSX
where the executive becomes confused enough to deliberately crash the system.
You get the physical crash point from the first page of the crash listing just
like this was an XDT crash. 

It only takes a few seconds to get the Program Counter, Processor Status, and
general registers of the physical crash point (after you have done it once or
twice). The expert system now has its starting point for further detective
work.

	Getting to Code
	------- -- ----

The next step is start reading code. It takes four stages to get from the
physical crash data to the correct source listing. You map the Program Counter
to physical memory by using the memory management registers listed on the first
page of the crash listing. You then find which component is loaded into this
part of memory by scanning the the memory map page in the crash dump listing.
This page is found at the beginning of the partition information section. You
then look at the component's task map for to find the source module, and
finally look in the source module to see the instruction which caused the
fault. 

This process is actually not as involved as it sounds. Crashes always happen in
kernel CPU state so you are only concerned with the Kernel I Space registers
(lower left corner). If the PC is from 0 to 120000, the physical crash point is
in the executive and you look directly in the executive map ([1,34]RSX11M.MAP)
to find which module was involved. If the executive source listing is not
available, make one by using the following command: 

	MAC ,file=LB:[1,1]EXEMC/ML,SY:[11,10]RSXMC,file

Addresses above 120000 are either device drivers or privilege tasks. The number
in the appropriate memory management register gives the physical memory address
in 32 word units. You append two zeros to get the octal physical address. It is
then quick to scan along the base column in the memory map to find the
component which begins just below some given physical address. You now know
which compoment for which you might need to generate a map and listings.

The translation from crash dump Program Counter to source listing instruction
occurs over and over again during crash analysis. You need to remember that the
Processor Status at the time the Program Counter was used determines which set
of memory management registers are used to compute the physical memory address.
The PDP-11 Processor Handbook has detailed information on the format of the
Processor Status and the operation of the memory management registers. 

    	Crash Detective
    	----- ---------

A crash is a sequence of time-ordered events: the system is performing
normally, the logical error occurs, the system continues processing, the
physical error occurs, and the system crashes. 

The element of time is crucial to crash analysis. A crash dump gives you a
snapshot of the system at the physical point of failure. All registers, stacks,
and variables are frozen at this point in time. We need to try to use this
information to walk backwards in time to the logical error. The ideal crash
analysis utility would be a screen-mode debugger which single step backwards!
Register and variable contents would be backdated as far as possible. 

Backwards debugging is possible because the kernel stack provides the past
history of what happened in the system since the last switch from user to
kernel state. Walking up and down the stack lets you reconstruct the immediate
chain of events.

The easiest crashes to solve are those where the sequence of events is trap to
executive (for example the set event flag directive is issued), some
processing, logical error, some more processing, physical error, and crash.
When the logical error occurs in the same context as the physical crash point,
it will probably take only one crash to solve the problem because all the
history is captured by the crash dump. 

The difficult crashes are those where the history from the logical error to the
physical error is lost. There is no cookbook approach which applies. You have
to collect the available evidence and use your judgement as to the next
steps to take. 

It helps if you can learn to think like a PDP-11. It is critical to remember a
system crash is a completely logical event, at least from the point of view of
the computer. If a routine has been successfully executed once, the exact same
input to the same code will work the second, third, and all subsequent calls.
If a call to this routine crashes the system, we need to find out what has
changed.

Occam's Razor also applies to crashes. Crashes often leave what seems to be
unrelated or conflicting evidence. But when the crash is finally solved, the
logical error turns out to be an obvious explanation for all the facts. As you
accumulate evidence and symptoms about the problem, try to theorize the
simplest glich which would fully explain the facts. When you attempt to
prove the theory, you will either find the bug or learn even more about
the problem.

This is the point an expert system becomes valuable. The evidence gleaned from
crash dumps can point to actual problems if you apply an knowledge database to
the data. For instance if you tell me one of your symptoms is your system clock
occasionally loses 18.2 minutes, I know the problem is a device forking twice.
This and probably another hundred different rules can be coded. 

	Tricks
	------

Unfortunately it would take hundreds of pages to express the same knowledge as
printed material because in effect these rules are all that I and others know
about RSX. Experience is probably the most difficult subject you can attempt to
document.

What I can cover is some of the tools and tricks I use to solve the tough
crashes. The first tool is what could be called the crash dump editor - the ZAP
utility. ZAP lets you look at any  word in any file. This is perfect for
looking around in crash dump files. 

ZAP uses an ODT-like syntax and is fully explained in the RSX Utilities Manual.
ZAP is the only RSX utility which does not accept the filename on the initial
command line. Use the following sequence to open a crash dump file: 

	>ZAP
	ZAP>filename/AB/RO

The /AB switch tells ZAP to open the file in absolute mode. The /RO switch
prevents you from accendentially modifying the crash file. Nothing can be more
frustrating then speding 12 hours chasing down a bogus piece of evidence you
put into the dump. 

You use ZAP to dig out evidence from the crash dump. For instance,
you have looked up the physical crash point in the source code and found
the instruction we failed on was:

	MOV	U.RED(R5),R5		;Get next redirect chain

However, R5 at the time of the crash is not odd. What can explain the problem?
ZAP commands can be used to find that the MOV instruction has been clobbered. 

ZAP can only access 32KW from a given base address. When looking at a crash
dump file, it is useful to set relocation registers to the base of various
components. For example, 1:0;1R sets relocation register 1 to physical memory
address 0 (block 1, byte 0). 

Assume you want to examine location 123404 in F11ACP. According to the crash
listing, F11ACP is located at physical address 345200. The following commands
examine the desired address:

	_346:200-120000;2R	or 	_346:200;2R
	_2,123404/			_2,3404/

It is not difficult to write your own programs to do special analysis of crash
dump files as the file format is fixed length 512 byte records. For example
your analysis shows something is illegally referencing location 30 (EMT trap)
and storing a 2456 in it. These numbers have to come from somewhere. It is
worth a shot to write a throw away program to read the entire crash dump file
and output all occurences of 30 or 2456. 

These tricks and more are discussed in the workbook Ken Johnson and I used for
the Crash Dump Analysis presymposium seminar. The book can be found on the
Fall 1982 RSX SIG tape.

It is probably too late in the life time of RSX to develop a crash analysis
expert system. On the other hand, I have heard estimates as high as 100,000 RSX
systems have been sold. If each system crashes only once a month, 30 systems
have crash in the 15 minutes since you start reading this article. Is your
system still up?