Crashes - (Part 2) Different programmers enjoy different parts of the software cycle. The majority take greatest job satisfaction from actual coding while others like the design phase. Several years ago I heard rumors about someone down in Texas who thought writing documentation was the best part of their job. My speciality is debugging. I especially like solving system crashes. Crash dump files have the same effect on me as a mysterious caller to 221B Baker Street had on Sherlock Holmes. My approach to crash dumps also follows Holmes' behavior at a crime scene. After quickly scanning a few crash dump listing pages, I alternate between periods of staring into space and frenzy page turning. When the culprit is apprehended, what I consider a perfectly logical explanation might seem to you has reached only through black magic. Just as Sherlock Holmes was an expert in cigar ash, posions, and footprints, I have mastered tracing through the system stack, locating the stray I/O, and accounting for all pool usage. Our expertise in our respective professions gives both Holmes and myself the ability to observe minute facts and make startling deductions. While crime is probably too large of an area to be mastered by current artificial intellegence technology, crash analysis is a perfect application for an expert system. In fact, such a system could work almost unattended, getting the necessary input from the crash dump file and working on a rule base which includes the current sources and all known problems. This article looks at some design goals for a crash analysis expert system and the initial rule base an RSX implementation of the system might follow. The goal of crash analysis is to find and fix the logic which caused the system to crash. If we cannot reach this goal, our purpose is to better understand why the system crashed at this particular point in time. Learning to avoid a crash is almost as good as an actual fix. Our crash analysis expert system starts with our observations of what was going on at the time of the crash. The first question the expert system would always ask is "What was new or unique about the system at this point in time?" If we answer Joe ran his new program for the first time, the system would ask Joe to run the program again before allowing normal usage. A second crash might result in giving Joe the rest of the day off as a means of avoiding future crashes. One crash in 10 is resolved by just reviewing your observations, especially crashes caused by pilot error. It is more likely you will isolate the reason for crashing and move further crash analysis work on the problem to off-hours. Physical Crash Point ------- ----- ----- In any case, nine times out of 10 you need to dig into the crash dump. The hypothetical expert system would first ask some questions on how the system crashed: did a failure take you to XDT, directly to the crash routine, or did you force a crash by starting at location 40. See Part 1 (February 1987) for details on all the different ways an RSX system crashes. Our expert system then starts examining the crash dump file. Its first goal is to find the instruction which caused the system fault. The instruction which caused the system fault is called the physical crash point. For instance, the following RSX executive instruction will cause a crash if R5 is an odd address instead of the expected UCB address: MOV U.RED(R5),R5 ;Get next redirect chain The various ways you can crash an RSX system determine how the expert system would find the Program Counter (PC), Processor Status (PS), stack pointer (R6) and general registers (R0-R5) for the physical crash point. The easiest case is an XDT crash. If the crash was generated by typing 'X' to XDT, the physical crash PC/PS are shown on crash listing on the "BEFORE CRASH PC/PS" line on the first page. The next line lists the general registers at the time XDT was called and SP(K) value on the first line is the kernel stack pointer (R6) when the crash occured. The above values are also used if a crash is forced by starting at location 40 except the BEFORE CRASH PC/PS. The Program Counter and Processor Status values in the listing are meaningless. Instead you use the PC/PS you wrote down when you halted the system. These crucial values are lost when you restart the CPU at location 40. The crash listings are misleading when you try to find the physical crash point when the system jumps directly to the crash routine. The BEFORE CRASH PC/PS are for the standard instruction (an IOT) RSX issues when it has decided the system must crash. Furthermore, the registers and kernel stack pointer reflect the processing RSX did in deciding to crash. We really want to find the place which forced RSX to execute this code. You need to remember that all crashes in this category are initiated by some form of instruction trap: odd address, illegal instruction, segment fault, stack limit violation, or instruction (BPT, IOT, EMT, or TRAP). All of these traps are processed in the same way in the module SSTSR: the registers are saved, some trap specific data is pushed onto the stack, and RSX decides if the trap belongs a user task or the executive. 99.9% of time user code is blamed and some task gets aborted (or an SST trap is declared to the task). One time in a thousand executive code caused the trap and the system must crash. To get to the crash code, SSTSR issues an IOT instruction. The IOT processing in SSTSR includes a jump to the crash routine. The PC/PS, registers, and kernel stack of this IOT instruction are reported on the first page of the crash listing. However, it is the contents of the kernel stack which interest us. There are six different cases to consider. Some simple pattern matching makes it easy to identify the different cases from each other. In all cases, you start by getting the kernel stack pointer from the first page of the crash listing. Then you turn to the kernel stack dump on the second page and start up from the value of the stack pointer. If the kernel stack pointer has a value of 376, the trap is a stack violation and the stack looks like Figure 1. The next four cases - odd address (trap 4), illegal instruction (trap 10), breakpoint (trap 14), and segment fault (trap 250) - are all variations on the same theme. Figure 2 shows the general case. Reading up the stack is the PC/PS of the final IOT in SSTSR, two constant values which vary according to the type of trap, and any optional parameters. Next comes the return address of the executive cooroutine used to save the registers and the saved R0-R5. We finally come to the PC/PS of the actual trap. This is the physical crash point. The registers at the time of the crash have been neatly saved on the stack. The two constant values make it easy to find out which trap occured. The table in Figure 3 shows the relationships. The pattern of odd address trap (4,0) is probably the most common. Only segment fault (12,2) has any optional parameters. The memory management registers SR0, SR2, and SR1 repectively are pushed onto the stack. The last case is none of the above and is the IOT crash. The IOT istruction is the RSX crash instruction. There are over twenty places inside RSX where the executive becomes confused enough to deliberately crash the system. You get the physical crash point from the first page of the crash listing just like this was an XDT crash. It only takes a few seconds to get the Program Counter, Processor Status, and general registers of the physical crash point (after you have done it once or twice). The expert system now has its starting point for further detective work. Getting to Code ------- -- ---- The next step is start reading code. It takes four stages to get from the physical crash data to the correct source listing. You map the Program Counter to physical memory by using the memory management registers listed on the first page of the crash listing. You then find which component is loaded into this part of memory by scanning the the memory map page in the crash dump listing. This page is found at the beginning of the partition information section. You then look at the component's task map for to find the source module, and finally look in the source module to see the instruction which caused the fault. This process is actually not as involved as it sounds. Crashes always happen in kernel CPU state so you are only concerned with the Kernel I Space registers (lower left corner). If the PC is from 0 to 120000, the physical crash point is in the executive and you look directly in the executive map ([1,34]RSX11M.MAP) to find which module was involved. If the executive source listing is not available, make one by using the following command: MAC ,file=LB:[1,1]EXEMC/ML,SY:[11,10]RSXMC,file Addresses above 120000 are either device drivers or privilege tasks. The number in the appropriate memory management register gives the physical memory address in 32 word units. You append two zeros to get the octal physical address. It is then quick to scan along the base column in the memory map to find the component which begins just below some given physical address. You now know which compoment for which you might need to generate a map and listings. The translation from crash dump Program Counter to source listing instruction occurs over and over again during crash analysis. You need to remember that the Processor Status at the time the Program Counter was used determines which set of memory management registers are used to compute the physical memory address. The PDP-11 Processor Handbook has detailed information on the format of the Processor Status and the operation of the memory management registers. Crash Detective ----- --------- A crash is a sequence of time-ordered events: the system is performing normally, the logical error occurs, the system continues processing, the physical error occurs, and the system crashes. The element of time is crucial to crash analysis. A crash dump gives you a snapshot of the system at the physical point of failure. All registers, stacks, and variables are frozen at this point in time. We need to try to use this information to walk backwards in time to the logical error. The ideal crash analysis utility would be a screen-mode debugger which single step backwards! Register and variable contents would be backdated as far as possible. Backwards debugging is possible because the kernel stack provides the past history of what happened in the system since the last switch from user to kernel state. Walking up and down the stack lets you reconstruct the immediate chain of events. The easiest crashes to solve are those where the sequence of events is trap to executive (for example the set event flag directive is issued), some processing, logical error, some more processing, physical error, and crash. When the logical error occurs in the same context as the physical crash point, it will probably take only one crash to solve the problem because all the history is captured by the crash dump. The difficult crashes are those where the history from the logical error to the physical error is lost. There is no cookbook approach which applies. You have to collect the available evidence and use your judgement as to the next steps to take. It helps if you can learn to think like a PDP-11. It is critical to remember a system crash is a completely logical event, at least from the point of view of the computer. If a routine has been successfully executed once, the exact same input to the same code will work the second, third, and all subsequent calls. If a call to this routine crashes the system, we need to find out what has changed. Occam's Razor also applies to crashes. Crashes often leave what seems to be unrelated or conflicting evidence. But when the crash is finally solved, the logical error turns out to be an obvious explanation for all the facts. As you accumulate evidence and symptoms about the problem, try to theorize the simplest glich which would fully explain the facts. When you attempt to prove the theory, you will either find the bug or learn even more about the problem. This is the point an expert system becomes valuable. The evidence gleaned from crash dumps can point to actual problems if you apply an knowledge database to the data. For instance if you tell me one of your symptoms is your system clock occasionally loses 18.2 minutes, I know the problem is a device forking twice. This and probably another hundred different rules can be coded. Tricks ------ Unfortunately it would take hundreds of pages to express the same knowledge as printed material because in effect these rules are all that I and others know about RSX. Experience is probably the most difficult subject you can attempt to document. What I can cover is some of the tools and tricks I use to solve the tough crashes. The first tool is what could be called the crash dump editor - the ZAP utility. ZAP lets you look at any word in any file. This is perfect for looking around in crash dump files. ZAP uses an ODT-like syntax and is fully explained in the RSX Utilities Manual. ZAP is the only RSX utility which does not accept the filename on the initial command line. Use the following sequence to open a crash dump file: >ZAP ZAP>filename/AB/RO The /AB switch tells ZAP to open the file in absolute mode. The /RO switch prevents you from accendentially modifying the crash file. Nothing can be more frustrating then speding 12 hours chasing down a bogus piece of evidence you put into the dump. You use ZAP to dig out evidence from the crash dump. For instance, you have looked up the physical crash point in the source code and found the instruction we failed on was: MOV U.RED(R5),R5 ;Get next redirect chain However, R5 at the time of the crash is not odd. What can explain the problem? ZAP commands can be used to find that the MOV instruction has been clobbered. ZAP can only access 32KW from a given base address. When looking at a crash dump file, it is useful to set relocation registers to the base of various components. For example, 1:0;1R sets relocation register 1 to physical memory address 0 (block 1, byte 0). Assume you want to examine location 123404 in F11ACP. According to the crash listing, F11ACP is located at physical address 345200. The following commands examine the desired address: _346:200-120000;2R or _346:200;2R _2,123404/ _2,3404/ It is not difficult to write your own programs to do special analysis of crash dump files as the file format is fixed length 512 byte records. For example your analysis shows something is illegally referencing location 30 (EMT trap) and storing a 2456 in it. These numbers have to come from somewhere. It is worth a shot to write a throw away program to read the entire crash dump file and output all occurences of 30 or 2456. These tricks and more are discussed in the workbook Ken Johnson and I used for the Crash Dump Analysis presymposium seminar. The book can be found on the Fall 1982 RSX SIG tape. It is probably too late in the life time of RSX to develop a crash analysis expert system. On the other hand, I have heard estimates as high as 100,000 RSX systems have been sold. If each system crashes only once a month, 30 systems have crash in the 15 minutes since you start reading this article. Is your system still up?