See CRASH.CMD for figure. Crashes - (Part 1) Have you ever noticed how RSX says goodbye to you just as you are getting ready to leave on Friday for the big weekend in the mountains: CRASH -- CONT WITH SCRATCH MEDIA ON MT0 System crashes are a fact of life. While 5:00 pm Friday is certainly not a convenient time for a crash, neither is any other time of the week. A crash is never a welcomed experience, but if you are prepared crashes need not be fatal to your system or your home life. This article lists the steps you should take at your site to prepare for when your RSX system crashes. Next month's article will discuss how to be a crash detective and solve the really tough crash dumps. Crashes come in several forms. Most people equate RSX system crashes to the above message and subsequent CPU halt. However, I define crash as any system failure which can be fixed only by rebooting the system. This broader definition includes conditions such as CPU halts or uninterruptable loops, pool exhaustion, memory deadlocks, critical task failure (i.e. F11ACP aborts), and other miscellaneous errors such as removing INStall. All computer systems crash. The suggestions in this article are mostly common sense and apply to handling system crashes for all systems. However, examples and specific details will be drawn from RSX systems. The first question you need to answer regarding crashes is whether you need to bother with them at all? There is a dollar cost associated with handling and analyzing crashes. The cost comes mostly in the form of manpower, either your own or outside experts when you have no in-house systems expertise. The payback for the time and money invested in solving crashes comes from the production losses you avoid in the future. A system crash is not a random event. Unlike a spurious hardware glich, if you do not find and fix the problem which caused your system to crash, the next time the same set of events occur the system will crash again. For the lucky sites which consider interruption of service to be only a minor annoyance, crashes can be ignored. However most sites can calculate a real cost if their systems fail. The higher this cost, the more steps it is realistic to take to avoid system outages - 7x24 Field Service coverage, hot and cold system backups, or redundant systems. Crash handling and analysis procedures are an integral part of these steps. There are three distinct phases to solving system crashes: collecting information at the time system fails, checking the symptoms for any obvious causes, and finally systematically analyzing the crash dump and other data to pinpoint and fix the reason for failure. Collecting information about system failures presents some real problems. When your system is down you will be under a great deal of pressure to get the system back up. No one will be happy if it takes you 20 minutes to find and load a scratch magtape in order to get a crash dump. If you give in to peer pressure and reboot, the same problem could rear its head next month when you go into real production. Responding to system failures is something you practice. Anyone who knows how to reboot the computer system should know how to take a crash dump and log the crash data. Training to get crash dumps starts with the four different methods you can use to bring RSX systems to the standard crash routine. The simplest case is the system has already printed the crash message on the console and you are ready to mount a scratch medium and proceed. If the executive has broken to XDT instead, typing an "X" will cause XDT to jump to the crash routine. RSX always stores a jump instruction to the crash routine in physical memory location 40. If the system is in a tight loop, halted, or is in its console ODT mode you can force a crash dump by starting execution at location 40. For instance, I would type 40G on the console of my PDP-11/24. When you take this approach, you should write down the program counter (PC) of where the system halted as the PC is not saved in the crash dump. If you can still type MCR commands and want to force a crash, I recommend depositing a 101(8) at location 100(8) using the OPEN command. >OPEN 100/KNLD 00000100 /xxxxxx 101 The system will crash with an odd-address trap the next time the system clock tics (sometime in the next 1/60th of a second). This technique can also be used from the console when the system seems to be in a tight but interruptable loop. The advantage over the 40G mechanism is current program counter and processor status are pushed onto the stack by the clock interrupt. If none of these methods result in the crash message, either the executive crash routine is corrupted or there is a severe hardware error. In either case it will be impossible to get a crash dump and you should simply log the failure and try to reboot the system. Once your system halts after writing the crash message, you are ready to load the crash medium and write the crash dump. You select the crash device when you generated your RSX-11M or RSX-11M-Plus system. Micro/RSX and pre-generated RSX-11M-Plus systems have loadable crash device support. Obviously the crash device must have been loaded into memory prior to the system failure. You should always choose magnetic tape devices over disk drives and floppy disks over hard disks as crash devices. If you must use a disk device, be sure you train your people to write protect other disk drives. The crash routines write over the first 'n' blocks of a disk, which is a sure-fire method of destroying the boot block and home block of a RSX disk. The system halts again after the crash dump had been written to the scratch medium. If you continue from this halt, the crash routine loops and writes another crash dump. If a problem occurs writing the crash dump, such as a no write ring in the tape, wait till the system halts again, fix the problem, and continue. The various executive crash routines work by writing physical memory to the crash device until a non-existent memory error occurs. Thus the crash dump does not include any device or CPU registers located in the I/O page. If you suspect a hardware device error caused the crash, it is a good idea to manual examine the device registers from the console when the system halts after writing the crash dump. A system crash is the wrong time to look through hardware manuals for various addresses. Put together a simple 1-2 page document with contact names, reboot procedure, and any relevant CPU and device registers and tape the list to the side of the CPU. A system crash is also the wrong time to scramble around looking for a scratch medium. You should dedicate tapes or disks as crash medium and not use for any other purposes. You do not want to be in the situation where you lose a crash dump because you do not have a scratch tape. You this have to suffer another production loss before you can get the information needed to solve the problem and avoid system outages. You are not quite finished when you have taken the crash dump and rebooted the system. The crash dump records what happened inside the computer at the time of the crash. You need to record what was happening outside the system. Do not rely on the memory of yourself and others. It may be days or months before you really dig into this crash dump. This is really nothing more than informally asking different people what they were doing when the system crashed and what they think went wrong. Pay special attention to anything new which was just attempted. You should also look at the console terminal and other terminals for messages out of the ordinary. You looking to get a picture of what was going when the system crashed. The resulting information needs to be logged in some fashion. You do not expect system crashes. You probably have not allowed 2-3 hours it takes to start digging into a crash dump. For the time being, you take the crash dump, reboot the system, get the picture of what is going on, and get back to productive work. The crash becomes tomorrow's problem (unless of course the system crashes again in ten minutes). Some sort of filing system for crash dumps is required. I use the piles of crash listings approach. As soon as the system comes back up, I use the command file CRASH.CMD to read the dump to a disk file and generate an initial listing. I put all my crash dumps in the same account and name the files according the date. The crash listing becomes my file for this crash. I write all notes about the crash directly on the listing. If there is any console output or other notes to keep, I staple the paper to the crash listing. My filing system is the stack(s) of crash listings. When I solve a crash, the crash dump file extension is changed from CDA to YEA. The notes on the crash listing are updated and all but the first few pages are thrown away. These cover sheets and any attachments are put into a actual file drawer. Eventually crash files and listings accumulate and I need to recover the disk and office space. Any solved and dead-end crash dump files are appended to a BRU tape used just to hold such files. Again just the cover sheets from the discarded crashes are kept and everything else is thrown away. The full output from the Crash Dump Analyzer can easily use several trees to print (over 300 pages for large systems). I prefer to use just a few switches in the initial listing and get more detailed output later if I need it. The two most useful switches are /ACT and /PCB. The /ACT switch generates a list of all the active tasks in the system. The /PCB switch is useful because of the memory map it outputs. You start analyzing a crash dump by looking through the crash listing and getting a picture of what was happening inside the system when the crash occurred. You look for the obvious and the unexpected. Probably the most common RSX problem is pool exhausted. The beginning section of the crash listing shows if pool is low. If so, sometimes the initial listing shows what is consuming pool. Look for large I/O counts and long receive queues. One example of the unexpected is active tasks which should not be active. Tasks such as TKTN or PMD mean some task has aborted. If ERRLOG is active, some device errors are being logged. You should also look for the opposite, tasks which should be active but are not. The tasks ...LDR, F11ACP, and MCR... should be active at all times. Your application probably has similar tasks. The active task portion of the crash listing should also be examined for tasks in with outstanding I/O. This probably not a direct indicator of the problem, but many system crashes are related to I/O in progress. You now have have pictures of what was happening inside and outside the system when the crash occurred. You are now ready to play detective (but you will have to wait for next month's exciting conclusion).