Caught in the Act


All programmers encounter program errors. The majority are simple to track down
and fix. There is a straight-line you can trace back from the point where the
error became visible and it actually occurred. 

There are two parts to any program error. The physical error is the point where
the program faulted. In RSX systems, this is usually means a task abort,
infinite loop, or some other abnormal behavior. The logical error is the actual
point in the program which must be corrected. Debugging programs is simply a
matter of tracing back in time from the physical error to the logical error. 

Some fraction of program errors defy solution. There may not be any tracks left
by the time the physical error occurs. All you know for sure is sometime
since the system was booted a value in global common was corrupted or a vital
record in the master database file was overwritten. Such errors are analogous
to a lightning bolt striking a tree. It is obvious from the splinters what the
damage is, but it is impossible to trace back up to the sky to the exact point
where the bolt came out of the sky. 

Once you have established the symptoms of a problem, there are a a variety of
debugging techniques which let you catch the guilty party in the act.


Fortran Traceback
------- ---------

One error which is difficult to trace is a corrupted variable in a common area.
If you are programming with Fortran, you can use the Fortran module traceback
mechanism to track down the guilty module. This technique lets you check the
problem variable before and after each subroutine call. If a bad value is
detected, you can trap the program immediately. 

Unless a Fortran module is compiled with /TR:NONE, the first action any Fortran
module takes is to call the NAM$ routine to add the module's name to the
traceback list. When the subroutine exits, control is passed back to NAM$ and
the module name is removed from the list. 

By adding some code to NAM$, you can use it to check for the problem symptoms
on entry and exit from each subroutine. Thus you can narrow the logical problem
to the specific module. Figure 1 shows the disassembled Macro-11 code for NAM$
(comments added by author). 

Figure 2 shows how NAM$ was changed to track down a specific problem. The
update NAM.OBJ module is included explicitly in the task build of the target
program. In this case, variable FOO in common area BAR should never be set to
zero. Fortran listings shows FOO is offset 102 (octal) bytes from the start of
common area BAR. When a zero value is detected, a Fortran error 98 is declared
(user-declared error). The module listed first in the traceback chain has set
FOO to zero. 

 
T-Bit Trap
----- ----

The technique above only works because of the method used to implement Fortran
module traceback. Some other technique must be used if coding in Macro-11 or
using a language which does not have a traceback facility similar to Fortran. 

Although it generates a fair amount of overhead, it is possible to trace every
instruction executed by a task. The trace bit (T-bit) in the Processor Status
Word causes a trap after every instruction. One little known debugging aid
supplied by Digital that uses this feature is LB:[1,1]TRACE.OBJ. When linked to
a task, TRACE outputs to the console listing device (CL:) a register dump for
every instruction. The routines traced can be controlled by task build
parameters. TRACE is fully documented in Chapter 6 of the IAS/RSX-11 ODT
Reference Manual. 

The Digital TRACE module produces volumes of output. It may be simpler to code
a trace module which watches for your specific error. Figure 3 shows a trace
module (TRACE.MAC) which again watches variable FOO for a non-zero value and
halts when the error is detected. In this case, the technique detects the exact
instruction causing the error. 

The tracing module is designed to be linked to the target task as a debugging
aid. This is using done by linking TRAP in the task build command file as
TRACE.OBJ/DA. When the task is started, the tracing module is called. The trace
module first establishes a T-trap handler using the SVDB$ directive. The T-bit
is set and control passed to the actual task entry by faking a interrupt
return. 

RSX traps after every instruction and passes control to the T-trap handler.
If no error is detected, the module simply returns using the RTT instruction
instead of RTI to prevent an immediate trap. While the sample trace module
halts on an error, you can modify it to take whatever action is appropriate.
		Fortran Execution Profile

The Fortran traceback mechanism and the T-bit trap can also be used to profile
a Fortran program's execution. The NAM$ routine lets you total the number of
times each subroutine is called. The T-bit trap lets you count each instruction
executed. 

The module TRACK.MAC combines the two techniques. You start execution profiling
at the appropriate point in your task with a call to TRKBEG. This routine takes
three arguments: the number of subroutine counting entries, the starting
address of the entries, and an overflow entry. Each subroutine entry takes 3
double precision words: the Radix-50 subroutine name, the number of times the
routine is called, and the number of instructions executed. All values should
initially be set to zero. The excution profile is stopped with a call to
TRKEND. 

The program TEST shows a simple example of how TRACK can be used to profile
execution. This technique can be used to find out where a program is spending
most of its execution time. It is also useful for comparing two or more
different algorithms.