1. Figure 1 is TEST.FTN.
2. Figure 2 is CHKP.MAC
3. TSTROT.MAC, TSTTBL.MAC, TSTUTL.MAC, and TST.CMD should be moved to ARIS.


		How Fast (Slow) Is RSX?

With many real-time systems, milliseconds may make the difference between
success and failure of an application. This article looks at how fast or slow
is RSX. Simple test programs have been written to measure the execution speed
of most RSX directives. Some of the results are surprising. 

A few months ago I was looking at a RSX-11M-Plus task accounting report from a
system I was seeing that day for the first time. The report covered a ten
minute interval when the particular application was in full production. A quick
glance through the listing revealed six tasks which consumed the majority of
CPU time. 

These tasks were issuing system directives at the rate as high as 400
directives/second. The RMDEMO system page displayed directive rates for the
whole system as high as 800 directives/second. The accounting report showed
some of the tasks context switching more than 100 times a second. 

These rates seemed excessive. But I had nothing on which to base this judgement.
If a typical directive takes one millsecond to execute, 800 directives/second
would be very significant (80% of the CPU)! But if average directive execution
is 0.1 milliseconds, only an insignificant 8% of the CPU is involved. 

A few simple tests showed 800 directives/second would be a substantial load on
this particular PDP-11 model. A close examination of the source code revealed a
frequently called subroutine which was always disabling and enabling
checkpointing around some of its operations. The programs using this subroutine
were the most important tasks in the system and thus always in memory. We
installed the tasks as noncheckpointable, removed the directives from the
subroutine, and eliminated over 300 directives/second. 

This experience lead to further questions about the speed of specific RSX
features. What is the overhead of a QIO? How many bytes of data can be moved
around a system using send/receive messages? How long does it take to context
switch or execute an AST? Is stopping more efficient than waiting? Is there any
differences between RSX-11M and RSX-11M-Plus or between old and new versions of
the RSX operating systems? 

The answer to these questions can be found with some very simple test programs
and a standalone system. The standard 60-cycle system clock does not have
enough resolution to measure how long it takes to execute a single directive.
However, accurate results can be obtained by computing the elapsed time when
issuing 10,000 iterations of the directive. A standalone system is needed or
the elapsed wall-clock time will include execution time for other tasks.

The initial test program I used was a Fortran-77 main program named TIMER which
timed how long it took to make some number of calls to the subroutine TEST. I
then wrote a whole series of TEST subroutines, each one issuing some directive
sequence. Figure 1 shows the Fortran-77 main program. Figure 2 lists a TEST
subroutine named CHKP which issues the disable and enable checkpointing
directives. 

The first couple of versions of TEST subroutines lead to other tests for such
things as mark times, send receive, and event flags. Soon I had accumulated 23
different TEST subroutines and task built 23 different tasks. A more systematic
approach seemed to be appropriate. 

I started over again and wrote the TST utility. TST contains a database of
almost all RSX-11M and RSX-11M-Plus directives. The TST command line selects a
test by directive name and specifies the loop count. The resulting timings are
output to the user terminal. TST uses only the basic PDP-11 instruction set
and should run on an PDP-11 model.

TST is too long to reproduce in this article. The TST sources and a command
file to execute all available tests are available through ARIS. The program is
also available on the San Francisco Fall 1986 RSX SIG Tape. The submission
also includes 15 small test tasks which act as targets for some of the
directives.

The initial tests simply establish the baseline performance of the TST utility
and particular PDP-11 CPU. I ran all tests on a PDP-11/24 running RSX-11M-Plus
V3.0, Autopatch A. The Null test measures how fast the test loop runs with
just a simple return instruction. My PDP-11/24 could execute 57,142 null
tests/second. Thus the test loop code takes only a little less than 0.02
milliseconds to execute. We will see that this is only 1-2% of event the
fastest RSX directive. 

The Block Move test measures how long it takes to copy a 512-byte block of
data using eight loops through a series of 32 MOV instructions. This test gives
us a means of comparing directive execution time to some easily visualized
workload, that is, we can state some directive took 2.1 block moves to execute.
The PDP-11/24 system executed 733 block moves per second or 1.36 milliseconds
to move one block of data. 

The Get Sense Switch directive (GSSW) is effectively the RSX null directive.
The actual directive processing takes only one instruction. As the test results
below show, the common processing performed for any directive is about 1/3 of a
block move. 

	Block Move	 733/sec  1.36 ms  1.00 bm	
	GSSW		2310/sec  0.43 ms  0.31 bm

Some of the fastest RSX directives are directives to set, clear, and read
event flags. The following timings show the time for local flags:

	SETF		1790/sec  0.55 ms  0.41 bm
	CLEF		1810/sec  0.55 ms  0.40 bm
	RDEF		1810/sec  0.55 ms  0.40 bm

Using global event flags is only slightly slower than local flags. However,
a uniform 0.13 millisecond difference could be seen between local and
group flags. This can be seen by looking at the set event flag directive:

	SETF/Local	1790/sec  0.55 ms  0.41 bm
	SETF/Global	1760/sec  0.57 ms  0.42 bm
	SETF/Group	1440/sec  0.69 ms  0.51 bm

Surprisingly, it is slightly faster to wait on the logical-or of event
flags than it is to wait on a single event flag. Also, waiting is slightly
faster than stopping on an event flag. The following results are the
test times when waiting on event flag 1 which has already been set.

	WTLO		1210/sec  0.83 ms  0.61 bm
	WTSE		1140/sec  0.88 ms  0.64 bm
	STSE		1120/sec  0.89 ms  0.65 bm

Some directives are almost always issued in pairs. This includes enable and
disable checkpointing and enable and disable AST recognition. These directive
pairs are very fast. But as discussed in the introduction, these directives are
often used around frequently called critical code segments. The CPU time used
for these directives becomes very significant when issued 100 or more times per
second. 

	ENCP/DSCP	1050/sec  0.95 ms  0.70 bm
	ENAR/DSAR	 974/sec  1.03 ms  0.75 bm

Perhaps the most common RSX directive is the QIO. The QIO directive is tested
by issuing read virtual functions to the null device. One test issued separate
QIO and WTSE (wait for event flag) directives. The second test uses the more
common and more effective QIOW form. 

	QIO/WTSE	 328/sec  3.05 ms  2.23 bm
	QIOW		 401/sec  2.49 ms  1.83 bm

As the above numbers show, the QIO and QIOW directives take almost six times
the processing needed for simple event flag directives. However, the QIO
is certainly not the slowest RSX directive. The TST utility showed the new
Parse FCS and Parse RMS directives win the slowest award:

	PFCS		  45/sec  22.2 ms 16.29 bm
	PRMS		  45/sec  22.2 ms 16.29 bm

Perhaps the second most common set of directives issued by RSX applications are
the PLAS directives. A new feature of RSX-11M-Plus V3.0 is the fast mapping
mechanism and the TST utility results show the new feature can be well-worth 
the time it takes to convert MAP$ directives to the new calling sequence.

	MAP		 555/sec  1.80 ms  1.32 bm
	fast no length  5833/sec  0.17 ms  0.13 bm
	fast w/ length  3488/sec  0.29 ms  0.21 bm

A fast mapping call with no length change can reposition you inside a region
10 times faster than the equivalent MAP directive. This is the normal case
as you almost alwats set the window size to the full 4KW boundary. Even if you
change the length, the fast mapping feature is a big winner. The TST
utility showed the execution speed of the MAP directive was the same whether
changing the region offset, length, or even between regions.

The most common method used for RSX intertask communication are the send and
receive directives. Both RSX-11M and RSX-11M-Plus offer a fixed-length
(13-word) mechanism. RSX-11M-Plus extends the mechanism to include any length
up to 256 words. The simplest case measured by TST is a send to itself followed
by one receive. While the 13-word SDAT/RCVD test executes twice as fast as the
256-word variable send and receive, the effective baud rate is only 1/10 as
fast.

	SDAT/RCVD	 226/sec  4.42 ms  3.24 bm ( 47 kb)
	VSDA/VRCD	 112/sec  8.93 ms  6.54 bm (458 kb)

The above test are unrealistic as very little useful work is accomplished by
sending data to yourself. The following tests show the TST utility sending data
to other tasks using different techniques to wake up the receiving task: event
flags, AST's, receive-data-or-stop, and receive-data-or-exit. Note the
receiving tasks always do two receives for each message sent. The second
receive is needed to detect no more messages from the lower priority TST
utility.

	SDAT/RCVD flag	 126/sec  7.94 ms  5.82 bm ( 26 kb)
	SDAT/RCVD AST	 126/sec  7.94 ms  5.82 bm ( 26 kb)
	SDAT/RCST 	 123/sec  8.13 ms  5.96 bm ( 25 kb)
	SDRC/RCVX 	  81/sec 12.34 ms  9.05 bm ( 17 kb)

	VSDA/VRCD flag	  79/sec 12.66 ms  9.28 bm (324 kb)
	VSDA/VRCD AST	  79/sec 12.66 ms  9.28 bm (324 kb)
	VSDA/VRCS 	  77/sec 12.99 ms  9.52 bm (315 kb)
	VSRC/VRCX         59/sec 16.95 ms 12.42 bm (242 kb)

In all tests above, the receiving task was fixed in memory so no disk load time
is involved. The send-and-request, receive-and-exit test show the penality
paid for starting and stopping a task.

The TST utility results can also be used to measure how much CPU is used to
process an AST or context switch between two tasks. AST time is measured by
adding an AST to the QIOW and simple send/receive data tests. The results
show the time RSX uses to declare and exit AST's is the same as one block
move.

	QIOW w/AST	 257/sec  3.89 ms  2.85 bm
	QIOW		 401/sec  2.49 ms  1.83 bm
	------------------------------------------
	AST processing	          1.40 ms  1.02 bm

	SDAT/RCVD w/AST	 171/sec  5.85 ms  4.29 bm
	SDAT/RCVD	 226/sec  4.42 ms  3.24 bm
	------------------------------------------
	AST processing	          1.43 ms  1.05 bm

The same technique can be used to measure task context switching. The time
needed for local send/receive processing can be compared against the time
needed for the same set of directives in two tasks. 

	S/R (2 task)	 126/sec  7.94 ms  5.82 bm
	S/R (1 task)	 150/sec  6.67 ms  4.87 bm
	------------------------------------------
	Context switch	          1.27 ms  0.95 bm

The typical RSX system spends more time in the exceutive then executing your
application code. As these results above show, RSX directives are not free. If
accounting and other data indicates particular tasks are issuing large number
of directives, you should question what function they are accomplishing with
these directives. You will often find directives being issued inside a loop
which could be moved outside the loop. 

Directive timings are also important to establish performance limits for your
system. The TST utility or simple programs like TIMER can easily find the
limits of your particular system. 

I intend to publish a complete table of RSX directive timings in the future.
This table will include as many different PDP-11 CPU's and versions of RSX-11M
and RSX-11M-Plus as I am able to test. If you are able to run the TST utility
on your system, please forward a copy of the results to me care of the DEC
Professional.