.PAGE SIZE 55,80
.RIGHT MARGIN 80.LEFT MARGIN 5
.FIRST TITLE
.TITLE Programs for testing the speed of CPUs
.SUBTITLE B.#Z.#Lederman
.PARAGRAPH
We have applications which run on PDP-11/70s, and until recently there was no
other PDP-11 which was as fast. With the introduction of the PDP-11/84, it was
suggested that applications could be moved to the new machine, but we wanted
some method of verifying the speed of the 11/84, and there were no published
instruction times. I therefore wrote some programs which will measure the speed
at which a CPU will execute instructions with reasonable accuracy: it will at
least give a method of comparing one CPU to another.
.PARAGRAPH
The method I use is quite simple: a single instruction is executed many times
(generally a minimum of 1,000,000 executions), and the time this takes is
measured. Because the largest single program segment on a PDP-11 is about
32,000 words, it is neccessary to use a loop: a long sequence of instructions
is repeated as many times as needed to get the total number of executions
wanted. There is also the effect of cache: a short loop will fit entirely into
cache and will execute at the fastest possible speed; a longer loop will not
all fit into cache, and the effect of the speed of main memory can be seen.
Finally, it is neccessary to test the speed of many different instructions (one
at a time) to get an adequate profile of a given CPU's performance.
.PARAGRAPH
The program S1D.MAC implements this type of test. Although I have tried to make
it as straight-forward as possible, it should be remembered that this is
basically a diagnostic program, and that it does some things that I normally
wouldn't allow in a production program: specifically, it treats instruction
space as data, is self-modifying, and there are two tables which must be kept
in alignment manually. Sometimes this sort of thing is neccessary in a
diagnostic program. While modifications would be fairly easy for a programmer
with some experience, I don't suggest making any changes unless you are fairly
certain you know what to do. (The program is not priviledged, however, so the
worst that can happen is the program will abort: it won't crash your system.)
.PARAGRAPH
There are two global equates at the beginning of the program: $NREP is set to
the number of instructions wanted in the loop, and $NPAS is the number of times
the loop is executed. Any combination which will multiply to a large even
number will do, such as 1,000 and 1,000; 3125 and 320; 2500 and 400, or any
other suitable combination. Having the numbers multiply out to 1,000,000 or
10,000,000 isn't absolutely neccessary, but it makes it easier to read out the
times: at 1,000,000 repititions the time measured in seconds is the instruction
time in micro-seconds. Because the loop has to have 3 instructions to form the
loop repeat logic, there is another global called $NREAL which is $NREP minus
3. As long as the loop is reasonably long (several hundred instructions or
more) the effect of the loop repeat instructions will be negligible. I
considered making them task build time options, but it would have made the
program far too complicated. Don't be supprised if the program takes more time
to assemble when $NREP is large, as it takes more time to fill the ISTRT area.
.PARAGRAPH
There are two tables which must match: the first starts at MNTAB, and is a list
of mnemonics which print out when you run the program to tell you what
instruction is being tested. This must match the set of instructions starting
at TABLE if you want to know what is going on. The instructions were placed in
their present order after running the program many times so the data comes out
in what I feel is the most useful form, though the instructions could be in any
order. All of the instructions are position independant, as they have to be
moved by the program into the execution area: testing other addressing modes
would be difficult as the addresses would all have to be adjusted, and I'm not
sure that the information obtained would significantly improve the comparisons
between machines. In this program, all instructions are one word long: another
program called S2D.MAC is identical except that two word instructions are
tested (actually instructions with addressing modes that cause the instruction
to be followed by one word of address information or literal data), and
similarly for S3D.MAC which tests three word instructions. Using three seperate
programs makes things very much easier (for me, the programmer). There are also
a few special cases: the BRANCH instructions are set to always branch to the
next instruction, which allows them to be tested even though they are relocated
within the program. The have been selected so that one will always branch and
the other will never branch: on many machines there will be a difference in
execution speed for these two conditions. The other special case is the Jump to
Subroutine instruction: for this to be tested there must be a Return from
Subroutine instruction (which is in a fixed location), and the time of
execution really is for both instructions together. As most subroutines end
with a return, this should be a valid test as long as one remembers that the
times obtained are for two instructions. The data which is used in the
instructions is either absolute references or loaded into registers: the data
for the Multiply and Divide instructions may not be the best possible test of
these instructions, but it does work. Auto-increment and Auto-decrement
addressing modes are difficult to test as the loop is at least 1,000,000
instructions long, and any continuous test would push the register out of
addressing range: the only way out I could see was to have the same instruction
do a decrement and increment together, so only two addressing locations are
used, and this can be seen in the MOV#-(SP),#(SP)+ test.
.PARAGRAPH
It should be noted that these programs include Floating Point Processor
instructions: though it would be easy to include an assembly condition to
include these only when wanted, I haven't been able to get around to doing that
yet. Persons wishing to test machines which don't have FPP should
conditionalize or comment out the FPP instructions: this would also apply to
anyone who does not have EIS instructions (MUL, DIV, etc.). The programs
do not test FIS or CIS instructions, because none of the machines I have access
to have these instructions.
.PARAGRAPH
In the main body of the program there is space reserved beginning at ISTRT:
this is initially filled with No Operate instructions, and the program fills
this space with one of the instructions from the TABLE in order to test it's
execution time. The system directive GTIM$ (Get Time) is used just before the
start of the loop and just after the loop to see how long it took: as system
time is delivered in 0.1 second intervals, it is neccessary to make the test
loop long enough to be greater than this. The fastest 11/70 and 11/84
instructions are about 0.3 micro-seconds long, so if a 1,000,000 instruction
loop is used, the measured times are about 0.3 seconds, only a little above the
measurment granularity. Though this is enough for an approximate comparison,
running a 10,000,000 instruction test brings the shortest times up to about 3.0
seconds, and gives one more significant digit of data. Running tests longer
than this should not be neccessary. For slower machines (such as the PRO-350
and 11/23), 10,000,000 instruction loops are nice but a little slow, and
1,000,000 instruction loops are probably adequate if you don't have much test
time available.
.PARAGRAPH
The use of the system directive to measure time brings up several important
factors in running these tests, of which the first is the amout of time it
takes to execute the GETIM$ directive. I measured this on our 11/70 by having a
program issue several GETTIM$ directives sequentially and measuring the time
each directive took to execute with SPM-11: this package from DEC hooks into
the executive and measures time in 0.00001 second intervals (100,000 per
second). The GETIM$ took about 0.0005 seconds (500 micro-seconds) to execute,
which is so much less than the minimum resolution of 0.1 second that it can be
ignored if the loop is long enough to register at least 0.1 seconds. A related
point is that when the test program is running, it should be the only program
running, and at a high enough priority that it will "grab" the CPU more or less
exclusively (the CPU will still have to service the system clock if you want
the test times). I have run the tests on otherwise idle systems, and at a
priority of 151 or 152 which is high enough to be above just about everything
except MCR: I want the test to be below MCR in case I want to abort it. While
this might make the instructions test a little slow, it is still a valid
comparison between different machines, and when I have compared the test
results obtained with the published instruction times in the processor
handbooks, I have obtained quite good matches, indicating that the program is
testing accurately. Under these conditions, it should be possible to run the
test either on RSX-11M-Plus (as I did), or RSX-11M, or possibly on RSX-11S and
obtain valid results. The basic program would probably also run on IAS, but it
might be neccessary to change the calls to the system library that format the
output messages and the QIOs if they are not the same on IAS.
.PARAGRAPH
I have only been able to test 3 machines: the 11/70, 11/84P and PRO-350. One of
the reasons I am placing these programs on the DECUS tape is the hope that
other users with other machines will run the programs and publish the results,
so the user community will have additional data to make comparisons between
models of PDP-11s when choosing a new system or contemplating an upgrade path.
Also, by testing more machines, it will be possible to compare the times with
the published speeds in the processor handbooks, and obtain additional
verification that the speed tests are accurate (or not, though I have
reasonable evidence now that the test is correct). I would be very interested
in seeing the results of tests on other CPUs, or suggestions on other
instructions which might be added to the test set.
.PARAGRAPH
I have also started on a version which will run on the VAX family of machines:
the preliminary program is included here as S3V.MAR, and it tests a limited set
of 3 byte instructions. It will be neccessary to expand the number of
instructions tested by this program, and to produce programs which test
instructions which are different lenghts in the same manner as the three PDP-11
programs. This program works, but I have not yet had an opportunity to run it
as the only program on the system or at a high priority, so the times I have so
far are not valid and are not included in this submission. Also, I have not yet
been able to investigate the effect of longer or shorter loops on data cacheing
or page boundery crossing. The program GETTIM.MAR is used to test the length of
time needed to do a $GETTIM_S, which is used to time the loop: it does this by
measuring the time which elapses while issuing 1000 directives. This works out
to 0.00012 seconds on a 11/750 (120 micro-seconds), which is supprisingly less
than the RSX system: I suspect that this is because the system call is simpler,
and that the directive itself has less work to do as it simply returns the 8
byte system time register. The program prints out the 8 byte time values as I
was also using it to test that my time subtraction logic was correct before
using it in the S3V program. Because the system clock keeps time in smaller
intervals, the instruction times on the 750 come out in 0.01 second intervals,
though I would still run instruction loops of at least 1,000,000 for best
accuracy. I would like very much to hear from other users who can suggest a
good instruction set for the VAX tests, and who obtain test results.
.NO JUSTIFY.NO FILL
.BLANK
B.#Z.#Lederman
2572 E.#22nd St.
Brooklyn, N.Y. 11235
.BLANK
(212)#250-2300#7:30#AM#-#3:30#PM#Eastern time
or DCS#(Lederman)