PDP-11 CPU instruction timing tests Page 1 B. Z. Lederman The following are the results of the timing tests, and have been arrainged to show the effects of addressing modes. The category "Basic Instructions" includes the TeST, ROtate Right, SWAp Bytes, MOVe, MOVe Byte, CoMPare, ADD, and BIt Test instructions, all of which can normally be expected to execute at the same speed when operating on registers, and did in fact do so in the tests. SWAp Bytes is also used again in one of the addressing mode tests: in these tests "addr" is a symbolic address reference: when there are two references to "addr" in the same instruction they are to two different locations. The Branch EQual and Branch Not Equal instructions are used to show the difference in speed when the branch is taken and when it is not taken: all branch instructions should execute at the same speed under the same circumstances. Another special case is the Jump to SubRoutine instruction: this must also includes the time for the ReTurn from Subroutine instruction, and so is the time needed to execute two instructions. This is reasonable, as most subroutines conclude with the RTS instruction. The reference to "Short" and "Long" is to the length of the instruction sequence: a short sequence should all fit into cache and give the fastest possible CPU times. The long sequence does not fit into cache, and shows the effect of main memory speed. All times are converted to micro-seconds. 11/70 11/70 11/84 11/84 Core Core PRO- Instruction Short Long Short Long 350 Basic Instructions 0.31 0.91 0.3 0.7 2.38 TST (R0) 0.76 1.36 0.7 1.3 4.20 CMP (R0), (R3) 1.06 1.67 1.6 2.1 6.11 MOV -(SP), (SP)+ 1.70 2.55 2.1 2.7 6.50 MOVB -(SP), (SP)+ 2.00 2.86 2.1 2.7 6.51 MUL R1, R0 3.19 3.81 5.7 6.5 26.64 DIV R1, R0 1.98 2.59 9.3 10.2 8.96 BEQ addr 0.31 0.90 2.38 BNE addr 0.61 1.21 2.38 CCC, NOP 0.61 1.21 0.8 1.25 3.23 MOV #1, R0 0.77 1.95 0.6 1.5 4.07 ASHC #1, R0 1.38 2.56 2.1 3.1 11.92 MOV @#addr, R1 1.67 2.42 1.1 2.0 5.49 TST @#addr 1.67 2.42 1.1 2.0 5.49 SWAB @#addr 2.03 3.66 1.9 2.9 6.78 MOV addr, R1 1.43 2.27 1.4 2.3 6.07 TST 2(R0) 1.51 2.26 1.3 2.2 6.11 TST addr 1.48 2.27 1.3 2.3 6.10 MOV 2(R0), (R3) 1.87 3.52 2.3 3.4 8.12 JSR PC, (R2) 3.06 3.68 9.03 JSR PC, @#addr 3.80 4.63 10.25 PDP-11 CPU instruction timing tests Page 2 B. Z. Lederman MOV #1, @#addr 2.04 4.08 1.9 3.2 7.34 MOV @#addr, @#addr 2.76 4.54 2.4 3.7 8.96 MOV 4(R0), 2(R3) 2.62 4.44 3.0 4.3 9.68 ADD #1, @#addr 3.02 4.34 2.1 3.5 8.39 ADD @#addr, @#addr 2.81 4.84 2.7 4.0 10.10 ADD 4(R0), 2(R3) 2.68 4.75 3.2 4.5 10.73 SETF 1.22 1.83 1.6 2.0 6.14 MULF AC2, AC0 2.57 2.68 15.4 16.3 61.35 ADDF AC0, AC1 2.72 2.81 14.3 15.3 72.78 NEGF AC0 1.22 1.83 4.8 5.4 15.83 ABSF AC0 1.22 1.83 5.2 5.9 15.99 LDF R0, AC0 1.22 1.82 11.47 LDF (R3), AC0 1.97 2.60 14.67 LDCIF R0, AC0 2.27 2.37 29.14 LDCIF (R3), AC0 2.27 2.59 20.47 STCFI AC0, R0 2.73 3.36 36.00 STCFI AC0, (R3) 3.36 4.17 37.59 SETD 1.22 1.82 1.6 2.1 6.13 MULD AC2, AC0 3.94 4.01 44.4 46.4 215.44 ADDD AC0, AC1 2.72 2.83 19.2 20.4 98.83 NEGD AC0 1.22 1.82 5.9 6.5 19.32 ABSD AC0 1.22 1.82 6.2 7.0 19.29 LDF #1, AC0 1.81 2.88 3.4 4.4 14.29 LDF @#addr, AC2 2.72 3.49 16.06 STF AC0, @#addr 4.22 5.26 4.0 5.1 11.80 CMPF @#addr, AC0 2.89 3.49 5.7 6.9 30.40 LDCIF #1, AC0 2.61 2.89 42.01 LDCIF @#addr, AC0 2.44 3.18 25.20 STCFI AC0, @#addr 3.41 4.89 9.9 11.1 38.24 The results of these tests need to be considered carefully. First, I had a limited amount of time on the 11/84, and ran the shortest tests: it would be desireable to run them again to obtain the same precision as for the 11/70 and PRO-350 tests. Also, as can be seen below, the longest sequence might not have hit the maximum memory access times on the 11/84. The times obtained for the 11/70 are the same using core and MOS memory, but it is suspected that in both cases memory was not interleaved: interleaving makes main memory access faster, and so the times on the 11/70 for the long loop might improve on a machine with interleaved memory. The times for the MULtiply and DIVide instructions don't always match the processor handbook values: this is probably because the data values used are special cases, and a divide by one is faster on some machines than a divide by another number. The values for the short loop, where the entire loop is in cache, do match very well the published values in the 11/70 processor handbook It is apparent, however, that the 11/84 certainly comes very close to the speed of an 11/70. For some addressing modes, the 11/84 is faster: PDP-11 CPU instruction timing tests Page 3 B. Z. Lederman when this is combined with the larger cache, the net effect is that some other test programs actually ran significantly faster on the 11/84 than on the 11/70. A program which was designed to heavily exercise memory mapping (it used a Virtual array in Fortran, and essentially did nothing but re-map) was the one CPU bound test which ran slower on the 11/84: I do not know the reason for this. The two cases where the 11/84 is obviously slower are arithmetic instructions (MUL, DIV, ASHC) where the 11/84 is slightly slower, and with Floating Point instructions, where the 11/84 is considerably slower. How much this affects an application depends upon how many floating point instructions are executed: even a program with floating point variables generally has a relatively low percentage of the total program executing actual floating point arithmetic. A program which did use floating point heavily (large matrix inversion, Fast Fourier Transforms, Network Analysis, and similar scientific and engineering applications) would see a reduction in speed on an 11/84. The story does not end here, however. DEC has announced an upgraded 11/84A, whose processor runs 20% faster than the machine I tested, so all of the instruction times given above can probably be reduced by 20%: this would make the 11/84 faster than the 11/70 in nearly all instances. This machine will also accommodate the Floating Point Co-Processor, which is said to execute floating point instructions 5 to 8 times faster than the basic J-11 processor: this would narrow or eliminate the gap between 11/70 and 11/84 floating point performance. The times for the PRO-350 are close to but slightly slower than the times published for an 11/23, which uses the same processor. The PRO-350 CPU has to do extra work such as updating the video screen and keeping the system clock working, which may account for the difference. Both the 11/70 and the 11/84 have cache memory: this means that the speed at which instructions will execute depends upon the CPU having to fetch the instruction from fast cache memory or relatively slower main memory. To examine the effect this has, the length of the timing loop was varied from a very small loop (which will fit entirely in cache and therefore execute at the maximum CPU speed), to a very large loop which will not fit into cache, so that the instructions will mostly be fetched from main memory. By using loops of intermediate size, one can see the effect of a larger cache. The instructions are divided into three sets: single word instructions, instructions followed by one word of address, and instructions followed by two words of address. The numbers at the top of the table give the number of instructions in each loop. Though the tests were run for all instructions, ony a few need be given here to show the effect of cache. PDP-11 CPU instruction timing tests Page 4 B. Z. Lederman PDP-11/84 Instruction 100 250 500 1,000 2,500 5,000 10,000 ROR R0 .3 .3 .3 .3 .3 .4 .8 TST (R0) .7 .8 .8 .8 .8 1.0 1.3 CMP (R0), (R3) 1.6 1.6 1.6 1.6 1.6 1.8 2.1 MOV -(SP), (SP)+ 2.1 2.2 2.2 2.2 2.1 2.3 2.7 MULF AC2, AC0 15.4 15.6 15.7 15.8 15.8 16.1 16.3 MULD AC2, AC0 44.4 45.2 45.5 45.7 45.9 46.1 46.4 Instruction 100 200 320 400 500 800 1,000 2,000 10,000 ASHC #1, R0 2.1 2.1 2.1 2.1 2.1 2.2 2.2 2.2 3.1 MOV @#A, R1 1.1 1.0 1.1 1.0 1.0 1.1 1.0 1.1 2.0 STCFI AC0, @#A 9.9 10.0 10.0 10.1 10.1 10.2 10.2 10.3 11.1 Instruction 100 125 200 320 400 500 800 1250 2500 MOV #1, @#A 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 3.2 ADD #1, @#B 2.1 2.1 2.1 2.2 2.2 2.2 2.6 2.2 3.5 PDP-11/70, Core Memory Instruction 100 250 500 1,000 2,500 5,000 10,000 ROR R0 .3 .3 .3 .4 .8 .9 .9 TST (R0) .7 .8 .8 .7 1.2 1.3 1.3 CMP (R0), (R3) 1.0 1.1 1.1 1.4 1.6 1.7 1.7 MOV -(SP), (SP)+ 1.7 1.7 1.7 2.0 2.5 2.5 2.5 MULF AC2, AC0 2.5 2.6 2.6 2.6 2.6 2.7 2.7 MULD AC2, AC0 3.8 3.9 4.0 4.0 3.9 3.9 3.9 Instruction 100 200 320 400 500 800 1,000 2,000 10,000 ASHC #1, R0 1.3 1.3 1.4 1.4 1.4 2.2 2.2 2.4 2.6 MOV @#A, R1 1.2 1.2 1.4 1.5 1.7 2.1 2.2 2.3 2.4 STCFI AC0, @#A 3.3 3.4 3.4 3.4 3.4 3.6 3.6 3.6 3.7 Instruction 100 125 200 320 400 500 800 1250 2500 MOV #1, @#A 2.0 2.0 2.0 2.1 2.7 3.5 4.0 4.0 4.0 ADD #1, @#B 2.3 2.3 2.5 3.1 3.5 3.9 4.1 4.1 4.2 Though the tests showed the effects of the larger cache on the 11/84, the transition point was shifted more twords long loops than I had anticipated. If I had the opportunity to run the tests again I would use a longer loop (31,250 instructions for the first test, 15,625 for the second and 10,000 for the third) to insure obtaining the maximum execution times. I would probably also run the tests for 10,000,000 total instructions rather than the 1,000,000 used for the above tests to obtain one more PDP-11 CPU instruction timing tests Page 5 B. Z. Lederman significant digit of precision, and I would not have to run as many tests between the shortest and longest loop as the point at which the effect of cache begins to be lost can be determined sufficiently well from the above tests. It is interesting to see what happens with certain instructions, especially floating point: the instructions which take the most time to execute suffer least when the loop is long as the time needed to fetch the instruction from memory is only a small part of the total instruction execution time. The fastest instructions suffer most, as here the time needed to fetch the instruction can be as much as or longer than the time the CPU needs to actually do the instruction once it has it.