PDP-11 CPU instruction timing tests                                      Page  1
B. Z. Lederman


          The  following  are  the  results  of  the timing tests, and have been
     arrainged to show the effects of addressing  modes.   The  category  "Basic
     Instructions" includes the TeST, ROtate Right, SWAp Bytes, MOVe, MOVe Byte,
     CoMPare, ADD, and BIt Test instructions,  all  of  which  can  normally  be
     expected  to execute at the same speed when operating on registers, and did
     in fact do so in the tests.  SWAp Bytes is also used again in  one  of  the
     addressing  mode  tests:   in  these  tests  "addr"  is  a symbolic address
     reference:  when there are two references to "addr" in the same instruction
     they are to two different locations.  The Branch EQual and Branch Not Equal
     instructions are used to show the difference in speed when  the  branch  is
     taken  and when it is not taken:  all branch instructions should execute at
     the same speed under the same circumstances.  Another special case  is  the
     Jump  to  SubRoutine instruction:  this must also includes the time for the
     ReTurn from Subroutine instruction, and so is the time  needed  to  execute
     two  instructions.   This  is reasonable, as most subroutines conclude with
     the RTS instruction.  

          The  reference  to  "Short"  and  "Long"  is  to  the  length  of  the
     instruction sequence:  a short sequence should all fit into cache and  give
     the fastest possible CPU times.  The long sequence does not fit into cache,
     and shows the effect of main memory speed.   All  times  are  converted  to
     micro-seconds.  

                           11/70   11/70  11/84  11/84
                            Core   Core                   PRO-
         Instruction       Short   Long   Short   Long     350

     Basic Instructions     0.31   0.91    0.3    0.7     2.38

     TST     (R0)           0.76   1.36    0.7    1.3     4.20
     CMP     (R0), (R3)     1.06   1.67    1.6    2.1     6.11
     MOV    -(SP), (SP)+    1.70   2.55    2.1    2.7     6.50
     MOVB   -(SP), (SP)+    2.00   2.86    2.1    2.7     6.51
     MUL       R1, R0       3.19   3.81    5.7    6.5    26.64
     DIV       R1, R0       1.98   2.59    9.3   10.2     8.96
     BEQ     addr           0.31   0.90                   2.38
     BNE     addr           0.61   1.21                   2.38
     CCC, NOP               0.61   1.21    0.8    1.25    3.23

     MOV       #1, R0       0.77   1.95    0.6    1.5     4.07
     ASHC      #1, R0       1.38   2.56    2.1    3.1    11.92
     MOV   @#addr, R1       1.67   2.42    1.1    2.0     5.49
     TST   @#addr           1.67   2.42    1.1    2.0     5.49
     SWAB  @#addr           2.03   3.66    1.9    2.9     6.78
     MOV     addr, R1       1.43   2.27    1.4    2.3     6.07
     TST    2(R0)           1.51   2.26    1.3    2.2     6.11
     TST     addr           1.48   2.27    1.3    2.3     6.10
     MOV    2(R0), (R3)     1.87   3.52    2.3    3.4     8.12
     JSR       PC, (R2)     3.06   3.68                   9.03
     JSR       PC, @#addr   3.80   4.63                  10.25


PDP-11 CPU instruction timing tests                                      Page  2
B. Z. Lederman


     MOV       #1, @#addr   2.04   4.08    1.9    3.2     7.34
     MOV   @#addr, @#addr   2.76   4.54    2.4    3.7     8.96
     MOV    4(R0), 2(R3)    2.62   4.44    3.0    4.3     9.68
     ADD       #1, @#addr   3.02   4.34    2.1    3.5     8.39
     ADD   @#addr, @#addr   2.81   4.84    2.7    4.0    10.10
     ADD    4(R0), 2(R3)    2.68   4.75    3.2    4.5    10.73

     SETF                   1.22   1.83    1.6    2.0     6.14
     MULF     AC2, AC0      2.57   2.68   15.4   16.3    61.35
     ADDF     AC0, AC1      2.72   2.81   14.3   15.3    72.78
     NEGF     AC0           1.22   1.83    4.8    5.4    15.83
     ABSF     AC0           1.22   1.83    5.2    5.9    15.99
     LDF       R0, AC0      1.22   1.82                  11.47
     LDF     (R3), AC0      1.97   2.60                  14.67
     LDCIF     R0, AC0      2.27   2.37                  29.14
     LDCIF   (R3), AC0      2.27   2.59                  20.47
     STCFI    AC0,  R0      2.73   3.36                  36.00
     STCFI    AC0, (R3)     3.36   4.17                  37.59

     SETD                   1.22   1.82    1.6    2.1     6.13
     MULD     AC2, AC0      3.94   4.01   44.4   46.4   215.44
     ADDD     AC0, AC1      2.72   2.83   19.2   20.4    98.83
     NEGD     AC0           1.22   1.82    5.9    6.5    19.32
     ABSD     AC0           1.22   1.82    6.2    7.0    19.29

     LDF       #1, AC0      1.81   2.88    3.4    4.4    14.29
     LDF   @#addr, AC2      2.72   3.49                  16.06
     STF      AC0, @#addr   4.22   5.26    4.0    5.1    11.80
     CMPF  @#addr, AC0      2.89   3.49    5.7    6.9    30.40
     LDCIF     #1, AC0      2.61   2.89                  42.01
     LDCIF @#addr, AC0      2.44   3.18                  25.20
     STCFI    AC0, @#addr   3.41   4.89    9.9   11.1    38.24

          The  results of these tests need to be considered carefully.  First, I
     had a limited amount of time on the 11/84, and ran the shortest tests:   it
     would  be  desireable to run them again to obtain the same precision as for
     the 11/70 and PRO-350 tests.  Also, as  can  be  seen  below,  the  longest
     sequence  might  not have hit the maximum memory access times on the 11/84.
     The times obtained for the 11/70 are the same using core  and  MOS  memory,
     but  it  is  suspected  that  in  both  cases  memory  was not interleaved:
     interleaving makes main memory access faster, and so the times on the 11/70
     for  the long loop might improve on a machine with interleaved memory.  The
     times for the MULtiply and  DIVide  instructions  don't  always  match  the
     processor  handbook  values:  this is probably because the data values used
     are special cases, and a divide by one is faster on some  machines  than  a
     divide  by another number.  The values for the short loop, where the entire
     loop is in cache, do match very well the  published  values  in  the  11/70
     processor handbook 

          It  is apparent, however, that the 11/84 certainly comes very close to
     the speed of an 11/70.  For some addressing modes,  the  11/84  is  faster:

PDP-11 CPU instruction timing tests                                      Page  3
B. Z. Lederman


     when  this  is  combined with the larger cache, the net effect is that some
     other test programs actually ran significantly faster on the 11/84 than  on
     the 11/70.  A program which was designed to heavily exercise memory mapping
     (it used a Virtual array  in  Fortran,  and  essentially  did  nothing  but
     re-map) was the one CPU bound test which ran slower on the 11/84:  I do not
     know the reason for this.  The two  cases  where  the  11/84  is  obviously
     slower  are  arithmetic  instructions  (MUL,  DIV, ASHC) where the 11/84 is
     slightly slower, and with Floating Point instructions, where the  11/84  is
     considerably slower.  How much this affects an application depends upon how
     many floating  point  instructions  are  executed:   even  a  program  with
     floating  point  variables generally has a relatively low percentage of the
     total program executing actual floating point arithmetic.  A program  which
     did  use  floating  point  heavily  (large  matrix  inversion, Fast Fourier
     Transforms,  Network  Analysis,  and  similar  scientific  and  engineering
     applications) would see a reduction in speed on an 11/84.  

          The  story  does not end here, however.  DEC has announced an upgraded
     11/84A, whose processor runs 20% faster than the machine I tested,  so  all
     of  the instruction times given above can probably be reduced by 20%:  this
     would make the 11/84 faster than the 11/70 in nearly all  instances.   This
     machine  will  also  accommodate  the Floating Point Co-Processor, which is
     said to execute floating point instructions 5 to 8 times  faster  than  the
     basic J-11 processor:  this would narrow or eliminate the gap between 11/70
     and 11/84 floating point performance.  

          The  times  for  the PRO-350 are close to but slightly slower than the
     times published for an 11/23, which uses the same processor.   The  PRO-350
     CPU  has to do extra work such as updating the video screen and keeping the
     system clock working, which may account for the difference.  

          Both  the  11/70 and the 11/84 have cache memory:  this means that the
     speed at which instructions will execute depends upon  the  CPU  having  to
     fetch  the  instruction  from  fast  cache memory or relatively slower main
     memory.  To examine the effect this has, the length of the timing loop  was
     varied  from  a  very  small  loop  (which  will  fit entirely in cache and
     therefore execute at the maximum CPU speed), to a  very  large  loop  which
     will  not  fit  into cache, so that the instructions will mostly be fetched
     from main memory.  By using loops of intermediate size,  one  can  see  the
     effect  of  a  larger cache.  The instructions are divided into three sets:
     single word instructions, instructions followed by one word of address, and
     instructions  followed  by two words of address.  The numbers at the top of
     the table give the number of instructions in each loop.  Though  the  tests
     were  run  for  all  instructions, ony a few need be given here to show the
     effect of cache.  


PDP-11 CPU instruction timing tests                                      Page  4
B. Z. Lederman


                                     PDP-11/84 

         Instruction        100    250    500  1,000   2,500  5,000 10,000

       ROR R0                .3     .3     .3     .3      .3     .4    .8
       TST (R0)              .7     .8     .8     .8      .8    1.0   1.3
       CMP (R0), (R3)       1.6    1.6    1.6    1.6     1.6    1.8   2.1
       MOV -(SP), (SP)+     2.1    2.2    2.2    2.2     2.1    2.3   2.7
       MULF AC2, AC0       15.4   15.6   15.7   15.8    15.8   16.1  16.3
       MULD AC2, AC0       44.4   45.2   45.5   45.7    45.9   46.1  46.4

        Instruction       100   200   320   400   500   800 1,000 2,000 10,000

      ASHC #1, R0         2.1   2.1   2.1   2.1   2.1   2.2   2.2  2.2  3.1
      MOV @#A, R1         1.1   1.0   1.1   1.0   1.0   1.1   1.0  1.1  2.0
      STCFI AC0, @#A      9.9  10.0  10.0  10.1  10.1  10.2  10.2 10.3 11.1

        Instruction       100   125   200   320   400  500  800 1250  2500

      MOV #1, @#A         1.9   1.9   1.9   1.9   1.9  1.9  1.9  1.9   3.2
      ADD #1, @#B         2.1   2.1   2.1   2.2   2.2  2.2  2.6  2.2   3.5

                               PDP-11/70, Core Memory 

         Instruction        100    250    500  1,000   2,500  5,000 10,000

       ROR R0                .3     .3     .3     .4      .8     .9    .9
       TST (R0)              .7     .8     .8     .7     1.2    1.3   1.3
       CMP (R0), (R3)       1.0    1.1    1.1    1.4     1.6    1.7   1.7
       MOV -(SP), (SP)+     1.7    1.7    1.7    2.0     2.5    2.5   2.5
       MULF AC2, AC0        2.5    2.6    2.6    2.6     2.6    2.7   2.7
       MULD AC2, AC0        3.8    3.9    4.0    4.0     3.9    3.9   3.9

        Instruction       100   200   320   400   500   800 1,000 2,000 10,000

      ASHC #1, R0         1.3   1.3   1.4   1.4   1.4   2.2   2.2  2.4  2.6
      MOV @#A, R1         1.2   1.2   1.4   1.5   1.7   2.1   2.2  2.3  2.4
      STCFI AC0, @#A      3.3   3.4   3.4   3.4   3.4   3.6   3.6  3.6  3.7

        Instruction       100   125   200   320   400  500  800 1250  2500

      MOV #1, @#A         2.0   2.0   2.0   2.1   2.7  3.5  4.0  4.0   4.0
      ADD #1, @#B         2.3   2.3   2.5   3.1   3.5  3.9  4.1  4.1   4.2

          Though  the tests showed the effects of the larger cache on the 11/84,
     the transition point  was  shifted  more  twords  long  loops  than  I  had
     anticipated.  If I had the opportunity to run the tests again I would use a
     longer loop (31,250 instructions for the first test, 15,625 for the  second
     and  10,000 for the third) to insure obtaining the maximum execution times.
     I would probably also run  the  tests  for  10,000,000  total  instructions
     rather  than  the  1,000,000  used  for  the above tests to obtain one more

PDP-11 CPU instruction timing tests                                      Page  5
B. Z. Lederman


     significant digit of precision, and I would not have to run as  many  tests
     between  the  shortest and longest loop as the point at which the effect of
     cache begins to be lost can be determined sufficiently well from the  above
     tests.   It  is  interesting to see what happens with certain instructions,
     especially floating point:  the instructions which take the  most  time  to
     execute  suffer least when the loop is long as the time needed to fetch the
     instruction from memory is only a  small  part  of  the  total  instruction
     execution  time.   The  fastest  instructions suffer most, as here the time
     needed to fetch the instruction can be as much as or longer than  the  time
     the CPU needs to actually do the instruction once it has it.