Timing of PDP-11 Instructions

Don North ak6dn at mindspring.com
Sat Nov 26 21:46:05 CST 2005


My suspicion is you are seeing cache line conflicts within your
code, such that code/data accesses keep pushing valid data
out of your cache. When your code fits, and the address does
not collide with your data accesses, it runs 'fast'. When you
get address collisions that cause cache invalidate/replacement,
it runs slow.

A little bit of research indicates:
            Cache   Set    Block
Model   Size     Size    Size       Type
11/44    8KB      1       2B        Unifiied I/D
11/60    2KB      1       2B        Unifiied I/D
11/70    2KB      2       4B        Unifiied I/D
11/73    8KB      1       2B        Unifiied I/D

As you can see, the 11/73 has a very simple direct-mapped cache.
Any two memory locations that have the same low 13 physical address
bits will map to the same cache location, and contend for that slot.

By varying how your loop is unrolled, and the placement of your code
and data in memory, you can either allow thrashing to never occur
or always occur. Since the PDP-11 caches were all unified I/D caches
(not separate I/D caches like modern processors) and have very
simplistic set organizations (direct mapped, except for 11/70) thrashing
is very easy to achieve (either on purpose or by accident).

For your code, you want to guarantee that any code loops or data buffers
are not offset by 8KB in physical address (the caches are physically 
tagged).

Jerome H. Fine wrote:

> Hi All,
>
> I am not sure if there is still sufficient interest, however, I have
> encountered a rather bizarre timing situation when a specific
> subroutine is executed on my PDP-11/73.
>
> Background:  I want to calculate the reciprocal (inverse) of a
> number between 1 and 65535 to an accuracy of 256 bits after
> the binary (or decimal) point.  This result will then be used to
> calculate the logarithm of that value which in turn will be used
> to calculate li(x) for values of x up to 10**38.  The code is in
> FORTRAN 77 (could also be in FORTRAN IV) which calls
> MACRO-11 code to do the really low level stuff such as
> repeated addition of many words with MANY Adc instructions
> following each addition (99.9% of the time the Adc instructions
> are never used).
>
> Preliminary Details:  The subroutine first converts the INTEGER * 4
> to a REAL * 8 to produce:
> R8ARG =  I2ARG
> R8RCP = 1.0D0 / R8ARG
> as the initial 56 bit accurate estimate.
>
> The subroutine then continues:  I must then convert R8RCP to a:
> REAL * 64 = INTEGER * 32 / FRACTION * 32
> which is actually quite simple since the R8RCP value already
> contains the value EXP ** 2 which is the number of bits to
> be shifted right after the R8RCP fraction of 56 bits is unpacked.
> Next the unpacked 56 bits are moved by a whole number of
> words for each multiple of 16 bits to be right shifted.  Completion
> is done by shifting right a group of 5 words when the last portion
> of the total shift is less than 16 bits of shifting.
>
> Unpacking uses the instructions:
> Mov   (R0)+,-(R1)
> Word shifts use the following pair of instructions four times:
> Mov   (R0),-(R1)
> Clr     (R0)+
> The final shifting (if needed) uses five instructions of:
> Ror    -(R0)
> as many times as are required (1 to 15 times)
>
> The problem is as follows:  When the destination argument
> address is in PAR0 (address is less than 17600), there is
> no problem.  When I use an argument address in PAR1
> (address is above 21000), there is a substantial speed reduction
> in the subroutine.  It takes an average of 1611 microseconds
> for each of the 65535 cases when the destination address is
> in PAR0 (specifically 15236) while it takes an average of 6055
> microseconds for each if the 65535 cases when the destination
> address is in PAR1 (specifically 21000) or about 3.75 times as
> long.
>
> What is EXTREMELY interesting is that the above timing is also
> repeated under E11, although everything is over TWENTY times
> as fast.  Under E11, it takes 65 microseconds using PAR0 and
> 230 microseconds using PAR1 with the same identical program.
>
> I run using RT-11 with RT11XM for all testing, although I did
> check with RT11FB under E11 with almost the same timing results.
> I could also check with RT11FB on the real DEC PDP-11/73
> if anyone thinks it might be worthwhile.
> -----------------------------------------------------------------
>
> HOT  FROM  THE  COMPUTER  -  A  BIT  MORE  DETAIL!
>
> At first I was most suspicious of the  Mov / Clr pairs since they
> are the code that is extra.  HOWEVER, if I eliminate the last loop
> with the five Ror instructions, then (although the answer is incorrect -
> up to 2 ** 15 too large without the final shifts) the timing is the same
> for both PAR0 and PAR1 arguments.  But this result contradicts the
> earlier version of the code which uses ONLY Ror instructions (a total
> of 16 Ror instructions are needed to span the 32 bytes in question) to
> accomplish all the shifting.  Because the timing for the earlier version
> of the code (which is still present, just not used when the branch is
> taken to the "faster??" version) can now be compared by changing
> the branch instruction to use the less efficient code, I am still able to
> compare the timing for both versions - and the timing for the original
> less efficient version (which still includes the four unpack instructions
> and many more Ror instructions) is still the same as before for both
> PAR0 and PAR1 destinations arguments and equal to each other.
>
> SO  I  AM  STUMPED!!  Even when I turn the five Ror instructions
> loop into five Nop instructions (as opposed to not doing the loop
> even when there are more shifts needed - the code checks the remaining
> shift count, so pretending the remaining shift count is always zero to
> stop using the last five Ror instruction loop was easy), the timing test
> with a PAR1 destination argument takes longer even though NOTHING
> is being done!  In fact using a PAR1 destination argument with a five Nop
> instructions loop still takes more than half of the extra timing (i.e. 
> much
> more time than for the PAR0 destination argument which still uses the
> five Ror instructions loop).
>
> ----------------------------------------------------------------------
>
> HOTTER  FROM  THE  COMPUTER  -  STILL  MORE  DETAILS!
>
> To my surprise, there is considerable variation even when different PAR0
> destination addresses are used.  Since the portion of the code up to the
> final right shifting within 5 words is all that can be different, I am 
> even
> more mystified.  However, after running 3 different code versions at
> 3 different addresses (2 in PAR0 and 1 in PAR1) on both a real DEC
> PDP-11/73 and under E11 on a Pentium III 750 (at over 20 times the
> speed of the PDP-11/73), the results are consistent.  When I gradually
> increase the PAR0 destination argument address right up to the end of
> PAR0, the timing gradually decreases.  When the address is in PAR1
> (all or none - the field is 64 bytes long, so the last example in PAR0
> is 17600), the average time suddenly jumps, but only for the "more
> efficient" code version.
>
> Since the results of all calculations are correct (verified against a 
> version
> which uses the less efficient code), I am now totally mystified.  I have
> never seen this occur before in all of the 45 years I have been using
> computers.
>
> Sincerely yours,
>
> Jerome Fine
> -- 
> If you attempted to send a reply and the original e-mail
> address has been discontinued due a high volume of junk
> e-mail, then the semi-permanent e-mail address can be
> obtained by replacing the four characters preceding the
> 'at' with the four digits of the current year.
>
>



More information about the cctalk mailing list