.PAGE SIZE 55,80 .RIGHT MARGIN 80.LEFT MARGIN 5 .FIRST TITLE .TITLE Programs for testing the speed of CPUs .SUBTITLE B.#Z.#Lederman .PARAGRAPH We have applications which run on PDP-11/70s, and until recently there was no other PDP-11 which was as fast. With the introduction of the PDP-11/84, it was suggested that applications could be moved to the new machine, but we wanted some method of verifying the speed of the 11/84, and there were no published instruction times. I therefore wrote some programs which will measure the speed at which a CPU will execute instructions with reasonable accuracy: it will at least give a method of comparing one CPU to another. .PARAGRAPH The method I use is quite simple: a single instruction is executed many times (generally a minimum of 1,000,000 executions), and the time this takes is measured. Because the largest single program segment on a PDP-11 is about 32,000 words, it is neccessary to use a loop: a long sequence of instructions is repeated as many times as needed to get the total number of executions wanted. There is also the effect of cache: a short loop will fit entirely into cache and will execute at the fastest possible speed; a longer loop will not all fit into cache, and the effect of the speed of main memory can be seen. Finally, it is neccessary to test the speed of many different instructions (one at a time) to get an adequate profile of a given CPU's performance. .PARAGRAPH The program S1D.MAC implements this type of test. Although I have tried to make it as straight-forward as possible, it should be remembered that this is basically a diagnostic program, and that it does some things that I normally wouldn't allow in a production program: specifically, it treats instruction space as data, is self-modifying, and there are two tables which must be kept in alignment manually. Sometimes this sort of thing is neccessary in a diagnostic program. While modifications would be fairly easy for a programmer with some experience, I don't suggest making any changes unless you are fairly certain you know what to do. (The program is not priviledged, however, so the worst that can happen is the program will abort: it won't crash your system.) .PARAGRAPH There are two global equates at the beginning of the program: $NREP is set to the number of instructions wanted in the loop, and $NPAS is the number of times the loop is executed. Any combination which will multiply to a large even number will do, such as 1,000 and 1,000; 3125 and 320; 2500 and 400, or any other suitable combination. Having the numbers multiply out to 1,000,000 or 10,000,000 isn't absolutely neccessary, but it makes it easier to read out the times: at 1,000,000 repititions the time measured in seconds is the instruction time in micro-seconds. Because the loop has to have 3 instructions to form the loop repeat logic, there is another global called $NREAL which is $NREP minus 3. As long as the loop is reasonably long (several hundred instructions or more) the effect of the loop repeat instructions will be negligible. I considered making them task build time options, but it would have made the program far too complicated. Don't be supprised if the program takes more time to assemble when $NREP is large, as it takes more time to fill the ISTRT area. .PARAGRAPH There are two tables which must match: the first starts at MNTAB, and is a list of mnemonics which print out when you run the program to tell you what instruction is being tested. This must match the set of instructions starting at TABLE if you want to know what is going on. The instructions were placed in their present order after running the program many times so the data comes out in what I feel is the most useful form, though the instructions could be in any order. All of the instructions are position independant, as they have to be moved by the program into the execution area: testing other addressing modes would be difficult as the addresses would all have to be adjusted, and I'm not sure that the information obtained would significantly improve the comparisons between machines. In this program, all instructions are one word long: another program called S2D.MAC is identical except that two word instructions are tested (actually instructions with addressing modes that cause the instruction to be followed by one word of address information or literal data), and similarly for S3D.MAC which tests three word instructions. Using three seperate programs makes things very much easier (for me, the programmer). There are also a few special cases: the BRANCH instructions are set to always branch to the next instruction, which allows them to be tested even though they are relocated within the program. The have been selected so that one will always branch and the other will never branch: on many machines there will be a difference in execution speed for these two conditions. The other special case is the Jump to Subroutine instruction: for this to be tested there must be a Return from Subroutine instruction (which is in a fixed location), and the time of execution really is for both instructions together. As most subroutines end with a return, this should be a valid test as long as one remembers that the times obtained are for two instructions. The data which is used in the instructions is either absolute references or loaded into registers: the data for the Multiply and Divide instructions may not be the best possible test of these instructions, but it does work. Auto-increment and Auto-decrement addressing modes are difficult to test as the loop is at least 1,000,000 instructions long, and any continuous test would push the register out of addressing range: the only way out I could see was to have the same instruction do a decrement and increment together, so only two addressing locations are used, and this can be seen in the MOV#-(SP),#(SP)+ test. .PARAGRAPH It should be noted that these programs include Floating Point Processor instructions: though it would be easy to include an assembly condition to include these only when wanted, I haven't been able to get around to doing that yet. Persons wishing to test machines which don't have FPP should conditionalize or comment out the FPP instructions: this would also apply to anyone who does not have EIS instructions (MUL, DIV, etc.). The programs do not test FIS or CIS instructions, because none of the machines I have access to have these instructions. .PARAGRAPH In the main body of the program there is space reserved beginning at ISTRT: this is initially filled with No Operate instructions, and the program fills this space with one of the instructions from the TABLE in order to test it's execution time. The system directive GTIM$ (Get Time) is used just before the start of the loop and just after the loop to see how long it took: as system time is delivered in 0.1 second intervals, it is neccessary to make the test loop long enough to be greater than this. The fastest 11/70 and 11/84 instructions are about 0.3 micro-seconds long, so if a 1,000,000 instruction loop is used, the measured times are about 0.3 seconds, only a little above the measurment granularity. Though this is enough for an approximate comparison, running a 10,000,000 instruction test brings the shortest times up to about 3.0 seconds, and gives one more significant digit of data. Running tests longer than this should not be neccessary. For slower machines (such as the PRO-350 and 11/23), 10,000,000 instruction loops are nice but a little slow, and 1,000,000 instruction loops are probably adequate if you don't have much test time available. .PARAGRAPH The use of the system directive to measure time brings up several important factors in running these tests, of which the first is the amout of time it takes to execute the GETIM$ directive. I measured this on our 11/70 by having a program issue several GETTIM$ directives sequentially and measuring the time each directive took to execute with SPM-11: this package from DEC hooks into the executive and measures time in 0.00001 second intervals (100,000 per second). The GETIM$ took about 0.0005 seconds (500 micro-seconds) to execute, which is so much less than the minimum resolution of 0.1 second that it can be ignored if the loop is long enough to register at least 0.1 seconds. A related point is that when the test program is running, it should be the only program running, and at a high enough priority that it will "grab" the CPU more or less exclusively (the CPU will still have to service the system clock if you want the test times). I have run the tests on otherwise idle systems, and at a priority of 151 or 152 which is high enough to be above just about everything except MCR: I want the test to be below MCR in case I want to abort it. While this might make the instructions test a little slow, it is still a valid comparison between different machines, and when I have compared the test results obtained with the published instruction times in the processor handbooks, I have obtained quite good matches, indicating that the program is testing accurately. Under these conditions, it should be possible to run the test either on RSX-11M-Plus (as I did), or RSX-11M, or possibly on RSX-11S and obtain valid results. The basic program would probably also run on IAS, but it might be neccessary to change the calls to the system library that format the output messages and the QIOs if they are not the same on IAS. .PARAGRAPH I have only been able to test 3 machines: the 11/70, 11/84P and PRO-350. One of the reasons I am placing these programs on the DECUS tape is the hope that other users with other machines will run the programs and publish the results, so the user community will have additional data to make comparisons between models of PDP-11s when choosing a new system or contemplating an upgrade path. Also, by testing more machines, it will be possible to compare the times with the published speeds in the processor handbooks, and obtain additional verification that the speed tests are accurate (or not, though I have reasonable evidence now that the test is correct). I would be very interested in seeing the results of tests on other CPUs, or suggestions on other instructions which might be added to the test set. .PARAGRAPH I have also started on a version which will run on the VAX family of machines: the preliminary program is included here as S3V.MAR, and it tests a limited set of 3 byte instructions. It will be neccessary to expand the number of instructions tested by this program, and to produce programs which test instructions which are different lenghts in the same manner as the three PDP-11 programs. This program works, but I have not yet had an opportunity to run it as the only program on the system or at a high priority, so the times I have so far are not valid and are not included in this submission. Also, I have not yet been able to investigate the effect of longer or shorter loops on data cacheing or page boundery crossing. The program GETTIM.MAR is used to test the length of time needed to do a $GETTIM_S, which is used to time the loop: it does this by measuring the time which elapses while issuing 1000 directives. This works out to 0.00012 seconds on a 11/750 (120 micro-seconds), which is supprisingly less than the RSX system: I suspect that this is because the system call is simpler, and that the directive itself has less work to do as it simply returns the 8 byte system time register. The program prints out the 8 byte time values as I was also using it to test that my time subtraction logic was correct before using it in the S3V program. Because the system clock keeps time in smaller intervals, the instruction times on the 750 come out in 0.01 second intervals, though I would still run instruction loops of at least 1,000,000 for best accuracy. I would like very much to hear from other users who can suggest a good instruction set for the VAX tests, and who obtain test results. .NO JUSTIFY.NO FILL .BLANK B.#Z.#Lederman 2572 E.#22nd St. Brooklyn, N.Y. 11235 .BLANK (212)#250-2300#7:30#AM#-#3:30#PM#Eastern time or DCS#(Lederman)