RSX11M System Tuning, Performance Optimization and Measurement This is going to be a workshop System Tuning and Performance Optimization for RSX11M systems. While most of what will be dis- cussed is particularly applicable to RSX11M V3.2 systems, some of it will be generally applicable to other systems. The first part of this session will be devoted to presentations by the speakers on a wide ranging series of topics. This will take about 1.5 hours. Then in the last half hour we will welcome discussion, comments,suggestions and questions from the floor. We have a con- siderable amount of material to cover, so if questions can be held to a minimum it would be useful. I became interested in optimizing the performance on our RSX11M system over 2 years ago. We currently have an 11/45 with 11 terminals and a very large number of users, each of which covets the entire resourses of our system. Two years ago we were criti- cally tight on disk space and terminal time. This problem led me into disk block and terminal time accounting and finally into the area of resourse accounting in general. After getting the disk space and terminal time problems under control, I found that the growth of the system usage was leaving POOL and memory critically tight. At this point I implemented Richard Kirkman's CCL, updating it for V3.1, and a whole raft of MCR, INDIRECT, and BATCH enhance- ments. For a while our memory and POOL problems vanished. Then the users' discovered how convenient the system was, and usage went up, again pushing us up against the proverbial wall. To make matters worse, V3.2 of RSX11M seemed to require substantially more core than before. This pushed us into reconfiguring the system to make heavy use of a FCS resident library, and to monitoring system performance and performance of various software subsystems. These measurements helped justify adding a CACHE memory and additional disk capacity. At present performance measurement and capacity planning is our critical interest. We are trying to get a handle on the ultimate capacity limits of our machine. Users now want to increase the number of terminals from 11 to 16 or more. Serious thought is being given to connecting the PDP-11/45 to 3-4 satilite 11/23's via DECNET to handle program development for a new labora- tory data acquisition system. At some point a larger machine will be needed and hopefully performance measurements will give us enough information to plan our data acquisition system's growth. By the end of this workshop, we hope to present material which will show you the user, that very substantial performance improve- ments are possible. Depending on the ammount of effort one wishes to invest, 10% to over 50% improvements in performance are not un- reasonable. The goal of this workshop is not to tell you how to make these improvments, or to overly extol the virtues of any one method. Rather, we want to tell you what can be done, give you an estimate of the degree of benefit involved, and give some feeling for the complexity of the task and any system consequences. All of the improvements I will discuss, we have implemented on our system. The net result is that throughput on our system is perhaps 2-4 PAGE 2 times greater on our system than on a standard 11/45 running a standard RSX11M operating system. The lower limit to thourghput improvements is the result of adding a CACHE memory. The upper limit is the result of many software and system changes. What do I mean by performance? Performance can either be con- sidered system wide or on a per task basis. A development system will probably be concerned with overall optimization of the number of tasks processed through the system. A dedicated system will be primarily interested in optimizing performance of a set of tasks. Toward this end we will address a number of topics. Types of Optimization 1. Overall System Throughput 2. Subsystem or Application System Throughput 3. Ease of use (system/subsystem) 4. Memory Usage 5. Pool Usage Go to Dan Steinberg's talk. 6. Disk Usage (SEE SLIDE OF TOPICS AND SPEAKERS) PAGE 3 TUNING A SYSTEM A) HARDWARE OPTIMIZATION The easiest ways to improve performance of an existing system is to install a CACHE memory if it doesn't have one already. I be- lieve that one can now get Cache memories for any DEC processor larger than a PDP 11/20 either through DEC or through a BRAND X company. For about $6K or so you system can run 30-50% faster. Swapping disks or disk emulators can also provide substantial beni- fit to a heavily loaded system. Often just redistributing the disk usage on a multi-disk system can provide significant benifit. The trick is to find out where bottle necks are occuring and to know which changes help the situation the most. Finally, users doing a lot of high speed terminal I/O can improve their system's perfor- mance quite significantly by using a DH11 (DMA) terminal interface and the V3.2 full duplex terminal driver. B) SOFTWARE AND OPERATING SYSTEM OPTIMIZATION Without resorting to using performance measurement tools or incorporating hardware enhancements, there are a number of software options the user has readily available which will improve system performance. 1) SHUFFLER and RMDEMO The SHUFFLER and RMDEMO affect the performance of a heavily loaded system such as ours. Firstly, comparing two identical load- ed systems except that one is running the new RMDEMO and the other is running the old RMDEMO, one finds that the new RMDEMO imposes at least a 25% greater overhead than the old version. Part of this is because it is larger and more swapping occurs. Secondly, the new RMDEMO issues 1 QIO for each line changed on the screen since the last plot. On systems selecting checkpointing both on input and output, RMDEMO can get into a mode in which it checkpoints between QIO's if the system is loaded. In some cases the the system will swap itself to a standstill. For those who wish to use the old RMDEMO, it is on the SAN DIEGO tape from the last DECUS symposium. It is modified for V3.2, and has many of the features of the new RMDEMO. We plan to support it at least through V4.0 A second very useful trick for a heavily loaded system is to place SHF in its own partition and FIX it there. This costs almost no memory, since SYSPAR is much too large as it stands for a mapped system. It can almost hold SHF and MCR... side by side as it is now. Tests of system throughput for a system doing a lot of swap- ping show that throughput is increased by about 5% if SHF is fixed in its own partition. If you have an extreemly heavily loaded system, you find that a large fraction of the system's time is spent with SHF trying to make room and not being able to. It is a design 'feature' of the PAGE 4 shuffler that there is no limit placed on the number of times/sec it will try and recoverer memory to run a task. There is a rela- tively simple way around this problem if one wishes to modify the EXEC slightly. Each Executive Swapping interval load a counter with the maximum number of times one feels SHF may run during a swapping interval. Then each time the SHF is requested decrement the counter. If the counter goes to zero reject the request. One should also have the SHUFFLER set a flag to tell the EXEC it is ac- tive but waiting since it will only make matters worse if SHF gets requested part way through a run. We have done this on our system and gained 10-20% more throughput out of a thrashing system. 2) QIO OPTIMIZATION If you have selected QIO optimization at SYSGEN you also have another easily modified tuning parameter. As a quick and dirty test of what one might expect, I dumped an RK05 to NL: using PIP with MAXPKT set to 15 and 0 respectively. I found that it took 2.6% less time to complete the transfer with MAXPKT set to 15. Next, I wrote a test program which wrote 10,000 records to NL:, in- stalled it 6 times with different names, ran all six tasks at once, and timed how long it took for all six tasks to finish and exit. If MAXPKT was set to 15 rather than 0, the process took 6.6% less time to complete. This benefit would be less pronounced if one was transferring to a real device. But it is indicative that the sys- tem may be spending 5% or so less time servicing the QIO before it gets to the driver. 3) ROUND ROBIN SCHEDULER AND SWAPPING INTERVAL Another tuning parameter which can be tweaked, is the Round Robin and Swapping interval. As part of the KMS System Accounting and Performance measurement package, I included a task for display- ing and modifying these intervals on-line. To test the effects of heavy swapping on throughput, I started up 6 simultaneous taskbu- ilds with versions of BIGTKB which were 23K in size. Only 3 of the TKB version could fit in core at one time, and the system was run- ning with anywhere from 80-102K of tasks swapped out. Holding the Round Robin interval fixed at 5/100 second, the Swap time was var- ied from 20/100 to 1 sec. The results are displayed in the graph. The baseline is the time it would take for 6 sequential taskbuilds to take place. As you can see as the swapping interval is length- ened, the time for the entire taskbuild approaches the baseline as- ymptotically. The 'dots' represent the timings taken with SHF in SYSPAR, and the "x's" represent timings with SHF fixed in SHFPAR. These timings were done for an RK07 disk. From them we see that a swapping interval of 40/100 of a second is about the minimum we should use. A slower disk(RK05) should use a somewhat longer swap- ping interval. 4) MEMORY USAGE PAGE 5 On our system memory is very tight. In the face of an incre- asing number of users and terminals on our system, it was necessary to examine how tasks used memory very critically. One of the first things that was noticed was that a significant number of DEC task- build command files for mapped systems build utilities and privi- leged tasks with default sizes which are far, far in excess of what is needed. A classic case in point is that the new despooler task was built with a task size of 8 k words when it only needs 2-3k words. A second example is INS.TSK which is also built to be an 8 k task. If it is installed via VMR with an INC=0, or taskbuilt to the correct size, its size is reduced by a little over 1K word. By going through the command files, and modifying the partition size prior to taskbuilding, you will create tasks which use less memory. Some tasks, however, like BIGTKB use the extra space for sym- bol storage or buffering. For TKB what we did was to find the size required for an small taskbuild and start out TKB at that size. Since our system supports the extend task directive, TKB will grow in size if needed. Since having to do too many extend tasks slows down performance, use a CCL command to create a super large version of TKB for use during SYSGENS and RMS taskbuilds. This technique saves about 3k of memory for a typical use of TKB. 5) RESIDENT LIBRARIES Since the taskbuilder is one of the biggest users of core, a substantial overall improvement in system throughput can be accom- plished by making it run faster. This is one of the advantages of using resident libraries. Taskbuilding a given task using a resi- dent library often takes only a small fraction of the time required for taskbuilding without a resident library. Those of us who have taskbuilt RMS tasks are well aware of how long task builds using the RMS ODL's can be. On a small system such as ours, the length of the taskbuild times insures that if other users are doing pro- gram development from other terminals, checkpointing is almost cer- tain to take place. This problem is considerably alleviated by using an RMS resident library. However, a small system such as ours just can not support an extra 4-12k of resident libraries all the time. There are two ways to circumvent this problem. First, one can use the SET /TOP command to shrink GEN and make room for a tempora- ry (but otherwise very normal) resident library. The required software patches to MCR for doing this were published in the MULTI- TASKER a few months ago and are on the San Diego tape. This method is only useful during program development and is not transparent to user's. A second and much better way of doing this is to use transient resident libraries. Brian Mcarthy (DEC) developed a method of lo- PAGE 6 ading resident libraries into PLAS regions. When a task is run, the resident library is automatically loaded into core and linked to the task(s). When the task exits the resident library PLAS re- gion vanishes. The code to produce loadable resident libraries will be on this DECUS tape. We have been using it in production for 5 months and it has worked very well. There are two drawbacks. The first is that PLAS regions can not be shuffled. The second is that the method does not work on PIC libraries or on BP2 libraries linked with seperate RMS libraries. However, a combined BP2/RMS 12K library is supplied. 6) BATCH Another thing which can be done to improve overall throughput on a crowded system is to implement BATCH. An improved single stream BATCH and a multi- stream program development queue utility will be on this DECUS SIG tape. Numerous in house tests have shown that 4-8 users all trying to compile or taskbuild at the same time will swap the system almost to a standstill. Under conditions like these, it is prudent and to everyone's advantage to submit taskbu- ilds and compilations to a single or multiple stream BATCH queue. While working to optimize BATCH, we realized that the indirect file processor while a great tool, is very inefficiently used if it has to keep swapping in and out of core during the course of a pro- gram compilation and taskbuild. Since, the majority of program de- velopment work uses simple MCR commands, we implemented a Procedure Interpreter (PIN) 4 times smaller than ...AT. which spawns off simple commands to MCR from a command file. Indirect is used on the fly to create the user specific PIN command files. Even if PIN get swapped out, it is far more efficient to swap it in and out instead of ...AT. 7) CCL I don't want to say too much about CCL. Once I get started it's hard to stop. CCL is not quite a panacea to all your prob- lems, but it comes very close. It consists of two parts. A user extensible, file driven catchall task and modifications to INSTALL to allow passing MCR command lines to uninstalled tasks. The big- gest benefit of CCL is to free up POOL. Most tasks which were pre- viously installed now can be removed, transparently to the user. Command lines are parsed and sent on from CCL to non installed tasks. In addition, it can make your system unbelievably friendly. Simple user understandable commands can be used to aid the user through the complexity of our RSX wonderland. For example we have a BUILD command. The user types BUILD FILNAME, and in a completely transparent way a FORTRAN task is correctly compiled and taskbuilt. Of any of the CCL's available on DEC systems(RSTS or VAX) this is perhaps the most powerful. I firmly believe that if you haven't got CCL on your system yet you should, run, not walk to your near- est San Diego Decus tape, and get a copy to put on your system. PAGE 7 8) FCS Resident Library Building a system based on using an FCSRES is perhaps the hardest of the various performance/tuning options I have discussed. However, it holds the greatest potential for DRAMATICALLY improving your system performance. Amoung its benefits are BENEFITS OF USING FCSRES DISADVANTAGES 1. Faster taskbuild times 1. More task to build for SYSGEN 2. Smaller on disk task size 2. On line SYSGEN from one base level to another is hard. 3. Smaller in-core task size. 4. Tasks less overlayed, run faster 5. LESS LOADING on the system disk. 6. Smaller tasks swap faster. 7. More tasks can fit in core On the 1979 San Diego tape, I supplied command and ODL files which will allow building all DEC unprivleged utilities, and a few privileged tasks with FCSRES. In the table is a comparison of on disk task sizes of DEC utilities built with and without FCSRES. For just these tasks there is saving in disk blocks of 23%. PAGE 8 COMPARISON OF TASK SIZES (WITH AND WITHOUT FCSRES) SIZE (Disk Blocks) TASK No FCSRES With FCSRES BIGMAC 71 57 BIGTKB 161 145 CDA 159 114 CMP 50 29 CRF 36 24 DMP 57 41 EDI 60 41 EDT 108 88 FLX 129 106 FMT 65 57 IOX 99 79 LBR 72 52 PAT 44 25 PIP 67 51 SLP 48 30 VFY 57 39 VMR 144 124 ZAP 38 25 -------------------------------------- TOTAL 1465 blocks 1126 blocks The ammount of incore size reduction for DEC utilities built with FCSRES is not as large as one might hope for. The savings runs from about .5 to 2K words with the average around 1k word. The reason for this is that the utilities are very heavily overla- yed. Also the get command line code is not in the resident library (size??, PIC??) and is included in most of the utilities. Even so, consider the effect that a 1K word average size reduction has on a system with say 64K words of free core in GEN. Assuming an average task size of 8K words, 8 tasks can fit in core before swapping starts. If you lop off 1K word from each task, this frees up an additional 8K words for an additional task which can be in core be- fore swapping starts. An additional benefit is that initial task loads are significantly faster by an average of 13% for the typical utility. For the same reason checkpointing takes significantly less time. The real benefit of building the tasks with a FCSRES is in the dramatic increase in speed of the standard DEC utilities. These utilities are normally very heavily overlayed. Building them with FCSRES removes almost all the FCS overlay in the task. Consider the following speed improvements. PAGE 9 Speed Improvement PIP NL:=DM:[*,*]*.*/FU 2.4 times faster NL:=[1,2]*.HLP 1.2 times faster BIGMAC NL:=HELLO.MAC 1.1 times faster ...AT. For a file with a loop including (.INC, .IF, .TESTFILE,.GOTO) 1.7 times faster A very significant benefit is that the overall responsiveness of the system improves. Since the tasks have far fewer overlays, the system disk stays far less busy and tasks requests to read and write files are not interrupted by reading in overlays from the task image. At present I do not have a reliable indicator of how much this affects performance, but on a busy RK05 system disk with 4-5 users at work with the utilities, the busy light flickers only intermittently if the task are built with FCSRES. If the task are not built with FCSRES, the busy light never goes off until the users exit. Using an FCS resident library does have a few disadvantages. Firstly, one must rebuild all utilities for the initial FCSGEN. Normally, this need only be done once for each release cycle. Due to the number of tasks involved (both DEC utilities and our Fortran application programs) this task makes SYSGEN seem trivial. However, even it can be done on line in a day of continuous taskbu- ilding. However, if a bug in FCS appears which requires patching, FCSRES would have to be rebuilt along with all the tasks using it. This is not something to be lightly undertaken. Because DEC has modified FCS at each release level, a FCS re- sident library must be rebuilt each time a new release of RSX11M is obtained. At present building all tasks with a new version of FCS/FCSRES from an existing system using FCSRES is a bit of a pain. I managed to do it because I have multiple disks. I mount alter- nate disk for the new SYSLIB and FCSRES and assign it as my local LB:. I then task build all utilites and application programs plac- ing then on a UIC which will become my new LBUIC. A user with only one large disk would have substantially more difficulty doing an on-line FCSGEN using the previous release of the operating system. However, such users have the same problem in general in doing an online SYSGEN from one base level to another. PERFORMANCE MEASUREMENT I have said a great deal so far about ways to improve system performance, but before you go trying to tweek your system to im- prove its performance, you must answer a very important question first, namely, "How do I know if or how much performance has PAGE 10 changed?". Now our approach was evolutionary an piecemeal. I would make one change and then write a test program or procedure to test it out. For changes such as varying the value of MAXPKT or changing the Round Robin/Swapping times this approach is adequate. It does not however provide any standardized measure of how a sys- tem performs under load conditions. The KMS accounting and performance measurement package provid- ed information on performance but it lacked the ability to snapshot performance on a fine enough scale without gathering huge amounts of data. Also there was no easy way to relate the data to the de- gree of system load. The selective task accounting feature easily enables one to measure the performance of a specific task as one attempts to optimize it. I felt that a similar system wide capa- bility was needed for tuning system performance. What was needed was a method of simulating the load on the system of 'N' users on 'N' terminals buisily working away. In this way pool requirements, processor speed requirements, and disk ac- cess time requirements could all be estimated prior to committing to hardware and software. During the course of this test, I wanted to snapshot the statistics gathered by our accounting package, and to measure the activity on the system disk. I did not want to use physical terminals, since I wanted to observe the effect of having more terminals active than we currently have. To do this 'N' copies of INS were installed with names I00,I01,....Inn. These versions of INS were used to start a proce- dure file running with a unique name for each procedure interpreter task (PINXnn). An indirect command file creates a unique procedure file for each PINXnn task. The each file contains commands of the form Inn $F4P/TASK=F4PXnn/RUN=REM/PRM="FOO=FOO". The indirect file processor monitors the status of the 'N' job streams and when all have exited, writes out all the statistics to a log file. For completeness, the command file also executes a program to snapshot our system performance statistics (CPU time, Shuffler count, Checkpoint count) before and after the load test. This data is passed to the command file which also logs these statistics. A seperate task monitors the busy time of LB: and this also is reported. This performance measurement package has proved to be very versitile and useful. Using it we were able to pin point a serious contention problem between TKB searching SYSLIB and tasks swapping in and out of the checkpoint file on LB0:. This package will be on this fall's RSX SIG tape.