RSX11M System Tuning, Performance Optimization and Measurement

          This is going to be a workshop System Tuning  and  Performance
     Optimization  for  RSX11M systems.  While most of what will be dis-
     cussed is particularly applicable to RSX11M V3.2 systems,  some  of
     it  will  be generally applicable to other systems.  The first part
     of this session will be devoted to presentations by the speakers on
     a  wide  ranging series of topics.  This will take about 1.5 hours.
     Then  in  the  last  half  hour   we   will   welcome   discussion,
     comments,suggestions  and questions from the floor.  We have a con-
     siderable amount of material to cover, so if questions can be  held
     to a minimum it would be useful.

          I became interested  in  optimizing  the  performance  on  our
     RSX11M system over 2 years ago.  We currently have an 11/45 with 11
     terminals and a very large number of users, each  of  which  covets
     the  entire  resourses of our system.  Two years ago we were criti-
     cally tight on disk space and terminal time.  This problem  led  me
     into  disk  block and terminal time accounting and finally into the
     area of resourse accounting in general.   After  getting  the  disk
     space  and  terminal  time problems under control, I found that the
     growth of the system usage was leaving POOL and  memory  critically
     tight.  At this point I implemented Richard Kirkman's CCL, updating
     it for V3.1, and a whole raft of MCR, INDIRECT, and BATCH  enhance-
     ments.   For  a  while our memory and POOL problems vanished.  Then
     the users' discovered how convenient the system was, and usage went
     up,  again  pushing  us  up  against  the proverbial wall.  To make
     matters worse, V3.2 of RSX11M seemed to require substantially  more
     core  than before.  This pushed us into reconfiguring the system to
     make heavy use of a FCS resident library, and to monitoring  system
     performance  and performance of various software subsystems.  These
     measurements helped justify adding a CACHE  memory  and  additional
     disk  capacity.   At  present  performance measurement and capacity
     planning is our critical interest.  We are trying to get  a  handle
     on  the ultimate capacity limits of our machine.  Users now want to
     increase the number of terminals from 11 to 16  or  more.   Serious
     thought  is being given to connecting the PDP-11/45 to 3-4 satilite
     11/23's via DECNET to handle program development for a new  labora-
     tory  data acquisition system.  At some point a larger machine will
     be needed and  hopefully  performance  measurements  will  give  us
     enough information to plan our data acquisition system's growth.

          By the end of this workshop, we hope to present material which
     will  show you the user, that very substantial performance improve-
     ments are possible.  Depending on the ammount of effort one  wishes
     to  invest, 10% to over 50% improvements in performance are not un-
     reasonable.  The goal of this workshop is not to tell  you  how  to
     make  these  improvments, or to overly extol the virtues of any one
     method.  Rather, we want to tell you what can be done, give you  an
     estimate  of  the degree of benefit involved, and give some feeling
     for the complexity of the task and any system consequences.  All of
     the improvements I will discuss, we have implemented on our system.
     The net result is that throughput on  our  system  is  perhaps  2-4


                                                                  PAGE 2


     times  greater  on  our  system  than on a standard 11/45 running a
     standard RSX11M operating system.  The lower  limit  to  thourghput
     improvements  is  the  result  of adding a CACHE memory.  The upper
     limit is the result of many software and system changes.

          What do I mean by performance?  Performance can either be con-
     sidered  system  wide or on a per task basis.  A development system
     will probably be concerned with overall optimization of the  number
     of  tasks processed through the system.  A dedicated system will be
     primarily interested in optimizing performance of a set  of  tasks.
     Toward this end we will address a number of topics.
      
     		Types of Optimization
      
     	1.  Overall System Throughput
     	2.  Subsystem or Application System Throughput
     	3.  Ease of use (system/subsystem)
     	4.  Memory Usage
     	5.  Pool Usage		Go to Dan Steinberg's talk.
     	6.  Disk Usage
      
     		
     	(SEE SLIDE OF TOPICS AND SPEAKERS)
      

                                                                  PAGE 3


     TUNING A SYSTEM
      
     A) HARDWARE OPTIMIZATION

          The easiest ways to improve performance of an existing  system
     is to install a CACHE memory if it doesn't have one already.  I be-
     lieve that one can now get Cache memories  for  any  DEC  processor
     larger  than  a  PDP  11/20 either through DEC or through a BRAND X
     company.  For about $6K or so you system  can  run  30-50%  faster.
     Swapping disks or disk emulators can also provide substantial beni-
     fit to a heavily loaded system.  Often just redistributing the disk
     usage  on a multi-disk system can provide significant benifit.  The
     trick is to find out where bottle necks are occuring  and  to  know
     which  changes help the situation the most.  Finally, users doing a
     lot of high speed terminal I/O can improve their  system's  perfor-
     mance  quite significantly by using a DH11 (DMA) terminal interface
     and the V3.2 full duplex terminal driver.
      
     B) SOFTWARE AND OPERATING SYSTEM OPTIMIZATION

          Without resorting to using performance  measurement  tools  or
     incorporating hardware enhancements, there are a number of software
     options the user has readily available which  will  improve  system
     performance.
      
     1) SHUFFLER and RMDEMO

          The SHUFFLER and RMDEMO affect the performance  of  a  heavily
     loaded system such as ours.  Firstly, comparing two identical load-
     ed systems except that one is running the new RMDEMO and the  other
     is running the old RMDEMO, one finds that the new RMDEMO imposes at
     least a 25% greater overhead than the old version.  Part of this is
     because  it  is larger and more swapping occurs.  Secondly, the new
     RMDEMO issues 1 QIO for each line changed on the screen  since  the
     last  plot.   On  systems selecting checkpointing both on input and
     output, RMDEMO can get into a mode in which it checkpoints  between
     QIO's  if  the system is loaded.  In some cases the the system will
     swap itself to a standstill.  For those who wish  to  use  the  old
     RMDEMO,  it is on the SAN DIEGO tape from the last DECUS symposium.
     It is modified for V3.2, and has many of the features  of  the  new
     RMDEMO.  We plan to support it at least through V4.0

          A second very useful trick for a heavily loaded system  is  to
     place SHF in its own partition and FIX it there.  This costs almost
     no memory, since SYSPAR is much too large as it stands for a mapped
     system.   It  can almost hold SHF and MCR...  side by side as it is
     now.  Tests of system throughput for a system doing a lot of  swap-
     ping  show that throughput is increased by about 5% if SHF is fixed
     in its own partition.

          If you have an extreemly heavily loaded system, you find  that
     a  large  fraction of the system's time is spent with SHF trying to
     make room and not being able to.  It is a design 'feature'  of  the


                                                                  PAGE 4


     shuffler  that  there is no limit placed on the number of times/sec
     it will try and recoverer memory to run a task.  There is  a  rela-
     tively  simple  way around this problem if one wishes to modify the
     EXEC slightly.  Each Executive Swapping  interval  load  a  counter
     with  the  maximum  number  of times one feels SHF may run during a
     swapping interval.  Then each time the SHF is  requested  decrement
     the  counter.  If the counter goes to zero reject the request.  One
     should also have the SHUFFLER set a flag to tell the EXEC it is ac-
     tive  but waiting since it will only make matters worse if SHF gets
     requested part way through a run.  We have done this on our  system
     and gained 10-20% more throughput out of a thrashing system.
      
     2) QIO OPTIMIZATION

          If you have selected QIO optimization at SYSGEN you also  have
     another  easily  modified  tuning  parameter.  As a quick and dirty
     test of what one might expect, I dumped an RK05 to NL:   using  PIP
     with  MAXPKT  set  to  15 and 0 respectively.  I found that it took
     2.6% less time to complete the transfer  with  MAXPKT  set  to  15.
     Next, I wrote a test program which wrote 10,000 records to NL:, in-
     stalled it 6 times with different names, ran all six tasks at once,
     and  timed  how  long it took for all six tasks to finish and exit.
     If MAXPKT was set to 15 rather than 0, the process took  6.6%  less
     time to complete.  This benefit would be less pronounced if one was
     transferring to a real device.  But it is indicative that the  sys-
     tem  may be spending 5% or so less time servicing the QIO before it
     gets to the driver.

           
     3) ROUND ROBIN SCHEDULER AND SWAPPING INTERVAL

          Another tuning parameter which can be tweaked,  is  the  Round
     Robin  and Swapping interval.  As part of the KMS System Accounting
     and Performance measurement package, I included a task for display-
     ing  and modifying these intervals on-line.  To test the effects of
     heavy swapping on throughput, I started up 6  simultaneous  taskbu-
     ilds with versions of BIGTKB which were 23K in size.  Only 3 of the
     TKB version could fit in core at one time, and the system was  run-
     ning  with anywhere from 80-102K of tasks swapped out.  Holding the
     Round Robin interval fixed at 5/100 second, the Swap time was  var-
     ied  from 20/100 to 1 sec.  The results are displayed in the graph.
     The baseline is the time it would take for 6 sequential  taskbuilds
     to  take place.  As you can see as the swapping interval is length-
     ened, the time for the entire taskbuild approaches the baseline as-
     ymptotically.   The  'dots' represent the timings taken with SHF in
     SYSPAR, and the "x's" represent timings with SHF fixed  in  SHFPAR.
     These  timings were done for an RK07 disk.  From them we see that a
     swapping interval of 40/100 of a second is  about  the  minimum  we
     should use.  A slower disk(RK05) should use a somewhat longer swap-
     ping interval.


     4) MEMORY USAGE


                                                                  PAGE 5


          On our system memory is very tight.  In the face of an  incre-
     asing number of users and terminals on our system, it was necessary
     to examine how tasks used memory very critically.  One of the first
     things  that was noticed was that a significant number of DEC task-
     build command files for mapped systems build utilities  and  privi-
     leged tasks with default sizes which are far, far in excess of what
     is needed.  A classic case in point is that the new despooler  task
     was  built  with  a  task size of 8 k words when it only needs 2-3k
     words.  A second example is INS.TSK which is also built to be an  8
     k  task.  If it is installed via VMR with an INC=0, or taskbuilt to
     the correct size, its size is reduced by a little over 1K word.  By
     going  through  the command files, and modifying the partition size
     prior to taskbuilding, you will create tasks which use less memory.

          Some tasks, however, like BIGTKB use the extra space for  sym-
     bol storage or buffering.  For TKB what we did was to find the size
     required for an small taskbuild and start out  TKB  at  that  size.
     Since  our system supports the extend task directive, TKB will grow
     in size if needed.  Since having to do too many extend tasks  slows
     down performance, use a CCL command to create a super large version
     of TKB for use during SYSGENS and RMS taskbuilds.   This  technique
     saves about 3k of memory for a typical use of TKB.


     5) RESIDENT LIBRARIES

          Since the taskbuilder is one of the biggest users of  core,  a
     substantial  overall improvement in system throughput can be accom-
     plished by making it run faster.  This is one of the advantages  of
     using  resident libraries.  Taskbuilding a given task using a resi-
     dent library often takes only a small fraction of the time required
     for  taskbuilding without a resident library.  Those of us who have
     taskbuilt RMS tasks are well aware of how long  task  builds  using
     the  RMS  ODL's can be.  On a small system such as ours, the length
     of the taskbuild times insures that if other users are  doing  pro-
     gram development from other terminals, checkpointing is almost cer-
     tain to take place.  This problem  is  considerably  alleviated  by
     using  an  RMS  resident  library.  However, a small system such as
     ours just can not support an extra 4-12k of resident libraries  all
     the time.  

          There are two ways to circumvent this problem.  First, one can
     use the SET /TOP command to shrink GEN and make room for a tempora-
     ry (but otherwise very  normal)  resident  library.   The  required
     software patches to MCR for doing this were published in the MULTI-
     TASKER a few months ago and are on the San Diego tape.  This method
     is only useful during program development and is not transparent to
     user's.  

          A second and much better way of doing this is to use transient
     resident  libraries.  Brian Mcarthy (DEC) developed a method of lo-


                                                                  PAGE 6


     ading resident libraries into PLAS regions.  When a  task  is  run,
     the  resident  library is automatically loaded into core and linked
     to the task(s).  When the task exits the resident library PLAS  re-
     gion  vanishes.   The  code  to produce loadable resident libraries
     will be on this DECUS tape.  We have been using  it  in  production
     for 5 months and it has worked very well.  There are two drawbacks.
     The first is that PLAS regions can not be shuffled.  The second  is
     that  the method does not work on PIC libraries or on BP2 libraries
     linked with seperate RMS libraries.  However,  a  combined  BP2/RMS
     12K library is supplied.


     6) BATCH

          Another thing which can be done to improve overall  throughput
     on  a  crowded  system  is  to implement BATCH.  An improved single
     stream BATCH and a multi- stream program development queue  utility
     will be on this DECUS SIG tape.  Numerous in house tests have shown
     that 4-8 users all trying to compile or taskbuild at the same  time
     will swap the system almost to a standstill.  Under conditions like
     these, it is prudent and to everyone's advantage to submit  taskbu-
     ilds and compilations to a single or multiple stream BATCH queue.  

          While working to optimize BATCH, we realized that the indirect
     file processor while a great tool, is very inefficiently used if it
     has to keep swapping in and out of core during the course of a pro-
     gram compilation and taskbuild.  Since, the majority of program de-
     velopment work uses simple MCR commands, we implemented a Procedure
     Interpreter  (PIN)  4  times  smaller than ...AT.  which spawns off
     simple commands to MCR from a command file.  Indirect  is  used  on
     the fly to create the user specific PIN command files.  Even if PIN
     get swapped out, it is far more efficient to swap  it  in  and  out
     instead of ...AT.
      
     7) CCL

          I don't want to say too much about CCL.  Once  I  get  started
     it's  hard  to  stop.  CCL is not quite a panacea to all your prob-
     lems, but it comes very close.  It consists of two parts.   A  user
     extensible,  file driven catchall task and modifications to INSTALL
     to allow passing MCR command lines to uninstalled tasks.  The  big-
     gest benefit of CCL is to free up POOL.  Most tasks which were pre-
     viously installed now can be removed, transparently  to  the  user.
     Command  lines  are  parsed  and  sent on from CCL to non installed
     tasks.  In addition, it can make your system unbelievably friendly.
     Simple  user  understandable  commands  can be used to aid the user
     through the complexity of our RSX wonderland.  For example we  have
     a BUILD command.  The user types BUILD FILNAME, and in a completely
     transparent way a FORTRAN task is correctly compiled and taskbuilt.
     Of  any  of the CCL's available on DEC systems(RSTS or VAX) this is
     perhaps the most powerful.  I firmly believe that  if  you  haven't
     got  CCL on your system yet you should, run, not walk to your near-
     est San Diego Decus tape, and get a copy to put on your system.


                                                                  PAGE 7


     8) FCS Resident Library

          Building a system based on using  an  FCSRES  is  perhaps  the
     hardest of the various performance/tuning options I have discussed.
     However, it holds the greatest potential for DRAMATICALLY improving
     your system performance.  Amoung its benefits are
 
   BENEFITS OF USING FCSRES             DISADVANTAGES
 
1.  Faster taskbuild times           1. More task to build for SYSGEN
2.  Smaller on disk task size        2. On line SYSGEN from one base
                                        level to another is hard.
3.  Smaller in-core task size.
4.  Tasks less overlayed, run faster
5.  LESS LOADING on the system disk.
6.  Smaller tasks swap faster.
7.  More tasks can fit in core
 

          On the 1979 San Diego tape, I supplied command and  ODL  files
     which  will allow building all DEC unprivleged utilities, and a few
     privileged tasks with FCSRES.  In the table is a comparison  of  on
     disk  task  sizes  of  DEC utilities built with and without FCSRES.
     For just these tasks there is saving in disk blocks of 23%.


                                                                  PAGE 8


     COMPARISON OF TASK SIZES (WITH AND WITHOUT FCSRES)
      
     			SIZE (Disk Blocks)
     	TASK	   No FCSRES	With FCSRES
      
     	BIGMAC	    71		   57
     	BIGTKB	    161		   145
     	CDA	    159		   114
     	CMP	    50		   29
     	CRF	    36		   24
     	DMP	    57		   41
     	EDI	    60		   41
     	EDT	    108		   88
     	FLX	    129		   106
     	FMT	    65		   57
     	IOX	    99		   79
     	LBR	    72		   52
     	PAT	    44		   25
     	PIP	    67		   51
     	SLP	    48		   30
     	VFY	    57		   39
     	VMR	    144		   124
     	ZAP	    38		   25
     	--------------------------------------
     	TOTAL       1465 blocks    1126 blocks
        
      
          The ammount of incore size reduction for DEC  utilities  built
     with  FCSRES  is  not  as large as one might hope for.  The savings
     runs from about .5 to 2K words with the  average  around  1k  word.
     The  reason for this is that the utilities are very heavily overla-
     yed.  Also the get command line code is not in the resident library
     (size??, PIC??) and is included in most of the utilities.  Even so,
     consider the effect that a 1K word average size reduction has on  a
     system with say 64K words of free core in GEN.  Assuming an average
     task size of 8K words, 8 tasks can  fit  in  core  before  swapping
     starts.   If  you  lop off 1K word from each task, this frees up an
     additional 8K words for an additional task which can be in core be-
     fore  swapping  starts.  An additional benefit is that initial task
     loads are significantly faster by an average of 13% for the typical
     utility.   For  the  same  reason checkpointing takes significantly
     less time.

          The real benefit of building the tasks with a FCSRES is in the
     dramatic  increase  in  speed of the standard DEC utilities.  These
     utilities are normally very heavily overlayed.  Building them  with
     FCSRES  removes  almost  all the FCS overlay in the task.  Consider
     the following speed improvements.


                                                                  PAGE 9


     		Speed Improvement
      
     PIP             NL:=DM:[*,*]*.*/FU              2.4 times faster
                     NL:=[1,2]*.HLP                  1.2 times faster
      
     BIGMAC          NL:=HELLO.MAC                   1.1 times faster
      
     ...AT.          For a file with a loop including
                     (.INC, .IF, .TESTFILE,.GOTO)    1.7 times faster

          A very significant benefit is that the overall  responsiveness
     of  the  system improves.  Since the tasks have far fewer overlays,
     the system disk stays far less busy and tasks requests to read  and
     write  files  are  not  interrupted by reading in overlays from the
     task image.  At present I do not have a reliable indicator  of  how
     much  this affects performance, but on a busy RK05 system disk with
     4-5 users at work with the utilities, the busy light flickers  only
     intermittently  if the task are built with FCSRES.  If the task are
     not built with FCSRES, the busy light  never  goes  off  until  the
     users exit.                

          Using an FCS resident library does have a  few  disadvantages.
     Firstly,  one  must  rebuild  all utilities for the initial FCSGEN.
     Normally, this need only be done once for each release cycle.   Due
     to the number of tasks involved (both DEC utilities and our Fortran
     application  programs)  this  task  makes  SYSGEN   seem   trivial.
     However, even it can be done on line in a day of continuous taskbu-
     ilding.  However, if a bug in FCS appears which requires  patching,
     FCSRES  would have to be rebuilt along with all the tasks using it.
     This is not something to be lightly undertaken.

          Because DEC has modified FCS at each release level, a FCS  re-
     sident library must be rebuilt each time a new release of RSX11M is
     obtained.  At present building all tasks  with  a  new  version  of
     FCS/FCSRES from an existing system using FCSRES is a bit of a pain.
     I managed to do it because I have multiple disks.  I  mount  alter-
     nate  disk  for the new SYSLIB and FCSRES and assign it as my local
     LB:.  I then task build all utilites and application programs plac-
     ing then on a UIC which will become my new LBUIC.

          A user with only one large disk would have substantially  more
     difficulty  doing  an  on-line FCSGEN using the previous release of
     the operating system.  However, such users have the same problem in
     general in doing an online SYSGEN from one base level to another.


     PERFORMANCE MEASUREMENT

          I have said a great deal so far about ways to  improve  system
     performance,  but  before you go trying to tweek your system to im-
     prove its performance, you must answer a  very  important  question
     first,  namely,  "How  do  I  know  if  or how much performance has


                                                                 PAGE 10


     changed?".  Now our approach  was  evolutionary  an  piecemeal.   I
     would make one change and then write a test program or procedure to
     test it out.  For changes such as varying the value  of  MAXPKT  or
     changing  the Round Robin/Swapping times this approach is adequate.
     It does not however provide any standardized measure of how a  sys-
     tem performs under load conditions.

          The KMS accounting and performance measurement package provid-
     ed information on performance but it lacked the ability to snapshot
     performance on a fine enough scale without gathering  huge  amounts
     of  data.  Also there was no easy way to relate the data to the de-
     gree of system load.  The selective task accounting feature  easily
     enables  one  to  measure the performance of a specific task as one
     attempts to optimize it.  I felt that a similar system  wide  capa-
     bility was needed for tuning system performance.

          What was needed was a method of simulating  the  load  on  the
     system of 'N' users on 'N' terminals buisily working away.  In this
     way pool requirements, processor speed requirements, and  disk  ac-
     cess  time  requirements could all be estimated prior to committing
     to hardware and software.  During the course of this test, I wanted
     to  snapshot the statistics gathered by our accounting package, and
     to measure the activity on the system disk.  I did not want to  use
     physical  terminals, since I wanted to observe the effect of having
     more terminals active than we currently have.  

          To do this  'N'  copies  of  INS  were  installed  with  names
     I00,I01,....Inn.  These versions of INS were used to start a proce-
     dure file running with a unique name for each procedure interpreter
     task (PINXnn).  An indirect command file creates a unique procedure
     file for each PINXnn task.  The each file contains commands of  the
     form
     	Inn $F4P/TASK=F4PXnn/RUN=REM/PRM="FOO=FOO".
     The indirect file processor monitors the  status  of  the  'N'  job
     streams  and when all have exited, writes out all the statistics to
     a log file.  For completeness, the command  file  also  executes  a
     program  to  snapshot  our system performance statistics (CPU time,
     Shuffler count, Checkpoint count) before and after the  load  test.
     This  data  is  passed  to  the  command file which also logs these
     statistics.  A seperate task monitors the busy  time  of  LB:   and
     this also is reported.

          This performance measurement package has  proved  to  be  very
     versitile and useful.  Using it we were able to pin point a serious
     contention problem between TKB searching SYSLIB and tasks  swapping
     in and out of the checkpoint file on LB0:.  This package will be on
     this fall's RSX SIG tape.