.BOTTOM NUMBER .NONUMBER .roman numerals .UPPER CASE .FILL .SPACING 1 .JUSTIFY .RIGHT MARGIN 70 .DOCUMENT ID CLUSTR USERS MANUAL .BOTH SIDES .FIGURE 8 .CENTRE 75 ^A NUMERICAL CLASSIFICATION SUITE\ .SKIP 4 .CENTRE 75 ^CLUSTR\ .SKIP 6 .CENTRE 75 ^C.J. ANDREWS\ .DOCUMENT ID .PAGE .DOCUMENT ID CLUSTR USERS MANUAL .PAGE .LEFT MARGIN 5 .CENTRE ^&ABSTRACT\& .lower case .NUMBER .PARAGRAPH 5 ^THIS MANUAL DESCRIBES A SUITE OF PROGRAMS WHICH ARE CAPABLE OF DEALING EFFECTIVELY WITH SETS OF DATA WHICH ARE TO BE NUMERICALLY CLASSIFIED. ^THE DATA REPRESENT SEVERAL ENTITIES WHICH ARE DESCRIBES BY RELEVANT ATTRIBUTES. .PARAGRAPH ^THE METHOD BY WHICH THE CLASSIFICATION IS PERFORMED MAY BE CONTROLLED IN A MOST FLEXIBLE MANNER, BY SEVERAL EASILY SET USER OPTIONS. ^THESE OPTIONS CONTROL THE FOLLOWING STEPS IN THE CLASSIFICATION PROCESS: .LEFT MARGIN 10 .RIGHT MARGIN 60 .UPPER CASE .skip 1 (i) A transformation of the raw data may optionally be carried out in one of several ways. .break (ii) A choice of dissimilarity indices may be made. .break (iii) A choice of sorting and clustering strategies is available. .break (iv) Output optionally available includes printouts of trellis diagrams, two way tables and summaries of the raw data, and plots of derived dendrograms from the sorting strategies. .break (v) optional Ordination derived from the methods of Principal Component Analysis and/or Principal Coordinate analysis, may be selected. .skip 1 .RIGHT MARGIN 72 .left margin 5 .paragraph The program as outlined performs both normal and inverse analyses of two-dimensional raw data in the form of entities versus attributes. Such data are commonly generated in psychological, taxonomic, and biological studies and also in studies in other social sciences. .paragraph CLUSTR can also be used to classify three dimensional data (entity-1 x entity-2 x attribute) as is often required in ecological study, for example in analysis. This extension in no way affects the two-dimensional study of data, and is entirely transparent to users of the latter facility. .paragraph An extension of the known TAXAN program has been accomplished, and the output options of CLUSTR may be coupled with the ability of TAXAN to handle disordered multistate data. This facility is useful for taxonomists. .DOCUMENT ID .page .DOCUMENT ID CLUSTR USERS MANUAL .NONUMBER .PAGE .NUMBER .centre ^&ACKNOWLEDGEMENTS\& .paragraph The author gratefully acknowledges the help of several people in the preparation of these programs. Prof. W. Stephenson provided the initial motivation and need for the suite, and over a long period has provided the author with many helpful discussions and suggestions as to its functioning. Dr. H.T. Clifford has contributed to the author's thinking in many areas of classification and is similarly acknowledged. .paragraph Dr. Clifford provided access to Dr. E. Burr's program TAXAN which forms the basis of the sorting section of CLUSTR. It has been modified with the help of Mr. R.A. Cook of La Trobe University. Mr. R.D. Nilsson provided certain ideas for flexible core management which the author has subsequently incorporated into CLUSTR. .paragraph The University of Queensland Prentice Computer Centre has provided finance for the latest stage of development of this suite and is also gratefully acknowledged. .skip 2 .centre ^&SUPPORT\& .paragraph Support for this suite is available in two forms. .paragraph Users who desire advice on classificatory methodology may consult Dr. H.T. Clifford of the Botany Department, University of Queensland. Such users should first have gained familiarity with the methods, via one of the standard texts (e.g. 1,2) .paragraph Those who desire some assistance with the construction and running of CLUSTR data decks may consult the University Computer Centre via its user consultation services. An attempt should first have been made to set up the deck before consultation, as shown in this manual. .DOCUMENT ID .page .DOCUMENT ID CLUSTR USERS MANUAL .NONUMBER .PAGE .SPACING 2 .NUMBER .centre Contents .nofill .nojustify (i) Abstract (ii) Acknowledgements (iii) Support (iv) Contents .skip 1 Part I The Analysis of Two Dimensional (2D) Data 1.0 The Classification Process - an overview 2.0 CLUSTR operations for 2D data 3.0 Data input format for 2D data 4.0 TAXAN-CLUSTR interface .skip 1 Part II The Analysis of Three Dimensional (3D) Data 5.0 3D Clasification - an overview 6.0 CLUSTR operations for 3D data 7.0 Data input format for 3D data .skip 1 Part III CLUSTR on the PDP-10 8.0 8.1 Controlling CLUSTR on the PDP-10 Batch System 8.2 Controlling CLUSTR via a PDP-10 remote terminal 9.0 CLUSTR system components and interactions .SPACING 1 .fill .justify .chapter 1 .arabic numerals .skip 10 .centre Part 1 .skip 15 .centre The Analysis of Two Dimensional (2D) Data .page 1.0 ^&The Classification Process - an overview\& .paragraph Throughout this part it is assumed that a user has a data matrix that he wishes to "have classified" and that he has some basic knowledge of what sort of classification he wishes to have performed on that matrix. Those users who desire more information on the theory of numerical classification itself are referred to one of the standard texts on the subject (e.g. 1,2). .footnote 6 .CENTRE * * * * * * * * * * 1. Clifford,H.T. Stephenson, W. "An Introduction to Numerical Classifiction" Ac.Pr. 1975 .break 2. Sneath,P.H.A., and Sokal,R.R., "Numerical Taxonomy", Freeman, 1973. ! It will be the object of this section (section 1.0) to define terminology used with regard to this program, and set the stage for describing the options this program provides at each stage of its execution. .paragraph An examination is now made of the processes which are carried out on a set of data in classifying it. .paragraph It is assumed that n entities are described by the values of v variates. The data for each of the entities can be set out in a matrix X (an n x v matrix), viz: .test page 10 .blank 10 .paragraph In this matrix Xij represents the value of the ith attribute(variate) of the jth entity. .paragraph Taking into account all v variates for each entity it is possible to produce (n#x#n) indices of dissimilarity between all the possible pairs of entities. It may not however always be desirable to produce these indices from the raw data and so a new matrix F (n x v) can first be produced such that each element of F is a TRANSFORMED version of the elements of X (i.e. as in matrix F below) .figure 10 The transforming function could be defined for example as .skip 1 .nofill .nojustify f(Xij)=log(Xij+1) or f(Xij)=Xij i.e. operate on raw data .fill .justify .paragraph From matrix F can now be produced the matrix D, a square nxn matrix whose i,j th element Dij represents the dissimilarity between elements i and j of the F matrix. This matrix has elsewhere been termed the Q matrix (3); .footnote 3 .CENTRE * * * * * * * * * * 3. Gower, T.C., "Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis". Biometrika 53, 325-388 1966 ! Also the term TRELLIS DIAGRAM has been applied. .paragraph The trellis diagram can now be "sorted" so that elements of closest affinity can be brought together progressively in clusters. This process has variously been termed sorting, clustering, grouping or fusing. The means of determining which element of D is next fused into an existing group is called the "sorting strategy". Initially each entity is regarded as a group of size 1. The sorting strategy therefore operates on the D matrix, and successive fusions progressively reduce the size of the matrix. What is produced in essence is a re-ordering of the rows (or columns) of the D matrix. .paragraph Two factors are noted when entities fuse with existing groups (which may possibly be other single entities). These are, the entities which fuse, and the dissimilarity level at which they fuse. The result of all fusions for one trellis diagram can be expressed in terms of these two factors in a DENDROGRAM e.g. .test page 10 .blank 10 .paragraph In the above case entities 1 and 2 have fused to form group 6, 5 and 4 to give 7, and entity 3 has fused with group 7 to form a larger group 8. Similarly, 6 and 8 fuse to give 9 and the process is complete at this stage since one cluster alone finally exists. .paragraph The base line of this dendrogram can be used to give an ordering of the initial entities into "groups of closest affinity" at various levels of affinity (see for example groups B and C above). .paragraph It is now possible to rearrange the columns of the original X matrix to reflect this ordering and in so doing the grouping of entities should become visually more obvious in that matrix. .paragraph So far this general discussion has centred on classifying entities against each other taking account of all possible variates. It is also posible to perform the so called inverse analysis classifying attributes against each other. In this case it is possible, may be desirable, but is not mandatory to use a different transformation, different dissimilarity index, and different sorting strategy than was used for the "normal" analysis. The end products, however, are the same - a dendrogram specifying in this case attribute correlations, and a resorting this time of the rows of the X matrix. .paragraph When both resorting of rows and columns of the X matrix occurs, a so called TWO-WAY table is produced. .paragraph Once a trellis diagram has been obtained, it is possible to perform a principal co-ordinate analysis of this diagram to obtain a representation of the n entities or v variates in a space which can be interpreted geometrically. The clusters may then be viewed visually. The method by which this can be done is elsewhere described and is quite complex. For details of its derivation the user is referred to references (3) and (4). .footnote 3 .CENTRE * * * * * * * * * * 4. Gower, J.C., "Multivariate Analysis and Multidimensional Geometry", Statistician, 17,13-28 1967 ! .paragraph Principal Component Analysis may also be performed, and by this alternate method the n entities may be represented in a v-dimensional space. The latter analysis operates on the original F matrix, and is equivalent to a rotation of the original v axes against which the entities of F are represented, to new orthogonal positions which maximise attribute variance along their length. .paragraph The two Principal Analyses are members of a group of analyses known under the general title of Ordination. .paragraph The CLUSTR program provides users with all the features outlined above, i.e. transformation, dissimilarity calculation, sorting, production of two-way tables, and ordination, for both normal and inverse analyses, with possible differences in strategy for each. .paragraph The various different strategies available will be defined in Section 2.0 and how they may be specified in running the program will be shown in Section 3.0 of this manual. .page 2.0 ^&CLUSTR operations for 2D data\& .paragraph The various options available to a user at each level of the clustering process are now outlined. It is assumed at this stage that the raw data matrix has been read from an input medium. The method by which this is done will be given in Section 3.0 and 8.0. .skip 2 2.1 ^&Summary of Raw Data\& .paragraph An initial summmary of the raw data is printed in 2D analysis. For an X matrix of the following form, .test page 10 .blank 10 .paragraph a quantity Pi is derived which may be either .indent 10 (a) the total of all variate values for that entity. .nofill .nojustify .skip 1 .test page 3 v Pi = Xji [later termed S ] j=1 .fill .justify .skip 1 .indent 7 or (b) the average variate value for that entity .nofill .nojustify .skip 1 .test page 5 v Xji j=1 Pi = [later termed A] v .skip 2 .fill .justify 2.2 ^&Transformation Option\& .paragraph The transformation option available for each element of the X matrix (F(Xij) of 1.0) may be: .nofill .nojustify .skip 1 either (i) F(Xij)=log(Xij) log transform [L] .skip 1 or(ii) F(Xij)= Xij no tranformation [N] .skip 1 .test page 5 Xij (standardize by row or(iii) F(Xij) = [S] n Xik total for normal k=1 analysis) .skip 1 .test page 5 Xij (standardize by column F(Xij) = [S] v Xkj total for normal k=1 analysis) .skip 1 or(iv) F(Xij) = Xij ** (1/n) where n is integral [P] .skip 1 or(v) F(Xij) = (Xij-m)/s [V] where m and s are the mean and standard deviation in rows or columns. .skip 2 .fill .justify 2.3 ^&Dissimilarity Index Option\& .paragraph The current dissimilarity indices allowed are: .skip 1 .nofill .nojustify (i) Bray-Curtis [B] .skip 1 .test page 7 v Fki-Fkj (For k=1 normal dij = analysis) v Fki+Fkj k=1 .skip 1 .test page 7 n Fik-Fjk (For k=1 inverse dij = n Fik+Fjk analysis) k=1 .skip 1 .fill .justify .indent 10 where Fij is the element found in the transformation step as F(Xij). .skip 1 .nofill .nojustify (ii) Manhattan Metric [M] .skip 1 .test page 3 v dij = Fki-Fkj k=1 .skip 1 .test page 3 n or dij = Fik-Fjk k=1 for analyses as before .skip 1 (iii) Canberra Metric [C] .skip 1 .test page 3 1 v Fki-Fkj dij = - v k=1 Fki+Fkj .skip 1 .test page 3 or 1 n Fik-Fjk dij = - n k=1 Fik+Fjk .skip 1 for analyses as before .skip 1 (iv) D-squared Euclidean Distance [D] .skip 1 .test page 3 v 2 dij = (Fki-Fkj) k=1 .skip 1 .test page 3 or n 2 dij = (Fik-Fjk) k=1 .skip 1 for analyses as before .skip 1 (v) Matching Coefficient [A] .skip 1 .test page 5 v (Wij) k=1 k dij = v .skip 1 .test page 5 or n (Wij) k=1 k dij = n .skip 1 .fill .justify .paragraph where (Wij)#=1 if elements i and j have equal values for Fik and Fjk. Thus both negative and positive matches are considered. .nofill .nojustify .skip 1 .test page 6 (vi) ratio coefficient [R] .skip 1 .test page 4 g v (Fik) dij = v - k=1 (Fjk) .skip 1 .test page 4 or g n (Fki) dij = n - k=1 (Fkj) .skip 1 .fill .justify .paragraph where g is chosen to be either +1 or -1 to make the fraction less than or equal to unity. .break .skip 1 General Note on Dissimilarity Measures .paragraph It will be seen later that missing data may be specified in the CLUSTR data input deck. If any of the elements of F specified above are missing for particular items in the summations then those summation items contribute nothing to the summations. .fill .justify .skip 2 2.4 ^&Sorting Strategies\& .paragraph Several commonly used sorting strategies are available in this program suite. The point at which a sorting strategy becomes identifiable is that point at which the dissimilarity index is recalculated between a new group and all other groups after the former group has received a new member. Lance and Williams (5) have shown that some common strategies (at least most of those described here) could be expressed in terms of the following general form: .footnote 3 .CENTRE * * * * * * * * * * 5. Lance, G.N. and Williams, W.T., "A Generalised Sorting Strategy for Computer Classifiations" Natre (Lond.) 212,218 1966 ! .test page 5 .blank 5 .paragraph where the choice of #,#,# and # determine the strategy. The following diagram defines the other terms in the expression. .test page 15 .figure 15 .paragraph The strategies allowed in CLUSTR are defined in terms of these parameters as follows: .paragraph (i) Nearest Neighbour .test page 3 .figure 3 .paragraph (ii) Furthest Neighbour .test page 3 .figure 3 .paragraph (iii) Group Average .test page 3 .figure 3 .paragraph (iv) Simple Average .test page 3 .figure 3 .paragraph (v) Centroid .test page 3 .figure 3 .paragraph (vi) Incremental Sum Of Squares .indent 15 Not of the Lance and Williams type previously mentioned. The user is referred to Burr(6). .footnote 3 .centre *#*#*#*#*#*#*#*#*#*# (6)Burr, E.J., 'Cluster Sorting with Mixed Character Types. II Fusion Strategies' Aust. Comp. Jour. 2,98-103 1970. ! .paragraph (vii) Variance .indent 15 Again not of the Lance and Williams type; see(6). .paragraph (viii) Flexible .test page 3 .figure 3 .skip 2 2.5 ^&Other Options\& .paragraph At each stage of the classification process, the user may optionally select output of various results. These include trellis diagrams, dendrograms and two way tables. .skip 2 2.6 ^&Ordination\& .paragraph Ordination is treated as a slightly separate topic as it may proceed fairly independently of other analyses. Once trellis diagrams have been obtained, Principal Coordinate analysis may proceed, and once transformations have been carried out (i.e. matrix F options have been specified) Principal Components Analysis may be performed. .skip 2 2.6.1 ^&Method of Principal Coordinate Analysis\& .paragraph The method used for Principal Coordinate Analysis is that given by Gower(4) as modified by Williams and Dale(7). .footnote 2 .LEFT MARGIN 5 .CENTRE * * * * * * * * * * 7. Dale, M.B. Personal Communication .left margin 10 ! .paragraph Given a dissimilarity matrix D of elements Dij the procedure is as follows: .left margin 21 .indent -6 1 (a) For Bray Curtis trellises form the similarity matrix Sij=1 - Dij .break .indent -6 ##(b) For Manhattan Metric trellises form the matrix Sij=(-1/2)*Dij**2 .break .indent -6 2. "Transform" the elements by Sij <= Sij - S.j - Si. + S.. where . represents the mean over the appropriate row or column .break .indent -6 3. Find the eigenvectors and eigenvalues of the matrix (Sij) and standardise each vector so that the sum of squares of its elements is equal to the corresponding eigenvalue. .break .indent -6 4. The k th elements of each of the n vectors then represents the co-ordinates of the kth point with respect to axes defined by the n vectors. The relative magnitude of each eigenvalue gives the relative "importance" of each eigenvector axis. .left margin 5 .skip 2 2.6.2 ^&Principal Component Analysis\& .paragraph The method used for Principal Component Analysis is that outlined by various authors e.g. Seal(8),Blackith and Reyment(9), and Sneath and Sokal(2). .footnote 6 .centre * * * * * * * * * * (8) Seal, H.L., "Multivariate Statistical Analysis For Biologists", Methuen, London, 1964. .break (9) Blackith,R.E., and Reyment, R.A., (eds), "Multivariate Morphometrics" , Ac. Pr., N.Y., 1971. .left margin 10 ! .paragraph Given a matrix F the method is .break .left margin 15 1. Form the matrix, R, of variance and covariance between entities or attributes. .break 2. The eigenvectors of R give the Principal Components we are seeking. .break 3. Standardize the eigenvectors so that they are of length 1, giving matrix V. .break 4. The matrix of new coordinate points P of the entities represented by F is given by P=V'F. .left margin 5 .skip 1 2.7 ^&Conclusion\& .paragraph Section 2 has given the formal definition of each of the facilities available in CLUSTR. How to specify these to the running program will be examined in Section 3. .page 3.0 ^&Data Input Format For 3D Data\& .paragraph To run the CLUSTR program, a user codes his data onto cards and forms a data deck. The data deck contains four types of card in the manner of 3.1: .skip 2 3.1 ^&Data Deck Structure\& .paragraph The data deck consists of four types of cards placed in the following order: .test page 15 .blank 15 .paragraph Each of these cards will be discussed in the following sections. The samples card and dimensions card will be outlined first as they are least complex. The options card will next be detailed, and the section concluded with details of the missing data cards. .skip 2 3.2 ^&Dimensions Card\& .paragraph The dimensions card contains the maximum dimensions of the matrix to be input. The format is: .test page 10 .blank 10 .paragraph The title may be up to 65 characters in length and serves to identify all subsequent output. .skip 2 3.3 ^&Samples Cards\& .paragraph The samples cards contain the data for the n x v attribute matrix X. .paragraph There are two different forms for the input which are selectable depending on which the user finds most convenient. The forms are either to input the data by entity, or by attribute, i.e. row-wise or column-wise in the X matrix. .skip 1 Entity format: .paragraph A user specifies the various attributes for a given entity together with an attribute score. The form is .test page 10 .blank 10 .paragraph More than one card for a given entity is permissable. .skip 1 Attribute Format: .paragraph A user specifies the scores for several entities for one given attribute per card. The form is .test page 10 .blank 10 .paragraph More than one card per attribute is permissable. .skip 2 3.4 ^&Options Card\& .paragraph The options card specifies the manner in which the analysis is to be performed. It contains codes to indicate to the program which of the particular facilities in section 2.0 a user wishes to specify at each stage of his analysis. .paragraph It is organized into 11 blocks each of which contains 5 columns, as follows: .test page 10 .figure 10 .test page 10 .figure 10 .paragraph Each block declares the optins required for an identifiable stage in the analysis. The form of the blocks will next be given. .SKIP 2 3.4.1 ^&DATA INPUT BLOCK\& .paragraph The data input block contains three fields which signify the form of the input deck following. .left margin 15 .skip 1 (a) Samples Card Format .break Column 1 contains an option which specifies which format the samples cards will appear in: .indent 10 E - entity format .indent 10 A - attribute format .skip 1 (b) Divide Option .skip 1 Columns 2-4 contain a numeric constant which will be used to divide into each incoming sampled number. If left blank no division is performed. This allows the user to scale his input deck for example 10 or 100. .skip 1 .test page 3 (c) Listing Option .skip 1 The program initially lists summary totals or averages as previously defined. Column 5 contains one of the following: .indent 10 S - summed values .indent 10 A - average values .left margin 5 A sample form is: .test page 10 .blank 10 .skip 2 3.4.2 ^&Logarithm Option Block\& .paragraph While the transformation of log (Xij + 1) may more logically appear to be included in the transformation block (3.4.3) it is included in a separate block as this will provide greater flexibility for 3D analysis extension later described. .paragraph If a log (Xij + 1) transformation of the complete X matrix is required then the L option is specified in column 9 of the logarithm option block. Further transformation may optionally be applied as previously defined by using options in the transformation block. .paragraph A sample form for this block is: .figure 10 .skip 2 .test page 3 3.4.3 ^&Transformation Block\& .paragraph The transformations of the X matrix (log'ed or not) are specified in columns 11-15 of the options card. The options are: .left margin 15 (i) standardise (by row or column) ####- S option .break (ii) power option - ###################-#P option .break .left margin 20 the power to which each data element is raised is specified in 3.4.4. This option is used for taking the nth root of an element. .left margin 15 (iii) do nothing to the element - #####-#N option .break (iv) express as std deviation from mean##-##V option .left margin 5 .skip 1 The form of the Transformation block is: .break .test page 9 .blank 9 .skip 2 3.4.4 ^&Power Block\& .paragraph If the P option has been used in 3.4.3 the power block specifies the power to which each element is raised. If 3 is placed in the appropriate column each element is raised to the 1/3 power and so on. In general if n is specified the elements are raised to the 1/n power. .paragraph The form is: .test page 9 .blank 9 .paragraph If the P option is not used, these columns may be left blank. .skip 2 3.4.5 ^&Dissimilarity Block\& .paragraph This block indicates the type of dissimilarity measure to be used in creating the trellis diagrams for later sorting. The options are: .left margin 15 .skip 1 (i) "B" option - Bray-Curtis measure .break (ii) "M" option - Manhattan Metric .break (iii)"C" option - Canberra Metric .break (iv) "D" option - Euclidean distanc squared .break (v) "A" option - Matching Co-efficient .break (vi) "R" option - Ratio Measure .break .skip 1 This field may be blank if dissimilarity calculation is not required. The form is e.g.: .left margin 5 .test page 10 .blank 10 .skip 2 3.4.6 ^&Printout Block\& .paragraph This block is a simple Yes/No option block to control the printout of each triangular trellis diagram. The options are: .skip 1 .indent 10 (i) Y - Yes, print it out .indent 10 (ii) N - don't print it out. .test page 10 .blank 10 .skip 2 3.4.7 ^&Sorting Block\& .paragraph This block indicates the type of sorting algorithm to be used in generating the dendrograms from the triangular matrices. The algorithms are: .left margin 25 .nofill .nojustify ^&Option\& N nearest neighbour F furthest neighbour G weighted average A simple average C centroid I incremental sum of squares V variance X flexible .skip 1 This field may be left blank if a sort is not required. .left margin 5 .fill .justify A sample form is: .test page 10 .blank 10 .skip 2 3.4.8 ^&Dendrogram Block\& .paragraph This block controls whether each dendrogram is produced and the form of its output. The options are: .skip 1 .LEFT MARGIN 19 .indent -4 L - produce dendrogram, output is listing of group fusions .indent -4 P - produce plot of dendrogram .indent -4 B - produce both plot and listing of dendrogram .indent -4 N - (or blank column) dendrogram not required .skip 1 .LEFT MARGIN 5 Example: .test page 10 .blank 10 .skip 2 3.4.9 ^&Two Way Table Block\& .paragraph Data which have previously been log-transformed may be de-transformed when printing out a two-way table. The options which may be specified are: .skip 1 .left margin 15 L - 'leave as is' printout (don't attempt to detransform) .break U - 'un-logged' printout (detransform the data) .break B - both L and U above .break N - (or blank column) not required .skip 1 .break .left margin 5 .test page 11 Example: .blank 10 .skip 2 3.4.10 ^&Ordination Block\& .paragraph The Ordination block specifies whether or not to produce an ordination analysis of the appropriate trellis diagram, or F matrix. The options available are: .skip 1 .break .left margin 15 R - produce Principal Coordinate analysis .break M - produce Principal Component analysis .break B - produce both R and M above .break N - (or blank column) ordination not required .break .left margin 5 .paragraph Its form is: .blank 10 .paragraph In Principal Component analysis the F matrix is multiplied by a weighting matrix, W, to give the matrix of new coordinates, C, i.e. .test page 10 .blank 10 .paragraph Using the normal notation of matrix algebra, it will be seen that the element Wij represents the weighting placed on old axis j in the F data, in its contribution to the score on the new axis i in the C data. .paragraph The W matrix is automatically printed out in Principal Component analysis. .paragraph It should be noted that the analyses of large matrices for eigenvectors and eigenvalues involves extensive calculation, and may therefore be expensive to perform. .skip 2 3.4.11 ^&Note\& .paragraph If any particular option is not required for the user's particular analysis it may be left blank. An interpretation of the options card is printed out at the beginning of the output to serve as a check for the user. .skip 2 3.4.12 ^&Examples\& .paragraph The following are sample options cards for illustrative purposes: .figure 10 .skip 1 .nofill .nojustify .left margin 15 .indent -5 Example 1 above specifies .skip 1 i input deck in entity form to be divided by 10 ii initial listing to be averaged values iii log transform the data iv standardize the logged data in columns by total, for entity analysis " " " " " rows " " , " attrbt analysis v use Bray-Curtis for entity analysis use Manhattan Metric for attribute analysis vi print out the entity trellis but not the attribute trellis vii sort both trellises using group average sorting viii produce both dendrograms ix print both logged and unlogged two-way tables x perform a princ compnt analysis of the attributes .figure 10 .skip 1 .indent -5 Example 2 above specifies .skip 1 i Deck is in attribute format, no division, summated printout ii no log transformation iii entities untouched, attributes standardised iv entities Bray-Curtis, attributes Manhattan Metric v Attribute trellis only vi entities nearest neighbour sorting, attributes flexible sorting vii attribute dendrogram only viii princ coord analysis of entities Both analyses of attributes .left margin 5 .fill .justify .skip 2 3.5 ^&Missing data options\& .paragraph Any entity-attribute pair which will not occur, and should be flagged as such, is specified on a missing data card. This card has the following form: .figure 10 .paragraph Up to seven entity-attribute pairs may be specified, and more than one missing data card is permissable. Users are referred to the general note regarding missing data in section 2.3 of this manual. .page 4.0 ^&TAXAN-CLUSTR Interface\& .paragraph TAXAN is a program which provides a subset of the facilities of CLUSTR, but in addition handles disordered multistate data. .paragraph TAXAN has been modified so that it calls on sections of CLUSTR to provide some of the advanced output and analysis features available in other CLUSTR analyses. Thereby the user gains the advantage of being able to use disordered multistate data. .paragraph The input form of a TAXAN data deck is summarized below. The program as modified provides: .left margin 15 .skip 1 (a) normal TAXAN output .break (b) plot of a dendrogram .break (c) principal coordinate analysis of the produced dissimilarity matrix .break .left margin 5 .skip 3 TAXAN Card Deck .skip 1 .left margin 15 .indent -5 Card 1 .skip 1 A 0-65 character title starting in column 1 .skip 1 .indent -5 Card 2 .skip 1 The numbers NENT NB NN ND NO J1 J2 PLOT ORDIN in that order in any columns across the card. The numbers must be separated by spaces, and zeroes must be explicitly stated. The numbers represent: .skip 1 NENT - number of entities in analysis .break NB###-#number#of#binary#attributes .break NN###-####"###"##numeric####" .break ND###-####"###"##disordered#m/s#" .break NO###-####"###"##ordered#m/s####" .break J1###-#clustering#option .break .left margin 20 0,1 nearest neighbour .break 2 furthest neighbour .break 3 group average .break 4 simple average .break 5 centroid .break 6 incremental sum of squares .break 7 variance .break .left margin 15 J2 - standardisation option (numeric and ordered m/s only) .break .left margin 20 1 standardise by division by range .break 2########"######"#####"#####"##twice#variance .break .left margin 15 PLOT - optional plot of dendrogram .break .left margin 20 0 plot required .break 1 no plot required .break .left margin 15 ORDIN - optional Principal Coordinate analysis .break .left margin 20 0 ordination required .break 1 ordination not required .left margin 15 .skip 1 .indent -5 Card 3 .skip 1 Size of multistates 40 per card, using sufficient cards to give ND+NO values. The values on each card should appear in columns 1,3,5,...,77,79. .skip 1 .indent -5 Data Cards .skip 1 The data cards for each entity follow in the following order: .skip 1 .left margin 20 .nofill .nojustify Binary data for entity 1 numeric " " " 1 disordered m/s " 1 ordered m/s " 1 repeat for entity 2 . . . repeat for entity NENT .fill .justify .left margin 15 .skip 1 .indent -5 Binary data format .skip 1 Binary data appears as 1,0,'*', or ' ' in successive columns of data cards , 60 values per card using as many cards as necessary to fill the required NB number. .skip 1 .indent -5 Numeric data format .skip 1 Numeric data appear 30 numbers per card, again using as many cards as are necessary to fill NN values. The numbers appear in any columns, separated by spaces. .skip 1 .indent -5 Multistate data format .skip 1 Multistate data appear as numerically coded values or '*' or ' '. They appear 40 values per card using as many values as are necessary to satisfy ND or NO values respectively. They are coded to appear in columns 1,3,5,...,77,79 of a data card. .left margin 5 .skip 1 .TEST PAGE 6 LIMITATIONS .paragraph Presently TAXAN caters for a maximum of .skip 1 .nofill .nojustify NENT = 84 NM = 40 NB = 40 NO+ND= 40 .fill .justify .paragraph Further enquiries regarding the properties of TAXAN should be directed to Dr. Clifford of the University of Queensland's Botany Department. Problems with its operational aspects should, as with CLUSTR, be directed to the Computer Centre. .chapter .skip 20 .centre Part II .skip 15 .centre The Analyses of Three Dimensional Data .page 5.0 ^&3D CLASSIFICATION - AN OVERVIEW\& .BREAK .paragraph Three dimensional classification has found application particularly in the ecological literature (e.g. 10,11). .footnote 7 .centre *#*#*#*#*#*#*#*#*#* (10) Williams,W.T., and Stephenson,W., "The analysis of three-dimensional data (sites x species x times) in marine ecology", J. Exp. Mar. Ecol. 11,207-227. .break (11) Stephenson,W., Williams,W.T., and Cook, S., "The macrobenthos of soft bottoms in southern Moreton Bay (south of Peel Island)" , Mem. Queensl. Mus. 17,73-124. ! The processes and operations which are performed in 3D analysis have analogues in 2D analysis, and so this section of the manual merely extends the concepts introduced in sections 1 and 2. .paragraph The X matrix now takes on the following form .blank 10 .paragraph The values of attribute k when two entities have the values i and j is given by Xijk. .paragraph It may therefore be seen that 3D analysis concerns itself with classifying two entities using the values of given attributes occurring for each possible co-incident pair of entities. It may of course not be possible for a given entity-1 entity-2 combination to occur, and so missing data options in 3D analysis are extended. .paragraph Having defined the 3D data matrix, the 3D classification reduces itself to two 2D classification problems. From the X matrix (3D), one produces two 2D auxiliary data matrices, A# and A#,shown diagrammatically below. .test page 10 .blank 10 .paragraph The A# (n#x v) matrix is produced by a mathematical reduction over all entity-2 values. .paragraph The A# (n#x v) matrix is produced by a mathematical reduction over all entity-1 values. .paragraph Once this reduction is accomplished, both normal and inverse analyses of both A# and A# matrices can be performed as outlined in part#I of the manual. .paragraph The various methods of mathematical reduction are defined formally in section 6.0 of this manual, and the method of setting out 3D data, options, dimensions, and missing data cards is given in section 7.0. .page 6.0 ^&CLUSTR operations for 3D data.\& .paragraph The various options available to a user at each level of the classification process are now formally defined. It is assumed at this stage that the raw data matrix has been read from an input medium. The method by which this may be done will be given in sections 7.0 and 8.0. .break .skip 2 6.1 ^&Summary of raw data\& .break .paragraph An initial summary of the raw data is printed in 3D analysis. For an X matrix of the following form: .test page 10 .blank 10 .paragraph quantities Pi and Qi are derived, representing entity-1 and entity-2 summaries. The quantities may be: .nofill .nojustify .break .left margin 15 .test page 8 (a) totals [S] Pi = Xijk jk Qj = Xijk ik .test page 11 (b) averages [A] Xijk jk Pi = n v Xijk ik Qj = n v .fill .justify .left margin 5 .skip 2 .test page 18 6.2 ^&Reduction options\& .paragraph For an X matrix of the form shown below, .figure 10 .tab stops 40,45 .nofill .nojustify n entity-1 n entity-2 v variates Xijk value of variate k when entity-1=i and entity-2=k .fill .justify .paragraph the quantities in the A# matrix may be defined as: .nofill .nojustify .left margin 15 .test page 5 (a) log-total [T] A ik = log ( Xijk +1 ) 1 10 j .test page 7 (b) log-average [L] Xijk j A ik = log ( + 1 ) 1 10 n .test page 7 (c) average [A] Xijk j A ik = 1 n .test page 6 (d) summed [S] A ik = Xijk 1 j .fill .justify .left margin 5 .paragraph Similarly, the quantities in A# are defined by: .nofill .nojustify .left margin 15 .test page 5 (a) log-total [T] A jk = log ( Xijk +1 ) 2 10 i .test page 7 (b) log-average [L] Xijk i A jk = log ( + 1 ) 2 10 n .test page 7 (c) average [A] Xijk i A jk = 2 n .test page 6 (d) summed [S] A jk = Xijk 2 i .fill .justify .left margin 5 .paragraph Subsequent transformation of the A# and A# matrices, indeed all subsequent classification processes are carried out exactly as if A# and A# were independent 2D matrices for analysis as in part I of this manual. In particular A# and A# are operated on as in section 2.2 to produce independent F matrices .paragraph It will now be obvious why the statement in the introductory paragraph in 3.4.2 is made. The log transformation of data is part of the process (at least computationally) of reduction of 3D data to 2D data, and so merits a separate block in the options card. Not that the effect obtained is conceptually different than from other conditionings#-#but the allocation of a separate block allows log transformations to be performed in conjunction with for example standardisations. Also it is noted that log transformation of a 2D matrix (in the part#I sense) is achieved by treating it as a degenerate 3D matrix. .paragraph How to specify subsequent operations on the A# and A# matrices is a topic discussed in section 7.0 following. .PAGE 7.0 ^&Data input format for 3D data.\& .paragraph The data deck structure for 3D data is exactly the same as for 2D data shown in section 3.1. In general, the options specified for 2D data are immediately transferable to 3D data, however some extensions are allowable and will be discussed in the following paragraphs. .skip 1 7.1 ^&Dimensions card\& .paragraph The format for the dimensions card is as follows: .blank 10 7.2 ^&Samples cards\& .paragraph Once again there are two formats for data input. .skip 1 Entity format .paragraph A user specifies attribute scores for each entity1-entity2 combination as folows: .test page 10 .blank 10 .paragraph More than one card for a given entity pair is allowed. .skip 1 Attribute format .paragraph A user specifies the scores for several entity pairs for that one attribute on one or more cards. The form is: .test page 10 .figure 15 .blank 10 .paragraph Each entity pair is specified in its own 5 column block as follows: .test page 9 .blank 9 .paragraph More then one card per attribute is permissable. .skip 1 7.3 ^&3D options card\& .paragraph The options card for 3D analysis is merely an extension of the 2D options card to allow for the extra 2D analysis required for A#. .paragraph It is organized into exactly the same eleven blocks as shown in 3.4, with the exception that the log option block (cols 6-10) is now better termed the REDUCTION BLOCK (see also 6.2) .skip 1 7.3.1 ^&Data input block\& .paragraph The form of the data input block is exactly the same for 3D analysis as it is for 2D analysis as shown in 3.4.1. .skip 1 7.3.2 ^&Reduction block\& .paragraph The reduction block specifies the way in which the 3D matrix is reduced to two 2D matrices. The options as defined previously are .skip 1 .left margin 15 .nofill .nojustify T - log total L - log average A - average S - summation .left margin 5 .fill .justify .paragraph The form of the block is: .test page 11 .blank 11 .nofill .nojustify .test page 6 7.3.3 Transformation block 7.3.4 Power block 7.3.5 Dissimilarity block 7.3.6 Printout block 7.3.7 Sorting block 7.3.8 Dendrogram block .fill .justify .paragraph Each of these blocks in 3D analysis contain exactly the same range of options as they did in 2D analysis (see 3.4.3 to 3.4.8 for available options). The blocks specify options for each of the four possible analyses, as follows: .test page 15 .blank 15 7.3.9 ^&Two way table block\& .break .paragraph The options available for two-way table printout are the same as outlined in 3.4.9, the ordering within the block being the same as in 7.3.2 . .skip 1 7.3.10 ^&Ordination block\& .paragraph The options available for ordination are the same as those outlined in 3.4.10, the ordering being the same within the block as shown in 7.3.3-7.3.8 . .skip 1 7.3.11 ^&Note\& .paragraph If a particular analysis is not required for the particular users task the appropriate option column may be left blank where indicated in the previous text. CLUSTR recognizes the situation where 3D data is input but only for one of the 2D analyses by recognizing the appropriate blank column in the reduction block. .skip 1 7.3.12 ^&Example\& .paragraph The following example is an extension of the example in 3.4.12. As well as all the options specified there, the extra options specified for entity-2 attribute analysis are enumerated below. .test page 7 .blank 7 .nofill .nojustify .left margin 15 (i) 'average' reduction for ent2 attribute matrix (ii) standardise for normal analysis no transform for inverse analysis (iii)Canberra metric for normal analysis Euclidean distance squared for inverse analysis (iv) no trellis printout (v) flexible sorting for both analysis (vi) all dendrograms required (vii)both two way tables required (viii) no ordination required .fill .justify .left margin 5 .skip 2 7.4 ^&Missing data options\& .paragraph There are three types of missing data cards in 3D analysis. They are enumerated below. .skip 1 (a) Missing entity1-entity2 combination .paragraph If a given entity1-entity2 combination does not occur, i.e. value of all attributes for that combination cannot occur, the following missing data card is used: .test page 10 .blank 10 (b) Missing entity-1 attribute combination .paragraph If for a given entity1 a given attribute is impossible, regardless of the value of entity-2, the following form is used: .test page 10 .blank 10 (c) Missing entity-2 attribute combination .paragraph If for a given entity-2, a given attribute cannot occur, regardless of the value of entity-1, the following form is used: .figure 10 .centre *#*#*#*#*#*#*#*#*#*#*#*#* .chapter .centre Part III .skip 15 .centre CLUSTR on the PDP10 .page 8.0 ^&CLUSTR on the PDP-10\& .paragraph This section defines the system control cards required to control CLUSTR on the University of Queensland PDP-10 timesharing system where it currently operates under version 6.02 of the monitor. .skip 1 8.1 ^&The UQ Batch System\& .paragraph CLUSTR requires input from a card reader, and gives output to both line printer and plotter. Intermediate storage is required on moving head disc, on the users area. .paragraph The line printer, card reader and plotter on the PDP-10 are so-called spooling devices. That is, input from, or output to the real devices occurs well before or after the program's actual running. In the intervening time interval data to or from these devices is stored as files of information on moving head disc. .paragraph The task of providing control cards for CLUSTR therefore involves associating the appropriate spooled disc files with the running of the program at the appropriate time. .paragraph The following deck setup is typical and will be examined in some detail. .nofill .nojustify .left margin 15 .skip 2 $SEQUENCE $JOB $DATA < CLUSTR data deck > $EOD $TOPS10 _.NOERROR _.R STA:CLUSTR _.PRINT *.LPT _.PLOT PLT1:=*.PLT _.DEL *.DAT $EOJ .fill .justify .LEFT MARGIN 5 .skip 2 .paragraph The $SEQUENCE and $JOB cards identify the current users job to the system, the job being terminated by a $EOJ card. .paragraph The $DATA to $EOD cards include the CLUSTR data deck and instruct the batch system to set up a disc file. This file will later be used for input to CLUSTR as if it had come directly from the card reader. It is thus one of the spooled files. The disc file will be placed on the disc area set aside for the current user. .paragraph Having set up the appropriate input streams, the command .R STA:CLUSTR then causes the CLUSTR to begin execution. .paragraph During the course of its execution CLUSTR may produce disc files meant for the line printer and plotter, i.e. spooled output files. These files are placed in the appropriate queues for the real devices by the .PRINT and .PLOT commands. .paragraph The .DELete command tidies the user disc area by deleting the disc files which CLUSTR has created during the course of its running ( xxxxxx.DAT files ). .paragraph This sample deck setup is quite typical of most operations a user is likely to perform. It is modifiable, and will largely depend on whether or not the CLUSTR data deck already exists on disc as from a previous run or is to be read during this run using a $DATA - $EOD construction. .skip 1 8.2 ^&The UQ Remote Terminal\& .paragraph CLUSTR is primarily designed for use via the PDP-10 batch stream. To use it via a remote terminal a little knowledge of the FORTRAN operating system is reqyired, though once again only sufficient to organise the required input/output. .paragraph CLUSTR recognises when it is being run from a remote terminal, and expects input from FORTRAN unit number 5, and does output to unit number 6 (as opposed to 2 and 3 when running from the batch system). A user then uses interactive commands to associate units 5 and 6 with the appropriate (possibly spooled) peripheral devices. The following command sequence is typical: .left margin 15 .nofill .nojustify #.ASSIGN CDR:5 #.ASSIGN LPT:6 # . # . #.SET CDR ABC # . # . #.R STA:CLUSTR .fill .justify .left margin 5 .paragraph The .ASSIGN commands associate units 5 and 6 with the card reader and line printer respectively. The .R command is as previously discussed in 8.1. .paragraph Then a user invokes exactly the same .PRINT, .PLOT and .DEL commands as shown in 8.1. .left margin 5 .paragraph It has been assumed that a .CDR file has already been set up on user disc area, perhaps by a previous $DECK - $EOD combination from a previous batch run, or perhaps using EDITOR to create the file interactively. .paragraph A possible variation on the above interactive procedure is the following .skip 1 .left margin 15 .nofill .nojustify #.ASSIGN DSK:5 #.ASSIGN DSK:6 #.R STA:CLUSTR .fill .justify .left margin 5 .paragraph In this case the non-spooled disc device has been assigned for input and output. By convention the disc file expected for input is the file FOR05.DAT, and output is sent to FOR06.DAT. No .SET CDR command is required as the card reader (spooled) has no longer been involved. .left margin 5 .page 9.0 ^&CLUSTR system components and interactions\& .paragraph This section is not aimed at the average user, but is rather designed to give an overview of CLUSTR structure to the programmer who may be faced with the maintenance or modification of CLUSTR. A resume of each program is given, and the way in which inter-program communication is achieved via disc files. Further the structure of each of the disc files is given. The explanations are all in terms of 3D analyses, because as far as CLUSTR is concerned ALL analyses are 3D, possibly with unit thickness in the 2D case. .paragraph It is assumed that the entity1 x entity2 x attribute analysis to be performed has dimensions n#xn#xv. .paragraph The diagram in fig. 9.1 summarizes CLUSTR action. .paragraph It may be seen that CLUSTR is really only the first of seven programs which are run in sequence automatically. The programs communicate via disc data files which are all binary files. It will also be evident that if execution aborts at any stage due to some external error(etc), executeion may be re-commenced at one of the intervening stages. .skip 1 9.1 ^&Program synopses\& .skip 1 CLUSTR .paragraph The first program in the suite reads dimensions and options cards, de-codes them and writes codified versions of them to disc for use by later programs. It then reads the samples cards storing two matrices for each of the A# and A# reduced matrices. The two matrices for each one record total entity 1 or 2 attribute values, and the number of occurences of each entity 1 or 2 to allow for possible missing data. .paragraph Throughout the listings copious references to sites, times and species occur. CLUSTR was originally written for ecological analyses, and later generalised. For sites, times and species, in the listings should be read entity1, entity2 and attributes. .paragraph Routines LISTC and LISRAW generate data summaries. .paragraph Routines LOGRD1,AV1,LOGAV1,LOGRD2,AV2, and LOGAV2 then perform the reductions using the four 'A# and A#' matrices. These are written to disc for CONDIT. .test page 45 .page .skip 1 CONDIT .paragraph The second program is responsible for conditioning the reduced matrices ready for dissimilarity analysis by BRACUR. .paragraph Four matrices are generated in the conditioning process, corresponding to one for each of the inverse and normal analyses for both entity1/attribute and entity2/attribute data. .paragraph As BRACUR does dissimilarity calculations only by column, for inverse analysis CONDIT writes the matrix transpose to disc. .skip 1 BRACUR .paragraph The third program writes four trellises ready for sorting by DENDRO. .skip 1 DENDRO .paragraph This program is the later part of Prof. Burr's TAXAN2 program. The CLUSTR suite grew from an original desire to add facilities to this program. It was considered that better facilities could be made available if the sorting section only of TAXAN2 was added to CLUSTR and not vice versa. Consequently most of TAXAN2 has been discarded and this is the only section of code remaining not written by the author. It would, for obvious reasons gleaned from the listings, bear replacement, and be so replaced at a later stage. .paragraph The basic method used is that given by Clifford and Stephenson(1), and following the Lance and Williams approach(5). .skip 1 DENPLT .paragraph The DENPLT program is the program which plots dendrograms from tha output of the DENDRO program. .paragraph It is basically a recursive binary tree drawer(!), made iterative due to the constraints of FORTRAN. .paragraph The ordering of leaves along the baseline of the dendrogram defines the resorting of the original data matrix, and is written to disc for use by the two-way table printer. .skip 1 RESLST .paragraph The RESLST program prints a resorted version of the matrices contained in AVGE.DAT, after resorting according to the ordering reflected in RESVC.dat, written out by DENPLT. .paragraph RSAV.DAT is an auxiliary file used in the sorting. .skip 1 PCA .paragraph PCA is the program which performs all the ordination using either trellis diagrams (DENn.DAT) or conditioned data matrices (MATn.DAT). .skip 2 .left margin 10 .indent -2 General notes .skip 1 .indent -2 1. At each stage of the classification process only those options specified in AVGE.DAT from the options card are performed. .skip 1 .indent -2 2. Owing to a quirk on the author's part, all arrays are stored by row and not by the conventional FORTRAN column. This means that the first subscript always refers to a column position along a row, and the second refers to a row position down a column for each element. .paragraph Particular note of this fact should be taken in PCA where interface to standard SSP routines occurs. .skip 1 .indent -2 3. Use is made throughout of FORTRAN's variable dimension facility, and the utility routine MORCOR. For further details consult the listing of MORCOR. .skip 1 .indent -2 4. In each case where four matrices are written out, their names are of a form like for example MATn.DAT. The 'n' specifies the analysis for which the data is intended, viz: .skip 1 .nofill .nojustify .left margin 15 1 entity1/attribute data normal analysis 2 " inverse " 3 entity2/attribute " normal " 4 " " inverse " .fill .justify .left margin 5 .skip 2 9.2 ^&File Formats\& .paragraph The following is a record by record description of each binary file: .skip 2 .nofill .nojustify .left margin 10 .indent -3 AVGE.DAT .skip 1 a) 1 record of 3 words n# n# v and 13 words A5 title b) 1 record of 36 words coded options c) v records of n# words A# matrix d) v records of n# words A# matrix (TAXAN dummy AVGE.DAT does not include c and d) .skip 2 .indent -3 MATn.DAT (e.g. MAT1.DAT ) .skip 1 a) 1 record of 2 words ( number rows, number columns) e.g. v n# b) conditioned matrix e.g. v records of n# words .skip 2 .indent -3 DENn.DAT (e.g. DEN1.DAT) .skip 1 a) 1 record of 1 word - size of (square) matrix b) 1 record of 21 words - title and subtitle c) lower triangle of the square matrix e.g. n#-1 records of 1,2,3,4...n#-1 words each .skip 2 .indent -3 DENIn.DAT (e.g. DENI1.DAT) .skip 1 a) 1 record of 1 word - no. of entities on dendrogram baseline b) 1 record of 21 words - title subtitle for dendrogram c) fusions e.g. n#-1 records with four words a,b,c,d. these indicate entities a and b fuse to give c at level d .skip 2 .indent -3 RESVC.DAT .skip 1 a) n# records of 1 word each giving order along baseline of entity1 from dendrogram 1. b) v records similarly from dendrogram 2 c) n# " " " " 3 d) v " " " " 4 .skip 2 .indent -3 RSAV.DAT - resorted conditioned matrix .skip 1 a) v records of n# entities b) v records of n# entities