KK KK EEEEEEEE YY YY WW WW RRRRRR DDDDD KK KK EE YY YY WW WW RR RR DD DD KK KK EE YYYY WW WW WW RR RR DD DD KKKKK EEEEE YY WW WWWW WW RRRRRR DD DD KKK KK EE YY WWWW WWWW RR RR DD DD KK KK EE YY WWW WWW RR RR DD DD KK KK EEEEEEEE YY WW WW RR RR DDDDD 10 August 1980 KEYWRD, Word and Phrase Recognition Logic Generator ------ ---- --- ------ ----------- ----- --------- The KEYWRD program produces a sequence of tests which can identify the leading word or the leading phrase formed of a fixed sequence of words in a line of text without ever having to test a character which has already been identified. Such a leading word or phrase, which will be referred to as a command in this document, does not need to include any characters to the right of the first of the characters which uniquely identify the command. The word or each of the words in a phrase can be abbreviated by truncation, leaving at least the left character in each word of a phrase if additional words or their abbreviations appear to the right. Spaces are allowed between the words in a phrase, but are not required. A single sequence of tests is used to recognize the initial portions of commands which start with a common series of characters, then the unique portion of each command is identified by a separate sequence of tests. After the unique portion of each command has been identified by the separate sequence of tests, then a single sequence of tests is similarly used to recognize the final portions of commands which end with a common series of characters. The KEYWRD program reads a single input file and produces an output listing file and an output FORTRAN language file containing DATA statements which represent the sequence of tests. The sequence of tests could, of course, easily be ouput in a form other than FORTRAN source code. These DATA statements must be merged into the FORTRAN program by which these tests are to be used. The KEYWRD program is written in FORTRAN, and is machine independent except for the short subroutine which asks for the names of the files and then opens these files. The number of tests which are necessary to describe an entire glossary is usually less than proportional to the number of characters in the glossary, although it is possible to design a small glossary for which each character requires a separate test. The tests are linked together so that when a word or phrase is identified using these tests, each character in the word or phrase is tested only against those characters which could logically follow the characters which have already been identified in the word or phrase. KEYWRD, Word and Phrase Recognition Logic Generator Page 2 The description of each test is packed into 2 array locations and consists of 5 items: the serial location in the alphabet of the letter to be matched, an indication of whether spaces can appear before the letter, the location in the arrays of the description of the next test to be applied if the current match fails, the value to be associated with the command if the current match succeeds, and the location in the arrays of the description of the next test to be applied if the current match succeeds. The KEYWRD program required 247 seconds on a DECsystem1091 computer to produce 395 tests which could be used to identify 45 single words and 84 phrases representing 85 different commands. The processing of this glossary, which consisted of 249 words and 1094 letters, required 5582 locations in each of 6 arrays. Each of these arrays is dimensioned at 3000 in the distributed version of the KEYWRD program. The sizes of these arrays can be changed if the value of the variable named LMTPNT is also changed to the new number of locations in each array. If the arrays are too small for the glossary being processed, then the KEYWRD program will inform the user of the required array dimensions. The number of locations which must be available in each array for the processing of a glossary is equal to no more than the sum of the lengths of the single word commands and of the products of the lengths of each of the words in each of the phrases times the lengths of all of the words appearing to the left in the same phrases. The actual number of locations needed will be smaller if some commands start with the same character sequences as others. A glossary consisting of the single word SPACING and the phrase NO FLAGS UPPER CASE would require 7+2+(2*5)+(2*5*5)+(2*5*5*4) or 269 locations in each array. The execution time for processing a glossary is proportional to the square of the maximum number of locations which are used in each array. If the execution time and the size of the KEYWRD program are to be minimized, then the glossary should contain relatively few phrases, and any phrases which must be included should be kept short. Of course, the program which uses the tests produced by the KEYWRD program will identify phrases nearly as efficiently as single words. On the DECsystem1091 computer, the excecution time for the KEYWRD program is given by seconds = 8E-6*(number of locations used in single array**2) Each line in the input file which does not start with one of the reserved characters *, /, ( or ), which are described later, specifies a single word or phrase preceded by the nonzero value by which the word or phrase is to be identified. The number should not duplicate the number to be associated with any other command unless these commands are synonyms or unless some of these commands are abbreviations which would otherwise be ambiguous. The number can be preceded by one or more spaces, but leading KEYWRD, Word and Phrase Recognition Logic Generator Page 3 spaces are not required. The number cannot contain any characters other than the digits 0 through 9 and a leading minus sign if the value is negative or an optional leading plus sign if the value is positive. Spaces and/or a single comma can appear between the leading number and the following word or phrase, but are not required. The words within a phrase must be separated by at least 1 space. Extra spaces are ignored. Words and phrases can be constructed from any characters, but upper and lower case alphabetic letters are considered to be equivalent. The sequence of tests produced by the KEYWRD program is independent of the order in which the commands are defined in the input file, but with the exception that the numerical values assigned to nonalphabetic characters are 26 plus the order in which these unexpected characters are first encountered. The input file is terminated by a line containing a number which is not followed on the same line by any word or phrase. The following is an input file which defines 4 very similar phrases. 10 NEXT RUNE 20 NEAT RULE 30 NEXT RULE 40 NEAT RUNE 0 The listing file produced by the KEYWRD program summarizes each of the words and phrases and phrase abbreviations which can be recognized. The following listing file would be produced if the above input file is processed. 0 NE RULE NE RUNE N RULE N RUNE 10 NEXT RU(N)E NEX RU(N)E 20 NEAT RU(L)E NEA RU(L)E 30 NEXT RU(L)E NEX RU(L)E 40 NEAT RU(N)E NEA RU(N)E The numbers at the left in the listing file are the values which are to be associated with the words and phrases which are shown to the right. The unique portion of each word or phrase is enclosed in parentheses. An abbreviation of a command is unambiguous if it includes the first of the letters which are enclosed in parentheses. Since the final letter E in each of the above phrases is not necessary for the recognition of the phrases, the test sequences can be merged at the end as well as at the start, so that the same test for the final letter E can be included in all of the test sequences. The tests for the next to the last KEYWRD, Word and Phrase Recognition Logic Generator Page 4 character, the letter N, cannot be merged in the 2 phrases which contain the word RUNE, and the tests for the letter L cannot be merged in the 2 phrases which contain the word RULE, since no tests for unique characters would then remain in the sequences of tests which identify these phrases. The abbreviations of all but the rightmost words in each phrase are included in the listing file since the manner in which the previous words in a phrase are abbreviated can change which of the following characters are unique. Abbreviations such as NE RULE, NE RUNE, N RULE and N RUNE, in the above example, contain no unique characters and so are ambiguous. Ambiguous abbreviations are shown in the listing file with the associated value being zero. The FORTRAN statement file produced by the KEYWRD program contains a description of the input file, a description of the tests, the DATA statements defining 2 arrays, MCHPNT and NOTPNT, which specify the tests, and a DATA statement specifying the sizes of these arrays, KNTPNT. The names which are assigned to all of the arrays and variables which are represented in the DATA statements in the FORTRAN statement file can be specified by a line starting with an initial slash as described later. The following FORTRAN statement file would be produced if the above input file is processed. In the comment lines, a minus sign to the left of a letter indicates that the letter appears at the start of a word. C 10 NEXT RUNE C 20 NEAT RULE C 30 NEXT RULE C 40 NEAT RUNE C C FINAL STORAGE USED= 19, MOST USED= 62, LIMIT= 3000 C C CHECKSUMS 990, 32,2894,2575, 866 C C COUNT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 C COMMAND 0 0 0 0 0 0 20 40 0 0 0 0 30 10 0 C LETTER -N E A T -R U L N X T -R U L N -R C SUCCESS 2 3 4 5 6 7 19 19 10 11 12 13 19 19 16 C FAILURE 0 15 9 5 0 0 8 0 15 11 0 0 14 0 0 C C COUNT 16 17 18 19 C COMMAND 0 0 0 0 C LETTER U L N E C SUCCESS 17 19 19 0 C FAILURE 0 18 0 0 C C DIMENSION MCHPNT(19) DIMENSION MCHPN1(19) EQUIVALENCE (MCHPN1(1),MCHPNT(1)) C DIMENSION NOTPNT(19) DIMENSION NOTPN1(19) EQUIVALENCE (NOTPN1(1),NOTPNT(1)) KEYWRD, Word and Phrase Recognition Logic Generator Page 5 DATA KNTPNT,KNTXTR/ 19, 0/ DATA MCHPN1/2,3,4,5,6,7,419,819,10,11,12,13,619,219, 116,17,19,19,0/ DATA NOTPN1/-280,115,29,405,-360,420,248,280,495,411, 1-360,420,254,280,-360,420,258,280,100/ If the words and phrases are constructed from characters other than spaces and the alphabetic letters A through Z, then an additional DATA statement is generated which specifies a third array, LTRXTR, containing the unexpected characters. KNTXTR, which is specified by one of the DATA statements which are always generated, is the size of the LTRXTR array. If the words and phrases are constructed only from spaces and the alphabetic letters A through Z, then KNTXTR has the value zero and a DATA statement defining the LTRXTR array is not generated. The first location in the NOTPNT array describes the first test which is to be performed. The absolute value of each entry in the NOTPNT array is the sum of the location within the alphabet of the letter to be matched times (KNTPNT+1), plus the subscript of the location in the NOTPNT array which describes the next match which is to be attempted if the current match is a failure. The subscript of an array location is its serial position within the array, counting the first value in the array as being at subscript 1, the second value as being at subscript 2, and so on. If the entry in the NOTPNT array is negative, then the character starts a word and any spaces in the input line can be skipped. If the location within the alphabet is greater than 26, then this minus 26 is the location within the LTRXTR array of the character to be matched. If the match succeeds and the value to be associated with the command is positive or if the value is zero indicating that the match does not uniquely identify a particular command, then the parallel location in the MCHPNT array contains the sum of the value of the command times (KNTPNT+1), plus the subscript of the location in the NOTPNT array which describes the next test. If the match succeeds and the value to be associated with the command is negative, then the parallel location in the MCHPNT array instead contains the value of the command times (KNTPNT+1), minus the subscript of the location in the NOTPNT array which describes the next test. If the subscript of the location in the NOTPNT array which describes the next test is indicated to be zero, either by the MCHPNT array if the current match is a success or by the NOTPNT array if the current match is a failure, then no additional test remains to be performed, and the command is identified by the last nonzero value encountered in the MCHPNT array for a match which succeeded. The sequence of tests which the KEYWRD program produces to identify the commands in the above input file is diagrammed below. In this diagram, the horizontal lines formed of KEYWRD, Word and Phrase Recognition Logic Generator Page 6 minus signs indicate the next tests to be performed if the matches succeed, and the diagonal lines formed of slashes indicate the next tests to be performed if the matches fail. The numbers above the letters are the locations in the NOTPNT and MCHPNT arrays which describe the tests. The numbers below the letters are the values of the associated commands. The word "space" indicates a location at which an optional space character can appear. 18 19 N---E /0 /0 15 16 17/ / /---/-----------space---R---U---L---/ / / 0 0 0 / / / / / / 14 / / / N---/ / / /10 / / / 11 12 13/ / / / /---/---space---R---U---L---/ / 9/ 10/ / 0 0 30 / / X---T---/ / / /0 0 8 / / / N---/ / / /40 / / / 5 6 7/ / / / /---/---space---R---U---L---/ 1 2/ 3/ 4/ / 0 0 20 N---E---A---T---/ 0 0 0 0 The following input file defines 4 commands which start with the same first 3 letters. 10 FOGHORN 20 FOG HORN 30 FOG 40 FOGGY 0 The sequence of tests which the KEYWRD program produces to identify the commands in the above input file is diagrammed below. This sequence of tests is independent of the order in which the commands appear in the input file. If a continuation of the current word starts with the same letter as the next word of a phrase, then the continuation is always sought first and the space which must precede the next word is sought only if the test for the continuation fails. Thus, FOGHORN without a space before the letter H will always be identified as command 10 and FOG HORN with a space before the letter H will always be identified as command 20. FHORN and FOHORN are not abbreviations of the single word FOGHORN, and so are identified as command 20 whether or not a space appears before the letter H. KEYWRD, Word and Phrase Recognition Logic Generator Page 7 7 8 9 10 /---/---/---space---H---O---R---N / / / 20 /0 0 0 / / / / / / 6/ / / / H---------------/ / / /10 / / / 1 2/ 3/ 4/ 5 F---O---G---G---Y 0 0 30 40 40 Undesired abbreviations can be made unidentifiable by including them in the input file with zeroes as the associated values. If the glossary defined by the input file shown above is to include the abbreviations F HORN and FO HORN when a space or spaces appear between the abbreviations of the words of the phrase, but FHORN and FOHORN are not to be allowed as abbreviations, then the input file might be changed to the following. 10 FOGHORN 20 FOG HORN 30 FOG 40 FOGGY 0 FHORN 0 FOHORN 0 The following listing file would be produced if the above input file is processed. 0 FHORN FOHORN 10 FOG(H)ORN 20 FOG (H)ORN FO (H)ORN F (H)ORN 30 FO(G) 40 FOG(GY) The sequence of tests which the KEYWRD program produces to identify the commands in the above input file is diagrammed below. The only differences between this sequence of tests and that shown for the previous example is that this sequence of tests does not assign a nonzero value to the command when the letter H is found immediately after an initial letter F or after the initial letters FO. KEYWRD, Word and Phrase Recognition Logic Generator Page 8 9 10 11 12 /---/---/---space---H---O---R---N / / / 20 /0 0 0 / / / / / 8/ / / H-------------------/ / /0 / / / / / / / 7/ / / / H---------------/ / / /10 / / / / / 3/ 4/ 5/ 6 / O---G---G---Y / /0 30 40 40 / / / 1 2/ / F---H-----------------------/ 0 0 If, instead, the glossary is to include the abbreviations FHORN and FOHORN, but F HORN and FO HORN are not be be allowed as abbreviations when a space or spaces appear between the words, then the input file might be changed to the following. 10 FOGHORN 20 FOG HORN 30 FOG 40 FOGGY 20 FHORN 20 FOHORN 0 FO HORN 0 No harm would result from explicitly assigning the value zero to the phrase F HORN in the above input file, but this is not necessary since this phrase is an abbreviation of the phrase FO HORN. The following listing file would be produced if the above input file is processed. 0 FO HORN F HORN 10 FOG(H)ORN 20 F(H)ORN FOG (H)ORN FO(H)ORN 30 FO(G) 40 FOG(GY) The sequence of tests which the KEYWRD program produces to identify the commands in the above input file is diagrammed below. KEYWRD, Word and Phrase Recognition Logic Generator Page 9 10 11 12 13 /---/-------space---H---O---R---N / / 0 /0 0 0 / / / / 9/ / / H-------------------/ / /20 / / / / / / 8 / / / /---space---H---/ / / / 20 / / / / / / / 7/ / / / H---------------/ / / /10 / / / / / 3/ 4/ 5/ 6 / O---G---G---Y / /0 30 40 40 / / / 1 2/ / F---H-----------------------/ 0 20 When 2 or more commands start with the same sequence of characters and the input file defines a shorter command which is itself an abbreviation of the nonunique sequence, then the shorter command can be followed by the rest of the nonunique characters and still be identified as the shorter command. For example, if the above input files defined FOG HIDDEN as having the value 50, then then both FOG and FOG H would be identified as command 30. This could be prevented by assigning FOG H and all other undesired abbreviations of this sort some single nonzero value which is not used for any valid command and which will indicate that these undesired commands are illegal if they are ever encountered. If a short command which is explicitly declared in the input file would, if not declared, serve as an abbreviation of a longer command having a different identifying value, then the abbreviations of the shorter command will themselves be ambiguous. If the abbreviations of the shorter command are to be recognized, then the shortest possible abbreviations of the word, or of the rightmost words in the phrases of each possible length, must be explicitly declared. For example, if the input file contains only the following entries 1 FOOT NOTE 2 FOOT NOTE GAP 3 FOOT NOTE HEADER then abbreviations such as F, FO, FOO, FOOT, FN, FON and FNO will be unrecognizable. Any abbreviation of the command FOOT NOTE will be recognizable only if it contains the full KEYWRD, Word and Phrase Recognition Logic Generator Page 10 spelling of the word NOTE. This could be prevented by declaring F and FOOT N (the shortest abbreviations of the rightmost words in the phrases of each possible length) to have the value 1. If F already was declared as yet another command, then it would be FO and FOOT N which would be declared to have the value 1. Handling of Input Lines Starting with Special Characters -------- -- ----- ----- -------- ---- ------- ---------- Lines in the input file which start with a slash, an asterisk, a left parenthesis or a right parenthesis are treated specially. These initial characters cause the following actions to be performed. / An initial slash indicates that the line specifies the names of the arrays and variables which are to be represented in the DATA statements which are to be written into the output FORTRAN statement file. * An initial asterisk indicates that the line specifies 5 numbers which characterize the sequence of tests produced by the KEYWRD program. Such a line would appear in the input file only when the result is already known and the operation of the KEYWRD program is being verified. ( An initial left parenthesis indicates that the rest of the current line is to be copied into the output FORTRAN statement file unchanged. ) An initial right parenthesis indicates that the DATA statements which represent the sequence of tests are to be written into the output FORTRAN statement file. If a line starts with an slash, then the next 5 groups of printing characters on the line are used as the names of the 3 arrays and 2 variables which are represented in the DATA statements when the next line is encountered which starts with a left parenthesis or which contains only a number. These groups of printing characters can be separated by spaces and/or by single commas. The names of the 3 arrays must each differ from the others in their first 3 characters. 1. The first group of up to 6 characters is used as the name of the array which specifies the next operation if a match fails. This name is NOTPNT if a line starting with an slash does not appear in the input file. 2. The second group of up to 6 characters is used as the name of the array which specifies the next operation if a match succeeds. This name is MCHPNT if a line starting with an slash does not appear in the input KEYWRD, Word and Phrase Recognition Logic Generator Page 11 file. 3. The third group of up to 6 characters is used as the name of the nondimensioned variable which contains the number of tests specified by the NOTPNT and MCHPNT arrays. This name is KNTPNT if a line starting with an slash does not appear in the input file. 4. The fourth group of up to 6 characters is used as the name of the array which specifies all characters, other than spaces and the letters A through Z, appearing in the commands. This name is LTRXTR if a line starting with an slash does not appear in the input file. 5. The fifth group of up to 6 characters is used as the name of the nondimensioned variable which contains the number of characters in the LTRXTR array. This name is KNTXTR if a line starting with an slash does not appear in the input file. If a line starts with an asterisk, then the rest of the line predicts the values of 5 checksums which are to be calculated from the glossary which is being defined. These checksums can be separated by spaces and/or by single commas. The user of the KEYWRD program will be informed if the predicted values and the calculated values do not agree. The predicted checksums are used in standardized versions of the input file which are processed to determine if the KEYWRD program still produces the same results after the KEYWRD program itself has been changed. A sixth group of up to 6 characters on the line beginning with an asterisk can define a label which is to be displayed to the user. The characters appearing to the right of an initial left parenthesis are copied directly into the FORTRAN statement file. The appearance of a line starting with a left parenthesis does not interrupt the specification of the glossary of keywords, and, in particular, does not cause the DATA statements to be generated. Lines with an initial left parenthesis can be used to introduce FORTRAN statements and FORTRAN comment lines into the FORTRAN statement file. Lines which begin with a right parenthesis cause the DATA statements representing the sequence of tests to be written into the FORTRAN statement file, but do not indicate that the end of the input file has been reached. All characters which follow the initial right parenthesis on the same line are ignored. If the input file is used for testing the KEYWRD program, then additional glossaries might be defined on the lines which follow that which begins with the right parenthesis, but, usually, only lines containing information to be copied directly to the FORTRAN statement file would appear before the final line which contains only a number. The final line will cause the DATA statements to be written out if a line with an initial right parenthesis has not KEYWRD, Word and Phrase Recognition Logic Generator Page 12 already done so. For example, the following input file ( BLOCK DATA ( COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR ( COMMON/LTRKEY/LTRXTR(20) 10 FOGHORN 20 FOG HORN (C COMMENT LINE TO BE COPIED DIRECTLY TO OUTPUT 30 FOG 40 FOGGY )GENERATE THE DATA STATEMENTS ( END 0 would, when processed by the KEYWRD program, produce the following FORTRAN statement file. BLOCK DATA COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR COMMON/LTRKEY/LTRXTR(20) C 10 FOGHORN C 20 FOG HORN C COMMENT LINE TO BE COPIED DIRECTLY TO OUTPUT C 30 FOG C 40 FOGGY C C FINAL STORAGE USED= 10, MOST USED= 24, LIMIT= 3000 C C CHECKSUMS 650, 8, 736, 306, 101 C C COUNT 1 2 3 4 5 6 7 8 9 10 C COMMAND 0 0 30 40 40 10 20 0 0 0 C LETTER -F O G G Y H -H O R N C SUCCESS 2 3 4 5 0 8 8 9 10 0 C FAILURE 0 7 7 6 0 7 0 0 0 0 C C DIMENSION MCHPNT(10) DIMENSION MCHPN1(10) EQUIVALENCE (MCHPN1(1),MCHPNT(1)) C DIMENSION NOTPNT(10) DIMENSION NOTPN1(10) EQUIVALENCE (NOTPN1(1),NOTPNT(1)) DATA KNTPNT,KNTXTR/ 10, 0/ DATA MCHPN1/2,3,334,445,440,118,228,9,10,0/ DATA NOTPN1/-66,172,84,83,275,95,-88,165,198,154/ END KEYWRD, Word and Phrase Recognition Logic Generator Page 13 A Sample Program Using the Output from the KEYWRD Program - ------ ------- ----- --- ------ ---- --- ------ ------- The following FORTRAN program demonstrates the manner in which the variables and the arrays generated by the KEYWRD program are used. This sample program accepts a short line of text typed by the user, then informs the user what this line of text contains. The user can type more than a single command on each line if these commands are separated by semicolons. COMMON/NUMKEY/NOTPNT(300),MCHPNT(300),KNTPNT,KNTXTR COMMON/LTRKEY/LTRXTR(20) DIMENSION LTRBFR(30),LTRABC(26),LWRABC(26) C NUMBER OF UNIT FROM WHICH TEXT IS READ DATA ITTY/5/ C DIMENSION OF LTRBFR ARRAY, NUMBER OF LETTERS TO READ DATA LMTBFR/30/ C UPPER CASE LETTERS A THROUGH Z DATA LTRABC/1HA,1HB,1HC,1HD,1HE,1HF,1HG,1HH,1HI,1HJ, 11HK,1HL,1HM,1HN,1HO,1HP,1HQ,1HR,1HS,1HT,1HU,1HV,1HW, 21HX,1HY,1HZ/ C LOWER CASE LETTERS A THROUGH Z DATA LWRABC/1Ha,1Hb,1Hc,1Hd,1He,1Hf,1Hg,1Hh,1Hi,1Hj, 11Hk,1Hl,1Hm,1Hn,1Ho,1Hp,1Hq,1Hr,1Hs,1Ht,1Hu,1Hv,1Hw, 21Hx,1Hy,1Hz/ C THE SPACE CHARACTER FOR FINDING SPACES IN TEXT C ASTERISK FOR MARKING UNKNOWN LETTERS IN MESSAGE DATA LTRSPC,LTRSEM/1H ,1H;/ C C CALCULATE FACTOR FOR EXTRACTING BYTES FROM ARRAYS IOFFST=KNTPNT+1 C C GET NEXT LINE OF TEXT TO BE INTERPRETED 1001 WRITE(ITTY,1002) 1002 FORMAT(2H *,$) READ(ITTY,1003)LTRBFR 1003 FORMAT(30A1) C C SET UP POINTERS WITHIN THE INPUT LINE BEING EVALUATED LOCBFR=0 1004 MINPRT=LOCBFR+1 C C SET UP POINTERS WITHIN THE TEST SEQUENCE KMDNOW=0 LOCPNT=1 C C GET NEXT CHARACTER TO BE TESTED 1005 MAXPRT=LOCBFR LOCBFR=LOCBFR+1 IF(LOCBFR.GT.LMTBFR)GO TO 1014 C C ATTEMPT TO IDENTIFY THE CHARACTER 1006 IF(LOCPNT.LE.0)GO TO 1014 IVALUE=NOTPNT(LOCPNT) KEYWRD, Word and Phrase Recognition Logic Generator Page 14 LOCABC=IVALUE IF(IVALUE.LT.0)LOCABC=-LOCABC LOCABC=LOCABC/IOFFST 1007 IF(LTRBFR(LOCBFR).EQ.LTRSEM)GO TO 1014 IF(LOCABC.LE.26)GO TO 1008 IF(LTRBFR(LOCBFR).EQ.LTRXTR(LOCABC-26))GO TO 1012 GO TO 1009 1008 IF(LTRBFR(LOCBFR).EQ.LTRABC(LOCABC))GO TO 1012 IF(LTRBFR(LOCBFR).EQ.LWRABC(LOCABC))GO TO 1012 C C LETTERS DID NOT MATCH 1009 IF(IVALUE.GE.0)GO TO 1011 IVALUE=-IVALUE C C CHECK FOR SPACES BEFORE NEXT WORD 1010 IF(LTRBFR(LOCBFR).NE.LTRSPC)GO TO 1007 LOCBFR=LOCBFR+1 IF(LOCPNT.EQ.1)MINPRT=LOCBFR IF(LOCBFR.LE.LMTBFR)GO TO 1010 GO TO 1014 C C GET NEXT LETTER TO BE TESTED IF FAILURE 1011 LOCPNT=IVALUE-(IOFFST*LOCABC) GO TO 1006 C C LETTERS MATCHED 1012 IVALUE=MCHPNT(LOCPNT) IF(IVALUE.GE.0)GO TO 1013 IVALUE=-IVALUE KMDNEW=IVALUE/IOFFST LOCPNT=IVALUE-(IOFFST*KMDNEW) IF(KMDNEW.NE.0)KMDNOW=-KMDNEW GO TO 1005 1013 KMDNEW=IVALUE/IOFFST LOCPNT=IVALUE-(IOFFST*KMDNEW) IF(KMDNEW.NE.0)KMDNOW=KMDNEW GO TO 1005 C C FIND RIGHTMOST PRINTING CHARACTER FOR USE IN MESSAGES 1014 MAXTST=LOCBFR MAXSHO=MAXPRT 1015 IF(LOCBFR.GT.LMTBFR)GO TO 1017 IF(LTRBFR(LOCBFR).EQ.LTRSPC)GO TO 1016 IF(LTRBFR(LOCBFR).EQ.LTRSEM)GO TO 1017 MAXSHO=LOCBFR 1016 LOCBFR=LOCBFR+1 GO TO 1015 C C CHECK IF KNOWN AND NOT FOLLOWED DIRECTLY BY LETTER 1017 IF(LOCPNT.EQ.1)GO TO 1024 IF(KMDNOW.EQ.0)GO TO 1022 IF(MAXTST.GT.MAXSHO)GO TO 1019 IF(MAXTST.GT.(MAXPRT+1))GO TO 1019 LTRNOW=LTRBFR(MAXTST) DO 1018 I=1,26 KEYWRD, Word and Phrase Recognition Logic Generator Page 15 IF(LTRNOW.EQ.LTRABC(I))GO TO 1022 IF(LTRNOW.EQ.LWRABC(I))GO TO 1022 1018 CONTINUE C C REPORT WHAT WAS FOUND 1019 WRITE(ITTY,1020)KMDNOW,(LTRBFR(I),I=MINPRT,MAXPRT) 1020 FORMAT(8H COMMAND,1I3,2H: ,31A1) IF(MAXTST.GT.MAXSHO)GO TO 1026 WRITE(ITTY,1021)(LTRBFR(I),I=MAXTST,MAXSHO) 1021 FORMAT(13H ARGUMENTS: ,31A1) GO TO 1026 1022 WRITE(ITTY,1023)(LTRBFR(I),I=MINPRT,MAXSHO) 1023 FORMAT(13H UNKNOWN: ,31A1) GO TO 1026 1024 WRITE(ITTY,1025) 1025 FORMAT(11H MISSING) C C GET NEXT STATEMENT ON SAME LINE 1026 IF(LOCBFR.LE.LMTBFR)GO TO 1004 GO TO 1001 END The following is a typical dialog between the user and the program which is listed above. The glossary which is used consists of the words FOGHORN, FOG HORN, FOG and FOGGY. * MISSING *; MISSING MISSING *FOH;FOGH;FOG H COMMAND 20: FOH COMMAND 10: FOGH COMMAND 20: FOG H *FOG;FOGHORN;FOG HORN;FOGGY COMMAND 30: FOG COMMAND 10: FOGHORN COMMAND 20: FOG HORN COMMAND 40: FOGGY *FOG123;FOG ABC COMMAND 30: FOG ARGUMENTS: 123 COMMAND 30: FOG ARGUMENTS: ABC *ABC;FO;FOGABC UNKNOWN: ABC UNKNOWN: FO UNKNOWN: FOGABC KEYWRD, Word and Phrase Recognition Logic Generator Page 16 A Description of the Algorithm Used by the KEYWRD Program - ----------- -- --- --------- ---- -- --- ------ ------- The first printing character in each new line read from the input file is checked to determine if it is one of the reserved characters. If the line instead begins with a number, then this number is evaluated and the spaces and/or a single comma following the number are discarded. A decision tree is then constructed which describes the paths by which the characters in the word or in the phrase could have been recognized. Each node in the decision tree specifies the node from which it can be reached and whether it is reached upon the success or upon the failure, never both, of the test described by that node. The first character of the second and of each of the subsequent words in a phrase is included in separate tests which are reached on failures to match each appearance of the second and of each of the subsequent characters of the previous word, and is also included in separate tests which are reached upon successful matches of each appearance of the final character of the previous word. If the command consists of a single word, then the decision tree is a simple chain of tests, each node indicating that it is reached only if the test described by the previous node is successful. If the command consists of more than a single word, then each character in the second word appears in as many tests as there are characters in the first word. Each character in a third word would appear in a number of tests equal to the number of characters in the first word times the number of characters in the second, and so on. The decision tree which can be used to identify the new command is appended to the arrays which are used to store the decision trees already obtained for the previously read commands. For example, the input file 10 WHO ARE YOU 20 WHO AM I 0 would be converted into the 2 trees which are shown below after the second line of the input file is read. KEYWRD, Word and Phrase Recognition Logic Generator Page 17 17 26 35 Y--O--U /10 10 10 /20 29 38 / Y--O--U / /10 10 10 5 8/11/14 23 32 A--R--E--Y--O--U /10 10 10 10 10 10 / 18 27 36 / Y--O--U / /10 10 10 / /21 30 39 / / Y--O--U / / /10 10 10 / 6 9/12/15 24 33 / A--R--E--Y--O--U / /10 10 10 10 10 10 / / 16 25 34 / / Y--O--U / / /10 10 10 / / /19 28 37 / / / Y--O--U / / / /10 10 10 1 2/ 3/ 4 7/10/13 22 31 W--H--------O--------A--R--E--Y--O--U 10 10 10 10 10 10 10 10 10 and 53 I /20 44 47/50 A--M--I /20 20 20 / 54 / I / /20 / 45 48/51 / A--M--I / /20 20 20 / / 52 / / I / / /20 40 41/ 42/ 43 46/49 W--H-----O-----A--M--I 20 20 20 20 20 20 After a new command has been read and converted to the back pointing decision tree, nodes in the new tree which are identical to nodes in the old trees, and which are reached under identical conditions, are removed and all references to the removed nodes in the new tree are revised to point to the nodes in the old trees to which the removed nodes are KEYWRD, Word and Phrase Recognition Logic Generator Page 18 identical. The portion of the new tree which is stored above the node which is being removed is shifted downward 1 location and all references in this portion of the new tree to the node being removed are changed to instead point back to the node in the old trees to which the removed node is identical. The decision tree shown below would represent the input file shown above after the redundant subphrase WHO A is removed from the second command. There are then 2 nodes which branch from node 4 on a success, 2 from 5 and 2 from 6. KEYWRD, Word and Phrase Recognition Logic Generator Page 19 17 26 35 Y--O--U /10 10 10 /20 29 38 / Y--O--U / /10 10 10 8/11/14 23 32 :--R--E--Y--O--U : 10 10 10 10 10 : 47 : I : /20 5 41/ 44 A--M--I /0 20 20 / 18 27 36 / Y--O--U / /10 10 10 / /21 30 39 / / Y--O--U / / /10 10 10 / 9/12/15 24 33 / :--R--E--Y--O--U / : 10 10 10 10 10 / : 48 / : I / : /20 / 6 42/ 45 / A--M--I / /0 20 20 / / 16 25 34 / / Y--O--U / / /10 10 10 / / /19 28 37 / / / Y--O--U / / / /10 10 10 / / 7/10/13 22 31 / / :--R--E--Y--O--U / / : 10 10 10 10 10 / / : 46 / / : I / / : /20 1 2/ 3/ 4 40/ 43 W--H--------------O--------------A--M--I 0 0 0 0 20 20 The roots of the remaining trees, which will each begin with a different character, are chained together in a series of failure transfers when the end of the input file is read. The resulting single back pointing decision tree must then be converted into a forward pointing decision tree in which each node indicates the node which is to follow it in the event of a success or of a failure. A node in the back pointing decision tree can be reached only upon the success or the failure of a single node which is stored lower in the KEYWRD, Word and Phrase Recognition Logic Generator Page 20 arrays since all linking and relinking has been to nodes in older trees and since the nodes are still in the same relative order as when they were constructed. Since nodes can only be reached from nodes which are stored lower in the arrays, a single pass through the arrays can be used to convert from the back pointing decision tree to the forward pointing decision tree. If a node indicates that it is reached by a failure of a node which represents a letter which is higher in the alphabet, then the tree must be relinked to place the new node earlier in the chain of failures than that which is indicated to be its predecessor. In this reordering, as in all situations in which characters are compared, characters which appear at the start of words are considered to be in a second and higher alphabet. The following forward pointing decision tree would represent the sample input file shown above. 17 26 35 Y--O--U /10 10 10 47/20 29 38 I Y--O--U /20/10 10 10 8/11/14 23 32 R--E--Y--O--U /10 10 10 10 10 5 41/ 44 A--M--I /0 20 20 / 18 27 36 / Y--O--U / /10 10 10 / 48/21 30 39 / I Y--O--U / /20/10 10 10 / 9/12/15 24 33 / R--E--Y--O--U / /10 10 10 10 10 / 6 42/ 45 / A--M--I / /0 20 20 / / 16 25 34 / / Y--O--U / / /10 10 10 / / 46/19 28 37 / / I Y--O--U / / /20/10 10 10 / / 7/10/13 22 31 / / R--E--Y--O--U / / /10 10 10 10 10 1 2/ 3/ 4 40/ 43 W--H-----------O-----------A--M--I 0 0 0 0 20 20 Each chain of failures is next purged of identical KEYWRD, Word and Phrase Recognition Logic Generator Page 21 characters, since the second appearance of a particular character in a chain of failures can never be reached. When a duplicate is found, then the chain of failures must be linked around the node being removed, and the chain of failures extending from the success transfer from the node being removed must be merged with the chain of failures extending from the success transfer from the node being kept, weaving the 2 chains together to obtain a single chain in which the characters are tested in alphabetical order. Once the tree has been purged of identical characters in the chains of failures, the nodes which represent the unique characters in all possible abbreviations of each command must be identified so that these nodes can be preserved when identical branches of the decision tree are merged. The nodes which must be preserved are those which are on the chains of failures extending from the success transfers from the nodes which are known to be ambiguous by their having been on roots which were merged. Identical nodes which are at the ends of the branches or which transfer to the same nodes are merged if these identical nodes are not marked as representing unique characters in different commands. Since it is not necessary to preserve the relative order of the storage of the nodes in the arrays, the node being removed is interchanged with that at the upper end of the arrays and all references to both nodes are patched. This interchange prevents having to shift several nodes each time a node must be removed. Pruning of the identical branches would convert the decision tree shown above to that which is shown below. 6 11 12 /--/--/--Y--O--U 5/ / / 10 10 10 I / / /20/ / 7/10/ / R--E--/ /10 10 4 9/ 8 /--/--/--A--M--I 1 2/ 3/ / 0 20 20 W--H--O--/ 0 0 0 To make the decision tree easier to diagram, the nodes are rearranged into the order in which the nodes are encountered when the decision tree is climbed. The decision tree shown above would finally be reduced to that which is shown below. KEYWRD, Word and Phrase Recognition Logic Generator Page 22 10 11 12 /--/--/--Y--O--U 9/ / / 10 10 10 I / / /20/ / 7/ 8/ / R--E--/ /10 10 4 5/ 6 /--/--/--A--M--I 1 2/ 3/ / 0 20 20 W--H--O--/ 0 0 0 When the optional spaces before each of the words of each of the phrases are included, the decision tree would instead appear as is shown below. 10 11 12 /--/--/-space-Y--O--U 9/ / / 10 10 10 /-space-I / / / 20/ / 7/ 8/ / R----------E--/ /10 10 4 5/ 6 /--/--/-space-A--M--space--I 1 2/ 3/ / 0 20 20 W--H--O--/ 0 0 0 The following listing file would be produced when the above decision tree is constructed. 10 WHO A(RE YOU) WHO A(R YOU) WHO A (YOU) WH A(RE YOU) WH A(R YOU) WH A (YOU) W A(RE YOU) W A(R YOU) W A (YOU) 20 WHO A(M I) WHO A (I) WH A(M I) WH A (I) W A(M I) W A (I) KEYWRD, Word and Phrase Recognition Logic Generator Page 23 The following FORTRAN statement file represents the decision tree shown above. C 10 WHO ARE YOU C 20 WHO AM I C C FINAL STORAGE USED= 12, MOST USED= 54, LIMIT= 3000 C C CHECKSUMS 880, 30,1121, 448, 288 C C COUNT 1 2 3 4 5 6 7 8 9 10 11 12 C COMMAND 0 0 0 0 20 20 10 10 20 10 10 10 C LETTER -W H O -A M -I R E -I -Y O U C SUCCESS 2 3 4 5 6 0 8 10 0 11 12 0 C FAILURE 0 4 4 0 7 0 9 10 10 0 0 0 C C DIMENSION MCHPNT(12) DIMENSION MCHPN1(12) EQUIVALENCE (MCHPN1(1),MCHPNT(1)) C DIMENSION NOTPNT(12) DIMENSION NOTPN1(12) EQUIVALENCE (NOTPN1(1),NOTPNT(1)) DATA KNTPNT,KNTXTR/ 12, 0/ DATA MCHPN1/2,3,4,5,266,260,138,140,260,141,142,130/ DATA NOTPN1/-299,108,199,-13,176,-117,243,75,-127, 1-325,195,273/ The KEYWRD program and this documentation were written at the Harvard Business School by Donald E. Barth, who can be reached at the following address. Baker Library 21 Graduate School of Business Administration Harvard University, Soldiers Field Boston, Massachusetts 02163 Appendix: KEYWRD Development History -------- ------ ----------- ------- May 1980 Original release of the KEYWRD program through DECUS library. August 1980 (modification) Listing indicates commands which are subsets of other commands. (modification) Program estimates necessary array size if glossary is too large.