TPARS Users Manual ** ** ************************ * * * TPARS Users Manual * * * * 28-May-1976 * * * * A. C. Goldstein * * * * 130-958-021-02 * * * ************************ TPARS User s Manual PAGE 2 1.0 INTROD UCTION Fair warning: The reader is assumed to have a cursory knowledge of table driven parsing and/or finite state auto- mata. TPARS is a general purpose table driven parser. It is f un- damentally capable of handling anything that is finite state parsable. Due to its structure and choice of basic lexical t ypes, however, it works most efficiently for keyword gram- mars such as operating system command lines, etc. A string is parsed by int erpreting the transitions in a state table supplied by the cal ler. The state table is a list of states. Each state consists of a list of transi- tions to other states. Each transition is compo sed of the following elements: Type The type of syn tactic element which, when recog- nized in the input stri ng, will cause this transi- tion to be taken. Target The label of the state to transfer to. Action Address of a user supplied action routine to call when taking this transition. Mask A bit mask to set when taking this transitio n. Address The address into which to set the mask when taking this transition. The action, mask, and address co mponents of a transition are optional. The target component of a t ransition may be op- tional under certain circumstances explained lat er. When a transition is taken, the section of the input string that matches the specified type is passed over, and the characters following it become available for matching by transit ions in the next state. The structure of the state table specifies the syntax the parser will accept; the action routines specify t he seman- tics. TPARS Users Manual PAGE 3 2.0 STATE TABLE SOURCE CODING A set of macros is provided to generate the state table. The state table proper is built in a Psect named $STATE. All keyword strings are accumulated in a separate keyword table in Psect $KS TR. A keyword pointer table is built in Psect $KTAB. If the user de fines the symbol $RONLY, the $STATE, $KSTR, and $KTAB Psects w ill be generated as read only. The state table is coded at the sourc e level with the following macro calls: 2.1 ISTAT$ Statetable,Keytable ISTAT$ initializes table generation. It equat es the label "Statetable" to the start of the state table, and the label "Keytable" to the start of the keyword pointer table (expla- ined below). 2.2 STATE$ Label STATE$ d eclares the beginning of a state. The label is op- tional. If pr esent, it is equated to the address of the state. Because of the w ay the table generation macros work, the state table must be term inated with one extra STATE$ call to force completion of the last tra nsition entry. 2.3 TRAN$ Type,Label,Action,Mask,Address TRAN$ specifies a state transition. 2.3.1 Type - "Type" specifies the syntactical type that will cause this transition to be taken. "Type" may be one of the following: $A NY Matches any single character. $ALPHA Matches any single alphabetic character. $DIGIT Matches any single digit (0-9). $LAMDA Matches the empty string. This transition is al- ways successful. Lambda transitions are useful for getting action routines called without passing any of the input string. TPARS Users Manual PAGE 4 $NUMBR Matches a number. A number consists of a string of digits, followed optionally by a period . If the number is not followed by a period, it is taken as octal; if it is, it is taken as decimal, and the decimal point is included in the matching string. A number is terminated by any non-numeric chara cter. Values up through 2**32 - 1 will be converted co rrectly into a 32 bit unsigned in- teger. $ DNUMB Matches a decimal number. The string of digits is interpreted as decimal. The matched string does not inc lude trailing decimal points. Otherwise, it is treated t he same as $NUMBR. $STRNG Matches any alphanumeric string. The string will not be null. $RAD50 Matches a ny legal Radix-50 string, i.e., any non-null string containing alphanumerics and/or dot and dollar sign. Conv ersion to Rad-50 is up to the user's action routine. $BLANK Matches any non-null string of blanks and tabs. $EOS Matches end of the input string. Once TPARS has reached the end of the input string, $EOS will match as many times as it is encountered in the state table. Char Matches the single character whose ASCII code cor- responds to the value of the expression "Char". The value of the expression must be a 7 bit ASCII co de; i.e., must lie in the range 0 to 177 (octal). Constructions such as 'X are legal and in fact recommende d. "Keyword" Matches the specified keyword. Keywords may be any length, must contain only alphanumeric char- acters, and must be terminated by a non-alphanumeric character in the string being parsed . A state table may contain a maximum of 64 keywords. !Label Matches the sub-expression described by the sec- tion of the state table entered at state "Label" and terminated by a $EXIT. i.e., TPARS is caused to c all itself recursively. This construction is useful for ha ndling complex syntactical types that must appear in sever al places in the state table. It is also useful for recursi ve or indeterministic syntactical structures. Note, howev er, that each level of recursion requires six words of stack TPARS Users Manual PAGE 5 space. 2.3.2 Label - "Label" specifies the target state of the transition. If "Lab el" is null, the target is defaulted to the state imme- diately foll owing. A null label may be used only in the last transition in a state. If the label is $EXIT, the parse is terminated and TPAR S returns to the calling program with carry clear to indicate a succe ssful parse. 2.3.3 Action - "Action" is the a ddress of a caller supplied action routine which is called when th e transition is taken. If "Action" is null, no call is made. The fo llowing internal variables are available to action routines: .PSTCN Character count and .PSTPT address of the portion of the input string matched by this transition. These two variables are good for all types recognized by TPARS, including subexpressions. .PNUMH High ord er and .PNUMB low order binary value of the number found by $NUMBR or $DNUMB. .PCHAR The character found by $ANY, $ALPHA, $DIGIT, or "Char". .PFLAG The flag word supplied by the call to .TPARS in R1 (see calling sequence below). The action routine may modify t his word to change the options on the fly. R3 Byte count and R4 address of the remainder of the inpu t string. At the time the action routine is called, the string does not include the characters matched by the transition being taken. Action routines are cal led with JSR PC. They may modify R0, R1, and R2 but must preserve al l other registers. An action routine may reject the state transitio n from which it is being called by returning at call+4 instead of the normal call+2. (i.e., the action routine performs the equi- valent of ADD #2,(SP) before returning.) This mechanism TPARS Users Manual PAGE 6 permits an action routine to perform additional discrimina- tion o f syntactic types, and allows arbitrary extensibility of the set of " hard wired" syntactic types made available by TPARS. When an acti on routine rejects a transition, the overall effect is as if that tr ansition had never matched. TPARS will continue to attempt to matc h the remaining tran- sitions in the state. 2.3.4 Mask - "Mask" specifies a mask word that is to be stored when the transition is taken. If "Mask" is null, nothing is stored. 2.3.5 Address - "Address" specifies the location i nto which this mask word is to be stored. Functionally, TPARS executes a "BIS #MASK,ADDRESS" when the transition is tak en. This construct allows the user to avoid having to supply an a c- tion routine when all he wants to do is flag the fact that a particular transition has been taken. The "Mask" and "Ad- dr ess" arguments must be present or absent together. 2.4 Genera l Coding Considerations Note that the sets of strings that can be r epresented by the transitions in a state are not necessarily disjoint. Therefore the order in which types are presented in th e transition table of a state is critical. Transitions are always scanned in the order in which they are listed; hence the following ordering of mixed types is recommended: All Ch ar All "Keyword" $EOS $ALPHA $DIGIT $BLANK $NUMBR $DNUMB $STRNG $RAD50 $ANY $LAMDA The placement of !Label transitions in a state depends on t he basic types used within the subexpression. TPARS Users Manual PAGE 7 The handling of argume nts by the assembler's macro processor causes some problems when cha racter expressions are used in the "Char" syntactical type. While mo st characters can be matched with a straightforward expression, suc h as TRAN$ '= a comma must be encl osed in angle brackets to avoid being taken as an argument separat or TRAN$ <',> Angle brackets canno t be entered as character literals at all, but must be expressed s ymbolically, i.e. LA = '< TRAN$ LA 2.5 Blank Suppress TPARS may optionally be caused to flush blanks in the string being parsed. If so requested, all blanks and tabs will be ignored at the syntactic e lement level; i.e., blanks and tabs may appear between any syntac tic elements without being called for in the state table. The string being parsed is not modified. Note that under this option the $ BLANK type will never match. Note also that when a "!Label" transiti on is taken, the matching string specified by .PSTCN and .PSTPT may contain blanks or tabs, even though none were called for in the subexpression. The matching strings for the basic types will never contain extraneous blanks, however. Under the blank flush op tion, blanks and tabs function as invisible terminators. Consider the string ABC DEF This str ing is matched by the following state table segments with and without the blank flush option: Without With STATE$ TRAN$ $STRNG STATE$ STATE$ TRAN$ $STRNG TRAN$ $BLANK STATE$ STATE$ TRAN$ $STRNG TRAN$ $STRNG TPARS Users Manual PAGE 8 3.0 INTERNAL REPRESENTATION The state table is coded as follows. Each state consists of its transition entries concatenated in the order in which they are specified. The state label, if present, is equated to the address of the first transition in the state. Each transition entry consist s of 1 to 6 words as follows: Type - 1 byte. May assume the following values: $LAMDA = 300 $NUMBR = 302 $STRNG = 304 $BLANK = 306 $SUBXP = 310 U sed in the !Label type. $EOS = 312 $DNUMB = 314 $RAD50 = 316 $ANY = 320 $ALPHA = 322 $DIGIT = 324 Char = Ascii value of the character. "Keyword" = 200+n Where n is an index into the keyword table. The keyword table is an array of pointers to the keyword strings. Each string is stored terminated with a sin- gl e "377". Flags - 1 byte. Bits are used as follows: Bit 0 Type extension exists. Bit 1 Action exists. Bit 2 Mask word exists. Bit 3 Address word exists. Bit 4 Target exists. Bit 7 This is last transition in thi s state. The remaining five words are optional: Type extension - 1 word. Present only in the !Label type. Contains sta rting state of the subexpression. Action - 1 word. Present if acti on routine was specified. Contains address of the action routine. Mask - 1 word. Present if mask argument was supplied. This is the bit mask to set. Address - 1 word. Present if address a rgument was supplied. Contains address into which to set the mask. TPARS Users Manual PAGE 9 Target - 1 word. Present if explicit target state was spec- ified. Contains address of target state. The target of $EXIT is coded as a zero word. The assembly listing produced by the s tate table macros is not straightforward. The object code was des igned for com- pactness and convenience in processing, and the source macro language was designed for clarity and convenience in writing state tables. As a result, the mechanism that translates between the two (the macros) is of necessity somewhat ob- scure. The binary output of the macros is delayed by one statement. Thus, if the listing of macro generated binary code is en- ab led, the binary code that appears after a particular macro call is in fact the result of the preceding macro. Error messages generated by the macros are similarly delayed. This is the reason for the extra STATE$ call at the end of a state table. The macr os generate code in three Psects ($STATE, $KTAB, and $KSTR). At th e end of each macro expansion, the Psect is reset to the blank Psect , to prevent user code from being loaded into any of these Psects accidentally. As a result, the location counter displayed for each macro call is not meaningful, since it is simply the current pos ition in the blank Psect. TPARS Users Manual PAGE 10 4.0 CALLING SEQUENCE TPARS is called with the following arguments: R1 = O ptions word R2 = Pointer to the keyword table R3 = Length of string to be parsed R4 = Address of string to be parsed R5 = Label of starting state in state t able CALL .TPARS On return, R3 = Length and R4 = address of the unscanned portion of the string Other registe rs are preserved. The low byte of the options word contains flag bits. The only flag defined is bit 0; it controls the processing of blanks. If the bit is set, blanks are processed normally by transitions in the state table; if it is clear, blanks are i gnored as described above. The high byte of the options word is the match count on key- words. If it is zero, all keyword strings i n the string being parsed must match the keywords exactly as listed in the state table. If this byte is set to the value "n", then all keywords may be abbreviated down to a minimum of n char- acters. Keywords given with more than the minimum number of characte rs must still be spelled correctly. Keywords that are shorter th an n characters must still be given exactly. Prevention of ambiguitie s is the responsibility of the state table designer. The C bit is cleared to indicate a successful parse. This occurrs wh en a transition is made to $EXIT which is not within a subexpress ion. If a syntax error is encountered, the parse is terminated and TPARS returns with the C bit set. A syntax error occurrs when t here are no types listed in the current state which match the por tion of the input string currently being scanned. Illegal type co des, and other forms of bad data in the state table, will also caus e a syntax error return. Note that TPARS trusts addresses in the state and keyword tables, and that bad addresses may cau se program termination. Note that when the end of the input stri ng is reached, the only types that will match are $LAMDA and $EOS. TPARS Users Manual PAGE 11 5.0 SPACE REQUIREMENTS The routine .TPARS proper is 253 words in length. In addi- tion, it calls .OD2CT and .DD2CT in th e system library (an additional 68 words). 6.0 AVA ILABILITY TPARS and the table generation macros are available in the system library of RSX-11M, RSX-11D, and IAS. Copies of the routines may also be obtained from the author. 7.0 EXA MPLE 1 A sample state table and set of action routines appear on the following pages. Their function is to parse the command line to the UFD command in RSX-11M. The general form of the comm and line is as follows: UFD DK0:LABEL[201,202]/ALLOC=100./PRO=[RWED ,RWED,RWE,R] The associated action routines return the following va lues: $UDEV Device name (2 characters ascii) $UUNIT Unit number (binary) $UVNML Byte count and $UVNAM address of volume label string $UUIC Binary UIC for which to cre ate directory $UALL Number of directory entries to preallocate $UPRO Binary protection word for UFD $FLAGS Flags wor d containing the following bits: UF.ALL Set if allocation was spec ified. UF.PRO Set if protection was specified. Obse rve that the "LABEL" and the "/ALLOC" and "/PRO" switches are optional. TPARS Users Manual PA GE 12 .TITLE STATE TABLE FOR UFD COMMAND LINE .MCALL ISTAT$,STATE$,TRAN$ ; TO BE USED WITH BLANK SU PPRESS OPTION ISTAT$ UFDSTB,UFDKTB ; READ OVE R COMMAND NAME .GLOBL START STATE$ S TART TRAN$ "UFD" ; READ DEVICE AND UNIT NUMBER STATE$ TRAN$ $ALPHA,,SETDV1 STATE$ TRAN$ $ALPHA,,SETDV2 STATE$ TRAN$ $NUMBR,DEV1,SETUNT TRAN$ $LAMDA STATE$ DEV1 TRAN$ ': ; READ VOLUME LABEL STATE$ TRAN$ $STRNG,RUIC,SETLAB TRAN$ $LAMDA ; READ UIC STATE$ RUIC TRAN$ !UIC ; S CAN FOR OPTIONS AND END OF LINE STATE$ OPTS TRAN$ $EOS,$EXIT TRAN$ '/ STATE $ TRAN$ "ALLOC",ALC,,UF.ALL,$FLAGS TRAN$ "PRO",PRO,,UF.PRO,$FLAGS TPARS Users Manual PAGE 13 ; SET ALLOCATION ST ATE$ ALC TRAN$ '= STATE$ TRAN$ $NUMBR,OPTS,SETALC ; PROTECTION STATE$ PRO TRAN$ '= STATE$ TRAN$ '[,,IGROUP STATE$ SPRO TRAN$ '],OPTS,ENDGRP TRAN$ <',>,SPRO,NXGRP TRAN$ 'R,SPRO,SETRP TRAN$ 'W,SPRO,SETWP TRAN$ 'E,SPRO,SETEP TRAN$ 'D,SPRO,SETDP ; SUBEXPRESSION TO READ AND STORE UIC STATE$ UIC TRAN$ '[ STATE$ TRAN$ $NUMBR,,SETGN STATE$ TRAN$ <',> STATE$ TRAN$ $NUMBR,,SETPN STATE$ TRAN$ '],$EXIT STATE$ State Table Size: 60 words Keyword Table Size: 8 words Keyword Pointer Space: 3 words T PARS Users Manual PAGE 14 .S BTTL ACTION ROUTINES FOR THE COMMAND LINE PARSER ; DEVICE NAME CHAR 1 SETDV1::MOVB .PCHAR,$UDEV RETURN ; DEVICE NAME CHAR 2 SETDV2::MOVB .PCHAR,$UDEV+1 RETURN ; UNIT NUMBER SETUNT::MOV .PNUMB,$U UNIT RETURN ; VOLUME LABEL SETLAB:: MOV .PSTCN,$UVNML MOV .PSTPT,$UVNAM RETURN ; PPN - GROUP NUMBER SETGN:: MOVB .PNUMB,$ UUIC+1 BR TSTPPN ; PPN - PROGRAMMER NUMBER SETPN:: MOVB .PNUMB,$UUIC TSTPPN: TST .PNUMH ; CHECK FOR ZERO HIGH ORDER BNE 10$ TSTB .PNUMB+1 ; CHECK FOR BYTE VALUE BEQ 20$ 10$: ADD #2,(SP) ; BAD VALUE - REJECT TRANSITION 20 $: RETURN ; NUMBER OF ENTRIES TO ALLOCATE SETALC::M OV .PNUMB,$UALL RETURN TPARS Users Manual PAGE 15 ; SET PERMISSIONS ; INITIALIZE IGROUP::MOV #4,GRCNT ; MOVE TO NEXT PERMISSIONS CATEGORY NXGRP:: SEC ; FORCE ONE S ROR $UPRO ASR $UPRO ; SHI FT TO NEXT GROUP ASR $UPRO ASR $UP RO DEC GRCNT ; COUNT GROUPS BGE 30$ ; TOO MANY IS AN ERROR BADGRP: ADD #2,(SP) ; IF SO, REJECT TRANSITION 30$: RETURN ; SET READ PE RMIT SETRP:: BIC #FP.RDV*10000,$UPRO RETURN ; SET WRITE PERMIT SETWP:: BIC #FP.WRV*10000,$UPRO RETURN ; SET EXTEND PERMIT SETEP:: BIC #FP.EXT*10000,$UPRO RETURN ; SET DELETE PERMIT SETDP:: BIC #FP.DEL*10000,$UPRO RETU RN ; END OF PROTECTION SPEC ENDGRP::TST GRCNT ; CHECK THE GROUP COUNT BNE BADGRP ; MUST HAVE 4 RETURN .END UFD TP ARS Users Manual PAGE 16 8.0 EXAMPLE 2 As an illustration of the use of subexpressions and tr ansi- tion rejection, consider the following excerpt of a state table which parses a string quoted by an arbitrary character (i.e., the first character is taken to be the quote char- acter). Such a construct occurrs in many editors and some programming lan guages. The associated action routines leave the byte count and addr ess of the string in locations "QSTC" and "QSTP", respectively, and t he quoting character in loca- tion "QCHAR". ; MAIN L EVEL STATE TABLE ; ; PICK UP THE QUOTE CHARACTER STATE$ STRING TRAN$ $ANY,,SETQ ; ACCEPT THE QUOTED STRING STATE$ TRAN$ !QSTRG,,SETST ; GOBBLE UP THE TRAILING QUOTE CHARACTER STATE$ TRAN$ $ANY,NEXT,RESET ; SUBE XPRESSION TO SCAN THE QUOTED STRING ; THE FIRST TRANSITION WILL MATCH UNTIL IT IS REJECTED ; BY THE ACTION ROUTINE ST ATE$ QSTRG TRAN$ $ANY,QSTRG,TESTQ TRAN$ $LAMDA,$EXIT TPARS Users Manual PAGE 17 ; ACTION ROUTINES ; ; STORE TH E QUOTING CHARACTER SETQ: MOVB .PCHAR,QCHAR INCB .PFLAG ; TURN OFF SPACE FLUSH RETURN ; TEST FOR QUOTING CHARACTER IN THE STRING TESTQ: CMPB .PCH AR,QCHAR BNE 10$ ADD #2,(SP) ; REJECT TRANSITION ON MATCH 10$: RETURN ; STORE THE STRING DESCRIPTOR SETST: MOV .PSTPT,QSTP MO V .PSTCN,QSTC RETURN ; RESET THE SPACE FLUSH FLAG RESET: DECB .PFLAG RETURN TPARS Users Manual PAGE 18 9.0 EXAMPLE 3 The following state table excerpt illustrates how s ubexpres- sions may be used to parse indeterministic grammars. Th e state table excerpt accepts a number followed by a qualifier keyword; depending upon the keyword, the number is inter- pre ted either as octal or as decimal. The binary value of the number is left in the words tagged NUMBER. The follow- ing sort of strings will be accepted: 10/OCTAL 359/DECI MAL 77777/OCTAL This sy ntax is ambiguous for an LR(1) grammar; it is unam- biguous for an LR(2) grammar. ; MAIN STATE TABLE ENTRY - ACCEPT THE EXPRESSION A ND ; STORE ITS VALUE STATE$ T RAN$ !ONUMB,NEXT,SETNUM TRAN$ !DNUMB,NEXT,SETNUM ; SUBEXPRESSION TO ACCEPT OCTAL NUMBER STATE$ ONUMB TRAN$ $NUMBR STATE$ TRAN$ '/ STATE$ TRAN$ "OCTAL",$EXIT ; SUBEXPRESSION TO ACCEPT DECIMAL NUMBER STATE $ DNUMB TRAN$ $DNUMB STATE$ TRAN$ '/ STATE$ TRAN$ "DEC IMAL",$EXIT ; ACTION ROUTINE TO STORE THE NUMBER SET NUM: MOV .PNUMB,NUMBER MOV .PNUMH,NUMBER+2 TPARS Users Manual PAGE 19 RETURN Note that the contents of .PNUMB and .PNUMH ar e left undis- turbed by all state transitions except for the $NUM BR and $DNUMB types. Because of the way in which subexp ressions are processed, calls to action routines from within subex pressions must be handled with some care. When a subexpression is en countered in a transition, TPARS saves its current context and call s itself, using the label of the subexpression as the new starting state. If the subexpression parses successfully and returns with a $EXIT, all is well and that transition is taken. If the subexpression fails (encounters a syntax error), then TPARS restores the saved context to back out of the subexpression and trie s the next transition in the ori- ginal state. However, there is no facility for undoing ac- tions performed by the user's action r outines that were called from within the subexpression. Therefore , action routines called by transitions in a subexpression that ma y fail should store their results in some intermediate area, and let their results be made "permanent" by an action rou- tin e which is called from the main level state table.