TPARS Users Manual ** ** ************************ * * * TPARS Users Manual * * * * 28-May-1976 * * * * A. C. Goldstein * * * * 130-958-021-02 * * * ************************ TPARS Users Manual PAGE 2 1.0 INTRODUCTION Fair warning: The reader is assumed to have a cursory knowledge of table driven parsing and/or finite state automata. TPARS is a general purpose table driven parser. It is fundamentally capable of handling anything that is finite state parsable. Due to its structure and choice of basic lexical types, however, it works most efficiently for keyword grammars such as operating system command lines, etc. A string is parsed by interpreting the transitions in a state table supplied by the caller. The state table is a list of states. Each state consists of a list of transitions to other states. Each transition is composed of the following elements: Type The type of syntactic element which, when recognized in the input string, will cause this transition to be taken. Target The label of the state to transfer to. Action Address of a user supplied action routine to call when taking this transition. Mask A bit mask to set when taking this transition. Address The address into which to set the mask when taking this transition. The action, mask, and address components of a transition are optional. The target component of a transition may be optional under certain circumstances explained later. When a transition is taken, the section of the input string that matches the specified type is passed over, and the characters following it become available for matching by transitions in the next state. The structure of the state table specifies the syntax the parser will accept; the action routines specify the semantics. TPARS Users Manual PAGE 3 2.0 STATE TABLE SOURCE CODING A set of macros is provided to generate the state table. The state table proper is built in a Psect named $STATE. All keyword strings are accumulated in a separate keyword table in Psect $KSTR. A keyword pointer table is built in Psect $KTAB. If the user defines the symbol $RONLY, the $STATE, $KSTR, and $KTAB Psects will be generated as read only. The state table is coded at the source level with the following macro calls: 2.1 ISTAT$ Statetable,Keytable ISTAT$ initializes table generation. It equates the label "Statetable" to the start of the state table, and the label "Keytable" to the start of the keyword pointer table (explained below). 2.2 STATE$ Label STATE$ declares the beginning of a state. The label is optional. If present, it is equated to the address of the state. Because of the way the table generation macros work, the state table must be terminated with one extra STATE$ call to force completion of the last transition entry. 2.3 TRAN$ Type,Label,Action,Mask,Address TRAN$ specifies a state transition. 2.3.1 Type - "Type" specifies the syntactical type that will cause this transition to be taken. "Type" may be one of the following: $ANY Matches any single character. $ALPHA Matches any single alphabetic character. $DIGIT Matches any single digit (0-9). $LAMDA Matches the empty string. This transition is always successful. Lambda transitions are useful for getting action routines called without passing any of the input string. TPARS Users Manual PAGE 4 $NUMBR Matches a number. A number consists of a string of digits, followed optionally by a period. If the number is not followed by a period, it is taken as octal; if it is, it is taken as decimal, and the decimal point is included in the matching string. A number is terminated by any non-numeric character. Values up through 2**32 - 1 will be converted correctly into a 32 bit unsigned integer. $DNUMB Matches a decimal number. The string of digits is interpreted as decimal. The matched string does not include trailing decimal points. Otherwise, it is treated the same as $NUMBR. $STRNG Matches any alphanumeric string. The string will not be null. $RAD50 Matches any legal Radix-50 string, i.e., any non-null string containing alphanumerics and/or dot and dollar sign. Conversion to Rad-50 is up to the user's action routine. $BLANK Matches any non-null string of blanks and tabs. $EOS Matches end of the input string. Once TPARS has reached the end of the input string, $EOS will match as many times as it is encountered in the state table. Char Matches the single character whose ASCII code corresponds to the value of the expression "Char". The value of the expression must be a 7 bit ASCII code; i.e., must lie in the range 0 to 177 (octal). Constructions such as 'X are legal and in fact recommended. "Keyword" Matches the specified keyword. Keywords may be any length, must contain only alphanumeric characters, and must be terminated by a non-alphanumeric character in the string being parsed. A state table may contain a maximum of 64 keywords. !Label Matches the sub-expression described by the section of the state table entered at state "Label" and terminated by a $EXIT. i.e., TPARS is caused to call itself recursively. This construction is useful for handling complex syntactical types that must appear in several places in the state table. It is also useful for recursive or indeterministic syntactical structures. Note, however, that each level of TPARS Users Manual PAGE 5 recursion requires six words of stack space. 2.3.2 Label - "Label" specifies the target state of the transition. If "Label" is null, the target is defaulted to the state immediately following. A null label may be used only in the last transition in a state. If the label is $EXIT, the parse is terminated and TPARS returns to the calling program with carry clear to indicate a successful parse. 2.3.3 Action - "Action" is the address of a caller supplied action routine which is called when the transition is taken. If "Action" is null, no call is made. The following internal variables are available to action routines: .PSTCN Character count and .PSTPT address of the portion of the input string matched by this transition. These two variables are good for all types recognized by TPARS, including subexpressions. .PNUMH High order and .PNUMB low order binary value of the number found by $NUMBR or $DNUMB. .PCHAR The character found by $ANY, $ALPHA, $DIGIT, or "Char". .PFLAG The flag word supplied by the call to .TPARS in R1 (see calling sequence below). The action routine may modify this word to change the options on the fly. R3 Byte count and R4 address of the remainder of the input string. At the time the action routine is called, the string does not include the characters matched by the transition being taken. Action routines are called with JSR PC. They may modify R0, R1, and R2 but must preserve all other registers. An action routine may reject the state transition from which it is being called by returning at call+4 instead of the normal call+2. (i.e., the action routine performs the equivalent of ADD #2,(SP) before returning.) This mechanism TPARS Users Manual PAGE 6 permits an action routine to perform additional discrimination of syntactic types, and allows arbitrary extensibility of the set of "hard wired" syntactic types made available by TPARS. When an action routine rejects a transition, the overall effect is as if that transition had never matched. TPARS will continue to attempt to match the remaining transitions in the state. 2.3.4 Mask - "Mask" specifies a mask word that is to be stored when the transition is taken. If "Mask" is null, nothing is stored. 2.3.5 Address - "Address" specifies the location into which this mask word is to be stored. Functionally, TPARS executes a "BIS #MASK,ADDRESS" when the transition is taken. This construct allows the user to avoid having to supply an action routine when all he wants to do is flag the fact that a particular transition has been taken. The "Mask" and "Address" arguments must be present or absent together. 2.4 General Coding Considerations Note that the sets of strings that can be represented by the transitions in a state are not necessarily disjoint. Therefore the order in which types are presented in the transition table of a state is critical. Transitions are always scanned in the order in which they are listed; hence the following ordering of mixed types is recommended: All Char All "Keyword" $EOS $ALPHA $DIGIT $BLANK $NUMBR $DNUMB $STRNG $RAD50 $ANY $LAMDA The placement of !Label transitions in a state depends on the basic types used within the subexpression. TPARS Users Manual PAGE 7 The handling of arguments by the assembler's macro processor causes some problems when character expressions are used in the "Char" syntactical type. While most characters can be matched with a straightforward expression, such as TRAN$ '= a comma must be enclosed in angle brackets to avoid being taken as an argument separator TRAN$ <',> Angle brackets cannot be entered as character literals at all, but must be expressed symbolically, i.e. LA = '< TRAN$ LA 2.5 Blank Suppress TPARS may optionally be caused to flush blanks in the string being parsed. If so requested, all blanks and tabs will be ignored at the syntactic element level; i.e., blanks and tabs may appear between any syntactic elements without being called for in the state table. The string being parsed is not modified. Note that under this option the $BLANK type will never match. Note also that when a "!Label" transition is taken, the matching string specified by .PSTCN and .PSTPT may contain blanks or tabs, even though none were called for in the subexpression. The matching strings for the basic types will never contain extraneous blanks, however. Under the blank flush option, blanks and tabs function as invisible terminators. Consider the string ABC DEF This string is matched by the following state table segments with and without the blank flush option: Without With STATE$ TRAN$ $STRNG STATE$ STATE$ TRAN$ $STRNG TRAN$ $BLANK STATE$ STATE$ TRAN$ $STRNG TRAN$ $STRNG TPARS Users Manual PAGE 8 3.0 INTERNAL REPRESENTATION The state table is coded as follows. Each state consists of its transition entries concatenated in the order in which they are specified. The state label, if present, is equated to the address of the first transition in the state. Each transition entry consists of 1 to 6 words as follows: Type - 1 byte. May assume the following values: $LAMDA = 300 $NUMBR = 302 $STRNG = 304 $BLANK = 306 $SUBXP = 310 Used in the !Label type. $EOS = 312 $DNUMB = 314 $RAD50 = 316 $ANY = 320 $ALPHA = 322 $DIGIT = 324 Char = Ascii value of the character. "Keyword" = 200+n Where n is an index into the keyword table. The keyword table is an array of pointers to the keyword strings. Each string is stored terminated with a single "377". Flags - 1 byte. Bits are used as follows: Bit 0 Type extension exists. Bit 1 Action exists. Bit 2 Mask word exists. Bit 3 Address word exists. Bit 4 Target exists. Bit 7 This is last transition in this state. The remaining five words are optional: Type extension - 1 word. Present only in the !Label type. Contains starting state of the subexpression. Action - 1 word. Present if action routine was specified. Contains address of the action routine. Mask - 1 word. Present if mask argument was supplied. This is the bit mask to set. Address - 1 word. Present if address argument was supplied. Contains address into which to set the mask. TPARS Users Manual PAGE 9 Target - 1 word. Present if explicit target state was specified. Contains address of target state. The target of $EXIT is coded as a zero word. The assembly listing produced by the state table macros is not straightforward. The object code was designed for compactness and convenience in processing, and the source macro language was designed for clarity and convenience in writing state tables. As a result, the mechanism that translates between the two (the macros) is of necessity somewhat obscure. The binary output of the macros is delayed by one statement. Thus, if the listing of macro generated binary code is enabled, the binary code that appears after a particular macro call is in fact the result of the preceding macro. Error messages generated by the macros are similarly delayed. This is the reason for the extra STATE$ call at the end of a state table. The macros generate code in three Psects ($STATE, $KTAB, and $KSTR). At the end of each macro expansion, the Psect is reset to the blank Psect, to prevent user code from being loaded into any of these Psects accidentally. As a result, the location counter displayed for each macro call is not meaningful, since it is simply the current position in the blank Psect. TPARS Users Manual PAGE 10 4.0 CALLING SEQUENCE TPARS is called with the following arguments: R1 = Options word R2 = Pointer to the keyword table R3 = Length of string to be parsed R4 = Address of string to be parsed R5 = Label of starting state in state table CALL .TPARS On return, R3 = Length and R4 = address of the unscanned portion of the string Other registers are preserved. The low byte of the options word contains flag bits. The only flag defined is bit 0; it controls the processing of blanks. If the bit is set, blanks are processed normally by transitions in the state table; if it is clear, blanks are ignored as described above. The high byte of the options word is the match count on keywords. If it is zero, all keyword strings in the string being parsed must match the keywords exactly as listed in the state table. If this byte is set to the value "n", then all keywords may be abbreviated down to a minimum of n characters. Keywords given with more than the minimum number of characters must still be spelled correctly. Keywords that are shorter than n characters must still be given exactly. Prevention of ambiguities is the responsibility of the state table designer. The C bit is cleared to indicate a successful parse. This occurrs when a transition is made to $EXIT which is not within a subexpression. If a syntax error is encountered, the parse is terminated and TPARS returns with the C bit set. A syntax error occurrs when there are no types listed in the current state which match the portion of the input string currently being scanned. Illegal type codes, and other forms of bad data in the state table, will also cause a syntax error return. Note that TPARS trusts addresses in the state and keyword tables, and that bad addresses may cause program termination. Note that when the end of the input string is reached, the only types that will match are $LAMDA and $EOS. TPARS Users Manual PAGE 11 5.0 SPACE REQUIREMENTS The routine .TPARS proper is 253 words in length. In addition, it calls .OD2CT and .DD2CT in the system library (an additional 68 words). 6.0 AVAILABILITY TPARS and the table generation macros are available in the system library of RSX-11M, RSX-11D, and IAS. Copies of the routines may also be obtained from the author. 7.0 EXAMPLE 1 A sample state table and set of action routines appear on the following pages. Their function is to parse the command line to the UFD command in RSX-11M. The general form of the command line is as follows: UFD DK0:LABEL[201,202]/ALLOC=100./PRO=[RWED,RWED,RWE,R] The associated action routines return the following values: $UDEV Device name (2 characters ascii) $UUNIT Unit number (binary) $UVNML Byte count and $UVNAM address of volume label string $UUIC Binary UIC for which to create directory $UALL Number of directory entries to preallocate $UPRO Binary protection word for UFD $FLAGS Flags word containing the following bits: UF.ALL Set if allocation was specified. UF.PRO Set if protection was specified. Observe that the "LABEL" and the "/ALLOC" and "/PRO" switches are optional. TPARS Users Manual PAGE 12 .TITLE STATE TABLE FOR UFD COMMAND LINE .MCALL ISTAT$,STATE$,TRAN$ ; TO BE USED WITH BLANK SUPPRESS OPTION ISTAT$ UFDSTB,UFDKTB ; READ OVER COMMAND NAME .GLOBL START STATE$ START TRAN$ "UFD" ; READ DEVICE AND UNIT NUMBER STATE$ TRAN$ $ALPHA,,SETDV1 STATE$ TRAN$ $ALPHA,,SETDV2 STATE$ TRAN$ $NUMBR,DEV1,SETUNT TRAN$ $LAMDA STATE$ DEV1 TRAN$ ': ; READ VOLUME LABEL STATE$ TRAN$ $STRNG,RUIC,SETLAB TRAN$ $LAMDA ; READ UIC STATE$ RUIC TRAN$ !UIC ; SCAN FOR OPTIONS AND END OF LINE STATE$ OPTS TRAN$ $EOS,$EXIT TRAN$ '/ STATE$ TRAN$ "ALLOC",ALC,,UF.ALL,$FLAGS TRAN$ "PRO",PRO,,UF.PRO,$FLAGS TPARS Users Manual PAGE 13 ; SET ALLOCATION STATE$ ALC TRAN$ '= STATE$ TRAN$ $NUMBR,OPTS,SETALC ; PROTECTION STATE$ PRO TRAN$ '= STATE$ TRAN$ '[,,IGROUP STATE$ SPRO TRAN$ '],OPTS,ENDGRP TRAN$ <',>,SPRO,NXGRP TRAN$ 'R,SPRO,SETRP TRAN$ 'W,SPRO,SETWP TRAN$ 'E,SPRO,SETEP TRAN$ 'D,SPRO,SETDP ; SUBEXPRESSION TO READ AND STORE UIC STATE$ UIC TRAN$ '[ STATE$ TRAN$ $NUMBR,,SETGN STATE$ TRAN$ <',> STATE$ TRAN$ $NUMBR,,SETPN STATE$ TRAN$ '],$EXIT STATE$ State Table Size: 60 words Keyword Table Size: 8 words Keyword Pointer Space: 3 words TPARS Users Manual PAGE 14 .SBTTL ACTION ROUTINES FOR THE COMMAND LINE PARSER ; DEVICE NAME CHAR 1 SETDV1::MOVB .PCHAR,$UDEV RETURN ; DEVICE NAME CHAR 2 SETDV2::MOVB .PCHAR,$UDEV+1 RETURN ; UNIT NUMBER SETUNT::MOV .PNUMB,$UUNIT RETURN ; VOLUME LABEL SETLAB::MOV .PSTCN,$UVNML MOV .PSTPT,$UVNAM RETURN ; PPN - GROUP NUMBER SETGN:: MOVB .PNUMB,$UUIC+1 BR TSTPPN ; PPN - PROGRAMMER NUMBER SETPN:: MOVB .PNUMB,$UUIC TSTPPN: TST .PNUMH ; CHECK FOR ZERO HIGH ORDER BNE 10$ TSTB .PNUMB+1 ; CHECK FOR BYTE VALUE BEQ 20$ 10$: ADD #2,(SP) ; BAD VALUE - REJECT TRANSITION 20$: RETURN ; NUMBER OF ENTRIES TO ALLOCATE SETALC::MOV .PNUMB,$UALL RETURN TPARS Users Manual PAGE 15 ; SET PERMISSIONS ; INITIALIZE IGROUP::MOV #4,GRCNT ; MOVE TO NEXT PERMISSIONS CATEGORY NXGRP:: SEC ; FORCE ONES ROR $UPRO ASR $UPRO ; SHIFT TO NEXT GROUP ASR $UPRO ASR $UPRO DEC GRCNT ; COUNT GROUPS BGE 30$ ; TOO MANY IS AN ERROR BADGRP: ADD #2,(SP) ; IF SO, REJECT TRANSITION 30$: RETURN ; SET READ PERMIT SETRP:: BIC #FP.RDV*10000,$UPRO RETURN ; SET WRITE PERMIT SETWP:: BIC #FP.WRV*10000,$UPRO RETURN ; SET EXTEND PERMIT SETEP:: BIC #FP.EXT*10000,$UPRO RETURN ; SET DELETE PERMIT SETDP:: BIC #FP.DEL*10000,$UPRO RETURN ; END OF PROTECTION SPEC ENDGRP::TST GRCNT ; CHECK THE GROUP COUNT BNE BADGRP ; MUST HAVE 4 RETURN .END UFD TPARS Users Manual PAGE 16 8.0 EXAMPLE 2 As an illustration of the use of subexpressions and transition rejection, consider the following excerpt of a state table which parses a string quoted by an arbitrary character (i.e., the first character is taken to be the quote character). Such a construct occurrs in many editors and some programming languages. The associated action routines leave the byte count and address of the string in locations "QSTC" and "QSTP", respectively, and the quoting character in location "QCHAR". ; MAIN LEVEL STATE TABLE ; ; PICK UP THE QUOTE CHARACTER STATE$ STRING TRAN$ $ANY,,SETQ ; ACCEPT THE QUOTED STRING STATE$ TRAN$ !QSTRG,,SETST ; GOBBLE UP THE TRAILING QUOTE CHARACTER STATE$ TRAN$ $ANY,NEXT,RESET ; SUBEXPRESSION TO SCAN THE QUOTED STRING ; THE FIRST TRANSITION WILL MATCH UNTIL IT IS REJECTED ; BY THE ACTION ROUTINE STATE$ QSTRG TRAN$ $ANY,QSTRG,TESTQ TRAN$ $LAMDA,$EXIT TPARS Users Manual PAGE 17 ; ACTION ROUTINES ; ; STORE THE QUOTING CHARACTER SETQ: MOVB .PCHAR,QCHAR INCB .PFLAG ; TURN OFF SPACE FLUSH RETURN ; TEST FOR QUOTING CHARACTER IN THE STRING TESTQ: CMPB .PCHAR,QCHAR BNE 10$ ADD #2,(SP) ; REJECT TRANSITION ON MATCH 10$: RETURN ; STORE THE STRING DESCRIPTOR SETST: MOV .PSTPT,QSTP MOV .PSTCN,QSTC RETURN ; RESET THE SPACE FLUSH FLAG RESET: DECB .PFLAG RETURN TPARS Users Manual PAGE 18 9.0 EXAMPLE 3 The following state table excerpt illustrates how subexpressions may be used to parse indeterministic grammars. The state table excerpt accepts a number followed by a qualifier keyword; depending upon the keyword, the number is interpreted either as octal or as decimal. The binary value of the number is left in the words tagged NUMBER. The following sort of strings will be accepted: 10/OCTAL 359/DECIMAL 77777/OCTAL This syntax is ambiguous for an LR(1) grammar; it is unambiguous for an LR(2) grammar. ; MAIN STATE TABLE ENTRY - ACCEPT THE EXPRESSION AND ; STORE ITS VALUE STATE$ TRAN$ !ONUMB,NEXT,SETNUM TRAN$ !DNUMB,NEXT,SETNUM ; SUBEXPRESSION TO ACCEPT OCTAL NUMBER STATE$ ONUMB TRAN$ $NUMBR STATE$ TRAN$ '/ STATE$ TRAN$ "OCTAL",$EXIT ; SUBEXPRESSION TO ACCEPT DECIMAL NUMBER STATE$ DNUMB TRAN$ $DNUMB STATE$ TRAN$ '/ STATE$ TRAN$ "DECIMAL",$EXIT ; ACTION ROUTINE TO STORE THE NUMBER SETNUM: MOV .PNUMB,NUMBER MOV .PNUMH,NUMBER+2 TPARS Users Manual PAGE 19 RETURN Note that the contents of .PNUMB and .PNUMH are left undisturbed by all state transitions except for the $NUMBR and $DNUMB types. Because of the way in which subexpressions are processed, calls to action routines from within subexpressions must be handled with some care. When a subexpression is encountered in a transition, TPARS saves its current context and calls itself, using the label of the subexpression as the new starting state. If the subexpression parses successfully and returns with a $EXIT, all is well and that transition is taken. If the subexpression fails (encounters a syntax error), then TPARS restores the saved context to back out of the subexpression and tries the next transition in the original state. However, there is no facility for undoing actions performed by the user's action routines that were called from within the subexpression. Therefore, action routines called by transitions in a subexpression that may fail should store their results in some intermediate area, and let their results be made "permanent" by an action routine which is called from the main level state table.