..... use tbl and troff \-ms
.if \nP=0 .IM
.TL
Updating Publication Lists
.AU
M. E. Lesk
.NH
Introduction.
.PP
.\".if \nP>0 .pn 14
This note describes several commands to update the
publication lists.
The data base consisting of these lists is kept in
a set of files in
the directory
.I /usr/dict/papers
on the Version 7
.UX
system.
The reason for having special commands to update these files is
that they are indexed, and the only reasonable way to find the
items to be updated is to use the index.
However, altering the files
destroys the usefulness of the index,
and makes further editing difficult.
So the recommended procedure is to
.IP (1)
Prepare additions, deletions, and changes in separate files.
.IP (2)
Update the data base and reindex.
.LP
Whenever you make changes, etc. it is necessary to run
the ``add & index'' step before logging off; otherwise the
changes do not take effect.
The next section shows the format of the files
in the data base.
After that, the procedures for
preparing additions, preparing changes, preparing deletions,
and updating the public data base are given.
.NH
Publication Format.
.PP
The format of a data base entry is given completely in ``Some Applications
of Inverted Indexes on UNIX'' by M. E. Lesk,
the first part of this report,
.if \nP=0 (also TM 77-1274-17)
and is summarized here via a few examples.
In each example, first the output format for an item is shown,
and then the corresponding data base entry.
.LP
.DS
.ti 0
Journal article:
.fi
.ll 5i
A. V. Aho, D. J. Hirschberg, and J. D. Ullman, ``Bounds
on the Complexity of the Maximal Common Subsequence Problem,''
.I
J. Assoc. Comp. Mach.,
.R
vol. 23, no. 1, pp. 1-12 (Jan. 1976).
.nf
.ll
.sp
%T Bounds on the Complexity of the Maximal Common
Subsequence Problem
%A A. V. Aho
%A D. S. Hirschberg
%A J. D. Ullman
%J J. Assoc. Comp. Mach.
%V 23
%N 1
%P 1-12
%D Jan. 1976
.if \nP=0 %M TM 75-1271-7
.if \nP>0 %M Memo abcd...
.DE
.DS
.ti 0
Conference proceedings:
.fi
.ll 5i
B. Prabhala and R. Sethi, ``Efficient Computation of Expressions with Common
Subexpressions,''
.I
Proc. 5th ACM Symp. on Principles of Programming Languages,
.R
pp. 222-230, Tucson, Ariz. (January 1978).
.nf
.ll
.sp
%A B. Prabhala
%A R. Sethi
%T Efficient Computation of Expressions with
Common Subexpressions
%J Proc. 5th ACM Symp. on Principles
of Programming Languages
%C Tucson, Ariz.
%D January 1978
%P 222-230
.DE
.DS
.ti 0
Book:
.fi
.ll 5i
B. W. Kernighan and P. J. Plauger,
.I
Software Tools,
.R
Addison-Wesley, Reading, Mass. (1976).
.nf
.ll
.sp
%T Software Tools
%A B. W. Kernighan
%A P. J. Plauger
%I Addison-Wesley
%C Reading, Mass.
%D 1976
.DE
.DS
.ti 0
Article within book:
.fi
.ll 5i
J. W. de Bakker, ``Semantics of Programming Languages,''
pp. 173-227 in
.I
Advances in Information Systems Science, Vol. 2,
.R
ed. J. T. Tou, Plenum Press, New York, N. Y. (1969).
.nf
.ll
.sp
%A J. W. de Bakker
%T Semantics of programming languages
%E J. T. Tou
%B Advances in Information Systems Science, Vol. 2
%I Plenum Press
%C New York, N. Y.
%D 1969
%P 173-227
.DE
.DS
.ti 0
Technical Report:
.fi
.ll 5i
F. E. Allen, ``Bibliography on Program Optimization,''
Report RC-5767, IBM T. J. Watson Research Center,
Yorktown Heights, N. Y. (1975).
.nf
.ll
.sp
%A F. E. Allen
%D 1975
%T Bibliography on Program Optimization
%R Report RC-5767
%I IBM T. J. Watson Research Center
%C Yorktown Heights, N. Y.
.DE
.DS
.di xx
.ti 0
Technical Memorandum:
.fi
.ll 5i
A. V. Aho, B. W. Kernighan and P. J. Weinberg,
``AWK \- Pattern Scanning and Processing Language'',
TM 77-1271-5, TM 77-1273-12, TM 77-3444-1 (1977).
.nf
.ll
.sp
%T AWK \- Pattern Scanning and Processing Language
%A A. V. Aho
%A B. W. Kernighan
%A P. J. Weinberger
%M TM 77-1271-5, TM 77-1273-12, TM 77-3444-1
%D 1977
.di
.if \nP=0 .xx
.rm xx
.DE
.LP
Other forms of publication can be entered similarly.
Note that conference
proceedings are entered as if journals,
with the conference name on a
.I %J
line.
This is also sometimes appropriate for obscure publications
such as series of lecture notes.
When something is both a report and an article, or
both a memorandum and an article, enter all necessary information
for both; see the first article above, for example.
Extra information (such as ``In preparation'' or ``Japanese translation'')
should be placed on a line beginning
.I %O .
The most common use of %O lines now is for ``Also in ...'' to give
an additional reference to a secondary appearance of the same paper.
.PP
Some of the possible fields of a citation are:
.TS
c c 5 c c
a l   a l .
Letter	Meaning	Letter	Meaning
A	Author	K	Extra keys
B	Book including item	N	Issue number
C	City of publication	O	Other
D	Date	P	Page numbers
E	Editor of book	R	Report number
I	Publisher (issuer)	T	Title of item
J	Journal name	V	Volume number
.TE
Note that
.I %B
is used to indicate the title
of a book containing the article being entered;
when an item is an entire book, the title should
be entered with a
.I %T
as usual.
.PP
Normally, the order of items does not matter.  The only exception is
that if there are multiple authors (%A lines) the order of authors
should be that on the paper.
If a line is too long, it may be continued on to the next line;
any line not beginning with % or . (dot) is assumed to be
a continuation of the previous line.
Again, see the first article above for an example of a long title.
Except for authors, do not repeat any items; if two %J lines are
given, for example, the first is ignored.
Multiple items on the same file should be separated by blank lines.
.PP
Note that in formatted printouts of the file, the
exact appearance of the items is determined by
a set of macros and the formatting programs.
Do not try to adjust fonts, punctuation, etc. by editing
the data base; it is wasted effort.  In case someone has
a real need for a differently-formatted output, a new set
of macros can easily be generated to provide alternative
appearances of the citations.
.NH
Updating and Re-indexing.
.PP
This section describes the commands that are used to manipulate
and change the data base.
It explains the procedures for (a) finding references in the data base,
(b) adding new references, (c) changing existing references, and (d)
deleting references.
Remember that all changes, additions, and deletions are done by preparing
separate files and then running an `update and reindex' step.
.PP
.I
Checking what's there now.
.R
Often you will want to know what is currently in the data base.
There is a special command
.I lookbib
to look for things and print them
out.
It searches for articles based on words in the title, or the author's name,
or the date.
For example, you could find the first paper above with
.DS
lookbib aho ullman maximal subsequence 1976
.DE
or
.DS
lookbib aho ullman hirschberg
.DE
.LP
If you don't give enough words, several items will be found;
if you spell some wrong, nothing will be found.
There are around 4300 papers in the public file; you should
always use this command to check when you are not sure
whether a certain paper is there or not.
.PP
.I
Additions.
.R
To add new papers, just type in, on one or more files, the citations
for the new papers.
Remember to check first if the papers are already in the data base.
For example, if a paper has a previous memo version, this should
be treated as a change to an existing entry, rather than
a new entry.
If several new papers are being typed on the same file, be
sure that there is a blank line between each two papers.
.PP
.I
Changes.
.R
To change an item, it should be extracted onto a file.
This is done with the command
.DS
pub.chg key1 key2 key3 ...
.DE
where the items key1, key2, key3, etc. are
a set of keys that will find the paper,
as in the
.I lookbib
command.
That is, if
.DS
lookbib johnson yacc cstr
.DE
will find a item (to, in this case, Computing Science Technical Report
No. 32, ``YACC: Yet Another Compiler-Compiler,''
by S. C. Johnson)
then
.DS
pub.chg johnson yacc cstr
.DE
will permit you to edit the item.
The
.I pub.chg
command
extracts the item onto a file named ``bibxxx'' where ``xxx''
is a 3-digit number, e.g. ``bib234''.
The command will print the file name it has chosen.
If the set of keys finds more than one paper (or no papers) an
error message is printed and no file is written.
Each reference to be changed must be extracted with a separate
.I pub.chg
command, and each will be placed on a separate file.
You should then edit the ``bibxxx'' file as desired to change the item,
using the UNIX editor.
Do not delete or change the first line of the file, however, which begins
.I %#
and is a special code line to tell the update program
which item is being altered.
You may delete or change other lines, or add lines, as you wish.
The changes are not actually made in the public data
base until you run the update command
.I pub.run
(see below).
Thus, if after extracting an item and modifying it, you decide
that you'd rather leave things as they were, delete the
``bibxxx'' file, and your change request will disappear.
.PP
.I
Deletions.
.R
To delete an entry from the data base,
type the command
.DS
pub.del key1 key2 key3 ...
.DE
where the items key1, key2, etc. are a set
of keys that will find the paper, as with the
.I lookbib
command.
That is, if
.DS
lookbib aho hirschberg ullman
.DE
will find a paper,
.DS
pub.del aho hirschberg ullman
.DE
deletes it.
Upper and lower case are equivalent in keys;
the command
.DS
pub.del Aho Hirschberg Ullman
.DE
is an equivalent
.I pub.del
command.
The
.I pub.del
command will print the entry being deleted.
It also gives the name of a ``bibxxx'' file on which the deletion
command is stored.
The actual deletion is not done until the changes, additions, etc.
are processed, as with the
.I pub.chg
comma.hc ~
.bd I 2
.de TS
.br
.nf
.SP 1v
.ul 0
..
.de TE
.SP 1v
.fi
..
.de PT
.if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
.if \\n%>1 'sp
..
.ND July 21, 1975
.RP
.TM 75-1274-15 39199 39199-11
.TL
Lex \- A Lexical Analyzer ~Generator~
.AU ``MH 2C-569'' 6377
M. E. Lesk and E. Schmidt
.AI
.MH
.AB
.sp
.bd I 2
.nr PS 8
.nr VS 9
.ps 8
.vs 9p
Lex helps write programs whose control flow
is directed by instances of regular
expressions in the input stream.
It is well suited for editor-script type transformations and
for segmenting input in preparation for
a parsing routine.
.PP
Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program
which reads an input stream, copying it to an output stream
and partitioning the input
into strings which match the given expressions.
As each such string is recognized the corresponding
program fragment is executed.
The recognition of the expressions
is performed by a deterministic finite automaton
generated by Lex.
The program fragments written by the user are executed in the order in which the
corresponding regular expressions occur in the input stream.
.if n .if \n(tm .ig
.PP
The lexical analysis
programs written with Lex accept ambiguous specifications
and choose the longest
match possible at each input point.
If necessary, substantial look~ahead
is performed on the input, but the
input stream will be backed up to the
end of the current partition, so that the user
has general freedom to manipulate it.
.PP
Lex can generate analyzers in either C or Ratfor, a language
which can be translated automatically to portable Fortran.
It is available on the PDP-11 UNIX, Honeywell GCOS,
and IBM OS systems.
This manual, however, will only discuss generating analyzers
in C on the UNIX system, which is the only supported
form of Lex under UNIX Version 7.
Lex is designed to simplify
interfacing with Yacc, for those
with access to this compiler-compiler system.
..
.nr PS 9
.nr VS 11
.AE
.SH
.ce 1
Table of Contents
.LP
.ce 100
.TS
r 1l 2r .
1.	Introduction.	1
2.	Lex Source.	3
3.	Lex Regular Expressions.	3
4.	Lex Actions.	5
5.	Ambiguous Source Rules.	7
6.	Lex Source Definitions.	8
7.	Usage.	8
8.	Lex and Yacc.	9
9.	Examples.	10
10.	Left Context Sensitivity.	11
11.	Character Set.	12
12.	Summary of Source Format.	12
13.	Caveats and Bugs.	13
14.	Acknowledgments.	13
15.	References.	13
.TE
.ce 0
.2C
.NH
Introduction.
.PP
Lex is a program generator designed for
lexical processing of character input streams.
It accepts a high-level, problem oriented specification
for character string matching,
and
produces a program in a general purpose language which recognizes
regular expressions.
The regular expressions are specified by the user in the
source specifications given to Lex.
The Lex written code recognizes these expressions
in an input stream and partitions the input stream into
strings matching the expressions.  At the bound~aries
between strings
program sections
provided by the user are executed.
The Lex source file associates the regular expressions and the
program fragments.
As each expression appears in the input to the program written by Lex,
the corresponding fragment is executed.
.PP
.de MH
Bell Laboratories, Murray Hill, NJ 07974.
..
The user supplies the additional code
beyond expression matching
needed to complete his tasks, possibly
including code written by other generators.
The program that recognizes the expressions is generated in the
general purpose programming language employed for the
user's program fragments.
Thus, a high level expression
language is provided to write the string expressions to be
matched while the user's freedom to write actions
is unimpaired.
This avoids forcing the user who wishes to use a string manipulation
language for input analysis to write processing programs in the same
and often inappropriate string handling language.
.PP
Lex is not a complete language, but rather a generator representing
a new language feature which can be added to
different programming languages, called ``host languages.'' 
Just as general purpose languages
can produce code to run on different computer hardware,
Lex can write code in different host languages.
The host language is used for the output code generated by Lex
and also for the program fragments added by the user.
Compatible run-time libraries for the different host languages
are also provided.
This makes Lex adaptable to different environments and
different users.
Each application
may be directed to the combination of hardware and host language appropriate
to the task, the user's background, and the properties of local
implementations.
At present, the only supported host language is C,
although Fortran (in the form of Ratfor [2] has been available
in the past.
Lex itself exists on UNIX, GCOS, and OS/370; but the
code generated by Lex may be taken anywhere the appropriate
compilers exist.
.PP
Lex turns the user's expressions and actions
(called
.ul
source
in this memo) into the host general-purpose language;
the generated program is named
.ul
yylex.
The
.ul
yylex
program
will recognize expressions
in a stream
(called
.ul
input
in this memo)
and perform the specified actions for each expression as it is detected.
See Figure 1.
.GS
.TS
center;
l _ r
l|c|r
l _ r
l _ r
l|c|r
l _ r
c s s
c s s.

Source \(->	Lex	\(-> yylex

.sp 2

Input \(->	yylex	\(-> Output

.sp
An overview of Lex
.sp
Figure 1
.TE
.GE
.PP
For a trivial example, consider a program to delete
from the input
all blanks or tabs at the ends of lines.
.TS
center;
l l.
%%
[ \et]+$	;
.TE
is all that is required.
The program
contains a %% delimiter to mark the beginning of the rules, and
one rule.
This rule contains a regular expression
which matches one or more
instances of the characters blank or tab
(written \et for visibility, in accordance with the C language convention)
just prior to the end of a line.
The brackets indicate the character
class made of blank and tab; the + indicates ``one or more ...'';
and the $ indicates ``end of line,'' as in QED.
No action is specified,
so the program generated by Lex (yylex) will ignore these characters.
Everything else will be copied.
To change any remaining
string of blanks or tabs to a single blank,
add another rule:
.TS
center;
l l.
%%
[ \et]+$	;
[ \et]+	printf(" ");
.TE
The finite automaton generated for this
source will scan for both rules at once,
observing at
the termination of the string of blanks or tabs
whether or not there is a newline character, and executing
the desired rule action.
The first rule matches all strings of blanks or tabs
at the end of lines, and the second
rule all remaining strings of blanks or tabs.
.PP
Lex can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level.
Lex can also be used with a parser generator
to perform the lexical analysis phase; it is particularly
easy to interface Lex and Yacc [3].
Lex programs recognize only regular expressions;
Yacc writes parsers that accept a large class of context free grammars,
but require a lower level analyzer to recognize input tokens.
Thus, a combination of Lex and Yacc is often appropriate.
When used as a preprocessor for a later parser generator,
Lex is used to partition the input stream,
and the parser generator assigns structure to
the resulting pieces.
The flow of control
in such a case (which might be the first half of a compiler,
for example) is shown in Figure 2.
Additional programs,
written by other generators
or by hand, can
be added easily to programs written by Lex.
.BS 2
.TS
center;
l c c c l
l c c c l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c s s l
l c s s l.
	lexical		grammar
	rules		rules
	\(da		\(da

	Lex		Yacc

	\(da		\(da

Input \(->	yylex	\(->	yyparse	\(-> Parsed input

.sp
	Lex with Yacc
.sp
	Figure 2
.TE
.BE
Yacc users
will realize that the name
.ul
yylex
is what Yacc expects its lexical analyzer to be named,
so that the use of this name by Lex simplifies
interfacing.
.PP
Lex generates a deterministic finite automaton from the regular expressions
in the source [4].
The automaton is interpreted, rather than compiled, in order
to save space.
The result is still a fast analyzer.
In particular, the time taken by a Lex program
to recognize and partition an input stream is
proportional to the length of the input.
The number of Lex rules or
the complexity of the rules is
not important in determining speed,
unless rules which include
forward context require a significant amount of re~scanning.
What does increase with the number and complexity of rules
is the size of the finite
automaton, and therefore the size of the program
generated by Lex.
.PP
In the program written by Lex, the user's fragments
(representing the
.ul
actions
to be performed as each regular expression
is found)
are gathered
as cases of a switch.
The automaton interpreter directs the control flow.
Opportunity is provided for the user to insert either
declarations or additional statements in the routine containing
the actions, or to
add subroutines outside this action routine.
.PP
Lex is not limited to source which can
be interpreted on the basis of one character
look~ahead.
For example,
if there are two rules, one