"lz" File Compress/Decompress Utilities                   Page 1
        lzcomp -- File Compression


synopsis

        lzcomp [-options] [infile [outfile]]

description

        lzcomp  implements  the  Lempel-Ziv  file compression algorithm.
        (Files compressed by lzcomp  are  uncompressed  by  lzdcmp.)  It
        operates  by  finding common substrings and replaces them with a
        variable-size code.  This is deterministic, and can be done with
        a  single pass over the file.  Thus, the decompression procedure
        needs no input table, but can track the way the table was built.

        Options may be given in either case.

        -B      Input file is "binary", not "human readable text".  This
                is necessary on Dec operating systems, such as  VMS  and
                RSX-11M, that treat these files differently.  (Note that
                binary support is rudamentary and probably  insufficient
                as  yet.)  (On VMS version 4, this is ignored unless the
                -x  option  is  specified   or   the   input   file   is
                record-oriented.)

          Under RT-11 or TSX-Plus, the "-b" option should be used at all
          times or the last buffer of decompressed data will most likely
          NOT be written to the disk when decompressing!

        -M bits Write  using the specified number of bits in the code --
                necessary for  big  machines  making  files  for  little
                machines.   For  example,  if  compressing a file on VMS
                which is to be read on a PDP-11, you  should  select  -M
                12.

        -V [n]  Verbose  if specified.  If a value is specified, it will
                enable debugging code (if compiled in).

        -X [n]  "Export"  --  write  a  file  format that can be read by
                other operating systems.  Only the bytes in the file are
                copied;   file  attributes are not preserved.  If speci-
                fied, the value determines the  level  of  compatiblity.
                If not specified, or specified with an explicit value of
                zero, and lzcomp is running on Vax/VMS version  4  under
                VaxC  and  the  input  file  is  a  disk or magtape file
                (block-oriented), a VMS-private output  format  is  used
                which  is  incompatible  with the Unix compress utility,
                but which preserves VMS file attributes.  -X may take on
                the following values:

                 0  Choose VMS private format.  See restrictions below.
                 1  Compatible  with Unix compress version 3.0:  this is
                    the default if -x is given without a value.
                 2  As above, but suppress "block compression"
                 3  Suppress  block  compression  and  do  not  output a
                    compress header block.   This  is  for  compatiblity
                    with a quite early version of Unix compress (and re-
                    quires conditional-compilation to use).

                Note  that  the -B (binary) option is ignored unless the
       "lz" File Compress/Decompress Utilities                   Page 2
        lzcomp -- File Compression


                input file is "record-oriented", such as a  terminal  or
                mailbox.

        The  other  two  arguments  are  the  input and output filenames
        respectively.  Redirection is  supported,  however,  the  output
        must be a disk/tape file.

        The  file  format  is  almost  identical to the current Unix im-
        plementation of compress (V4.0).  Files written by Unix compress
        should be readable by lzdcmp.  Files written by lzcomp in export
        (-x) format will be  readable  by  Unix  compress  (except  that
        lzcomp  outputs  two "clear" codes to mark EOF.  A patch to Unix
        compress is available.)

VMS Restrictions

        VMS  Private  mode  stores  the  true name and attributes of the
        input file into the compressed file and lzdcmp restores the  at-
        tributes  (and  filename  if requested).  The following restric-
        tions apply -- they may be lifted in  the  future  as  they  are
        primarily  due  to the author's lack of understanding of the in-
        tricacies of of VMS I/O:

            All files must be stored on disk.
            The lzcomp output file must be specified directly.

        Also,  for all usage on VMS, the compressed file must be written
        to, and read from disk.

LZW compression algorithm

        This section is abstracted from Terry Welch's article referenced
        below.  The algorithm builds a  string  translation  table  that
        maps  substrings  in  the  input  into  fixed-length codes.  The
        compress algorithm may be described as follows:

             1.  Initialize table to contain single-character strings.
             2.  Read  the first character.  Set <w> (the prefix string)
                 to that character.
             3.  (step):  Read next input character, K.
             4.  If at end of file, output code(<w>);  exit.
             5.  If <w>K is in the string table:
                    Set <w> to <w>K;  goto step 3.
             6.  Else <w>K is not in the string table.
                    Output code(<w>);
                    Put <w>K into the string table;
                    Set <w> to K;  Goto step 3.

        "At  each execution of the basic step an acceptable input string
        <w> has been parsed off.  The next character K is read  and  the
        extended string <w>K is tested to see if it exists in the string
        table.  If it is there, then the  extended  string  becomes  the
        parsed  string  <w> and the step is repeated.  If <w>K is not in
        the  string  table,  then  it  is  entered,  the  code  for  the
       "lz" File Compress/Decompress Utilities                   Page 3
        lzcomp -- File Compression


        successfully parsed string <w> is put out as comprssed data, the
        character K becomes the beginning of the next  string,  and  the
        step is repeated."

        The decompression algorithm translates each received code into a
        prefix string and extension [suffix] character.   The  extension
        character  is  stored  (in  a  push-down  stack), and the prefix
        translated again, until the prefix is a single character,  which
        completes  decompression  of this code.  The entire code is then
        output by popping the stack.

        "An  update  to  the string table is made for each code received
        (except the first one).  When a code has  been  translated,  its
        final  character  is  used  as the extension character, combined
        with the prior string, to add a new string to the string  table.
        This  new  string  is assigned a unique code value, which is the
        same code that the compressor assigned to that string.  In  this
        way, the decompressor incrementally reconstructs the same string
        table that the decompressor used....   Unfortunately  ...   [the
        algorithm] does not work for an abnormal case.

        The abnormal case occurs whenever an input character string con-
        tains the sequence K<w>K<w>K, where K<w> already appears in  the
        compressor string table."

        The  decompression  algorithm,  augmented to handle the abnormal
        case, is as follows:

             1.  Read first input code;
                 Store in CODE and OLDcode;
                 With CODE = code(K), output(K);  FINchar = K;

             2.  Read next code to CODE;  INcode = CODE;
                 If at end of file, exit;

             3.  If CODE not in string table (special case) then
                    Output(FINchar);
                    CODE = OLDcode;
                    INcode = code(OLDcode, FINchar);

             4.  If CODE == code(<w>K) then
                    Push K onto the stack;
                    CODE == code(<w>);
                    Goto 4.

             5.  If CODE == code(K) then
                    Output K;
                    FINchar = K;

             6.  While stack not empty
                    Output top of stack;
                    Pop stack;

       "lz" File Compress/Decompress Utilities                   Page 4
        lzcomp -- File Compression


             7.  Put OLDcode,K into the string table.
                 OLDcode = INcode;
                 Goto 2.

        The algorithm as implemented here introduces two additional com-
        plications.

        The  actual codes are transmitted using a variable-length encod-
        ing.  The lowest-level routines increase the number of  bits  in
        the code when the largest possible code is transmitted.

        Periodically, the algorithm checks that compression is still in-
        creasing.   If  the  ratio  of  input  bytes  to  output   bytes
        decreases,  the entire process is reset.  This can happen if the
        characteristics of the input file change.

VMS Private File Structure

        In  VMS  Private  mode,  the  compressed  data  file  contains a
        variable-length (but compressed) file header with the file  "at-
        tributes"  needed by the operating system to construct the file.
        This allows the decompression program to recreate  the  file  in
        its  original  format,  which is essential if ISAM databases are
        compressed.

        The overall file format is as follows:

        LZ_SOH  "start  of  header"  signal (this value cannot appear in
                user data).

                A  variable-length  data record (maximum 256 bytes) con-
                taining the header name, followed  by  whitespace,  fol-
                lowed by header-specific information.  In this case, the
                name record will  contain  the  string  "vms$attributes"
                followed  by  the  number of bytes in the attribute data
                block.  (I assume that the name record will consist of a
                facility name, such as "vms", followed by a dollar sign,
                followed by a facility-unique word.)

        LZ_EOR  Signals "end of record".

                This  is  followed  by  a  VMS  file  attributes  record
                (generated by a VMS system library      routine).

        LZ_ETX  Signals "end of segment".

        ST_STX  Signals "start of text" (i.e., start of data file).

                This is followed by the user data file.

        LZ_ETX  Signals "end of segment"

        LZ_ETX  Two in a row signals "end of file".

       "lz" File Compress/Decompress Utilities                   Page 5
        lzcomp -- File Compression


        Note  that this format can easily be extended to include trailer
        records (with file counts and checksums)  and/or  multiple  data
        files in one compressed file.

        Note  also  that the LZ_CLEAR code may appear in headers or data
        files to cause the decompression program  to  "readapt"  to  the
        characteristics  of the input data.  LZ_STX and LZ_SOH reset the
        compression algorithm.  LZ_EOR does not.

Authors

        The  algorithm  is  from  "A Technique for High Performance Data
        Compression." Terry A. Welch.  IEEE  Computer  Vol  17,  No.   6
        (June 1984), pp 8-19.

        This revision is by Martin Minow.

        Unix Compress authors are as follows:

        Spencer W. Thomas       (decvax!harpo!utah-cs!utah-gr!thomas)
        Jim McKie               (decvax!mcvax!jim)
        Steve Davies            (decvax!vax135!petsd!peora!srd)
        Ken Turkowski           (decvax!decwrl!turtlevax!ken)
        James A. Woods          (decvax!ihnp4!ames!jaw)
        Joe Orost               (decvax!vax135!petsd!joe)
       "lz" File Compress/Decompress Utilities                   Page 6
        lzdcmp -- File Decompression


synopsis

        lzdcmp [-options] [infile [outfile]]

description

        lzdcmp  decompresses files compressed by lzcomp.  The documenta-
        tion for lzcomp describes the process in greater detail.

        Options may be given in either case.

        -B      Output  file  is  "binary",  not  text.  (Ignored in VMS
                private mode.)

          Under RT-11 or TSX-Plus, the "-b" option should be used at all
          times or the last buffer of decompressed data will most likely
          NOT be written to the disk when decompressing!

        -X 3    To  read  files  compressed  by an old Unix version that
                doesn't generate header records.

        -V val  Verbose  (print  status  messages and debugging informa-
                tion).  The value selects the amount of verbosity.

Author

        This version by Martin Minow.  See lzcomp for more details.
                                                                        