






                             DDDDMMMM: A Data Manipulator
                        Tutorial Introduction and Manual

                                  Gary Perlman
                          Cognitive Science Laboratory
                      University of California, San Diego


          _1.  _I_n_t_r_o_d_u_c_t_i_o_n

                  DDDDMMMM is a powerful data manipulating  program  with  a
             rich  set  of operators for manipulating columnated files
             of numbers and strings.  DDDDMMMM allows you to  avoid  writing
             little BASIC or C programs every time you want to do some
             transformation to a file of data.  To use DDDDMMMM, you write a
             list of expressions, and for each line of data, DDDDMMMM prints
             the result of evaluating each expression.

             _1._1.  _I_n_t_r_o_d_u_c_t_o_r_y _E_x_a_m_p_l_e_s

                     Usually, the input to DDDDMMMM is a file of lines, each
                with  an equinumerous number of fields.  Equivalently,
                DDDDMMMM's input is a file with some set number of columns.

                _1._1._1.  _C_o_l_u_m_n _E_x_t_r_a_c_t_i_o_n.   DDDDMMMM can be used to extract
                   columns  in a very simple manner.  If "data" is the
                   name of a file of five columns, then the  following
                   will  extract  the  third  string  followed  by the
                   first, followed by the fourth, and  print  them  to
                   the standard output.

                                   dm s3 s1 s4 < data

                   Thus DDDDMMMM is  very  useful  for  putting  data  in  a
                   correct  format for input to many programs, notably
                   the statistical programs developed available on the
                   UNIX* operating system.[1]

                _1._1._2.  _S_i_m_p_l_e _E_x_p_r_e_s_s_i_o_n_s.   In the  preceding  exam-
                   ple, columns were accessed by typing the letter 's'
                   (for string) followed by a column number.  You  can
                   also access the numerical value of a column by typ-
                   ing 'x' followed by a column number.  This is  use-
                   ful  when you want to form simple expressions based
                   on various columns.  Suppose "data" is  a  file  of
                   four  numerical columns, and that you want to print
          ____________________
9             * UNIX is a trademark of Bell Laboratories.
             [1]Gary Perlman.  Data  analysis  programs  of  the  UNIX
          operating    system.     _B_e_h_a_v_i_o_r   _R_e_s_e_a_r_c_h   _M_e_t_h_o_d_s   _a_n_d
          _I_n_s_t_r_u_m_e_n_t_a_t_i_o_n, 1980, _1_2, 554-558.



9







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                   the sum of the first two columns  followed  by  the
                   difference  of  the second two.  The easiest way to
                   do this is by the following.

                                  dm x1+x2 x3-x4 < data

                   Almost any arithmetic operation  is  available  and
                   can  be of arbitrary complexity, however, care must
                   be taken because many of the  symbols  used  by  DDDDMMMM
                   (such as * for multiplication) have special meaning
                   when used in UNIX.  This can be avoided by  putting
                   expressions  in  double  quotes.   For example, the
                   following will print the sum of the squares of  the
                   fist  two  columns  followed  by  the square of the
                   third, a sort of Pythagorean program.

                             dm "x1*x1+x2*x2" "x3*x3" < data


                _1._1._3.  _L_i_n_e  _E_x_t_r_a_c_t_i_o_n  _B_a_s_e_d  _o_n  _C_o_n_d_i_t_i_o_n_s.    DDDDMMMM
                   allows  you  to  test  conditions  and print values
                   depending on whether the conditions are  satisfied.
                   The DDDDMMMM call

                      dm "if x1 >= 100 then INPUT else KILL" < data

                   will print only those lines that have first columns
                   with  values  greater  than  or  equal to 100.  The
                   variable INPUT refers to the whole input line.  The
                   special  variable  KILL  instructs  DDDDMMMM to terminate
                   processing on the current line and go to the next.

          _2.  _D_a_t_a _T_y_p_e_s

             _2._1.  _S_t_r_i_n_g _D_a_t_a

                     To access or print a column in a file, the string
                variable,  s, is provided.  si (s followed by a column
                number, such as 5) refers to the  ith  column  of  the
                input,  treated  as a string.  The most simple example
                is to use an si as the only part of an expression.

                                     dm s2 s3 s1

                will print the second, third and first columns of  the
                input.  One special string is called INPUT, and is the
                current input  line  of  data.   String  constants  in
                expressions are delimited by single quotes.  For exam-
                ple:



9

9                                     ---- 2222 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


             _2._2.  _N_u_m_e_r_i_c_a_l _D_a_t_a

                     Two variables are available to refer to the input
                columns  (xi)  and  the  numerical result of evaluated
                expressions (yi).  xi refers to the ith column of  the
                input,  treated as a number.  xi is the result of con-
                verting si to a number.  If si contains  non-numerical
                characters,  xi may have strange values.  A common use
                of the xi is in algebraic expressions.

                                    dm x1+x2 x1/x2

                will print out two columns, first the sum of the first
                two input columns, then their ratio.

                     You can  refer  to  the  value  of  a  previously
                evaluated  expression,  and  avoid evaluating the same
                sub-expression more  than  once.   yi  refers  to  the
                numerical  value  of  the  ith expression.  Instead of
                writing:

                               dm x1+x2+x3 (x1+x2+x3)/3

                you could write:

                                 dm x1+x2+x3     y1/3

                y1 is the value of  the  first  expression,  x1+x2+x3.
                String  values  of expressions are unfortunately unac-
                cessable.

                     _I_n_d_e_x_i_n_g _n_u_m_e_r_i_c_a_l _v_a_r_i_a_b_l_e_s is usually  done  by
                putting  the  index  after x or y, but if you want the
                value of the index to depend on the input such as when
                there  are  a variable number of columns, and only the
                last column is  of  interest,  the  index  value  will
                depend  on the number of columns.  If a computed index
                is desired for x or y the index should be  an  expres-
                sion  in  square brackets following x or y.  For exam-
                ple, x[N] is the value  of  the  last  column  of  the
                input.  N is a special variable equal to the number of
                columns in INPUT.  You have the option to  use  x1  or
                x[1]  but  x1 will execute faster so computed indicies
                should not be used unless necessary.

             _2._3.  _S_p_e_c_i_a_l _V_a_r_i_a_b_l_e_s

                     DDDDMMMM offers  some  special  variables  and  control
                primitives  for  commonly desired operations.  Many of
                the special variables have more than one name to allow
                more readable expressions.

9

9                                     ---- 3333 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                N       N is the number  of  columns  in  the  current
                        input line.

                SUM     SUM is the sum of  the  xi,  i  =  1,N.   This
                        number  may  be of limited use if some columns
                        are non-numerical.

                INLINE  INLINE is the line number of the  input.   For
                        the first line of input, INLINE is 1.0.

                OUTLINE OUTLINE is the number of lines so far  output.
                        For  the first line of input, OUTLINE is zero.
                        OUTLINE will not increase if a line is  killed
                        with KILL.

                RAND    RAND is a random number uniform in [0,1).

                INPUT   INPUT is the original input line, all  spaces,
                        etc. included.

                NIL     If an expression evaluates to NIL, then  there
                        will  be  no  output  for that expression (for
                        that input line).  This should not be confused
                        with  KILL  that suppresses output for a whole
                        line,  or  a  prefix  X  that  unconditionally
                        suppresses output for an expression.

                KILL    If an expression evaluates to KILL, then there
                        will  be  no output for the present line.  All
                        expressions after a KILLed expression are  not
                        evaluated.  This can be useful to avoid nasti-
                        ness like division by zero.  NEXT can be  used
                        as a pseudonym for KILL.

                EXIT    If an expression evaluates to  EXIT,  then  DDDDMMMM
                        immediately  exits.  This  can  be  useful for
                        stopping a search after a target is found.

          _3.  _U_s_e_r _I_n_t_e_r_f_a_c_e

             _3._1.  _E_x_p_r_e_s_s_i_o_n_s

                     Expressions  are  written  in   common   computer
                language syntax, and can have spaces inserted anywhere
                except (1) in the middle of constants, and (2) in  the
                middle of multicharacter tokens, such as <= (less than
                or equal to).  Four modes are available for specifying
                your  expressions  to DDDDMMMM.  They give you the choice of
                entering expressions from your terminal or a file, and
                the option to use DDDDMMMM interactively.


9

9                                     ---- 4444 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                _3._1._1.  _A_r_g_u_m_e_n_t _M_o_d_e.   The most restrictive mode  is
                   to supply them are arguments to your call to DDDDMMMM, as
                   featured in previous examples.   The  main  problem
                   with  this  mode is that many special characters in
                   UNIX are used as  operators,  requiring  that  many
                   expressions   be  quoted  (with  single  or  double
                   quotes).  The main advantage is that this  mode  is
                   most  useful  in  constructing  pipelines and shell
                   scripts.

                _3._1._2.    _E_x_p_r_e_s_s_i_o_n   _F_i_l_e   _M_o_d_e.     Another   non-
                   interactive method is to supply DDDDMMMM with a file with
                   your expressions in it (one to each line)  by  cal-
                   ling DDDDMMMM with:

                                     dm E<filename>

                   where <filename> is your file of expressions.  This
                   mode  makes  it easier to use DDDDMMMM with pipelines and
                   redirection.

                _3._1._3.   _I_n_t_e_r_a_c_t_i_v_e  _M_o_d_e.    DDDDMMMM  can  also  be  used
                   interactively  by calling DDDDMMMM with no arguments.  In
                   interactive mode, DDDDMMMM will first ask you for a  file
                   of  your  expressions.   If  you  do  not have your
                   expressions in a  file,  and  wish  to  enter  them
                   interactively,  type  RETURN  when  asked  for  the
                   expression file.  A null filename tells DDDDMMMM to  read
                   expressions  from your terminal.  In terminal mode,
                   DDDDMMMM will prompt you with the expression number,  and
                   print  out how it interprets what you type in if it
                   has correct syntax, otherwise it lets  you  correct
                   your  mistakes.   When  you  have typed in the last
                   expression, an empty line tells DDDDMMMM  you  are  done.
                   If your expressions are in a file, type in the name
                   of that file, and DDDDMMMM will read them from there.

             _3._2.  _I_n_p_u_t

                     If you use DDDDMMMM in interactive mode,  you  will  be
                asked for an input file.  You can type in a file name,
                or type RETURN without typing any characters in  which
                case  DDDDMMMM  will  read  data from your terminal.  Out of
                interactive mode,  DDDDMMMM  will  read  from  the  standard
                input.

                     DDDDMMMM reads in your data a line at a time and stores
                that  line in a string variable called INPUT.  DDDDMMMM then
                takes each column in INPUT,  separated  by  spaces  or
                tabs, and stores each in the string variables, si.  DDDDMMMM
                then tries to convert these  strings  to  numbers  and
                stores  the  result in the number variables, xi.  If a
                column is not a number (eg. it is  a  name)  then  its


                                     ---- 5555 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                numerical  value  will  be inaccessible, and trying to
                refer to such a column will buy you an error  message.
                The number of columns in a line is stored in a special
                variable called N, so variable numbers of columns  can
                be  dealt with gracefully.  The general control struc-
                ture of DDDDMMMM is summarized in Figure 1.

                     Referring to a column that was not read  in  will
                result  in  an  error message, as will trying to use a
                string as a number.  Older  versions  of  DDDDMMMM  did  not
                check for such problems.

                     ____________________________________________

                         Figure 1: Control structure for DDDDMMMM.

                     read in n expressions; e1, e2, ..., en.
                     repeat while there is some input left
                             INPUT  = <next line from input file>
                             N      = <number of fields in INPUT>
                             SUM    = 0
                             RAND   = <a new random number in [0,1)>
                             INLINE = INLINE + 1
                             for i  = 1 until N do
                                     si  = <ith string in INPUT>
                                     xi  = <si converted to a number>
                                     SUM = SUM + xi
                             for i = 1 until n do
                                     switch on <value of ei>
                                             case EXIT: <stop the program>
                                             case KILL: <go to get new INPUT>
                                             case NIL : <go to next expression>
                                             default  : OUTLINE = OUTLINE + 1
                                                        yi = <value of ei>
                                                        print yi
                             <print a newline character>
                     ____________________________________________



             _3._3.  _O_u_t_p_u_t

                     In interactive mode, DDDDMMMM will ask you for an  out-
                put file or pipe.

                     Output file or pipe:

                You can supply a filename, type a "pipe  command,"  or
                type  RETURN.   A  null  filename tells DDDDMMMM to print to
                your terminal.  If you are  printing  to  a  file,  it
                should  be  different from your input file.  FBDM will
                ask you if you want to overwrite any  file  that  con-
                tains  anything,  but that does not mean you can write


                                     ---- 6666 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                to the file you are reading from.

                     You can  specify  that  the  output  from  DDDDMMMM  be
                redirected  to  another  program  by  having the first
                character of the output specification be a  pipe  sym-
                bol,  the vertical bar: ||||.  For example, the following
                line tells DDDDMMMM to pipe its output to FItee which prints
                a  copy  of  its output to the terminal, and a copy to
                its argument file.

                     Output file or pipe: | tee dm.save


                     Out of interactive mode, DDDDMMMM prints to  the  stan-
                dard output.


                     DDDDMMMM prints the values of all  its  expressions  in
                %.6g  format  for  numbers  (maintaining  at  most six
                digits of precision and printing in the fewest  possi-
                ble  characters), and %s format for strings.  A tab is
                printed after every column to insure separation.

          _4.  _O_p_e_r_a_t_i_o_n_s

                  DDDDMMMM offers a rich  set  of  numerical,  logical,  and
             string  operators.   The  operators  are evaluated in the
             usual orders (eg. times before plus) and expressions tend
             be evaluated from left to right.  Parentheses can be used
             to make the order of operations clear.  The way DDDDMMMM inter-
             prets  expressions  can  be  verified  by  entering  them
             interactively,  in  which  case   DDDDMMMM   prints   a   fully
             parenthesized form.

                  An assignment operator is  not  directly  available.
             Instead,  variables  can  be evaluated but not printed by
             using the expression suppression flag, X.  If  the  first
             character  of  an  expression is X, it will be evaluated,
             but not printed.  The value of  a  suppressed  expression
             can later be accessed with the expression value variable,
             yi.

             _4._1.  _S_t_r_i_n_g _O_p_e_r_a_t_i_o_n_s

                     Strings can be lexically  compared  with  several
                comparators: < (less-than), <= (less-than or equal), =
                (equal), != (not equal), >= (greater-than  or  equal),
                and > (greater than).  They return 1.0 if their condi-
                tion holds, and 0.0 otherwise.  For example,

                                  'abcde' <= 'eeek!'

                is equal to 1.0.  The length of strings can  be  found


                                     ---- 7777 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                with the # operator.

                                       # 'five'

                evaluates to four, the length of the string argument.

                     Individual  characters  inside  strings  can   be
                accessed by following a string with an index in square
                brackets.

                                     'abcdefg'[4]

                is the ASCII character number (164.0) of the 4th char-
                acter  in 'abcdefg'.  Indexing a string is mainly use-
                ful for comparing characters because  it  is  not  the
                character  that  is printed, but the character number.
                A warning is appropriate here:

                                     s1[1] = '*'

                will result in an error because the left side  of  the
                '='  is a number, and the right hand side is a string.
                The proper (although inelegant) form is:

                                    s1[1] = '*'[1]


                     A substring test is available. The expression:

                                  string1 C string2

                will return 1.0 if string1 is  somewhere  in  string2.
                This can be used as a test for character membership if
                string1 has only one character.  Also available is  !C
                which returns 1.0 if string1 is NOT in string2.

             _4._2.  _N_u_m_e_r_i_c_a_l _O_p_e_r_a_t_o_r_s

                     <, <=, =, !=, >=, and > are  the  numerical  com-
                parators  and  have  the  analogous  meanings as their
                string counterparts.

                     The binary operators, + (addition),  -  (subtrac-
                tion  or  "change-sign"),  *  (multiplication),  and /
                (division) are available. Multiplication and  division
                are evaluated before addition and subtraction, and are
                all evaluated left to right.   Exponentiation,  ^,  is
                the  binary  operator  of  highest  precedence  and is
                evaluated right to left.  Modulo division, %, has  the
                same  properties  as division, and is useful for tests
                of even/odd and the like.  NOTE: Modulo division trun-
                cates its operands to integers before dividing.
9

9                                     ---- 8888 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                     Several unary functions are available: l (natural
                log  [log]),  L  (base  ten log [Log]), e (exponential
                [exp]), a (absolute value [abs]), f (floor [floor]), c
                (ceiling  [ceil]).   Their  meaning can be verified in
                the UNIX programmer's  manual.   You  can  use  single
                letter names for these functions, or the more nmemonic
                strings bracketed after their names.

             _4._3.  _L_o_g_i_c_a_l _O_p_e_r_a_t_o_r_s

                     Logical operators are of  lower  precedence  that
                any  other operators.  Both logical AND (&) and OR (|)
                can be used to form complicated tests.   For  example,
                to  see  if  the  first  three  columns  are in either
                increasing or decreasing order, you could test  if  x2
                was between x1 and x3:

                            x1<x2 & x2<x3 | x1>x2 & x2>x3

                would  equal  1.0  if  the  condition  was  satisfied.
                Parentheses  are  unnecessary  because  < and > are of
                higher precedence than & which is of higher precedence
                than  |.   The unary logical operator, ! (NOT), evalu-
                ates to 1.0 if its operand is 0.0, otherwise it equals
                0.0.   Many  binary  operators can be immediately pre-
                ceded by ! to negate their value.  !=  is  "not  equal
                to", !| is "neither", !& is "not both", and !C is "not
                in".

             _4._4.  _C_o_n_d_i_t_i_o_n_a_l _E_x_p_r_e_s_s_i_o_n_s

                     The expressions:

                     if expression1 then expression2 else expression3
                        expression1  ?   expression2   :  expression3

                evaluate to expression2 if  expression1  is  non-zero,
                otherwise  they  evaluate  to  expression3.  The first
                form is more mnemonic than the second  which  is  con-
                sistent with C syntax.  Both forms have the same mean-
                ing.  expression1 has to be numerical, expression2  or
                expression3  can be numerical or string.  For example,
                The following expression will filter  out  lines  with
                the word 'bad' in them.

                        if 'bad' C INPUT then KILL else INPUT

                As another  example,  the  following  expression  will
                print  the ratio of columns two and three if (a) there
                are at least three columns, and (b)  column  three  is
                not zero.

9

9                                     ---- 9999 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                  if (N >= 3) & (x3 != 0) then x2/x3 else 'bad line'

                These are the only expressions, besides si or a string
                constant  that  can evaluate to a string.  If a condi-
                tional expression does evaluate to a string,  then  it
                CANNOT  be  used in some other expression.  The condi-
                tional expression is of absolute lowest precedence and
                groups  left  to  right, however parenthese are recom-
                mended to make the semantics obvious.

          _5.  _E_x_p_r_e_s_s_i_o_n _S_y_n_t_a_x

                  Arithmetic expressions may be formed using variables
             (with  xi  and  yi) and constants and can be of arbitrary
             complexity.  In the following  table,  unary  and  binary
             operators  are  listed along with their precedences and a
             brief description.  All UNARY operators are prefix except
             string indexing, [], which is postfix.  All binary opera-
             tors are infix.

                  Operators of higher precedence are  executed  first.
             All binary operators are left associative except exponen-
             tiation, which groups to the right.  An operator,  O,  is
             left associative if "xOxOx" is parsed as "(xOx)Ox", while
             one that is right associative is parsed as "xO(xOx)".

          ____________________________________________________________

                       Table 1: Functions and operators.

          UNARY OPERATORS:
              _o_p  _p_r_e_c   _d_e_s_c_r_i_p_t_i_o_n
              l    10    base e logarithm [log]
              L    10    base 10 logarithm [Log]
              e    10    exponential [exp]
              a    10    absolute value [abs]
              c    10    ceiling (rounds up to next integer) [ceil]
              f    10    floor (rounds down to last integer) [floor]
              #    10    number of characters in string
              []   10    ASCII number of indexed string character
              -    9     change sign
              !    4     logical not










9

9                                     ---- 11110000 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


          BINARY OPERATORS:
              _o_p  _p_r_e_c   _d_e_s_c_r_i_p_t_i_o_n
              ^    8     exponentiation
              *    7     multiplication
              /    7     division
              %    7     modulo division
              +    6     addition
              -    6     subtraction
              =    5     test for equality (also !=)
              >    5     test for greater-than (also >=)
              <    5     test for less-than (also <=)
              C    5     substring (also !C)
              &    4     logical AND (also !&)
              |    3     logical OR (also !|)
          ____________________________________________________________



          _6.  _P_r_o_g_r_a_m_m_i_n_g _C_o_n_s_i_d_e_r_a_t_i_o_n_s

                  This section is  only  for  persons  interesting  in
             modifying or compiling DDDDMMMM.

             _6._1.  _P_r_o_g_r_a_m _L_i_m_i_t_a_t_i_o_n_s

                     Several limitations are built into DDDDMMMM,  including
                a  few  reserved numbers as well as limitations of the
                size of problems DDDDMMMM can be  used  for.   The  reserved
                numbers  act as special flags that probably could have
                been implemented another way, however....  If  one  of
                your  expressions  evaluates  to  one of these numbers
                then unexpected results will occur.  The  numbers  are
                all  less than -10e15 so it is unlikely that this will
                be a practical problem.

                The size limitations of the program are:

                     BUFSIZ    maximum length of an input line
                     100       maximum number of input columns
                     31        maximum length of an input column string
                     100       maximum number of numerical constants
                     100       maximum number of output columns

                The number of and length of string constants are  lim-
                ited  by  dynamic  allocation  constraints,  as is the
                potential complexity of expressions.

             _6._2.  _R_e_c_o_m_p_i_l_i_n_g DDDDMMMM

                     The source for DDDDMMMM is in dm.y.  It must be prepro-
                cessed  by  running  yacc  on it, which will produce a
                file called y.tab.c which should be compiled with  the
                ly  and lS libraries on version 6 UNIX and with the lm


                                     ---- 11111111 ----







          DDDDMMMM: A Data Manipulator                   Tutorial and Manual


                library on version 7.



















































9

9                                     ---- 11112222 ----










                                DDDDMMMM: A Data Manipulator
                           Tutorial Introduction and Manual

                                     Gary Perlman
                             Cognitive Science Laboratory
                         University of California, San Diego




                                  Table of Contents



          1   Introduction ......................................    1
          1.1     Introductory Examples .........................    1
          1.1.1       Column Extraction.  .......................    1
          1.1.2       Simple Expressions.  ......................    1
          1.1.3       Line Extraction Based on Conditions.  .....    2
          2   Data Types ........................................    2
          2.1     String Data ...................................    2
          2.2     Numerical Data ................................    3
          2.3     Special Variables .............................    3
          3   User Interface ....................................    4
          3.1     Expressions ...................................    4
          3.1.1       Argument Mode.  ...........................    5
          3.1.2       Expression File Mode.  ....................    5
          3.1.3       Interactive Mode.  ........................    5
          3.2     Input .........................................    5
          3.3     Output ........................................    6
          4   Operations ........................................    7
          4.1     String Operations .............................    7
          4.2     Numerical Operators ...........................    8
          4.3     Logical Operators .............................    9
          4.4     Conditional Expressions .......................    9
          5   Expression Syntax .................................   10
          6   Programming Considerations ........................   11
          6.1     Program Limitations ...........................   11
          6.2     Recompiling DDDDMMMM ................................   11













9

9



