SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS


                          SORTING LIBRARY OF CONGRESS
                               BOOK CALL NUMBERS

                                  Mike Higgins
                               CETUS Corporation
                              Berkeley California


                                    ABSTRACT                                    ________

                    The format of Library of  Congress  book
                    call  numbers  makes  them  difficult to
                    sort on a  computer.   An  algorithm  is
                    described  which transforms call numbers
                    into a form that will sort correctly  in
                    a normal ASCII sort.


                           LIBRARY PROGRAMS AT CETUS

               We at CETUS corporation have developed  a  computerized
          book  catalog  for  our library.  The system consists of two
          large application programs:  a catalog editing program and a
          book list program.  Both programs use an in-house library of
          file management routines to manipulate a database-like file.
          These  general  file management routines were developed some
          time ago for handling biological  data  at  CETUS,  so  some
          general  utility  programs  were  already  available for the
          catalog file.  Also, the book list program uses  a  multiple
          column RUNOFF-like program to format the output.  Figure one
          shows all the programs in the system and how they interact.

                               BOOK LIST OUTPUTS

               I wrote the book list program,  which  is  called  LIB.
          This program selects subsets of books from the catalog file,
          sorts  the  subset  by  different  fields  (title,   author,
          subject, etc.), and prints out the books in a format similar
          to a catalog card.  When this  program  was  first  written,
          both  the programmers and librarians assumed that all fields
          could be sorted as normal ASCII  text.   If  this  were  the
          case,  then  the book list program would be fairly simple to
          write, and a general flowchart is given for it in figure  2.
          However,  when  a  call  number sort was first requested, we
          discovered that a normal ASCII sort  would  not  work.   The
          format  of a call number is quite involved.  Some fields are
          treated as alphabetic, some as numeric.  Some numeric fields
          are  treated  as  floating  point  numbers, others should be
          treated as if they were left justified and zero filled,  and


SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS                   PAGE 2


          some  should be right justified.  See figure 3 for a diagram
          of all the  possible  fields  in  a  call  number  and  some
          examples  of  call  numbers  that  we have in our file.  The
          example call numbers are sorted in a normal ASCII  way,  and
          the  column  of  numbers down the left hand side of the page
          shows the order in  which  they  should  have  been  sorted.
          Following is a description of each of the possible fields in
          a call number.

                             CLASSIFICATION NUMBER

               I  designate  the  first  section  the   classification
          number,  even  though  this term is often used for the whole
          call number.  The classification number consists of  one  or
          two  alphabetic  characters  followed by a number with up to
          seven digits.  This numeric section is treated as a floating
          point  number.   There can be a decimal point in the number,
          and it is significant.  For example, 69.2 sorts  after  9.6.
          The classification number reflects the subject of a book, so
          books with the same classification number will have the same
          subject.   In  a  library, books are shelved by call number.
          Once you find a book on the shelf that  you  like,  you  can
          "browse"  through  the nearby books and be assured that they
          will be similar to each other.

                                 CUTTER NUMBERS

               The third section of the call number (I'll get  to  the
          second field in a minute), is the Cutter number.  The format
          of a Cutter number is one alphabetic character, followed  by
          one to four numerals.  These numerals are treated as if they
          were left justified and zero filled.  For example, 692 would
          sort  before  96  instead  of the other way around as in the
          classification number.  The Cutter number is related to  the
          name  of  the  author  of the book.  In fact, the alphabetic
          character in the Cutter number is the first  letter  of  the
          author's name.  When the same author writes several books on
          the same subject, the  Cutter  number  guarantees  that  the
          books will be shelved next to each other in a library.

                           SUB-CLASSIFICATION NUMBERS

               Some call numbers  also  include  a  sub-classification
          number following the classification number.  This number has
          the same format as the Cutter number, making the call number
          appear  at  first  glance  to  have two Cutter numbers.  The
          sub-classification number sorts in the same way as a  Cutter
          number,  but  relates to the book's subject, not its author.
          This is doubly confusing (to a computer  or  to  some  other
          non-librarian)  when  books  that  have a sub-classification
          number are compared to books that do not.  Even  though  the
          Cutter  number  and  the  sub-classification  number are not
          related, you compare the sub-classification numbers  to  the


SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS                   PAGE 3


          cutter   numbers   in   the   books   that  do  not  have  a
          sub-classification  number.   For  example,  the  two   call
          numbers  QP72.6A11 and QP72.6A12B887 would be stored next to
          each other on the shelf because the author's cutter  number,
          A11,  in the first book is similar to the sub-classification
          number, A12, of the second book.

                             THE EXTENSION SECTION

               I designate the last section of  the  call  number  the
          extension  section.  When the same author writes two or more
          books on exactly the same subject, the  call  numbers  would
          all  be  the  same,  were it not for the extension area.  In
          this field, the first letter of the title is often  used  to
          tell the books apart.  But if an author writes several books
          on the same subject with the same title, the  book's  volume
          number,  issue  number,  or  both, are used to make the call
          numbers unique.  A publication date is used  to  distinguish
          different  editions of the same book.  Because of the volume
          and issue numbers, even the extension section will not  sort
          properly as is.  These numbers can run up into the hundreds,
          and  so  they  must  be  right  justified  before  they  are
          compared.   For example, VOL 3 must sort before VOL 10, even
          though "3" is greater than "1".

                        EXTRA CHARACTERS IN CALL NUMBERS

               Except for the extension section, there are  no  spaces
          in  the  call  number, but there can be extra decimal points
          between  the  classification  numbers,  before  the   cutter
          number,  or  before  the  extension  section.   Ideally, the
          solution to the call number sort  should  take  these  extra
          characters  into  account.   For  example,  QP67.8.A11.B887A
          should sort next to QP67.8A11B887.B.

                            SOLUTIONS TO THE PROBLEM

               One solution would have been to enter all call  numbers
          in  a  standard  form, but this was unacceptable for several
          reasons:  first, we already had  a  large  number  of  books
          entered in the catalog file;  and second, we wanted to print
          the call numbers in the  output  in  the  normally  accepted
          format.   Another  solution  would  be to write a comparison
          routine that treats  call  numbers  differently  than  other
          fields.   This  is a poor solution because in the process of
          sorting, the same  call  number  is  compared  against  many
          others.   Performing  this complex comparison burns up a lot
          of CPU time.  Also, the LIB program was  designed  to  do  a
          nested  sort,  combining chunks of different fields together
          into one large sort record.  Since the call number could  be
          anywhere in this sort record, the compare routine would have
          to be told somehow which parts of the record to sort one way
          or  another.   The solution used was to add a routine to the


SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS                   PAGE 4


          LIB program which converted call  numbers  to  a  form  that
          would  sort correctly.  As the books are read from the file,
          the call numbers  are  converted  and  stored  in  the  sort
          records  in this "normalized" form.  The call numbers in the
          file remain as they are, so librarians can enter them in the
          usual  way.   The  conversion  process is complex, but it is
          only performed once on each book, so the CPU  overhead  does
          not get out of hand.

               When the LIB program prints out a  call  number,or  any
          other  field,  the  text  in  the sort record is usually not
          used.  LIB  dips  back  into  the  file  and  retrieves  the
          unnormalized  call number for printing (see figure 2).  This
          seemingly redundant reading of the catalog file is  actually
          necessary  on  all  other  fields  also.   It is possible to
          request a "truncated" sort on  any  field,  using  only  the
          first  n  characters to make the sort run faster or to allow
          the sort to be  nested  deeper.   You  would  not  want  the
          truncated field to appear in the output.  Also, you will not
          necessarily want to print the same fields that you sort  by.
          Finally,  many  fields  are allowed to have multiple entries
          (like several authors), and all of these must appear in  the                                      ___
          output, even though only one of them will appear in a single
          sort record.  The only time LIB ever uses text from the sort
          record  is when printing subject headings.  Call numbers are
          specifically designed to be unique  for  each  book,  so  it
          should  never  be  necessary  to  print  them  as  a subject
          heading.  We solved this problem by adding a line -- to  the
          documentation.   "Don't  ask  for  subject headings when you
          sort by call number".

                        DETAILS OF THE CALLNO SUBROUTINE

               A subroutine, called CALLNO,  was  written  to  perform
          this normalizing conversion on call numbers.  The subroutine
          scans a call number twice, once to identify each  field  and
          check  for  correctness,  and  a  second  pass  to  transfer
          characters into the normalized form.  When scanning the call
          number,   CALLNO   first  divides  the  characters  up  into
          alternating areas of alphabetic and numeric text (see figure
          4).   In this process, the decimal point is considered to be
          a numeric character.  For each  area  found  this  way,  the
          following  things  are  recorded:   a  pointer  to the first
          character in the area and the number  of  characters.   Also
          recorded  for  numeric fields is a flag indicating whether a
          decimal point was found  and  how  many  digits  there  were
          before the decimal.  If at any time a non-alpha, non-numeric
          character is  found,  the  subroutine  assumes  that  it  is
          already  in  the middle of the extension section of the call
          number and scanning can stop.  After all  the  pointers  and
          counters  are  set up, they can be checked for legality.  If
          the alpha area of the call number is greater than two or  if
          the  length  of  any  other alpha field is greater than one,


SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS                   PAGE 5


          that alpha area must be the start of the extension  section.
          If  any  of the numeric areas have a width of zero, then the
          extension section must  start  at  the  previous  alphabetic
          field.

                    CONVERSION OF THE CLASSIFICATION NUMBER

               There is a technique used for  storing  floating  point
          numbers  that  allows  them  to  be  compared  with ordinary
          integer compares.   I  hit  upon  the  idea  of  using  this
          technique on the classification number.  When floating point
          numbers  are  stored  in  this  format,  they   are   called
          "normalized",  and this is primarily why I call the sortable
          format of call numbers "normalized"  also.   Basically,  the
          idea  is  to  convert  the  floating point number to a value
          between zero and one, dividing it by the  appropriate  power
          of ten.  (In most floating point hardware, powers of two are
          used,  of  course).   The  resulting  number  is  called   a
          mantissa,  and the power that you raise ten to is called the
          exponent.  The normalized number is stored with the exponent
          in  the  most  significant  position.   When two numbers are
          compared, a larger exponent "fools" the  integer  comparison
          logic  into  making  the  correct decision.  I am guaranteed
          that call numbers will all  be  positive  numbers  and  have
          positive  exponents,  so  I  don't have to worry about or go
          into what you do with negative  values.   In  practice,  the
          technique  is  much easier than it sounds.  Since the CALLNO
          subroutine already knows the number  of  digits  before  the
          decimal point (the exponent), this number is converted to an
          ASCII numeral and placed  in  the  appropriate  column  (see
          figure   5).   Then  all  the  digits  in  the  number  (the
          mantissa),  are  transferred   into   their   columns   left
          justified,  dropping  any  decimal  points  on the way.  The
          alphabetic section is also transferred left justified to its
          field, completing the classification number.

                  CONVERSION OF THE SUB-CLASSIFICATION NUMBER
                               AND CUTTER NUMBER

               Sub-classification  and  Cutter  numbers   present   no
          problem  to  transform.   The single alpha character and the
          numeric  area  are  simply  transferred  as  is.    In   the
          normalized  format,  space  is  reserved  for  two  of these
          fields, but the left most one is always filled first.  If  a
          book  has  only  a Cutter number, this number ends up in the
          first field.  However, if a book has both, the  first  field
          ends  up  containing the sub-classification number, and this
          number is compared against Cutter numbers in books that have
          only  one  field.   This  solved the problem mentioned above
          where the numbers of different type must be compared against
          each other.


SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS                   PAGE 6


                      CONVERSION OF THE EXTENSION SECTION

               Because volume and issue numbers can occur anywhere  in
          the  extension  section,  this  section is transferred as is
          until a numeric character  is  found.   When  this  happens,
          CALLNO  scans forward on the input string to find the length
          of the number.  Only numbers that are less than four  digits
          long  are treated specially.  If the number is small enough,
          the routine scans backwards over any spaces  in  the  output
          string   that  may  have  preceeded  the  number,  then  the
          characters  are  transferred  in,  zero  filled  and   right
          justified.

                      RETURNING THE NORMALIZED CALL NUMBER

               It was easier to add the  CALLNO  routine  to  the  LIB
          program if the normalized call number was returned in place.
          For this reason, CALLNO has an internal buffer in which  the
          normalized string is built.  After the last character in the
          extension section has been stored in this buffer, the buffer
          is copied back into the input parameter.

                       AVAILABILITY OF THE CALLNO ROUTINE

               The source for the  CALLNO  routine,  as  well  as  the
          RUNOFF  file for this document, will be submitted to the RSX
          symposium tape.  CALLNO was written  in  RATFOR,  a  FORTRAN
          pre-processor  that  allows  a programmer to use many of the
          structures available in the UNIX programming language called
          "C".   In case you would like to use this subroutine, but do
          not have a RATFOR pre-processor, the FORTRAN source is  also
          included  on  the  tape.  In addition, the source listing of
          the RATFOR version of CALLNO is included in the back of this
          paper (figure 6).