SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS Mike Higgins CETUS Corporation Berkeley California ABSTRACT ________ The format of Library of Congress book call numbers makes them difficult to sort on a computer. An algorithm is described which transforms call numbers into a form that will sort correctly in a normal ASCII sort. LIBRARY PROGRAMS AT CETUS We at CETUS corporation have developed a computerized book catalog for our library. The system consists of two large application programs: a catalog editing program and a book list program. Both programs use an in-house library of file management routines to manipulate a database-like file. These general file management routines were developed some time ago for handling biological data at CETUS, so some general utility programs were already available for the catalog file. Also, the book list program uses a multiple column RUNOFF-like program to format the output. Figure one shows all the programs in the system and how they interact. BOOK LIST OUTPUTS I wrote the book list program, which is called LIB. This program selects subsets of books from the catalog file, sorts the subset by different fields (title, author, subject, etc.), and prints out the books in a format similar to a catalog card. When this program was first written, both the programmers and librarians assumed that all fields could be sorted as normal ASCII text. If this were the case, then the book list program would be fairly simple to write, and a general flowchart is given for it in figure 2. However, when a call number sort was first requested, we discovered that a normal ASCII sort would not work. The format of a call number is quite involved. Some fields are treated as alphabetic, some as numeric. Some numeric fields are treated as floating point numbers, others should be treated as if they were left justified and zero filled, and SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS PAGE 2 some should be right justified. See figure 3 for a diagram of all the possible fields in a call number and some examples of call numbers that we have in our file. The example call numbers are sorted in a normal ASCII way, and the column of numbers down the left hand side of the page shows the order in which they should have been sorted. Following is a description of each of the possible fields in a call number. CLASSIFICATION NUMBER I designate the first section the classification number, even though this term is often used for the whole call number. The classification number consists of one or two alphabetic characters followed by a number with up to seven digits. This numeric section is treated as a floating point number. There can be a decimal point in the number, and it is significant. For example, 69.2 sorts after 9.6. The classification number reflects the subject of a book, so books with the same classification number will have the same subject. In a library, books are shelved by call number. Once you find a book on the shelf that you like, you can "browse" through the nearby books and be assured that they will be similar to each other. CUTTER NUMBERS The third section of the call number (I'll get to the second field in a minute), is the Cutter number. The format of a Cutter number is one alphabetic character, followed by one to four numerals. These numerals are treated as if they were left justified and zero filled. For example, 692 would sort before 96 instead of the other way around as in the classification number. The Cutter number is related to the name of the author of the book. In fact, the alphabetic character in the Cutter number is the first letter of the author's name. When the same author writes several books on the same subject, the Cutter number guarantees that the books will be shelved next to each other in a library. SUB-CLASSIFICATION NUMBERS Some call numbers also include a sub-classification number following the classification number. This number has the same format as the Cutter number, making the call number appear at first glance to have two Cutter numbers. The sub-classification number sorts in the same way as a Cutter number, but relates to the book's subject, not its author. This is doubly confusing (to a computer or to some other non-librarian) when books that have a sub-classification number are compared to books that do not. Even though the Cutter number and the sub-classification number are not related, you compare the sub-classification numbers to the SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS PAGE 3 cutter numbers in the books that do not have a sub-classification number. For example, the two call numbers QP72.6A11 and QP72.6A12B887 would be stored next to each other on the shelf because the author's cutter number, A11, in the first book is similar to the sub-classification number, A12, of the second book. THE EXTENSION SECTION I designate the last section of the call number the extension section. When the same author writes two or more books on exactly the same subject, the call numbers would all be the same, were it not for the extension area. In this field, the first letter of the title is often used to tell the books apart. But if an author writes several books on the same subject with the same title, the book's volume number, issue number, or both, are used to make the call numbers unique. A publication date is used to distinguish different editions of the same book. Because of the volume and issue numbers, even the extension section will not sort properly as is. These numbers can run up into the hundreds, and so they must be right justified before they are compared. For example, VOL 3 must sort before VOL 10, even though "3" is greater than "1". EXTRA CHARACTERS IN CALL NUMBERS Except for the extension section, there are no spaces in the call number, but there can be extra decimal points between the classification numbers, before the cutter number, or before the extension section. Ideally, the solution to the call number sort should take these extra characters into account. For example, QP67.8.A11.B887A should sort next to QP67.8A11B887.B. SOLUTIONS TO THE PROBLEM One solution would have been to enter all call numbers in a standard form, but this was unacceptable for several reasons: first, we already had a large number of books entered in the catalog file; and second, we wanted to print the call numbers in the output in the normally accepted format. Another solution would be to write a comparison routine that treats call numbers differently than other fields. This is a poor solution because in the process of sorting, the same call number is compared against many others. Performing this complex comparison burns up a lot of CPU time. Also, the LIB program was designed to do a nested sort, combining chunks of different fields together into one large sort record. Since the call number could be anywhere in this sort record, the compare routine would have to be told somehow which parts of the record to sort one way or another. The solution used was to add a routine to the SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS PAGE 4 LIB program which converted call numbers to a form that would sort correctly. As the books are read from the file, the call numbers are converted and stored in the sort records in this "normalized" form. The call numbers in the file remain as they are, so librarians can enter them in the usual way. The conversion process is complex, but it is only performed once on each book, so the CPU overhead does not get out of hand. When the LIB program prints out a call number,or any other field, the text in the sort record is usually not used. LIB dips back into the file and retrieves the unnormalized call number for printing (see figure 2). This seemingly redundant reading of the catalog file is actually necessary on all other fields also. It is possible to request a "truncated" sort on any field, using only the first n characters to make the sort run faster or to allow the sort to be nested deeper. You would not want the truncated field to appear in the output. Also, you will not necessarily want to print the same fields that you sort by. Finally, many fields are allowed to have multiple entries (like several authors), and all of these must appear in the ___ output, even though only one of them will appear in a single sort record. The only time LIB ever uses text from the sort record is when printing subject headings. Call numbers are specifically designed to be unique for each book, so it should never be necessary to print them as a subject heading. We solved this problem by adding a line -- to the documentation. "Don't ask for subject headings when you sort by call number". DETAILS OF THE CALLNO SUBROUTINE A subroutine, called CALLNO, was written to perform this normalizing conversion on call numbers. The subroutine scans a call number twice, once to identify each field and check for correctness, and a second pass to transfer characters into the normalized form. When scanning the call number, CALLNO first divides the characters up into alternating areas of alphabetic and numeric text (see figure 4). In this process, the decimal point is considered to be a numeric character. For each area found this way, the following things are recorded: a pointer to the first character in the area and the number of characters. Also recorded for numeric fields is a flag indicating whether a decimal point was found and how many digits there were before the decimal. If at any time a non-alpha, non-numeric character is found, the subroutine assumes that it is already in the middle of the extension section of the call number and scanning can stop. After all the pointers and counters are set up, they can be checked for legality. If the alpha area of the call number is greater than two or if the length of any other alpha field is greater than one, SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS PAGE 5 that alpha area must be the start of the extension section. If any of the numeric areas have a width of zero, then the extension section must start at the previous alphabetic field. CONVERSION OF THE CLASSIFICATION NUMBER There is a technique used for storing floating point numbers that allows them to be compared with ordinary integer compares. I hit upon the idea of using this technique on the classification number. When floating point numbers are stored in this format, they are called "normalized", and this is primarily why I call the sortable format of call numbers "normalized" also. Basically, the idea is to convert the floating point number to a value between zero and one, dividing it by the appropriate power of ten. (In most floating point hardware, powers of two are used, of course). The resulting number is called a mantissa, and the power that you raise ten to is called the exponent. The normalized number is stored with the exponent in the most significant position. When two numbers are compared, a larger exponent "fools" the integer comparison logic into making the correct decision. I am guaranteed that call numbers will all be positive numbers and have positive exponents, so I don't have to worry about or go into what you do with negative values. In practice, the technique is much easier than it sounds. Since the CALLNO subroutine already knows the number of digits before the decimal point (the exponent), this number is converted to an ASCII numeral and placed in the appropriate column (see figure 5). Then all the digits in the number (the mantissa), are transferred into their columns left justified, dropping any decimal points on the way. The alphabetic section is also transferred left justified to its field, completing the classification number. CONVERSION OF THE SUB-CLASSIFICATION NUMBER AND CUTTER NUMBER Sub-classification and Cutter numbers present no problem to transform. The single alpha character and the numeric area are simply transferred as is. In the normalized format, space is reserved for two of these fields, but the left most one is always filled first. If a book has only a Cutter number, this number ends up in the first field. However, if a book has both, the first field ends up containing the sub-classification number, and this number is compared against Cutter numbers in books that have only one field. This solved the problem mentioned above where the numbers of different type must be compared against each other. SORTING LIBRARY OF CONGRESS BOOK CALL NUMBERS PAGE 6 CONVERSION OF THE EXTENSION SECTION Because volume and issue numbers can occur anywhere in the extension section, this section is transferred as is until a numeric character is found. When this happens, CALLNO scans forward on the input string to find the length of the number. Only numbers that are less than four digits long are treated specially. If the number is small enough, the routine scans backwards over any spaces in the output string that may have preceeded the number, then the characters are transferred in, zero filled and right justified. RETURNING THE NORMALIZED CALL NUMBER It was easier to add the CALLNO routine to the LIB program if the normalized call number was returned in place. For this reason, CALLNO has an internal buffer in which the normalized string is built. After the last character in the extension section has been stored in this buffer, the buffer is copied back into the input parameter. AVAILABILITY OF THE CALLNO ROUTINE The source for the CALLNO routine, as well as the RUNOFF file for this document, will be submitted to the RSX symposium tape. CALLNO was written in RATFOR, a FORTRAN pre-processor that allows a programmer to use many of the structures available in the UNIX programming language called "C". In case you would like to use this subroutine, but do not have a RATFOR pre-processor, the FORTRAN source is also included on the tape. In addition, the source listing of the RATFOR version of CALLNO is included in the back of this paper (figure 6).