Articles

← Back to Indexing Articles.

The article below appeared in Key Words, the newsletter of the American Society of Indexers, Volume 5/number 1 (Jan/Feb 1997).


Encyclopedia Sorting Rules

An article in the NOV/DEC 1995 Key Words by Michael Brackney entitled "On Sorting Index Headings: Quandries and Queries" described in great detail the different methodologies used to sort entries.  The article gave examples utilizing the inherent punctuation in the headings to sort: people against places against titles against inanimate objects. The conclusion was that the punctuation in the text of the headings, when properly mapped for sorting, could result in a consistent, if not ideal, ordering.

Mr. Brackney’s article reminded me of a data conversion project Leverage Technologies undertook for P.F. Collier in New York, NY. (I’d like to thank P.F. Collier for allowing me to use their project and data as examples in this article.) P.F. Collier needed to get their electronic data that was typographically-coded and, literally, manually placed on the page transformed into data which could be loaded into CINDEX (Indexing Research's index preparation software). The conversion had several distinct objectives: (1) change the style codes (bold, italics, special characters, etc.) to CINDEX equivalents; (2) build the index entries from the indented structure; (3) remove forced line breaks and other hard-coded pagination information; and (4) insert codes that would cause the entries to sort properly. "Properly" here means the traditional encyclopedia ordering.

Before going further, I should add a caveat: I’m not a publisher, an editor, or an indexer. This article is written from my perspective as a computer consultant to indexers and publishers. I hope, however, that a glimpse into this specialized area of indexing and sorting might be applicable to indexers in other situations.

At this point I should also describe the size of the task. The index in question is for Collier’s Encyclopedia. The CE consists of 23 volumes of articles, maps, and illustrations; and, Volume 24 which contains an extensive (200-page) categorized bibliography and the index. The bibliographic citations are also indexed and their locators include an ordinal number within each category as well as the page number. The index contains over 470,000 entries. In printed form the index is 828 8.5x11-inch pages composed in 5 columns at 5.5 point type.

In most smaller indexes, if two or more entries don’t sort properly on their own, their sequence can be altered by setting global sorting parameters, or by manually inserting coding that forces them into the proper order. CINDEX gives users two coding methods: one, <>, inhibits text from being used in sorting, and the other, {}, adds text that is sorted but not printed. Uncoded text is both sorted and printed. Below are some examples.

heading sorts as

American Society of Indexers american society of indexers

<The >National Geographic Society national geographic society

{three}3M Corporation three m corporation

{saint}<St.> Joseph Church saint joseph church

Given the size of the Collier’s Encyclopedia index, a manual review of the sort order of all the entries was cost prohibitive. Further, since only a percentage of articles are revised or added for each annual edition, we wanted to develop sort sequence coding that could be applied by a common set of rules to each entry without having to review each entry in alphabetical context. This allows some of the index maintenance to be performed by freelancers, yet lets the entries merge into alphabetical sequence without review.

Initially the goal was to program the insertion of sort sequences during conversion to avoid costly manual work. Because of ambiguities in the wording of headings, it turned out not to be possible to sequence every entry correctly. But most of them could be, and entries that the program couldn’t decide how to code from context were flagged for manual review.

The hierarchy

The desired sequence of entries was set by the publisher from long-standing rules developed over years. It follows a structured hierarchy which ranks first similarly beginning headings, and if two entries are at the same level within the hierarchy, they are sorted alphabetically. [The encyclopedia uses letter-by-letter rather than the word-by-word sorting which Mr. Brackney’s article considered more natural.] The broadest levels or divisions of the hierarchy consist of persons, places, and things. In the case of the first two divisions there are further breakdowns while things have no lower breakdowns. The persons division is divided next into religious personalities, nobility, and commoners. Both of the first two divisions at this level are further structured as shown below.

religious personalities: saint, pope/antipope, cardinal, archbishop, bishop, etc.

nobility: emperor/empress, king/queen, prince/princess, etc.

The places division was divided into two sections: political and geographical. The political grouping was further subdivided, for example: empire, country, state, territory, etc.

Each level in the hierarchy is assigned a combination of numbers, and sometimes a letter, as the code to force the proper sequencing. For example, a thing is simply 3 (coded {3} in CINDEX). A duke would be {1.3k}, a lake {2.2}, and a province {2.1f}. An excerpt of the hierarchy appears here:

Identifying heading types and adding special coding

There were two aspects to inserting the sort coding sequences: determining the level in the hierarchy (e.g., persons, places, or things) from the main heading's text to know which sequence to use; and, identifying where in the main heading this sequence should be inserted. For both aspects it was necessary to scan the heading for keywords and typography (e.g., italics or bold). Typography could indicate the division of the entry or where special coding should be placed.

For persons and places, most level determinations were made on the basis of a keyword list provided by the publisher. The keywords appeared in the main headings following a comma and space or within parentheses, e.g., "Adams, Henry (hist.)" or "Adams, co., Miss.". If the heading included "king" or "hist." (historian) then it was a person; if it had "town" or "co." (county), it was a place. The list, however, had inherent ambiguities: "emp." could be emperor/empress (person), or empire (place). Any heading that could not be identified as a person or place was coded as a thing.

The sequence placement rules depended on the highest division in which the heading was classified. For persons, the insertion fell generally at the end of the last, or only, name, e.g., "Franklin{1.4}, Benjamin" or "Socrates{1.4}". For names such as "Leopold I", "Leopold II", "Leopold X", etc., the insertion had to precede the roman numerals, e.g., "Leopold{1.3a} I".

For places, the insertion preceded the comma connected to the identifying keyword (lake, mtn., etc.). This was necessary because commas could be used as part of the name itself.

Here are some actual headings as converted:

In the final conversion, everything fell into place rather well and the data came through very clean. Since the entries previously were placed in the desired order manually, some clean up remained. For instance, getting roman numeral nines (ix) to sort properly (after viii) as you would need to do in any index. The index has now been produced completely through CINDEX and work has begun on the second edition under the new system.

Summary

It is clear that sorting entries for encyclopedias is complex but can be accomplished using off-the-shelf software, such as CINDEX, with a classification scheme applied to each entry. This approach may be helpful to indexers with similar complexity in sections of indexes on which they work.


------------------------------------------------------------------------------------------------------------------------

David K. Ream is president of Leverage Technologies, Inc. which provides computer consulting and programming for indexers and the publishing industry. LevTech is also the corporate & government account representative for CINDEX, a product of Indexing Research.