Indexers and publishers employ various software in their workflow. Different situations can arise that require index data to be moved between or processed by different software. Index data can be rich: beyond the basic ASCII characters, there can be: Unicode (accented letters, symbols, punctuation, Greek, Arabic, etc.), applied styles (bold, italics, etc.), super/subscripts, font changes, sorting instructions. Index data can be thought of as records with each record representing an index entry. Records can carry date & time stamps, labeling/flagging, etc.
All of these conditions caused transferring index data between software applications using historically available file formats (.dat, .tab, etc.) to be sketchy and degraded. Only very simple data could be transferred with impunity.
This led to a multi-company effort to develop a new format that addressed all of these issues. The result was the Index XML or IXML format based on a DTD. Currently these companies have software that utilize this format:
- Indexing Research – CINDEX v3.0
- Sky Software – Sky Index v7.0
- Leverage Technologies – CINDEX Publishers' Edition v3.0, EntryExpander, FlipNames, FootnoteRanges, H/Check, Index/Check, IXMLembedder, Mapper, PageAdjuster, SumDex, Xelect
- Barry Campbell's IndexConvert
- abavo's SmartIndex
Several other companies have been informed of this format and will be added to the list when they accommodate it.
The purpose of this data is to simply encode index data for transfer. It is not intended to be an XML format that would support composition of an index. The various stand-alone indexing and other software have other methods providing that capability. In fact, publication XML often is specific, and possibly proprietary, to a publisher and cannot be provided via a standard—even a de jure standard. As a data exchange format, an IXML file does not contain any settings for the index as a whole, for example, sorting and formatting settings.
Note: when software importing an IXML file encounters an element or attribute that is unknown to it or unsupported by it, it is up to the software to ignore it or report it. Syntax or validation errors should be reported to the user. Similarly, if the importing software does not handle Unicode but encounters any charaters of values greater than 255, this should be reported to the user and/or converted to a default character or string.
The following summary is provided here but the DTD should be reviewed for the complete details of attributes used.
An imxl filerepeats the DTD at the top of the file followed by the encoded index data. The first elements in an index should be the source element and the fonts element. The former identifies the creating software, its version, and the time of creation. The latter lists the font names used in the index and associates them with an id number that is used in the index data if/when font changes occur. At least one font should be specified.
Each entry is represented by a record element which contains two or more fields. Each record should contain at least one heading (the main heading), optionally any subheadings, and lastly, the locator field. Each field is represented by the field element.
Data in a field can consist of
- Unicode using UTF-8 encoding
- character entity references
- text elements for changes of style, fonts, colors, small caps, and baseline offset (super/subscripts) via attributes
- sorting elements: literal, hide, sort
The text element is not a container element. Think of it more as representing a state space change rather than a pair of tags enclosing data. Each attribute sets the appropriate style setting. If an attribute is not repeated, then the setting for that attribute is set back to the default. A text element with no attributes resets all settings to normal. For example,
<field><text style="b"/>Main Bold <text style="bi"/>Italics <text style="biu"/> Underlined <text/>Plain</field>
would represent text that would display as
Main Bold Italics Underlined Plain
The literal element is also not a container. It should proceed a character not usually sorted on, for instance a punctuation character in a list of Symbols at the beginning of an index. The hide element encloses text that should display but should not be used during sorting. The sort element encloses text to be used when sorting but will not display. For example,
<field><hide>The </hide>Origin of the Species</field>
The first example shows how to sort the heading in the Os rather than the Ts. The second example how to sort the heading in the Fs. NOTE: The different software will import these elements and represent them internally in different manners.
The last field in a record is the locator (also specified by an attribute). This field would contain the page number, section number, or other citation to the text. It is also where cross references would be encoded usually beginning with lead-in words such as See or See also. It can contain no data, for instance, a set of records defining a heading structure, or a record that is an editorial note. The locator field could also contain electronic IDs of some nature used to embed entries in XML or to be used for links in an electronic index. NOTE: XML never puts rules on content. Some indexing software allows multiple locators to be present in the locator field usually separated by a comma, but a semicolon or other character may also be used. It is up to the users exchanging data to provide an IXML file with single locators in this field if the software that will import the IXML cannot split up locator lists.
A version of the index to the IDPF EPUB Indexes Specification is provided. It contains some of the features discussed above, such as, sort sequences, an editorial note, style changes, etc.