Repurposing Print Indexes for the Web

For publishers who want to have material deliverable in both print and electronic formats like on the Internet or a CD-ROM, they must solve the challenge of how to adapt print indexes and repurpose them as web pages. These print indexes traditionally have locators pointing to page numbers, journal citations, section numbers, legal citations, etc.

The publisher may want the “print” locators to become hypertext links to the material on the web, or may want the locators to simply display because the index is meant to be a sales tool. For example, the publisher may use the index to illustrate for potential users the depth and types of coverage of the publication and hopes that the index will attract companies to pay for a subscription to access to the information.

This article will describe various print-to-web index migrations based on actual past and present projects using HTML/Prep¹. It also lists questions that may help guide an indexer in what to discuss with a client when beginning a web index project. Indexers need to know how to respond to your client’s request to assist them with this migration. Hopefully this article will help point you in the right direction.

The process of creating a “navigational” index for a web site is not covered in this article. This type of index requires a different process that can be accomplished directly in the major indexing software² or by using specialized web indexing software³. In a web site index, each locator is a URL⁴ to a web page or a point internal to a page. Thus the link values that have to applied to locators vary quite differently from each other and likely need to be added manually. For more on such indexes, see Indexing Specialties: Websites.⁵

Creating a new index to collections of documents, articles, etc., that are only on the web and not in print but have uniform URLs, may also be accomplished using the techniques described herein. Web-only material could be journal or periodical articles, recipes, abstracts, reports, debates, letters, etc.

The advantages to using software to accomplish the transformation of a print index to a web index is (1) removing the tedium of manually adding link information to the locators, (2) getting it done quickly, and (3) reducing the potential for human error. In fact, utilizing programs can often point out errors in locators in the print index as well, giving you the opportunity to correct them prior to release of the print index.

Primarily, the focus of this article is on the process to create the pages and the methods for transforming the locators into linkable text; however, there are other design considerations to take into account. This type of work requires being part of a team—possibly working with the editor, IT staff, web managers, et al. But you can bring more value to your clients and get to perform additional work for them, leading to new sources of income for you.

The Design Phase

Prior to starting, there are some questions that must be asked of the publisher or their web/IT folks:

What is the architecture of the web site? Specifically, what are the URLs that are required to link make the hyperlink from a locator? Depending on the variety of locators there may be more than one type of URL format. Can the IT/web staff provide the URLs as a list? Or is there documentation on them?

Do locators need to link to internal points in web pages or just link to the top of pages? Generally the latter is easier to accomplish. Otherwise anchors need to be incorporated into the web pages which is usually not something the indexer is granted permission to do. If anchors already exist, then the indexer needs a list or a way to view them online. Links can be made to PDF files or to a specific page within a PDF file.

What should be the displayed text on which the user would click? Is the locator to serve this linking purpose? Or will the subheading that the locator is attached to serve that purpose? In other words, “Films, 54” or the actual word “Films” could each serve as the link. In the latter case a further limitation exists where each entry can have only one locator attached to it since multiple locators will be “lost” or require a complicated process for activation.

Should the entire index be one web page, or should each letter be its own page? The larger the index, and indeed for more than a few hundred entries, separate pages for the letters assist the user in moving around efficiently.

When there are separate pages, should there be a list of main headings for easy browsing of the top-level concepts with each heading linking to its full array?

Are cross-references actively linked to their target headings?

The Process

Here are steps in a typical workflow:

write or obtain the index for print product

create web pages with XHTML⁶ or HTML⁷ tagging

transform the locators into hypertext with link values

review the index “locally”, i.e., on the indexer’s PC

upload/transmit the index to the client’s QA area for review by them in their environment; possibly they will apply a web style sheet to the pages

publish the index to the public web site

Steps 2 and 3 are sometimes transposed depending on the most appropriate method to transform the locators. If it is easier to add links within the indexing software, then that step needs to be performed before creating the web pages.

Methods of Adding Links

Finding the best approach to transform locators depends on what you are starting with and what you need to end up with. Several approaches are possible. One common element of each approach is determining the folder path on the web site to the target material. [Note: this does not apply if the links are database queries an example of which appears later.] If the index web pages will reside in the same web folder as the pages/files that will be linked to, then the index’s hyperlinks do not require a path. The examples below will not address this issue every time it could come up, but the first method does show how this might apply.

The addition of link values is the part of the process that often requires special software to accomplish—either off-the-shelf or custom as noted below.

Simple Substitution: this type of change can usually be done in your editing software. If locators are document or abstract numbers (10339, 75434, etc.), and the documents to be linked to are PDF files named using the numbers, then doing a global substitution in your software can produce the necessary transformation. [Note to beware: do this to a copy of your actual file!] Simply find numbers in the page field, and replace them with the number.PDF

But this doesn’t really create the whole solution. You need to specify both URL value and the text to display on which the user clicks. In HTML, this type of link would be tagged:

75349

which would look like this on the web page

75349

When using HTML/Prep, the desired result to export from the page field is

7534975349.pdf

which HTML/Prep will convert to the proper HTML link tagging.

When clicked, it would display the file 75349.pdf that is in the same web folder. If the documents reside in another folder then that folder path, relative or absolute, is required:

75349 www.website.com/document/75349.pdf

75349 ../articles/75349.pdf

In other words, it is essential that the path must be included in any substitution that is performed. Relative paths are more beneficial since path above them, including the domain, could change leaving the links in the index still usable.

It should also be noted that the path may vary depending on the locator in other situations. For instance, if the index is a cumulation to many years of a journal, each year’s articles may reside in a separate folder path that embodies the year:

../2009/articles/jan.html
../2010/articles/mar.html

Multiple substitutions might be required to get this result.

Algorithmic Transformation: this method presumes that the locator contains the information needed to build or derive a link value but is not easily created by a substitution in your software. Here are some examples from the Rochester History Index⁸ showing the print locator and associated link value:

49(1):7-8 (Jan 1987) ./v49_1987/v49i1.pdf
44(1&2):25 (Jan & Apr 1982) ./v44_1982/v44i1-2.pdf

The locator elements include: the volume number, issue number(s), page number or range, month(s), and year. Knowing that, you can see how the links can be built where the folder name and the filename are derived from information in the locator. For this client, a special program was created to parse the locators to ensure they were correctly entered and to build the link value. Then the index was run through HTML/Prep to create the web pages.

Table of Contents Lookup: For page number based indexes, a potential transformation uses a file (created using your indexing software) that has a set of page number ranges and the link value needed:

8925-39 40302a
8995-9080 40303p

This strategy is most useful when you can construct a table of contents with page ranges that correspond to a distinct URL. This example is from the debates of a Canadian provincial parliament⁹ where the link value denotes the year, month, day, and sitting (am or pm).

For this workflow, the index is exported and processed through HTML/Prep to create the temporary web pages. Next a custom program is run on these temporary pages which looks up each page number in the table of contents file to determine the link value that should be substituted and a new set of final web pages is produced.

The examples so far have been limited to URLs that are links to static web pages. For another debates project (in process at publication time), the table of contents approach was used. However, its URLs are database queries:

541-558 http://www.ontla.on.ca/web/house-proceedings/house_detail.do?locale=en&Date=2009-01-26&Parl=39&Sess=1&detailPage=/house-proceedings/transcripts/files_html/26-JAN-2009_L105.htm

In principle there is no difference in the workflow or in how the user interacts with the web site. This is presented as an “under the hood” look at this type of link.

Correspondence List Mapping: When all other methods are infeasible, then the last recourse is to use a correspondence list. The list contains each possible locator and the link value to match it. Such a situation might be one where documents are assigned unique identifiers of a random nature (or in arbitrarily named folders):

Appendix A ./2009/appendices/39021.htm
Memorandum 35-1 ./memoranda/455821.htm

This project and the next made use of Mapper¹⁰ to apply the correspondence list to the print index and to append the link values to the display text. Then HTML/Prep was run on the export of those index files with the augmented locators.

In one project, the requirements demanded that the list contain HTML coding which was supplied to the indexer:

AB PSSPBA 4 *

In this mapping situation, the displayed text was inserted for the asterisk (*) in the HTML coding.

Other Considerations

Some web indexes may need additional functionality. In a large (100,000+ entry) legal index¹¹, another program was developed that provided a search capability for terms within the index web pages. In large indexes, not necessarily as large as this but when there are multiple pages, it may be useful to have a web page that lists the main topics in the index on a single page without any of the other index sublevels. This facilitates quick browsing of the high-level terms and users can link quickly to that term’s array in the index.

Here are some other questions to ask your client: How does the user of a web site become aware there is an index to use? Is there a link on the home page? Should there be a landing page with head notes or usage instructions, to begin using the index? In indexes with long subheading displays, the upper level headings may scroll out of view. Do you need to provide bread crumbs or pop-ups to let the user know the current context? These issues will need to be addressed both during the design and in the workflow.

Summary

The endnotes provide URLs to several of the cited web indexes for you to see and use the end result of the print index transformations and the various features applied to them.

This type of work is a new offering for you to present to clients, but one which in coming years will become more common in the industry. You will need to be conversant with web technology, may need to purchase new software or contract for custom software or consulting, and need to work with new departments at your client’s organization. In most cases, your client will recognize that these additional factors are not covered by a per-page billing rate. Determine what kind of hourly rate or fee schedule you would charge. Ask your client if you can pass through directly any programming consultant and software development costs or whether you need to build these into your pricing. While some of the methods described above are possible to accomplish in Excel with macros if you have a technological bent, you should determine if this is the best use of your time. It may be best to bring in a programming consultant to deal with technical issues especially if your client needs hand-holding as well in the workflow and design phase.

Remember the web needs more indexes and opportunity awaits!

1 Leverage Technologies’ HTML/Prep™ (http://www.levtechinc.com/ProdServ/LTUtils/HTMLPrep.htm).

2 Indexing Research’s CINDEX™ (www.indexres.com), Macrex™ (www.macrex.com), Sky Software’s SkyIndex™ (www.sky-software.com).

3 Brown’s HTML Indexer™ (www.html-indexer.com).

4 Uniform Resource Locator which may refer to a static web page containing a web site name, a folder path, a file name, and possibly an identifier or anchor to a point within a web page; or, it could invoke a query to a database (http://en.wikipedia.org/wiki/Uniform_Resource_Locator).

5 Hedden, H. (2007), Indexing Specialties: Websites. Wheat Ridge, CO: American Society for Indexing.

6 eXtensible HyperText Markup Language (http://www.w3.org/TR/xhtml).

7 HyperText Markup Language (http://www.w3.org/TR/html4).

8 Rochester, NY History Index (http://www.libraryweb.org/~rochhist/indexa.htm).

9 Legislative Assembly of British Columbia (http://www.leg.bc.ca/hansard/8-2.htm).

10 Leverage Technologies’ Mapper™ (http://www.levtechinc.com/ProdServ/LTUtils/Mapper.htm).

11 BNA’s Labor Relations Reporter General Index (http://www.bna.com/lrr/lrrindx.htm).

David K. Ream is LevTech’s chief consultant. He has an M.S. degree in Computer Science from Case Western Reserve University. Mr. Ream has spent over 25 years working with publishers in the areas of typesetting design and production, database creation, editorial systems, and electronic publication design and production. LevTech is the corporate/government sales partner for Indexing Research’s CINDEX products. LevTech also performs computer consulting and programming for editorial & web applications and batch composition services.